- 浏览: 2491854 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
TextExtract(1)Tika Basic
1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.
Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser
There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true
Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui
And we can choose files and change the view to see different contents we get from the files.
2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-resume.pdf";
public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}
Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-duffy.pdf";
public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);
// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}
// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8
books
Tika in Action.pdf
http://m.yiibai.com/tika/tika_content_extraction.html
1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.
Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser
There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true
Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui
And we can choose files and change the view to see different contents we get from the files.
2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-resume.pdf";
public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}
Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-duffy.pdf";
public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);
// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}
// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8
books
Tika in Action.pdf
http://m.yiibai.com/tika/tika_content_extraction.html
发表评论
-
Stop Update Here
2020-04-28 09:00 270I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 434NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 318Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 331Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 297Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 383Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 380Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 335Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 405VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 341Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 422NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 369Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 296Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 214GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 398GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 283GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 273Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 268Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 256Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 244Serverless with NodeJS and Tenc ...
相关推荐
tika-python 绑定到 Apache Tika REST 服务 Python binding to the Apache Tika REST services Apache Tika 库的 Python 端口,可使用 Tika REST 服务器使 Tika 可用。这使得 Apache Tika 可作为 Python 库使用,可...
Apache Tika本产品包括在以下位置开发的软件Apache软件基金会。版权所有1993-2010大学大气研究公司/ Unidata该软件包含源自UCAR / Unidata的NetCDF库的代码。Tika服务器组件使用CDDL许可的依赖项
tika读取文件所用jar包,包含各种文件类型所用jar
tika最新版本,tika-app-1.0.jar,提取office和pdf文档内容
tika 工程 简便获取文本的java工具
tika+lucene完整jar包:tika-app-1.20.jar、lucene-7.7.1
下载Apache的tika项目时发现网上没有现成的tika的jar文件,只能自己编译一个了。可能大家也会遇到这个问题。所以将编译好的jar包传上来于大家分享。其中包含了tika-app-0.5.jar,tika-core-0.5.jar,tika-parsers-...
可直接通过java -jar tika.jar运行该jar包 查看我们解析得到文本的结果
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。 功能包括: 侦测文档的类型,字符编码,语言,等其他现有文档的属性。 提取结构化的文字内容。...
使用tika0.5提取内容的基本的jar包。
lucene's tika可以直接去网站下载噢。
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。 功能包括: 侦测文档的类型,字符编码,语言,等其他现有文档的属性。 提取结构化的文字内容。 该...
tika读取文件所用jar包,tika-core-1.5.jar和tika-parsers-1.5.jar
Tika.in.Action.pdf
tika in action for text extraction
tika-app-1.16,java文档内容提取工具jar包,可提取office文档内容
最新tika1.8,可以帮助lucene的开发,提取文档的内容
tika-app-1.7.jar
Tika1.0jar包和源码 Lucene从各种文件类型中提取文字信息的工具
使用apache tika可以很方便地将文档内容提取出来,方便做全文检索使用。