- 浏览: 2488213 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
HTML解析htmlparser
htmlparser
首页:http://sourceforge.net/projects/htmlparser/
下载:http://sourceforge.net/project/showfiles.php?group_id=24399
文件:HTMLParser-2.0-SNAPSHOT-bin.zip
cpdetector
首页:http://cpdetector.sourceforge.net/
下载:http://sourceforge.net/project/showfiles.php?group_id=114421
文件:cpdetector_eclipse_project_1.0.7.zip
解开压缩后,运行ANT打包命令,build.xml有些地方需要稍微根据具体情况调整一下
ant jar.htmlentitydecoder
得到JAR包
cpdetector_1.0.7.jar
HTML工具类函数一:自动探测URL的HTML内容的编码
/**
* 自动探测页面的编码
*
* @param url
* @return
* @throws MalformedURLException
*/
public static String autoDetectCharset(String url) {
URL source = null;
try {
source = new URL(url);
} catch (MalformedURLException e) {
log.error(e);
}
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
Charset charset = null;
try {
charset = detector.detectCodepage(source);
} catch (IOException e) {
log.error(e);
}
if (charset == null) {
charset = Charset.defaultCharset();
}
return charset.name();
}
HTML工具类函数二:读取URL中的HTML文本
/**
* 读取文件HTML内容
*
* @param url
* @param charset
* @return
* @throws IOException
*/
public static String readURL(String url, String charset) {
/* StringBuffer的缓冲区大小 */
int TRANSFER_SIZE = 4096;
/* 当前平台的行分隔符 */
String lineSep = System.getProperty("line.separator");
String content = "";
URL source = null;
try {
source = new URL(url);
} catch (MalformedURLException e) {
log.error(e);
}
InputStream in = null;
try {
in = source.openStream();
} catch (IOException e) {
log.error(e);
}
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(in, charset));
} catch (UnsupportedEncodingException e) {
log.error(e);
}
String line = new String();
StringBuffer temp = new StringBuffer(TRANSFER_SIZE);
try {
while ((line = reader.readLine()) != null) {
temp.append(line);
temp.append(lineSep);
}
in.close();
reader.close();
} catch (IOException e) {
log.error(e);
}
content = temp.toString();
return content;
}
HTML工具类函数三:解析HTML得到其中的所有TAG
public static NodeList getFormNodeList(String url) {
Parser parser = Parser.createParser(readURL(url),
autoDetectCharset(url));
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
factory.registerTag(new ScclSelectBizCodesTag());
factory.registerTag(new InputTag());
factory.registerTag(new TextareaTag());
parser.setNodeFactory(factory);
NodeFilter formFilter = new PostFormFilter();
NodeList nodeList = null;
try {
nodeList = parser.extractAllNodesThatMatch(formFilter);
} catch (ParserException e) {
log.error(e);
}
return nodeList;
}
HTML工具类函数四:解析TAG中的属性,生成所有的PageField的POJO
public static List<PageField> getPageFields(String url) {
List<PageField> list = null;
NodeList nodeList = getFormNodeList(url);
if (nodeList != null && nodeList.size() > 0) {
// nodeList不为空,开始构建
list = new ArrayList<PageField>(nodeList.size());
for (int i = 0; i < nodeList.size(); i++) {
TagNode node = (TagNode) nodeList.elementAt(i);
if (node instanceof InputTag) {
InputTag input = (InputTag) node;
PageField t = new PageField(input.getAttribute("name"),
PageField.TAG_TYPE_INPUT, input
.getAttribute("type"));
list.add(t);
} else if (node instanceof ScclSelectBizCodesTag) {
ScclSelectBizCodesTag scclSelectBizCodesTag = (ScclSelectBizCodesTag) node;
PageField t = new PageField(scclSelectBizCodesTag
.getAttribute("id"),
PageField.TAG_TYPE_SELECT, null);
list.add(t);
} else if (node instanceof TextareaTag) {
TextareaTag textArea = (TextareaTag) node;
PageField t = new PageField(textArea.getAttribute("name"),PageField.TAG_TYPE_TEXTAREA,null);
list.add(t);
}
}
}
return list;
}
扩展自定义标签<sccl:selectBizCodes>
public class ScclSelectBizCodesTag extends TagNode {
private static final long serialVersionUID = -6352090777443844707L;
private static final String[] ids = new String[] { "sccl:selectBizCodes" };
public String[] getIds() {
return (ids);
}
public String[] getEnders() {
return (ids);
}
public String getCategory(){
return super.getAttribute("category");
}
public String getId(){
return super.getAttribute("id");
}
public String getSelected(){
return super.getAttribute("selected");
}
}
用FILTER方式过滤访问TAG
public class PostFormFilter implements NodeFilter {
private static final long serialVersionUID = 8162322553987269165L;
public boolean accept(Node node) {
if (node instanceof InputTag) {
return true;
}
if (node instanceof ScclSelectBizCodesTag) {
return true;
}
if (node instanceof TextareaTag) {
return true;
}
return false;
}
}
测试
public static void main(String[] args)
throws org.htmlparser.util.ParserException, IOException {
String url = "file:///E:\\work\\html\\editOrder.jsp";
List<PageField> list = getPageFields(url);
list.get(0);
}
以上代码可以解析<input> <select> 自定义类型
<sccl:selectBizCodes category="worksheet" id="worksheetCode" selected="cl" onChange="go();" html="style='test';"/>
问题一
拷贝cpdetector_1.0.7.jar到项目中后
同时也要拷贝ext下面的chardet.jar到lib下面,不然在调用
detector.add(JChardetFacade.getInstance());时要报错,找不到类
nsICharsetDetectionObserver
问题二
拷贝htmlparser相关包如下:
htmlparser.jar
htmllexer.jar
htmlparser
首页:http://sourceforge.net/projects/htmlparser/
下载:http://sourceforge.net/project/showfiles.php?group_id=24399
文件:HTMLParser-2.0-SNAPSHOT-bin.zip
cpdetector
首页:http://cpdetector.sourceforge.net/
下载:http://sourceforge.net/project/showfiles.php?group_id=114421
文件:cpdetector_eclipse_project_1.0.7.zip
解开压缩后,运行ANT打包命令,build.xml有些地方需要稍微根据具体情况调整一下
ant jar.htmlentitydecoder
得到JAR包
cpdetector_1.0.7.jar
HTML工具类函数一:自动探测URL的HTML内容的编码
/**
* 自动探测页面的编码
*
* @param url
* @return
* @throws MalformedURLException
*/
public static String autoDetectCharset(String url) {
URL source = null;
try {
source = new URL(url);
} catch (MalformedURLException e) {
log.error(e);
}
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
Charset charset = null;
try {
charset = detector.detectCodepage(source);
} catch (IOException e) {
log.error(e);
}
if (charset == null) {
charset = Charset.defaultCharset();
}
return charset.name();
}
HTML工具类函数二:读取URL中的HTML文本
/**
* 读取文件HTML内容
*
* @param url
* @param charset
* @return
* @throws IOException
*/
public static String readURL(String url, String charset) {
/* StringBuffer的缓冲区大小 */
int TRANSFER_SIZE = 4096;
/* 当前平台的行分隔符 */
String lineSep = System.getProperty("line.separator");
String content = "";
URL source = null;
try {
source = new URL(url);
} catch (MalformedURLException e) {
log.error(e);
}
InputStream in = null;
try {
in = source.openStream();
} catch (IOException e) {
log.error(e);
}
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(in, charset));
} catch (UnsupportedEncodingException e) {
log.error(e);
}
String line = new String();
StringBuffer temp = new StringBuffer(TRANSFER_SIZE);
try {
while ((line = reader.readLine()) != null) {
temp.append(line);
temp.append(lineSep);
}
in.close();
reader.close();
} catch (IOException e) {
log.error(e);
}
content = temp.toString();
return content;
}
HTML工具类函数三:解析HTML得到其中的所有TAG
public static NodeList getFormNodeList(String url) {
Parser parser = Parser.createParser(readURL(url),
autoDetectCharset(url));
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
factory.registerTag(new ScclSelectBizCodesTag());
factory.registerTag(new InputTag());
factory.registerTag(new TextareaTag());
parser.setNodeFactory(factory);
NodeFilter formFilter = new PostFormFilter();
NodeList nodeList = null;
try {
nodeList = parser.extractAllNodesThatMatch(formFilter);
} catch (ParserException e) {
log.error(e);
}
return nodeList;
}
HTML工具类函数四:解析TAG中的属性,生成所有的PageField的POJO
public static List<PageField> getPageFields(String url) {
List<PageField> list = null;
NodeList nodeList = getFormNodeList(url);
if (nodeList != null && nodeList.size() > 0) {
// nodeList不为空,开始构建
list = new ArrayList<PageField>(nodeList.size());
for (int i = 0; i < nodeList.size(); i++) {
TagNode node = (TagNode) nodeList.elementAt(i);
if (node instanceof InputTag) {
InputTag input = (InputTag) node;
PageField t = new PageField(input.getAttribute("name"),
PageField.TAG_TYPE_INPUT, input
.getAttribute("type"));
list.add(t);
} else if (node instanceof ScclSelectBizCodesTag) {
ScclSelectBizCodesTag scclSelectBizCodesTag = (ScclSelectBizCodesTag) node;
PageField t = new PageField(scclSelectBizCodesTag
.getAttribute("id"),
PageField.TAG_TYPE_SELECT, null);
list.add(t);
} else if (node instanceof TextareaTag) {
TextareaTag textArea = (TextareaTag) node;
PageField t = new PageField(textArea.getAttribute("name"),PageField.TAG_TYPE_TEXTAREA,null);
list.add(t);
}
}
}
return list;
}
扩展自定义标签<sccl:selectBizCodes>
public class ScclSelectBizCodesTag extends TagNode {
private static final long serialVersionUID = -6352090777443844707L;
private static final String[] ids = new String[] { "sccl:selectBizCodes" };
public String[] getIds() {
return (ids);
}
public String[] getEnders() {
return (ids);
}
public String getCategory(){
return super.getAttribute("category");
}
public String getId(){
return super.getAttribute("id");
}
public String getSelected(){
return super.getAttribute("selected");
}
}
用FILTER方式过滤访问TAG
public class PostFormFilter implements NodeFilter {
private static final long serialVersionUID = 8162322553987269165L;
public boolean accept(Node node) {
if (node instanceof InputTag) {
return true;
}
if (node instanceof ScclSelectBizCodesTag) {
return true;
}
if (node instanceof TextareaTag) {
return true;
}
return false;
}
}
测试
public static void main(String[] args)
throws org.htmlparser.util.ParserException, IOException {
String url = "file:///E:\\work\\html\\editOrder.jsp";
List<PageField> list = getPageFields(url);
list.get(0);
}
以上代码可以解析<input> <select> 自定义类型
<sccl:selectBizCodes category="worksheet" id="worksheetCode" selected="cl" onChange="go();" html="style='test';"/>
问题一
拷贝cpdetector_1.0.7.jar到项目中后
同时也要拷贝ext下面的chardet.jar到lib下面,不然在调用
detector.add(JChardetFacade.getInstance());时要报错,找不到类
nsICharsetDetectionObserver
问题二
拷贝htmlparser相关包如下:
htmlparser.jar
htmllexer.jar
发表评论
-
Update Site will come soon
2021-06-02 04:10 1613I am still keep notes my tech n ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 381Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 378Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 329Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 403VPN Server 2020(2)Docker on Cen ... -
Nginx Deal with OPTIONS in HTTP Protocol
2020-02-15 01:33 302Nginx Deal with OPTIONS in HTTP ... -
PDF to HTML 2020(1)pdftohtml Linux tool or PDFBox
2020-01-29 07:37 347PDF to HTML 2020(1)pdftohtml Li ... -
Elasticsearch Cluster 2019(2)Kibana Issue or Upgrade
2020-01-12 03:25 600Elasticsearch Cluster 2019(2)Ki ... -
Spark Streaming 2020(1)Investigation
2020-01-08 07:19 231Spark Streaming 2020(1)Investig ... -
Hadoop Docker 2019 Version 3.2.1
2019-12-10 07:39 258Hadoop Docker 2019 Version 3.2. ... -
MongoDB 2019(3)Security and Auth
2019-11-16 06:48 204MongoDB 2019(3)Security and Aut ... -
MongoDB 2019(1)Install 4.2.1 Single and Cluster
2019-11-11 05:07 251MongoDB 2019(1) Follow this ht ... -
Monitor Tool 2019(1)Monit Installation and Usage
2019-10-17 08:22 286Monitor Tool 2019(1)Monit Insta ... -
Ansible 2019(1)Introduction and Installation on Ubuntu and CentOS
2019-10-12 06:15 272Ansible 2019(1)Introduction and ... -
Timezone and Time on All Servers and Docker Containers
2019-10-10 11:18 293Timezone and Time on All Server ... -
Kafka Cluster 2019(6) 3 Nodes Cluster on CentOS7
2019-10-05 23:28 240Kafka Cluster 2019(6) 3 Nodes C ... -
K8S Helm(1)Understand YAML and Kubectl Pod and Deployment
2019-10-01 01:21 289K8S Helm(1)Understand YAML and ... -
Rancher and k8s 2019(5)Private Registry
2019-09-27 03:25 325Rancher and k8s 2019(5)Private ... -
Jenkins 2019 Cluster(1)Version 2.194
2019-09-12 02:53 404Jenkins 2019 Cluster(1)Version ... -
Redis Cluster 2019(3)Redis Cluster on CentOS
2019-08-17 04:07 336Redis Cluster 2019(3)Redis Clus ...
相关推荐
FileZilla FTP Client 开源的软件,完全免费,纯绿色,比市面上大部分ftp软件好用
开源库的包名是这个org.apache.commons.net.ftp.FTPClient;是属于局域网的ftp上传,要有ip、端口、用户名以及密码。我根据网上的下载demo,自已研究了一番,又封装了一个类自已使用。欢迎访问博客:...
前端开源库-ftp-clientftp客户机,节点ftp模块的包装器
一个简单的FTP客户端。
基于ucos系统通过lwip实现ftp客户端功能,用于从服务器下载升级程序少些片上flash
FileZilla是一个免费开源的FTP客户端软件,分为客户端版本和服务器版本,具备所有的FTP软件功能。可控性、有条理的界面和管理多站点的简化方式使得Filezilla客户端版成为一个方便高效的FTP客户端工具。
开源软件,支持tftp和ftp server端、client端,方便用户进行文件上传和下载
FileZilla FTP Client 是一个个免费开源的FTP软件 具备所有的FTP软件功能。可控性、有条理的界面和管理多站点的简化方式使得Filezilla客户端版成为一个方便高效的FTP客户端工具
Yafc是一个开放源代码控制台模式FTP客户端。 它支持Kerberos 4/5身份验证和sftp(ssh2)。 其他功能包括制表符完成,目录缓存,强大的别名,递归文件命令和具有自动登录功能的书签。
一个用Python编写的用户友好,菜单驱动的FTP客户端。
N年前写的FTP客户端,练习程序,支持常用命令,支持PASV与PORT模式 对应的服务器端为: http://download.csdn.net/source/2724443
PHP FTP CLIENT是ftp客户端,它通过HTTP接口运行。
MJftp是使用纯Java编写的开源多平台ftp客户端。
Ftp 客户端控制台应用程序设计用于资源有限的设备,如 Windows CE。 实现常见的ftp功能以及一些额外的功能,例如批量上传/下载文件。 它应该适用于大多数 Windows CE 平台和普通 Windows。
用edtftpj.jar 实现的ftp client edtFTPj是一个非常强大的FTP组件,有Java版本、.NET版本、JavaScript版本。 Java版本的有收费的edtFTPj/PRO,还有免费开源的edtFTPj/Free。 这里使用edtFTPj/Free。 ...
这个项目打算给大家用Java编写的ftp客户端。 当前,FtpGUI实现了常用功能。 获取-RETR检索ftp服务器上的文件列表文件,放置-STOR-上传文件,然后搜索名称
这是FTP客户端协议的纯.Net实现。 它旨在同时在Microsoft的.Net CLR和Mono CLR中运行。
这是一个用Perl编写的小型FTP客户端,它打算在终端中使用。 根据GNU GPL V3许可证发布的软件。