Web信息采集系统设计与实现

英文题名：Design and Implementation of Web Information Collection System
作者：周林云
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：Web信息采集 ; Web信息抽取 ; DOM ; Jsoup ; 多线程
英文关键词：Web information collection ; Web information extraction ; DOM ; Jsoup ; multiple
英文关键词：threads
学位年度：2013
导师：胡晓鹏
学科代码：081202
学位授予单位：西南交通大学
论文提交日期：2013-07-01

摘要

随着移动终端的快速发展和普及,人们越来越习惯通过在移动终端上安装阅读类应用软件获取感兴趣的信息,与之伴随的是平台供应商(也包括内容提供商)必须构建相应的技术平台来支撑这样的业务模式。而这个平台的内容来源可通过两种方式获取。一种是手工编辑,另一种是通过程序自动采集信息源的内容。本文针对后者设计了一套Web信息采集的解决方案。
     论文首先介绍了课题的研究背景,研究现状,以及信息抽取的相关技术和信息采集的工作原理,并对网页结构进行分析；接着,分析了系统的功能和面向的用户,运用用例图和用例规约对系统进行用例建模,分析了系统的非功能需求；然后,对系统进行总体设计和数据库设计；再次,对系统进行了详细设计与实现；最后,对系统进行测试,验证了本方案的有效性。本文的主要工作如下：
     1.本文研究了如何在HTML文档中快速定位目标信息的方法,通过利用HTML标签和属性及DOM的路径表达式设计了信息的抽取规则,采用可视化界面和简单的人机交互来自动生成信息的抽取规则,并在此基础上设计了一种实用的正文去噪解决方案。
     2.本课题包括采集配置子系统和采集子系统两部分组成。采集配置子系统可将配置的采集任务通过Socket机制传递给采集子系统,从而控制采集任务的开启、停止操作,使得用户不必关心采集运行过程即可得到采集结果。
     3.采集子系统根据用户已配置的采集任务,运用多线程技术、数据库连接池技术、动态采集策略、多页面合并技术,定时对这些网站进行信息采集、抽取、去噪、去重等,实现对相关网站特定信息的定时采集更新。
With the rapid development and popularity of mobile terminals, people are increasingly accustomed to obtaining information of interest through the reading application software that installed on the mobile terminal, at the same time, platform vendors (also including content providers) must construct the corresponding technology platform to support such a business. The contents of this platform sources can be obtained in two ways. One is manual editing, and the other is to automatically collect information through the program from information source. In this paper, as to the latter one, there is a Web information collection solution.
     This paper first introduces the research background, research status, the relevant information extraction technology, as well as including giving information collection works and webpage structure analysis. Secondly, there is a detailed analysis of the system function and the user of the system, the system use case modeling consists of using use case diagrams and use case specification, and analyzing the system's non-functional requirements. Then, design the system and database. Once more, gives out a detailed system design and implementation. Finally, verify the effectiveness of the program by means of testing the system. The key work is as following:
     1. This paper analyzes how to locate object information in the HTML document, and designs information extraction rules based on simple visual interface and human-computer interaction through HTML tags and attributes and DOM path expression. Then, gives a solution for main body de-noising based on above.
     2. This subject includes collection configuration subsystem and collection subsystem. The former pass the configured acquisition task to collection subsystem through the socket mechanism in order to control the task of open and stop operation. The benefits of doing so is to get the collection result and not concern about the operation process for user.
     3. Acquisition subsystem regularly and automatically collect、extract de-noise、 de-emphasis information based on user configuration on these sites by multi-threading technology, database connection pool technology, dynamic acquisition strategy and multi-page consolidation technology. Update at regular time collecting of site-specific information.

引文

[1]中国互联网络信息中心.第31次中国互联网络发展状况统计报告.http://www.cnnic.cn/gywm/shzr/shzrdt/201301/t20130115_38518.htm,2013.1.
    [2]蔡智澄,王志华.搜索引擎的主要特点及其检索策略[J].现代情报,2005,5：150-152.
    [3]郑峻iPad带来数字阅读时代：报纸的去纸化转型http://tech.sina.com.cn/i/2012-03-28/13096885610.shtml,2012.3.
    [4]陈少飞,郝亚南,李天柱,徐林昊,杨文柱.Web信息抽取技术研究进展[J].河北大学学报(自然科学版),2003,23(1)：106-112.
    [5]宫进,胡长军,曾广平.互联网信息定向采集系统的设计与实现[J].计算机应用,2007,27(S1)：16-17.
    [6]Califf M, Mooney R. Relational Learning of Pattern-Match rules for Information Extraction[C]. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, Orlando, Florida,1999:328-334.
    [7]Soderland S. Learning information extraction rules for semi-structured and Free Text[J]. Machine Learning,1999,34:233-272.
    [8]Freitag D. Machine learning for information extraction in information domains[J]. Machine Learning,2000.5:169-202.
    [9]Muslea I, Minton S, Craig A, et al. Active learning for hierarchical wrapper induction[Z]. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence. Orlando, Florida, USA,2007.
    [10]Hsu C N, Dung M. Generating finite-state transducers for semi-structured data extraction from the Web[J].Information System,1998:521-538.
    [11]Kushmerick N. Wrapper induction:efficiency and expressiveness[J].Artificial Intelligence Journal,2000:15-68.
    [12]Robert Baumgartner, Sergio Flesca and George Gottlob. Visual web information extraction with lixto[Z].Proceedings of 27th International Conference on Very Large Database, Roma,Italy,2001:119-128.
    [13]Valter Crescenzi, Giansalvatore Mecca. RoadRunner:towards automatic data extraction from large Web sites[Z].In Proceedings of the 27th International Conference on Very Large Database. Roma, Italy,2001:317-328.
    [14]Arnaud Sahugue, Fabien Azavan. Building intelligent Web applications using light weight wrappers[J]. Data Knowledge Engineering,2001,36(3):283-316.
    [15]Liu L, Pu C and Han W. XWRAP:An XML-enabled wrapper construction system for Web information sources [Z]. In Proceedings of the International Conference on Data Engineering, San Diego,2000:611-621.
    [16]Liu L, Han W, Buttler D, et al. An XML-Based wrapper generator for web Information extraction[Z].In Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA,1999:540-543.
    [17]何海芸,袁春风.基于Ontology的领域知识构建技术综述[J].计算机应用研究,2005,22(3)：14-18.
    [18]Christina Y C, Michael G and Neel S. Reverse engineering for web data:From visual to semantic structures [Z]. In Proceedings of the 18th International Conference on Data Engineering. San Jose, California,2002:53-63.
    [19]Embley D, Campbell D. Ontology-based extraction and structuring of information from data rich unstructured documents [C]. In Proceedings of the Conference on Information and Knowledge Management,1998:52-59.
    [20]Embley D, Campbell D. Conceptual-model-based data extraction from multiple record Web pages. Data and Knowledge Engineering,1999,31(3):227-251.
    [21]Arocena G O, Mendelzon A O. WebOQL:Restructuring Documents Databases and Webs. Proceedings of the 14th IEEE International Conference on Data Engineering, Orlando, Florida,1998:24-33.
    [22]McCallum A, Freitag D and Pereira F. Maximum entropy Markov models for information extraction and segmentation. Proc. ICML 2000:591-598.
    [23]杨文柱,徐林昊,郝亚南.个性化的Web查询助手的设计与实现[Z].19届全国数据库会议,郑州,2002.
    [24]庞景安.Web信息采集技术研究与发展[J].情报科学,2009,(12).
    [25]BRIN S, PAGE L. The Anatomy of a Large-Scale Hyper textual Web Search Engine[C]. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, April 1998.
    [26]Heydon A, Najork M. Mercator:A scalable, extensible Web crawler[J].World Wide Web,1999,2(4):219-229.
    [27]Soumen C, Martin d B and Byron D. Focused Crwaling:a new approach to topic-specific Web resource discovery. In:Proceedings of the 8th International World Wide Web Conference,Toronto,Canada,1999:545-562.
    [28]Jerry E, Kevin M and John T. An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In:Proc. of the 10th Intl. World Wide Web Conf. Hong Kong,2001.
    [29]李春旺.Web信息主题采集技术研究[J].图书情报工作,2005,(04).
    [30]Chakrabarti S, Van den Berg M and Dom B. Focused Crawling:A New Approach to Topic-Specific Web Resource Discovey[J]. Computer Networks,1999(31):1623-1640.
    [31]刘彤.个性化Web采集算法研究及其应用[J].贵州大学学报(自然科学版),2006,(03).
    [32]李盛韬,成陵,余智华.分布式Web信息采集系统的研究与设计[J].计算机工程与运用,2003,16：162-168.
    [33]Selberg E, Etzioni O. The Meta Crawler Architecture for Resource Aggregation on the Web [J].IEEE Expert,1997,12(1):11-14.
    [34]HTML W3C http://www.w3.org/html/.
    [35]DOM W3C http://www.w3.org/DOM/.
    [36]Jsoup http://jsoup.org/.
    [37]Hedley Y L, Younas M, James A, et al. Query-Related Data Extraction of Hidden Web Documents. In:Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2004:558-559.
    [38]梁立新.项目实践精解：基于Struts-Spring-Hibernate的Java应用开发[M].北京：电子工业出版社.2006.11.
    [39]哈瓦尼(著),侯伯薇(译)jQuery攻略=jQuery recipes:a problem-solution approach[M]北京：人民邮电出版社：2010.
    [40]李云云.浅析B/S和C/S体系结构[J].科学之友.2011(1)：6-7.
    [41]KANG J, CHOI J. Detecting informative Web page blocks for efficient information extraction using visual block segmentation[C]//2007 International Symposium on Information Technology Convergence. Jeonju, Korea:IEEE Press,2007:306-310.
    [42]邵俊.基于视觉热区的网页内容抽取方法[J].计算机应用与软件2012,29(6)：199-201.
    [43]胡瑜,王立志.基于HTML结构特征的网页信息提取[J].辽宁石油化工大学学报,2009.9.
    [44]李嘉佑,贾自艳,何清等.基于Web挖掘的网页清洗技术[J].计算机工程与应用,2006,42(25)：98～101.
    [45]张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,4(23)：387-393.
    [46]Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma VIPS:a vision-based page segmentation algorithm Nov.1,2003.
    [47]结城浩(著),博硕文化(译)Java多线程设计模式[M].北京：中国铁道出版社.2005：127-152.
    [48]陈隽.基于Java多线程技术的网络编程[J].电脑编程技巧与维护,2009(22)：83-84.
    [49]Cho J, Garcia-Molina H. Effective Page Refresh Policies for Web Crawlers[R].ACM Trans on Database System,2003,28(4):390-426.
    [50]Anirban D, Christopher O, Sandeep P, et al. The Discoverability of the Web. World Wide Web Conference Series[C],2007:421-430.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700