基于规则引擎的个性化网页爬虫研究

英文题名：Research Personalized Web Crawler Based on Rules Engine
作者：赵思佳
论文级别：硕士
学科专业名称：计算机技术
中文关键词：搜索引擎 ; 主题爬虫 ; 规则引擎 ; 垂直搜索
英文关键词：Search engines ; subject crawler ; rules engines ; vertical search
学位年度：2010
导师：吴敏 ; 方胜
学科代码：081202
学位授予单位：中南大学
论文提交日期：2010-09-01

摘要

目前互联网已经成为公众生活的必需品,大家的工作生活都需要从互联网上查找信息,搜索引擎在互联网信息查找的过程中起了非常重要的作用。
     以Google为首的各种综合搜索引擎在帮助用户从互联网上查找信息,但是搜索的结果只能是信息所在的网址,这种方式非常适合静态网页,但是现在动态网页越来越多,用户搜索需要的是非结构化网页里的结构化信息,例如不同网站的票务信息、房产信息、商品信息等等,目前要得到这些信息可以通过垂直搜索引擎的主题爬虫实现,但是现在的垂直搜索引擎对这些信息的提取一般分为两种策略,一种是先用主题爬虫抓取网页,再对抓取的网页进行分析提取；另一种是主题爬虫在抓取网页时就进行提取。前一种抓取网页比较广泛,但是分析时速度较慢,无关网页较多,效率比较低,现在一般采用的是后一种方式,这种方式精确度高,抓取准确,页面信息提取也较快。
     不论采用哪种方式,信息的提取都具有很强的针对性,但目前主题爬虫广泛存在配置不灵活,用户参与度不够等问题,论文通过研究搜索引擎和规则引擎技术,提出了利用规则引擎建立搜索引擎的配置机制,以实现能个性化配置的主题爬虫的目的。
     论文中将个性化主题爬虫的爬行过程设计为由规则编辑器模块、规则引擎模块和爬虫抓取模块三个部分组成。先由规则编辑器模块制定爬行所需要的规则库,然后在抓取任务执行过程中将事实数据和规则库都提交给规则引擎模块,最后由规则引擎模块根据规则指导爬虫抓取模块的运行。
     为了简化规则库的设定,将爬虫抓取模块分成了由五个小任务完成,分别是预抓取处理、抓取处理、内容抽取处理、写入和索引处理、后置处理,每一个小任务都将对应的常用算法转换了规则引擎处理模式,使得用户可以通过设定规则库文件,灵活调整爬虫的工作方式,最后将整个个性化主题爬虫加上用户控制,从而使得每个用户都可设定自己的爬虫,而不会影响到其他用户,还可以共享自己设置的规则库。
     通过这种方式替换传统的配置模式,达到提高配置的灵活性,降低用户使用难度的目的,最后利用实例证明这种方式的可行性。
Currently the Internet has become a public necessity of life, everyone's working life need to find information from the Internet, search engines to find information in the course of the Internet played a very important role.
     Google led to a variety of comprehensive search engine to help users find information from the Internet, but the search results only where the site is information, this approach is ideal for static pages, but now more and more dynamic pages, users need to search is unstructured and structured information in web pages, for example, information about the different ticketing websites, real estate information, commodity information, etc., now to get this information through the vertical search engine focused crawler to achieve, but now these vertical search engines Information from two of the general strategy is to use the theme of a web crawler to crawl, and then the analysis of web pages crawled extraction; the other is focused crawler when crawling web pages to extract. A wider front crawl the web, but the analysis is slow, nothing more pages, the efficiency is relatively low, the latter now generally used in a way, this way, high accuracy, capture accurate information extraction can page faster.
     Either way, information extraction are highly relevant, but the current widespread theme crawler configuration is not flexible, user participation is not enough and other issues, the paper by studying the search engines and rule engine technology, is proposed to establish by rule engine search engine configuration mechanism, to achieve the configuration of the subject can be personalized reptiles purposes.
     Papers will be focused crawler to crawl personalized ground rules for the process design editor module, the rule engine module and reptiles crawling module composed of three parts. Developed first by the rule editor module rule base needed to crawl, and then the facts will crawling task execution data and rule base are submitted to the rule engine module, and finally from the rules engine module reptiles crawl under the rules govern the operation of the module.
     To simplify the rule base settings to reptiles crawling module into small tasks by the five completed treatment were pre-crawl, crawl, content extraction processing, write, and the index processing, post processing, each small Common tasks will correspond to the rules engine conversion algorithm processing mode, so users can set rules for libraries, work flexibility to adjust the reptile, and finally focused crawler with the personalized user control, so everyone can make their own set their own reptile, without affecting other users can also share their own set of rules library.
     In this way replace the traditional configuration mode, to achieve greater configuration flexibility, the purpose of reducing the difficulty of users, the last example shows use of the feasibility of this approach.

引文

[1]姚国祥,罗伟其,沈镇林.网上信息搜索技术与搜索引擎[J].计算机科学,2000,27(7)：35-38
    [2]孔祥春,李义杰,郑凯明.垂直搜索引擎应用研究[J].计算机系统应用,2009,18(7)：150-152
    [3]王文钧,李巍.垂直搜索引擎的现状与发展探究[J].情报科学,2010,28(3)：477-480
    [4]刘畅.综合搜索引擎与垂直搜索引擎的比较研究[J].情报科学,2007,25(1)：97-102
    [5]M Chau, H Chen. Personalized and Focused Web Spiders[C]. in Web Intelligence, Springer-Verlag: Ning Zhong,2003,197-217
    [6]文振威,秦晓.个性化搜索引擎的研究与设计[J].计算机工程与设计,2009,30(2)：342-344,394
    [7]邱哲,符滔滔.开发自己的搜索引擎——Lucene 2.0+Heritrix[M].北京：人民邮电出版社,2007,4-89
    [8]李世威,钱晓东.一种构建个性化网络购物搜索引擎模型研究[J].计算机应用研究,2010,27(6)：2176-2180
    [9]黄于蓝,王洪,徐端颐等.搜索引擎技术的新发展—多元搜索引擎系统[J].计算机工程,2002,28(1)：5-6,53
    [10]郭立力,胡亮,张小栓.FTP搜索引擎数据采集策略的研究[J].计算机工程与设计,2009,30(8)：1853-1854,1885
    [11]谭爱平,成亚玲.搜索引擎技术综述[J].湖南工业职业技术学院学报,2008,8(03)：19-21,47
    [12]希顿,李纯,童兆丰等.网络机器人Java编程指南[M].北京：电子工业出版社,2002,10-200
    [13]E Selberg, O Etzioni. The MetaCrawler architecture for resource aggregation on the Web [J]. IEEE Expert,1997,12(01):11-14
    [14]K Bharat, A Broder, M Henzinger, et al.. The Connectivity Server:fast access to linkage information on the Web [J]. Computer Networks and ISDN Systems,1998, 30(07):469-477
    [15]杜亚军.搜索引擎智能行为的研究及实现：[博士学位论文].成都：西南交通大学.2005
    [16]H Chen, Y Chung, M Ramsey, et al.. A smart itsy bitsy spider for the Web[J]. Journal of the American Society for Information Science,1999,49(07):604-618
    [17]J Cho, H Garcia-Molina, L Page. Efficient crawling through URL ordering [J]. Computer Networks and ISDN Systems,1998,30(07):161-172
    [18]C C Aggarwal, F Al-Garawi, P S Yu. Intelligent crawling on the World Wide Web with arbitrary predicates[C]. in Proceedings of the 10th international conference on World Wide Web, Hong Kong:International World Wide Web Conference,2001,96-105
    [19]J Rennie, A K McCallum. Using Reinforcement Learning to Spider the Web Effciently[C]. in Proceedings of International Conference on Machine Learning Workshop, Machine Learning in Text Data Analysis, Bled, Slovenia,1999,335-343
    [20]陈财森,王韬,郑伟等.基于搜索引擎调用的主题搜索设计与实现[J].计算机工程与设计,2008,29(21)：5627-5629
    [21]P M E De Bra, R D J Post. Information retrieval in the World - Wide web:making client - based searching feasible[J]. Computer Networks and ISDN Systems,1995, 27(02):183-192
    [22]G Pant, F Menczer. MySpiders:Evolve Your Own Intelligent Web Crawlers [J]. Autonomous Agents and Multi-Agent Systems,2002,5(02):221-229
    [23]白坤,耿国华.基于Lucene/Heritrix的垂直搜索引擎的研究与应用[J].计算机应用与软件,2009,26(01)：212-215,247
    [24]周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(09)：1965-1969
    [25]陈俊彬,曹树金.基于Heritrix的Web信息抽取[J].图书情报工作,2008,53(09)：112-115
    [26]张丽敏.垂直搜索引擎的主题爬虫策略[J].电脑知识与技术,2010,6(15)：3962-3963
    [27]郑冬冬,赵朋朋,崔志明.Deep Web爬虫研究与设计[J].清华大学学报(自然科学版),2005,45(09)：1896-1902
    [28]欧阳柳波,李学勇,李国徽等.专业搜索引擎搜索策略综述[J].计算机工程,2004,30(13)：32-33,46
    [29]李学勇,欧阳柳波,李国徽等.搜索引擎中网络蜘蛛搜索策略比较研究[J].计算技术与自动化,2003,22(4)：63-67
    [30]汪涛,樊孝忠.链接分析对主题爬虫的改进[J].计算机应用,2004,24(B12)：174-176
    [31]李剑,金蓓弘.Web链接结构信息研究综述[J].计算机科学,2003,30(04)：95-138
    [32]J Cho, H Garcia-Molina, L Page. Efficient crawling through URL ordering[J]. Computer Networks and ISDN Systems,1999,30(1-7):161-172
    [33]贺晟,程家兴,蔡欣宝.基于模拟退火算法的主题爬虫[J].计算机技术与发展,2009,19(12)：55-58,62
    [34]孟涛,闫宏飞,李晓明.一种评价搜索引擎信息覆盖率的模型及其验证[J].电子学报,2003,31(8)：1168-1172
    [35]李勇,韩亮.主题搜索引擎中网络爬虫的搜索策略研究[J].计算机工程与科学,2008,30(3)：4-6,56
    [36]林海霞,原福永,陈金森等.一种改进的主题网络蜘蛛搜索算法[J].计算机工程与应用,2007,43(10)：174-176
    [37]R Cole, P W Eklund. Scalability in Formal Concept Analysis[J]. Computational Intelligence,1999,15(01):11-27
    [38]袁浩,黄烟波.网页标题分析对主题爬虫的改进[J].计算机技术与发展,2009,19(06)：22-24,28
    [39]龙宇巍,王永成,许欢庆.定题搜索引擎Robot的设计与算法[J].计算机仿真,2004,21(04)：69-72,76
    [40]宋聚平,王永成.搜索引擎中Robot搜索算法的优化[J].情报学报,2002,21(02)：130-133
    [41]王晓宇,熊方,凌波等.一种基于相似度分析的主题提取和发现算法[J].软件学报,2003,14(09)：1578-1585
    [42]A Castellucci, G Ianni, D Vasile, et al.. Searching and surfing the Web using a semi-adaptive meta-engine[C]. in Proceedings of 2001 Information Technology: Coding and Computing, Calabria Univ, Las Vegas, NV, USA,2001,416-420
    [43]张慧颖,曲著伟.基于子树匹配的交互式Web数据抽取方法[J].计算机工程,2006,32(09)：78-80
    [44]郭红艳,杨波,金蓓弘.高效DOM实现的技术研究[J].计算机科学,2006,33(06)：274-277
    [45]S Chakrabarti, M van den Berg, B Dom. Focused crawling:a new approach to topic-specific Web resource discovery[J]. Computer Networks,1999,31(11-16): 1623-1640
    [46]张晓琴,路永和.在网页浏览中用户点击超链接行为的影响因素分析[J].现代情报,2008,28(02)：221-225
    [47]陶晓俊,朱敏.基于规则引擎的企业服务开发模式[J].计算机技术与发展,2008,18(02)：115-118
    [48]D Lin, P Pantel. Discovery of inference rules for question-answering[J]. Natural Language Engineering,2001,7(04):343-360
    [49]K T Phalp, P Hendersonb, R J Waltersb, et al.. RolEnact:role-based enactable models of business processes[J]. Information and Software Technology,1998,40(03):123-133
    [50]孙勇强,邓咏梅,李续武.基于EJB的业务规则引擎的设计和实现[J].计算机工程,2005,31(20)：220-222
    [51]费廷伟,刘淑芬,屈志勇等.Java反射驱动的规则引擎技术研究[J].计算机应用,2010,30(5)：1324-1326,1330
    [52]修洁蕾,许南山,危胜军.基于Drools的离线分析研究与实现[J].微计算机信息,2009,25(3)：148-149,134
    [53]郑浩然,肖伟.基于规则引擎的JAVA声明式编程[J].计算机应用与软件,2009,26(12)：132-134
    [54]尤俊欣,饶若楠,詹晓峰.基于规则引擎的Web框架[J].计算机应用与软件,2007,24(2)：4-5,22
    [55]张渊,夏清国.基于Rete算法的JAVA规则引擎[J].科学技术与工程,2006,6(11)：1548-1550
    [56]刘伟.Java规则引擎——Drools的介绍及应用[J].微计算机应用,2005,26(06)：717-721
    [57]王李军,陶明亮,张曙等.面向业务规则引擎研究[J].计算机工程,2007,33(24)：52-56
    [58]张剑,孟波.基于规则引擎的一种智能工作流系统研究[J].计算机工程与设计,2006,27(14)：2591-2593
    [59]T. Gu, H.K Pung; D.Q. Zhang. Toward an OSGi-based infrastructure for context-aware applications[J]. Pervasive Computing,2004,3(04):66-74
    [60]Yuangui Lei, Victoria Uren, Enrico Motta. SemSearch:A Search Engine for the Semantic Web[J]. Lecture Notes in Computer Science,2006,48(42):238-245
    [61]P Ferragina, A Gulli. A personalized search engine based on web-snippet hierarchical clustering[C]. in Proceedings of WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web, ACM New York, NY, USA,2005, 801-810
    [62]X Jianfeng, Z Zhibing, L Lan, Q Taorong. Personalized search engine based on granular computing[C]. in Proceedings of Granular Computing,2008, GrC 2008, IEEE International Conference on, NanChang Univ., Nanchang,2008,690-694
    [63]L Ahuja, E Kumar. Development of expert search engine for web environment[C]. in Proceedings of Information Management and Engineering (ICIME),2010 The 2nd IEEE International Conference on, Chengdu,2010,288-291

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700