可定制的聚焦网络爬虫

英文题名：Customizable Focused Crawler
作者：邹海亮
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：聚焦爬虫 ; 垂直搜索 ; 主题定制 ; Ajax页面解析
英文关键词：Focused Crawler ; Vertical Search Engine ; Topic Customization ; Ajax parsor
学位年度：2009
导师：孙莉
学科代码：081202
学位授予单位：东华大学
论文提交日期：2008-12-01

摘要

互联网中,用户对信息的需求往往是针对某个领域和面向特定主题的,在这些方面传统搜索引擎的召回率和精确率都不能令人满意。面向主题的垂直搜索引擎的目的是提供分类精确、数据全面、更新及时的搜索服务,在满足用户个性化需求方面有独特的优势。
     在性能卓越的搜索引擎背后,都有强大的网络爬虫做后盾,它的性能直接影响搜索引擎的查全率、查准率。聚焦爬虫在传统爬虫的基础上实现了对web页面的主题相关度的计算和链接的主题相关度评价。聚焦爬虫作为当前的研究热点之一,由于人类语言概念的模糊、多义性,网络信息资源的半结构化特性,使得在主题判断与评价、自然语言理解、隧道穿越方面存在一些公认的难题。
     本文提出了一种可定制的聚焦网络爬虫(Customizable FocusedCrawler,CFC),主要内容有:
     (1)研究并实现了主题的定制算法。在用户和计算机交流的基础上,采用基于向量空间模型的方法描述用户主题信息,让计算机更好地理解和表达用户的兴趣。
     (2)实现了Ajax页面的解析。web2.0已成为互联网的主流技术,越来越多的页面采用Ajax技术,对于这样的页面,浏览器中丰富的文字信息没有在HTML源文件中出现,因此实现Ajax页面的解析势必能提高爬虫的查全率。本文主要针对在页面加载函数中出现的Ajax操作进行处理。
     (3)对于隧道穿越,本文提出了简单有效的宽容算法。此算法模仿人的行为特征,在遇到主题不相关页面或链接时并不立即的抛弃,而是根据宽容阀值的大小,试探性的包容当前不相关的链接。
     (4)研究与实现了基于链接价值的搜索策略。在此方法中利用了基于链接结构和内容的评价方法,综合考虑链接的主题性和权威性来决定链接在队列中的排名。
Requirement for information asked by user in intemet is normally aimed at some field and a specific subject oriented,the ratio of recalling and exactness for some traditional search engine can not be turned up trumps in all these aspects.The aim of subject oriented for verticalsearch engine is to provide a search service of classifying in exactness, all-around data,and updating in time so that there is a specific advantage in satisfying individuation requirement aspect.
     At the back of a powerful search engine,there is always a powerful crawler,whose performance determines the satisfaction of the search engine for users in such aspects as recall ratio and exactness.Based on traditional crawler,a focused crawler evaluates the topic relevance of the web page context and URL.As one of the current research focus,many problems,for example:the ambiguity and polysemy of human language,, semi-structured of the network information resources blocks the further progress,there are many difficulties in topic judgement and evaluation, natural language understanding and tunneling.
     This paper presents a Customizable Focused Crawler,CFC,Mainly including:
     Study and implementation of customization algorithms,on the basis of communication between users and computer,A topic model is formed with vector space model,which expresses the user's interest more explicitly and allow the computer to better understand.
     Implementation of Ajax interpretor.Web2.0 has become a mainstream technology,more and more of the pages using Ajax,for such a page,rich information saw in browser can not be found in HTML source file.Hence, the Ajax interpretor is bound to improve the recall ratio.In this paper,the page load function in Ajax operation is handled.
     For the tunnelling,this paper presents a simple and effective algorithm called tolerance.This algorithm imitates the behavior of people, a page or a link not related to the topic is not abandoned immediately,it will be handled as the relative according to the threshold size. Implement the search strategy based on the value of link.This method makes use of link structure and content-based methods of evaluation, considering both the topic-relevance and authority of links in order to give a priority to the more valuable links.

引文

[1]Yu-Xin Ding,Xiao-Long Wang,Le-Bin Lin,Qi Zhang,Yong-Hui Wu,The Design And Implementation Of The Crawler-Inar.In:Proceedings of the FitCh International Conference on Machine Learning and Cybernetics,Dalian,13-16August 2006.
    [2]李晓明,闰宏飞,王继民.搜索引擎一一原理、技术与系统.北京:科学出版社.2005:30-54.
    [3]J.Kleinberg,Authoritative sources in a hyperlinked environment.Report RJ 10076,IBM,May 1997.
    [4]彭涛,面向专业搜索引擎的主题爬行技术研究,[学位论文],吉林大学,2007.
    [5]M.Diligenti,F.Coetzee,S.Lawrence,C.Giles,M.Gori.Focused crawling using context graphs.In Proc.of26th Int.Conf.on Very Large Data Bases,September 2000.
    [6]D.Geory.Improved crawling with pagerank.In Proceedings of the InternationalConference on Data Engineeringl IEEE Computer Society,2000.24-33.
    [7]Rita Tehan.Internet Searching Techniques.Information Research Specialist Congressional Reference Division,1998(4).
    [8]朱炜王超web超链算法研究,南京大学学报,2008.
    [9]w3c http://www.w3.org.
    [10]Carlos Castillo,Alberto Nelli,Alessandro Panconesi.A Memory-Efficient Strategy for Exploring the Web.In:Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence.
    [11]Chakrabarti S,Punera K,Subramanyam M,Accelerated focused crawling through online relevance feedback,2002:148-159.
    [12]彭涛,面向专业搜索引擎的主题爬行技术研究,[学位论文],吉林大学,2007
    [13]K.Bharat and M.Henzinger,Improved algorithms for topic distillation in hypeflinked environments,in Proceedings 21st Int'l ACM SIGIR Conference.1998.
    [14]蔡琼,罗雪松,HITS算法在Web挖掘中的应用与改进,软件应用,2008.2.
    [15]E.S.Han,G.Karypis,and V Kumar.Text categorization using weight adjusted k-nearest neighbor classification.Computer Science Technical Report TR99-019,Department of Computer Science月 Jniversity of Minnesota,Minneapolis,Minnesota,1999.
    [16]Ceci,M.,Appice,A.,&Malerba,D.Mr-SBC:A multi-relational nave Bayes classifier.Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2003).Springer-Uerlag,2003,pp.95-106.
    [17]Rocchio,J.Relevance feedback in information retrieval.In The Smart Retrieval System:Experiments in Automatic Document Processing,G.Salton,Ed.Prentice-Hall,Englewood Cliffs,NJ,1971,313-323.
    [18]Tao Peng,Wanli Zuo,Fengling He.Text Classification from Positive and Unlabeled Documents Based on GA.In Proc.7th International Conference on High Performance Computing in Computational Sciences,Rio de Janeiro,Brazil,July 10-12,2006.
    [19]Donna Bergmark,Carl Lagoze,Alex Sbityakov.Focused Crawls,Tunneling,and Digital Libraries.Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries,Lecture Notes In Computer Science,2002,Uol.2458,91-106.
    [20]邱哲,符滔滔,开发自己的搜索引擎,人民邮电出版社,2007:232-268.
    [21]http://crawler.archive.org/apidocs/index.html
    [22]http://www.mozilla.org/rhino
    [23]S.Chakrabarti,M.van den Berg,and B.Dom.Distributed hypertext resource discovery through examples.In Proc.Of 25th Int.Conf.on Very Large Data Bases,pages 375-386,September 1999.
    [24]范明,孟晓峰.数据挖掘概念与技术[M].北京:机械工业出版社.2001,8.
    [25]彭涛,左万利,赫枫龄.基于链接上卜文的分类器主题爬行技术.计算机科学.2006 Uol.33 No.11(增刊)12-16.
    [26]A.Heydon and M.Najork.Mercator:A scalable,extensible web crawler.World Wide Web,2(4):219-229,1999.
    [27]B.Kahle.Archiving the internet.Scientific American,March.1997.
    [28]方启明,杨广文等.面向P2P搜索的可定制聚焦网络爬虫.华中科技大学学报自然科学版Vol.35 Sup.Ⅱ
    [29]Willet P,Recent Trends in Hierarchical Document Clustering:AcriticalReview[J],Information Processing and Mangement,1998(24):577-591
    [30]Rennie J,McCallum A.Using reinforcement learning to spider the web Efficiently,In Proceedings of ICML-99,16' International Conference on Machine Learning[C],1999:335-343.
    [31]J.M.Kleinberg,Proc.9th ACM Press,Authoritative Sources in a Hyperlinked Environment.In theNew York and Siam Press,1998:668-677.
    [32]Srinivasan P,Mencezer F,Pant G.A general evaluation framework for topical crawers[J].Information Retrieval,2005,8(3):417-447.
    [33]Menczer F,Pant G,Srinivasan P.Topic web crawler:Evaluating adaptive algorithm [J].ACM Transactions on Internet Technology,2004,4(4):378-419.
    [34]Donna Bergmark,Carl Lagoze,Alex Sbityakov.Focused Crawls,Tunneling,and Digital Libraries.Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries,Lecture Notes In Computer Science,2002,91-106.
    [35]Jon M.Kleinberg,Cornell Univ,Ithaca.Journal of the ACM(JACM) Volume 46,Issue 5(September 1999),Pages:604- 632.
    [36]Michael Hersovici,Michal Jacovi,Yoelle S.Maarek,Dan Pelleg,Menachenm Shtalhaim,Sigalit Ur.The Shark-Search algorithm an application:tailored web site mapping,http:/www2.cs.cmu.edu/～dpelleg/bin/360.html

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700