搜索引擎中网络爬虫的研究

英文题名：Research on the Crawler of Search Engine
作者：龚勇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：搜索引擎 ; 主题爬虫 ; Context ; Graph ; 特征选择
英文关键词：Search engine ; Focused crawler ; Context Graph ; Feature selection
学位年度：2010
导师：刘东飞
学科代码：081203
学位授予单位：武汉理工大学
论文提交日期：2010-04-01
答辩委员会主席：方安平

摘要

搜索引擎作为信息检索技术在互联网时代的应用,使人们能够更有效的从互联网获取各种资源。但随着互联网的发展,传统的搜索引擎,即通用搜索引擎渐渐不能满足人们对信息检索服务日益增长的需求。近年来,面向主题的搜索引擎应运而生。本文围绕主题搜索引擎,对主题搜索引擎中占有重要地位的主题爬虫相关技术进行了研究和讨论。
     网络爬虫用来从互联网上抓取页面。通用爬虫会从一些种子链接开始,目标是获取互联网上所有的页面。而主题爬虫的目标是获取与特定主题内容相关的页面,因此除了具有通用爬虫的基本功能外,还需要对页面的内容和链接进行分析从而能够对爬虫爬行的路径进行指导和预测。主题网络爬虫选择什么样的爬行策略对互联网进行访问,直接影响着其爬行的效率。本文着重研究并改进了基于Context Graph的主题爬行算法,研究工作主要有以下几个方面：
     (1)研究了搜索引擎中通用网络爬虫和主题网络爬虫的技术原理、工作流程,着重分析了主题网络爬虫的主题爬行策略,对主题网络爬虫常用的基于链接分析的爬行策略和基于内容分析的爬行策略进行分析比较。
     (2)针对传统的主题爬行算法不能很好解决“隧道现象”的问题,本文详细介绍了一种基于Context Graph的主题爬行算法,它通过预测新抓取页面在Context Graph中所处的层次,能够指导网络爬虫沿着最有可能找到目标页面的路径爬行,进而较好地解决“隧道现象”的问题。
     (3)使用一种基于词频差异的特征选择方法和改进的TF-IDF公式对基于Context Graph的主题爬行算法进行了改进,加入词的类别权重作为对TF-IDF公式的调整,以提高特征选择和评价的质量。
     (4)实现了一个主题爬虫原型,通过实验对各算法进行了分析和比较,验证了本文改进的算法能够得到更加准确的文档集特征及权重,进而提高主题爬虫的性能。
The search engine as the information retrieval technology in the Internet time's application makes the people more effective to gain network resources. But with the development of Internet, the traditional search engine, namely the general search engine cannot satisfy the people's increasingly demand to the information retrieval service. This thesis research and discuss correlation techniques to the focused crawler which held the important position in focused search engine.
     Web crawler is used to download web pages from Internet. Starting from some seeding links, general web crawler searches all the web pages throughout the internet. The focused crawler aim to get more pages related to topic, apart from the fundamental function of general web crawler, the focused crawler should able to analyze links and content in web pages to guide and forecast crawler's crawling path. What crawling strategy does the crawler used to visit the Internet have a significance impact on the focused crawler's efficiency. This thesis studied and improved the focused crawling algorithm based on the Context Graph. The main research works as follows:
     (1) Research on general crawler and focused crawler's technical principle and workflow; make a careful analysis of focused crawler's crawling strategy. This thesis introduce and analysis good and bad points of the crawling strategies based on link analysis and based on content analysis which are usually used by focused crawler.
     (2) To resolve the problem that traditional focused crawling algorithm cannot deal with " the tunnel", this thesis introduced in detail a crawling algorithm based on the Context Graph, by predicting the level of web pages in the context graph, the crawling algorithm advances along the most promising path that leads to target documents at low cost of crawling irrelevant pages to find target documents quicker and resolve "the tunnel".
     (3) To improve the feature selection and appraisal quality used in the crawling algorithm based on Context Graph, This thesis used a feature selection method based on the word frequency difference and a modified TF-IDF formula joined the word's category weight.
     (4) A demo system—Focused crawler was proposed in this paper. The experiment results show that the feature selection quality and the focused crawler's performance can improve by the improved algorithm proposed in this paper.

引文

[1]邓顺国.试论搜索引擎的发展趋势[J].图书馆理论与实践,2003(5)：51-52
    [2]谢新洲.网络信息检索技术与案例[M].北京：北京图书馆出版社,2005：29-30
    [3]卢亮,张博文.搜索引擎原理、实践与应用[M].北京：电子工业出版社,2007
    [4]邱哲,符滔滔.开发自己的搜索引擎：Lucene 2.0+Heritrix[M].北京：人民邮电出版社,2007
    [5]周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9)：1965-1989.
    [6]Brian Pinkerton. Finding What People Want:Experiences with the WebCrawler. Second International WWW Conference [C].Chicago, Illinois.1994.
    [7]李晓明,李星.搜索引擎与Web挖掘进展[M].北京：高等教育出版社,2003：134-40.
    [8]刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10)：26-29,47
    [9]袁浩.主题爬虫搜索Web页面策略的研究[D].长沙：中南大学.2009.5
    [10]Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Computer Networks and ISDN Systems,1998,30(1-7):161-172.
    [11]JeffHeaton.网络机器人Java编程指南[M].童兆丰.北京：电子工业出版社,2002
    [12]刘玮玮.搜索引擎中主题爬虫的研究与实现[D].南京：南京理工大学,2006.
    [13]戴明远.基础信息论[M].上海：同济大学出版社,2003
    [14]连浩,刘悦,许洪波.改进的基于布尔模型的网页查重算法[J].计算机应用研究,2007(02)36-39.
    [15]张东礼,汪东升,郑纬民.基于VSM的中文文本分类系统的设计与实现[J].清华大学学报(自然科学版),2003(9)：1288-1291
    [16]李雪蕾,张冬茉.一种基于向量空间模型的文本分类方法[J].计算机工程,2003(17)：90-92
    [17]杨溥.搜索引擎中爬虫的若干问题研究[D].北京：北京邮电大学.2009.1
    [18]汪涛,樊孝忠.链接分析对主题爬虫的改进[J].计算机应用,2004(24)：174-176.
    [19]Salton, Gerard. Developments in Automatic Text Retrieval[J]. Science,1991(253):974-980.
    [20]De Bra P, Houben G, Kornatzky Y, et al. Information retrieval in distributed hypertext[A]. Proceedings of the 4th RIAO[C].New York,1994:481-491
    [21]陈军,陈竹敏.基于网页分块的shark-Search算法[J].山东大学学报(理学版),2007(9)：62-166.
    [22]Haveliwala, T.. Efficient Computation of PageRank. Technical Report, Stanford University, February 1999
    [23]Jon M. Kleinberg. Authoritative sources in a hyperlinked environment Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms[C]. New Orleans:ACM Press,1998.668-677.
    [24]Donna Bergmark, Carl Lagoze, Alex Sbityakov. Focused Crawls, Tunneling, and Digital Libraries[M]. Springer Berlin,2002, Uol.2458:91-106.
    [25]DILIGENTIM, COETZEE F, LAWRENCE S, et al. Focused crawling using context graphs[C]. Proc of the 26th International Conference on Very Large Databases (VLDB 2000). Cairo,2000.
    [26]范劲松.特征选择和提取要素的分析及其评价[J].计算机工程与应用,2001(13)：95-99.
    [27]刘明吉,王秀峰.web文本信息的特征获取算法[J].小型微型计算机系统,2002,23(6)：683-86.
    [28]刘挺,秦兵,张宇,车万翔.信息检索系统导论[M].北京：机械工业出版社,2008.
    [29]SOGOU.文本分类语料库[DB/OL]. [2009-01]. http://www. sogou. com/labs/dl/c. html
    [30]JESoft.极易分词[CP/OL]. [2007-12]. http://www.jesoft.cn/
    [31]A Barroso, J.Dean, U.Hlzle. Web Search for a Planet:The Google Cluster Architecture[J]. Micro IEEE,2003,23(2):22-28.
    [32]王辉.基于质心具有增量性质的主题爬行[D].长春：吉林大学.2007.12
    [33]J. Makhoul, F. Kubala, R. Schwartz, R. Weischedel. Performance measures for information extraction[J]. DARPA Broadcast News Workshop,1999.
    [34]M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document Clustering Techniques. KDD-2000 Workshop on Text Mining[C], August 20-23,2000, Boston MA USA.109-110
    [35]严蔚敏,吴伟民.数据结构(C语言版)[M].北京：清华大学出版社,2007.
    [36]Google. Google SOAP Search API[CP/OL]. [2009-09]. http://code.google.com/intl/zh-CN/apis/soapsearch/
    [37]姚兴山.基于Hash算法的中文分词的研究[J].现代图书情报技术,2008,24(3)：78-81
    [17]Ronald R. Yager. An extension of the naive Bayesian classifier[J]. Information Science,2006, 176(5):577-588.
    [39]李学勇,欧阳柳波,李国徽.网络蜘蛛搜索策略比较研究[J].计算机工程与应用,2004,40(4)：128-131.
    [40]M Hersovici, M Jacovi, YS Maarek, et al. The shark-search algorithm-An application: Tailored Web site mapping. Proceedings of the 7th International World-wide Web Conference [C]. Brisbane, Australia:ACM Press,1998.317～326.
    [41]孙建军,成颖等.信息检索技术[M].北京：科学出版社,2004.
    [42]苏祺,项锟,孙斌.基于链接聚类的Shark-Search算法[J].山东大学学报：理学版,2006,41(3)：14.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700