过滤型网络爬虫的研究与设计

英文题名：Research and Design of Filtrating Web Crawler
作者：陈奋
论文级别：硕士
学科专业名称：系统工程
中文关键词：网络爬虫 ; 模式匹配 ; 分类方法
英文关键词：Web Crawler ; Pattern Matching ; Classification Methods
学位年度：2007
导师：吴顺祥
学科代码：081103
学位授予单位：厦门大学
论文提交日期：2007-06-01

摘要

网络爬虫是一个可以从因特网上自动提取网页的系统,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。通用搜索引擎的网络爬虫一般是从几个种子URL链接开始进行全盘爬行,而专业领域搜索引擎的网络爬虫除了通用网络爬虫的基本功能外,还能够对链接以及页面内容进行识别,因此称作聚焦网络爬虫。聚焦网络爬虫并不追求大的覆盖,而将目标定为抓取与某一特定主题内容相关的网页,为面向主题的用户查询准备数据资源。聚焦网络爬虫已经成为搜索引擎技术领域的一个研究热点,对于专业领域的搜索产生重要的作用。
本论文从聚焦的另一个角度——“过滤”上来研究网络爬虫技术,称这种类型的网络爬虫为“过滤型网络爬虫”。论文首先介绍了网络爬虫所起的作用以及网络爬虫技术的发展现状;接着在从两个方面来研究过滤型网络爬虫技术:(1)从链接过滤上,提出了链接群体的概念,根据不同的网站类型将链接群体分为单模式链接群体和多模式链接群体,同时在分析了传统的链接过滤算法的基础上,提出了基于规则匹配的链接过滤算法;(2)从内容过滤上,主要从以下三个方面来研究:(a)提出了一种基于网站内容特征的网站类型辨识方法,(b)使用一种基于标签权重的网页文本特征词选择算法,在此基础上构建网页文本的空间向量模型,并将该向量模型跟已经设定好的主题向量模型进行相似度计算,从而形成基于向量空间模型的主题过滤算法,(c)在分析非结构化数据分类过程的基础上,使用了基于朴素贝叶斯分类器的主题类别过滤算法;最后设计并实现了一个过滤型网络爬虫系统,并详细介绍了系统的整体设计流程、系统结构以及系统几个关键模块和关键技术。
Web crawler is a system which can automatically get web pages from Internet。It helps searching engine download web pages, so it is an important part of searching engine. Web crawler of normal searching engine starts working from some seeding links, and that web crawler of searching engine for special domain is able to identify links and content of web pages except functions of normal web crawler, so we call it focused web crawler. The main goals of focused web crawler are to get more web pages which are correlative with a certain topic and prepare data for users querying. The focused web crawler has been became a researching hotspot in technology domain of searching engine.
We research the focused web crawler from another aspect—“filtrating technology”, so we call this web crawler as the filtrating web crawler. Firstly, we introduce the main function of the web crawler and the present condition of technology of web crawler; secondly, we research the technology of filtrating web crawler from two aspects: (1)from filtrating links, we give the concept of links’colony and classify links’colony as single pattern and multiple pattern; at the same time we give the filtrating links algorithm after analyzing the traditional algorithm;(2)from filtrating content of web pages, we research it from three aspects: (a)put forward a method to differentiate the style of website basing the characteristic of content,(b) use a method basing on calculating the weight of tag to select the characteristic words of web pages, and then we construct the VSM of web pages to calculate the similarity with the topic VSM which we have prepared,(c) basing on analyzing the process of classifying non-structural data, we use the native bayes classifier to differentiate the topic types of web pages; lastly, we design and implement a filtrating web crawler system, and introduce the main module and technology of this system.

引文

[1] CNNIC. 第十九次中国互联网报告[R],2007:1-27.
    [2] P.Bra, G.Houben, Kornatzky. Information retrieval in Distributed Hypertexts[C].Proc of the 4th RIAO Conference, 1994:481-491.
    [3] M.Hersovici, A.Heydon, M.Mitzenmacher, D.Pelleg. The Shark-search Algorithm-An application: Tailored Web Site Mapping[C]. Proc of World-Wide Web Conference, Queensland, Australia, 1998:143-256.
    [4] M. Diligenti, F.Coetzee, S. Lawrence, et al. Focused Crawling Using Context Graphs[C] .In Proceedings of the 26th International Conference on Very Large Databases(VLDB 2000), Cairo, Egypt, September 2000.
    [5] L. Page, S. Brin, R. Motwani. The PageRank Citation Ranking:Bring Oreder to the Web. Technical report[R]. Stanford University, Stanford, CA, 1998.
    [6] Jon Kleinberg. Authoritative Sources in A Hyperlinked Environment[J].Journal of the ACM,1999,46(5):604-632.
    [7] J .Rennie. Using Reinforcement Learning to Crawler the Web efficiently[J]. In Proc of the International Conference on Machine Learning(ICML99), Bled,Slovenia,1999:433-476.
    [8] Craven D, DiPasquo D, Freitag, et al. Learning to Construct Knowledge Bases from the World Wide Web[J]. Artificial Intelligence, 2000, 118(1-2): 69-113.
    [9] S.Raghavan and H.Garcia-Molina. Crawling the Hidden Web[R]. Stanford Digital Libraries Technical Report, 2000.
    [10] S. Chakrabarti, M. van den Berg and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[C].In Proceedings of the 8th International WWW Conference, Toronto, Canada, May 1999.
    [11] D. Bergmark, C. Lagoze and A. Sbityakov. Focused Crawls, Tunneling, and Digital Libraries[C].VLDB,2002.
    [12] Mike Burner. Crawling Towards Eternity: Building an Archive of the World Wide Web[J]. Web Techniques Magazine, 1997,2(5):3445.
    [13] Sergey Brin, Lawrence Page. The Anatomy of a Large 一 scale Hypertextual Web Search Engine[C]. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia 1998:107-117.
    [14] Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle. Web Search for a Planet:The Google Cluster Architecture[J]. IEEE Micro, March 2003,23(2) :22-28.
    [15] A. Heydon, M. Najork. Mercator. A Scalable, Extensible Web Crawler. World Wide Web.1999,2(4):219-229.
    [16] M.NAJORK, J.WIENER. Breadth-first Search Crawling yields High-quality Pages[C]. InProceedings of the 10th International World Wide Web Conference(2001)ACM Press. Hong Kong 2001:114-118.
    [17] Z.ANDREI, MARC NAJORK. Efficient URL Caching for World Wide Web Crawling[C]. In proceedings of the 12th International World Wide Web Conference. Budapest, Hungary,2003:679-689.
    [18] 林乐彬.Inar 网络爬虫的设计与实现[D].哈尔滨工业大学硕士学位论文,2006.
    [19] 刘洁清.网站聚焦爬虫研究[D].江西财经大学硕士学位论文,2006.
    [20] 谭思亮.聚焦网络爬虫系统的设计-算法视角[D].中国科学院研究生院硕士学位论文,2006.
    [21] Boyer R.S.and J.S.Moore. A fast string searching algorithm[J]. Communications of the ACM20 (October 1977), 762-772.
    [22] Aho, A.V., and M.J.Corasick. Efficient string matching: an aid to bibliographic search[J]. Communications of the ACM 18 (June 1975), 333-340.
    [23] R. Horspool. Practical fast searching in strings[R]. Software -Practice and Experience, 10, 1980.
    [24] Sun Wu,Udi Manber. A Fast Algorithm For Multi-Pattern Searching[R].Technical Report TR-94-17, University of Ari-zona,May 1994.
    [25] COVER T M, HART P E.Nearest neighbor pattern classification[J].IEEE Transactions Inform Theory, 1967, IT13 :21-27.
    [26] V.Vpanik.The Nature of Statistical Learning Theory[M].Spring Verlag,1995.
    [27] V.Vpanik.Statistical Learning Theory[M].Wiley,1998.
    [28] 张鑫,谭建龙,程学旗.一种改进的 Wu M anber 多关键词匹配算法[J].计算机应用.2003(7):29-30.
    [29] 陈瑜,陈国龙.Wu- Manber 算法性能分析及其改进[J].计算机科学.2006(6):203-204.
    [30] Y.Yang and Pedersen J.P.A Comparative Study on Feature Selection in Text Categorization[J]. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 1997(4): 12-20.
    [31] J Han,M Kamber.范明,孟小峰,等译.数据挖掘:概念与技术[M].北京:机械工业出版社,2001.
    [32] 洪家荣,丁明峰,李星原,等.一种新的决策树归纳学习算法[J].计算机学报,1995,18(6):471- 475.
    [33] M Mehta,R Agrawal,J Rissanen.SLIQ:A Fast Scalable Classifier for Data Mining.In Proc.1996 Int. Conference on Extending Database Technology (EDBT' 96), Avignon, France, March 1996.
    [34] 刘小虎,李生.决策树的优化算法[J].软件学报,1998,9(10):798- 801.
    [35] 徐爱琴,张德贤.基于神经网络的分类决策树构造[J].计算机工程与应用,2000,(10):43- 45.
    [36] 田金兰,赵庆玉.并行决策树算法的研究[J].计算机工程与应用,2001,(20):112- 114.
    [37] 林士敏,田凤占,陆玉昌.贝叶斯学习、贝叶斯网络与数据挖掘[M].计算机科学. 2000 (10):69-72.
    [38] 冀俊忠,刘椿年,沙志强.贝叶斯网模型的学习、推理和应用[J].计算机工程与应用.2003(52): 4-28.
    [39] 张宏伟,田凤占,陆玉昌.对一种贝叶斯网络学习算法的改进及试验分析[J].计算机科学. 2002.29(5):97-100.
    [40] Qiang Lei,Xiao Tian-Yuan,Qiao Gui-XiuAn Improved Bayesian Networks Learning Algorithm[J] .Jo urnalof C omputerR esearcha ndD evelopment.2002.39(10):1221-1226.
    [41] 宫秀军,刘少辉,史忠植.一种增量贝叶斯分类模型[J].计算机学报 2002,25(6): 645- 650.
    [42] 宫秀军,孙建平,史忠植. 主动贝叶斯网络分类器[J].计算机研究与发展.2002 39( 5) :574-579.
    [43] Margart H.Dunham, Data Mining Introductory and Advanced Topics [M]
    [44] T. Mitchell. Decision Tree Learning[M]. In T. Mitchell, Machine Learning, The McGraw-Hill C Companies, Inc., 1997:52-78.
    [45] P. Winston. Learning by Building Identification Trees[M]. In P. Winston, Artificial Intelligence, Addison-Wesley Publishing Company, 1992: 423-442.
    [46] Mia K.Stem, Joseph E.Beck, and Beverly Park Woolf. Na?ve Bayes Classifiers for User Modeling[M].
    [47] D Heckerman. Bayesian networks for knowledge discovery [J]. Advances in Knowledge Discovery and DataMining,1996: 53-18.
    [48] D.Heckerman, D.Geiger, and D.Chickering. Learning bayesian networks: The combination of knowledge an statistical data[J]. Machine Learning,20:197-199.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700