基于WEB挖掘的网络蜘蛛的研究与实现

英文题名：Research and Implementation for Web Spider Based on Web Data Mining
作者：詹晶晶
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：Web挖掘 ; 网络蜘蛛 ; 搜索引擎
英文关键词：web mining ; network spider ; search engine
学位年度：2007
导师：倪子伟
学科代码：081202
学位授予单位：厦门大学
论文提交日期：2007-05-01

摘要

搜索引擎是从WWW上快速而有效地获取信息资源的捷径,而网络蜘蛛技术则是搜索引擎的关键。本文围绕WEB信息挖掘这一前沿性研究领域课题,结合搜索引擎框架的总体要求,实现了网络蜘蛛在互联网中的漫游,并将网页数据存储在本地数据库中,为以后网页搜索引擎的实现打下了良好的基础。
     本文首先从搜索引擎的分类和组成出发,对搜索引擎的内部运行机制进行了了初步的了解,然后详细分析了网络蜘蛛技术实现的功能和搜索的策略。最后本文实现了一个网络蜘蛛在网络中的漫游,并能将网页数据存储在本地数据库中。
     研究内容主要包含:
     首先分析搜索引擎的工作原理,实现搜索引擎工作中的第一步一从互联网上抓取网页。其次详细阐述和分析了所用到的技术,特别是本文实现中所用到的HTTP协议、正则表达式、多线程和ADO.NET等技术。在已有网络蜘蛛技术的基础上,对网络蜘蛛的系统进行分析和设计,采用广度优先的搜索策略,结合多线程机制,实现了对内网和外网页面的抓取和页面内容分析的算法。
     本文的创新点在于,首先,把正则表达式技术应用到WEB网页内容提取里面,快速有效地提取网页中的URL,实现了对内网和外网页面的抓取和页面内容分析的算法。最后使用Zlib数据压缩算法对网页数据进行压缩并存入本地数据库。其次,在读取网页信息模块的设计中,为了提高网页获取的速度,采用了一个特殊的错误URL处理策略,即通过服务器的响应时间来取决函数是否返回HTTP页面,把超时的URL放入错误队列,等待错误处理进程的处理。会使蜘蛛根据网络状况来快速处理服务器响应时间快的URL,从而提高蜘蛛的整体速度。
     然后,通过在校园网上进行实验,并且读取存储在数据库中的网页数据,验证了该网络蜘蛛的可行性,证明系统己达到了预期的目标。
     最后,对本课题下一步的主要工作内容进行系统的总结并做出简单的展望。
The spider programming technology is the key part of search engine, which is the convenient and effective method to get the information from the WWW. Surrounding the innovative technology of Web Data Mining and based on the whole request of search engine’s frame, the main work of this article is to realize the cruise of the Internet spider,and store the data of the page into the local database, place a firm foundation for the realization of intelligent search engine.
     The main contents of this article include:
     Firstly, analyze the principle of search engines and realize the first step in the work of search engine: get the page data from Internet. Secondly, describes the technology used in the article,such as HTTP protocol, Regular Expressions, Multi-thread and ADO.NET. Based on the network spider technique, the article analyzes and designs a system of a new spider. Using the BFS strategy ,Combined with multi-threads technology , this article realizes the algorithms of crawling the web-pages from Internal and External networks and analyzing the content .
     In this paper, the innovation lies, first, regular expression technology applications to getting WEB content to make extracting the website URL quickly and efficiently and achieving crawls the internal networks and the web-pages content and analysis algorithms. Finally compress data with Zlib algorithm and put the data into the local database. Secondly, in order to increase the speed, we adopt a special strategy to deal with the wrong URL. That is, through the server's response time to deciding whether or not to get the HTTP pages, then put the overtime URL in the wrong queue waiting for the process of the thread of dealing with wrong URL. Thirdly, after analyzing the result of experiment in the network of campus and the result of the data stored in the database, the feasibility of the spider can be validated,the prospective object of the system have been achieved.
     Finally,the conclusion of the whole system and the future work of the subject are presented.

引文

[1] http://www.9238.net/searchengine.htm.
    [2] 邹涛等.基于 WWW 的文本信息挖掘.情报学报,1999,18(4).
    [3] http://tpi.cnki.net/tech02.htm.
    [4] http://www.pconline.com.cn/pcedu/empolder/wz/php/0501/533294_3.html 2005-01-11 10:10:01.
    [5] 林杰斌.《数据挖掘与 OLAP).清华大学出版社,2003.1.
    [6] Etz96 O. Etzioni. The world wide web: Quagmire or gold mine. Communications of the ACM, 39(11):65-68,1996.
    [7] Zai99 O.Zaiane Resource and Knowledge Discovery from the Internet and Multimedia Repositories. 1999. Ph.D Thesis.
    [8] The Laborious Way from Data Mining to Web Log Mining Myra Spiliopoulou Institute Fur WirtschaftsinFormatil, Humbildt-University Berlin.
    [9] Grouping Web Page Reference into Transaction for Mining World Wide Web Browsing Patterns R.Cooley.
    [10] ZHLC98 O.R.Zaiane, J.Han, Z.-N.Li, S.H.Chee, and J.chiang. Multimediaminer: a system prototype for multimedia data mining. In Proc. ACM SIGMOD Intl. Conf.on Management of Data. p581-583,1998.
    [11] LSS96 L. Lakshmanan, F. Sadri, querying and restructuring the web",and I. Subramanian, "A declarative language In Proc. 6th Int. Workshop on Research Issues data Engineering, New Orleans, 1996.
    [12] AM98 G. O. Arocena, and A. O. Mendelzon, "WebQL: Restructuring documents, databases and webs",In Proc. of ICDE Conference, Feb.1998.
    [13] 庚麒.Web 文本挖掘技术,计算机与网络,2004.
    [14] Kle99 [kle99] J.M.Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM,46:604-632,1999.
    [15] CDGK98 [clever] S. Chakrabarti , B. Dom, D. Gibson, S.R. Kumar, etal,"Experiments in topic distillation", In ACM SIGIR workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, 1998.
    [16] 黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器.软件学报 . 20 00,11(1):73-78.
    [17] Freitag, D.,McCallum, A. Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the Eigh te enth Conference on Artificial Intelligence,2s AAAIW orkshop on Learning for Text Categorization, 1998.
    [18] Jason D .M. Rennie.i file:An Application of Machine Learning to E–Mail Filtering. In Proceedings of the KDD-2000 Workshop on Text Mining, Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data.
    [19] Mehran Sahami, Susan Dumais, David Heckerman, et al. A Bayesian Approach to Filtering Junk E -Mail.In: Proceedinga Mining,1998.
    [20] Andrew McCallum, Kamal Nigram. A Comparison of Event Model for NaiveBayes Text Classification.I n: AAAI/ICML-98 Workshop on Learning for Text Categorization. Technical Report WS-98-05, AAAI Press. 1998.
    [21] Zijian Zheng, Naive Bayesian Classifier Committees. In: Proceedings of ECML, Berlin, 1998:196-207.
    [22] 王峰松.网典:新一代智能搜索引擎.网络世界.1999, 12 (22).
    [23] 罗三定,黄勇.一个应用模糊方法的智能搜索引擎的构建.计算机工程.2001, 26 (12).
    [24] F. Menczer. Complementing Search Engines with Online Web Mining Agents. Decision Support Systems,35 (2):19 5-212,Feb,2003
    [25] 韩家炜,孟小峰,T静,李盛思.WEB挖掘研究.计算机研究与发展,38(4):405.414,2001
    [26] L .Page and S. Brin. The anatomy of a search engine. Proc. Of the 7th International WWW Conference(WWW 98), Brisbane, Australia, April 14-18,1998
    [27] Pretschner, Alexander, Gauch, Susan. Ontology based personalized search.Proceedings of the International Conference on Tools with Artificial Intelligence, p:391-398,1999
    [28] Rennie J and McCalum A. Using reinformation learning to spider the web efficiently. Proc of the International Conference on Machine Learning(ICML99),1999
    [29] Shapiro, A., Rimon, E., Shoval, S. Immobilization based control of spider like robots in tunnel environments. Proceedings 2001 ICRA, IEEE International Conference on Robotics and Automation, Vol.4, p:3636-3642, 2001
    [30] Zhang Jin, Dimilroff, Alexandra. The impact of metadata implementation on webpage visibility in search engine results. Information Processing and Management,41(3):691-715,May,2005
    [31] Ke Huand and Wing Shing Wong. Probabilistic Model for Intelligent Web Crawlers. Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC'03).
    [32] Ching-Cheng Lee, Yi xin Yang. Intelligent Web Topics Search Using Early Detection and Data Analysis. Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC'2003)
    [33] B. D. Davison. Topical locality in the Web. In Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval(SIGIR 2000),pages272-279,Athens,Greece, July 2000.ACM.
    [34] Jukka Perki, Wray Buntineand, Sami Perttu. Exploring Independent Trends in a Topic-Based Search Engine. HILT Technical Reports2004-12,ISSN1 458-9478.
    [35] W .Buntine, J.Lofstrom, J.Perkio, S.Perttu, V.Poroshin, T.Silander, H.Tin'i, A.Tuominenan dV.Tuulos. A Scalable Topic-Based Open Source Search Engine.HIIT Technical Reports2004-14,IS SN1 458-9478
    [36] Simon Fong, Aixin Sun and Kin Keong Wong. Price Watcher Agent : For E-Commerce. In World Scientific on June 21,2001.
    [37] Hsinchun Chen, Yi-Ming Chung, Marshall Ramsey, Christopher C .Yang. A smart itsy bitsy spider for the web. Journal of the American Society for Information Science archive, Volume 9,Issue7 ,Pages: 604-618,May 1998.
    [38] Andr'e Bergholz,Boris Chidlovskii. Crawling for Domain-Specific Hidden Web Resources. Proceedings of the Fourth International Conference on Web Information Systems Engineering (WISE'03).

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700