垂直搜索引擎中网络蜘蛛的设计与实现

英文题名：Design and Realize of Spider in Vertical Search Engine
作者：薛建春
论文级别：硕士
学科专业名称：检测技术与自动化装置
中文关键词：搜索引擎 ; 网络蜘蛛 ; 信息采集 ; 搜索策略
英文关键词：searching engine ; Web Spider ; information collection ; searching strategy
学位年度：2007
导师：段红梅
学科代码：081102
学位授予单位：中国地质大学（北京）
论文提交日期：2007-05-01

摘要

随着Internet的迅速发展,网络成为当今世界最大的信息库,它为信息共享和资源共享提供了一个良好的平台。然而大量的网页资源和网页的动态特性要求信息搜索系统不断升级,同时人们对获取信息的时效性、针对性、准确性等方面有了新的要求。因此基于各专业的搜索系统也应运而生。如何能更快速、更准确的得到网络中的有用信息资源是网络用户面临的一个重要问题,而搜索引擎技术恰好能解决此难题。搜索引擎主要由搜索器、索引器、检索器和用户接口四部分组成。搜索器旨在研究开发出一个智能化的搜索软件,自动的在网络中的网页上爬行,进行信息的发现和抽取,建立本地的索引数据库,向用户提供查询服务。垂直搜索引擎是搜索引擎的细分和延伸,是对网页库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。垂直搜索引擎与传统的网页搜索引擎最大的区别就是将网页中的信息进行结构化的提取。使得信息在提取的时候就建立了分类,更好的适应查询需求。
     本文从研究和设计的角度对WWW搜索引擎的相关技术作了详细的分析和讨论,论述了目前搜索引擎的国内外发展现状和发展趋势。分析了搜索引擎的工作原理及其各部分主要功能,抓住如何评价页面的主题相关性和设计高效的爬行策略这两个关键问题,提出一个基于图书专业的定题搜索器,它是垂直搜索引擎的核心。在文章的主体部分,以搜索引擎的设计流程为主线,从HTML页面解析的一般概念入手,结合网页之间的超链接分析(HITS算法),按照搜索引擎系统的要求,采用深度优先的搜索策略设计一个适合中小型网站专业网页信息获取的网络蜘蛛,并给出此网络蜘蛛的爬行算法,使用C++ Builder工具实现程序。另外,为了保证数据库中的信息不重复,还设计了一个专门用于数据查重的程序以保证资源准确。在此基础上采用数据库索引和检索工具Lucene相结合的方法建立索引、为检索结果排序。保证为用户提供更加准确的信息,更好的满足用户的检索要求。这种搜索方法对其他的专业搜索引擎系统的建立具有指导意义。
     最后的软件功能测试表明,此Spider程序算法准确、稳定、不会引起本地资源耗尽;它支持按指定站点搜索,按给定Url范围进行搜索的搜索策略。可以完成指定信息的自动搜索和下载。
With the rapid development of Internet, web has become the largest data base in the present world, which provides an ideal place for sharing and communicating infor-mation. However, the large amount of website resources and their dynamic characteris-tics require continual update of the data-searching system, as well as higher level of ef-ficiency, pertinence and accuracy in searching data. Therefore, various specialty-based searching engines have been invented. How to get access to useful information on the net more quickly and more correctly is one of the problems which web surfers face, while the technology of searching engine which consists of Spider、Indexer、Searcher and User interface system is the key to solve this problem. The spider aims at producing intelligent searching software which can automatically search information on the web for selecting the useful information, and at setting up a local index data base for the searching service to users. The vertical searching engine is a typical type of searching engine, which can classify information in certain field from those websites, select nec-essary data string by string along one direction, analyze those data and then return them to the user. The major difference between vertical searching engine and traditional searching engine is that the vertical one select information from website in a structural way– classify the information while selecting it to better satisfy the searching require-ments.
     The paper has analyzed and discussed the research and development of WWW searching engine technology in details, and its current situation as well as future trend in mainland and abroad. It also states the working theory of searching engine and the main function of each component. Firstly the paper emphasizes how to evaluate the subject pertinence of web page and designing efficient searching strategy as two key steps. Then it also describes a fixed-subject searching engine basing on the specialty of book, which is the core of vertical searching engine. The main part of the paper covers the whole procedure of designing the engine. Basing on the general conceptions of analytic HTML, combined with the analysis of hyperlinks between web pages(HIT al-gorithm), according to the requirements for searching engine, the paper has designed a web spider (with depth-preferred searching strategy) fitting for middle or small sized websites’information selection. The Searching arithmetic of the web spider has been presented and it can work with the aid of C++ Builder tools for better satisfying searching engine users. Besides, to avoid repetition of data, a program specified in checking the data repetition has been designed to guarantee the accuracy of data. Bas-ing on these principles the searching engine is set up by data index and searching tool Lucene to composite the searching result in guarantee of offering accurate information and better satisfying users’requirements. In general, this searching method is guidance for setting up other specified searching engine systems.
     The results of software function test show that the algorithm of Spider program is accurate and steady without the risk of local information resource exhaustion. It sup-ports the searching strategy of searching on fixed site or in a given Url circle. It can also do automatic searching and downloading according to the given information.

引文

[1] 孙猛. 基于分类语义的搜索引擎中若干关键技术的研究与实现: [硕士学位论文]. 沈阳: 东北大学, 2005
    [2] 史鹏辉. 专业服务网站搜索引擎的设计与实现: [硕士学位论文]. 大连: 大连理工大学, 2004
    [3] 王亮. 搜索引擎及其相关性排序研究: [硕士学位论文]. 武汉: 武汉大学, 2004
    [4]http://www.sowang.com/news/20060223-1.htm
    [5] 李刚, 宋伟, 邱哲, Ajax+Lucene 构建搜索引擎. 北京: 人民邮电出版社, 2006 194-363
    [6] 李志蜀, 李果, 中文搜索引擎的原理剖析及开发实现技术. 计算机应用研究, 2001, 11 (96)
    [7]http://www.data mining
    [8]http://www.fullsearcher.com/n20051112144420735.asp
    [9] 康平波, 田永鸿, 黄铁军, 智能化网页资源收集工具的设计与实现. 计算机工程, 2004, 30 (4)
    [10] 赫枫龄, 左万利, 用有向图法解决网页爬行中循环链接问题. 吉林大学学报(理学版), 2004, 7 (42)
    [11] 洪光宗, 王皓, 搜索引擎 Robot 技术实现的原理分析. 现代图书情报技术, 2002, (1)
    [12] 吕韩飞. 主题(topical)crawler 及其应用—主题搜索引擎: [博士学位论文]. 杭州:浙江大学, 2005
    [13] 陈先. 智能搜索引擎关键技术研究与实现: [硕士学位论文]. 哈尔滨: 哈尔滨工程大学, 2003
    [14] 徐群岭. 搜索引擎的定性、定量评价研究与合理选择. 情报检索, 2003, 3
    [15] 马彪, 李恒, 搜索引擎的性能评价. 新世纪图书馆, 2003, (6)
    [16] 方平, 网络医学资源检索与利用. 北京: 科学出版社, 2003
    [17] Steve Lawrence, C Lee Giles, Accessibility of information on the Web, Nature London: Jul8, 1999.Vol..400. lss.6740; p,107
    [18] 吴哲. 新闻搜索引擎系统中的网络机器人技术: [硕士学位论文]. 广州: 华南理工大学, 2004
    [19] 陈旭春, 赵明生. 分布式多搜索引擎系统的研究与实现. 微计算机信息, 2005(20)
    [20] 董建设. 基于 HTML 标记分析及中文切词的网页索引研究与实现: [硕士学位论文]. 兰州: 兰州理工大学, 2003
    [21] 寿周翔. 专业搜索引擎的研究与设计: [硕士学位论文]. 杭州: 浙江大学, 2005
    [22] 王坚. 化工类专业搜索引擎中的中文分词设计: [硕士学位论文]. 北京: 北京化工大学 2005
    [23]http://ysearchblog.cn/2006/07/post_14.html
    [24] 殷建平. 汉语自动分词方法. 计算机工程与科学, 1998, 20(3) 60-66
    [25] http://club.colabug.net/index.php?showtopic=291138
    [26] 林亚平, 李彦, 董调生, 尹峰. 汉语自动分词中的神经网络技术研究. 湖南大学学报, 1997, 24 (6) 95~101
    [27] 孙茂松, 左正平. 汉语真实文本中的脚集切分歧义. Proceedings of Interna-tional Conference on Quantitative and Computational Studies on the Chinese Language,1998,HK-323-328
    [28] 朱敬华. 数字图书馆中查询结果处理和参考文献超链接方法的研究: [硕士学位论文]. 长春: 黑龙江大学, 2002
    [29] 胡晓光. lucene 检索模型,(ppt)
    [30] 刘平冰. 基于 Lucene 的 Web 站内信息搜索系统: [硕士学位论文]. 成都: 电子科技大学, 2005

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700