智能化WEB信息搜索引擎的研究与实现

英文题名：Research and Implementation of Intellectualized Web Information Search Engine
作者：李建平
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：网络机器人 ; 搜索引擎 ; 信息检索 ; 元搜索 ; 更新周期
英文关键词：Web Robot ; Search Engine ; Retrieval Information ; Meta Search ; Update Cycle
学位年度：2003
导师：马瑞民
学科代码：081203
学位授予单位：大庆石油学院
论文提交日期：2003-02-20

摘要

Internet上的信息庞杂且分散，搜索引擎系统越来越成为人们网上冲浪和获取信息的必要工具。搜索引擎，是指在Internet中主动搜索信息并能自动索引、提供查询服务的一类网站，这些网站通过网络搜索软件(又称为网络搜索机器人Robots)或网站登录等方式，将Internet上大量网站的页面收集到本地，经过加工处理而建成本地数据库，当用户输入关键字(Keyword)查询时，该网站会告诉用户包含该关键字信息的所有网址，并提供通向该网站的链接。
     目前，Internet中已经存在着许多搜索引擎系统，但是在功能上和性能上它们都存在着一些缺陷，尤其是查全率和查准率。研究搜索引擎技术开发新的检索工具以帮助人们在网上方便地找到想要的准确信息是目前亟待解决的问题。
     文中对搜索引擎理论和技术进行概述，对网页特点作了分析和研究，比较分析了已经存在的各种搜索引擎系统，并在此基础上实现了两种类型的搜索引擎系统：基于目录的搜索引擎系统和基于机器人的综合式搜索引擎系统，这两种类型的系统相互联系、相互补充形成了一套智能化Web信息搜索引擎系统。
     目前，该系统已经能够实验性运行且效果良好，达到了预期的学习和实践的目的，为进一步研究搜索引擎技术并开发搜索引擎系统奠定了基础。
Information on the Internet is very huge and distributed, search engine has more and more become an absolutely necessary tool of Internet surfers. Search engine is an information retrieval Web site, which can use Web robots or Web site entry to collect documents, then analyzes and deals with this information, creates and maintains index database, gives a service of search to the user. When user input a keyword to query, this Web site can give all the documents' abstracts and links including this keyword.
    Now, many search engine systems are already in use on the Internet, but these search engines have some bugs in function or quality, especially in the rate of precision and whole of information. Studying and developing the new search tools is an urgent problem, which can help user conveniently to retrieval information on the Internet.
    This paper mainly discusses the theory and technology of search engine, discusses the character of pages on the Internet, compares and analyzes the search engines used on the Internet, and implements two types of Web search engine systems based on these work: one is based on directory, the other is based on Web robots. These two types of Web search engines are linked each other and recruited each other, so come to an intellectualized Web information search engine system.
    Now, this search engine system has experimentally run and the result is excellent. This system has reached its goal of experiment and study. The method can study the theory and develop the search engine system more. So the method is the base on the further research.

引文

[1]陈根栓，寇敏等．Web搜索引擎技术及应用．山东电子技术．2000，6：19～24
    [2]周宁，陈传艺等．Internet信息资源索引方法的研究．情报科学．1999，17(6)：583～586
    [3]吴昊．Web搜索引擎的现状分析．河南纺织高等专科学校学报．2001：16～18
    [4]宛玲，杨秀丹等．试析中文搜索引擎的评价标准．情报科学．2000，18(1)：28～31
    [5]陈晋．国内中文搜索引擎现状及检索技巧．图书馆学刊．2000，5：51～53
    [6]杨文峰，李星．网上搜索引擎的几个理论问题．计算机工程．2001，27(6)：20～21
    [7]毕强，温平．基于WWW信息获取过程的引擎选择．情报学报．2000，19(4)-402～407
    [8]Raymond Greenlaw&Ellen Hepp著．郭振波译．因特网和万维网的基本原理与技术．北京：清华大学出版社，2001年，第一版：122～138
    [9]张卫丰，徐宝文等．Web搜索引擎框架研究．计算机研究与发展．2000，37(3)：376～378
    [10]马瑞民，李建平等．基于元搜索的专题式Web搜索引擎的实现．大庆石油学院学报．2002，26(4)：55～59
    [11]WWW中英文搜索引擎的研究与实现．西安交通大学硕士论文．2001
    [12]戴雅琴．WWW信息专题式智能化检索系统的研究和设计．西安交通大学硕士论文．2001
    [13]丁国良，王嘉祯．专题式Web信息检索系统的设计与实现．军械工程学院学报，2000，12(1)：58～61
    [14]钟涛，陈新明等．中文文本Web搜索引擎的设计与实现．计算机工程与应用，2001，17：149～169
    [15]阳小华．WWW索引信息库的多周期增量更新方法．计算机应用，2000，20(4)：77～78
    [16]李瑞勤，朱慧．对Internet上专题型搜索引擎的探讨．情报学报，1999．9(18)：158～159
    [17]张琳．WWW上基于概念的智能搜索．上海海运学院学报，2000，21(4)：119～123
    [18]孟卫一，吴宗寰．集成搜索引擎的文本数据库选择．计算机研究与发展，2001，38(4)：397～401
    [19]文燕平，张玉峰．基于Agent的网络信息智能检索研究．图书情报知识，2000，3：54
    [20]王卫亚．互联网络公路交通信息搜索引擎的开发．交通与计算机，2000，18(5)：38～41
    [21]邢巍，于剑军．浅析Internet搜索引擎技术及运用技巧．中国民航学院学报，1998
    [22]刘向辉，尚振宏等．新一代Web搜索引擎中数据的抽取．昆明理工大学学报，2000，25(3)：22～24
    [23]张俐，李星等．中文网页自动分类新算法．清华大学学报(自然科学版)，2000，40(1)：39～42
    [24]闫琪，张志伟等．用户搜索请求中限定成分的识别及提取．计算机工程与科学，2000，22(3)：57～60
    [25]曾荣昌，蒋爱华．因特网上询找科技信息的两种方法．材料保护，1999，32(1)：40～42
    [26]张学福，冷伏海．商标数据库信息检索技术研究．1999，18(5)：5～7
    [27]王忠，周士波．Internet英文搜索引擎评析．1999，18(5)：32～34
    [28]夏旭，李健康等．WWW网络信息资源搜索引擎的研究进展．图书馆论坛．2000，10(5)：32～35
    [29]赵一唯，王和珍等．WWW信息检索综述．南京大学学报(自然科学)．2001，37(2)：192～198
    [30]陈苒，董占球．WWW信息搜索技术研究．计算机工程与应用．2001，14：62～64
    [31]阳小华．分布式WWW信息收集技术．计算机工程与应用．2000，5：145～169
    [32]瞿艳，卢增祥等．分布式网络信息查询系统．清华大学学报(自然科学版)．2000，40(1)：124～128
    [33]张彦民．基于Web的信息资源检索工具．图书馆论坛．2001，21(4)：34～36
    [34]张卫丰，徐宝文等．基于遗传算法的搜索引擎调度．微电子学与计算机．2001，4：34～38


    [35]李蕾，郭祥昊等．基于语义网络的概念检索研究与实现．情报学报．2000，19(5)：525～531
    [36]丁承，邵志清．基于字表的中文搜索引擎分词系统的设计与实现．计算机工程．2001，27(2)：191～193
    [37]黄国才．跨语言综合搜索引擎设计．现代图书情报技术．2001，4：31～33
    [38]陶跃华，孙茂松．搜索引擎搜索结果的评价技术．情报科学．2001，19(8)：862～873
    [39]李蕾，王楠等．中文搜索引擎概念检索初探．计算机工程与应用．2000，6：1～3
    [40]皮鹏，张国印．智能元搜索引擎的研究．应用科技．2001，28(8)：24～26
    [41]罗晓沛．数据库技术(高级)．北京：清华大学出版社，1999年，第一版：95～168
    [42]Ellen Spertus ParaSite:Mining Structural Information on the Web computer network and ISDN system Apr 1998
    [43]YanHong Li Toward A qualitative Search Engine IEEE-Internet computing Vol.4,1998
    [44]Lawrence, Steve; Giles, C. Lee. Inquirus, the NECI meta search engine. Computer Network and ISDN systems Vol.30,1998,95～105
    [45]Christos Faloutsos and Douglas Oard. A Survey of Information Retrieval and Filtering Methods. Http://www.cs.umd.edu
    [46]Eric W Brown, James P Callan, W Bruce Croft. Fast incremental indexing for full-text information retrieval. Proc. Of VLDB conf. Vol.9,1994:192～202
    [47]Robert E filman. Sangam pant searching the internet .IEEE internet computing, Vol. 4(2),1998
    [48]Daniel Dreilinger. Experiences with selecting search engines using metasearch[J]. ACM Transactions on information System, 1997,15(3):195～222
    [49]Arkadi Kosmynin. From bookmark managers to distributed indexing: An evolutionary way to the next generation of search engines. IEEE Communications Magazine, 1997,146～151
    [50]Ana B benitez, Mandis Beigi, Chang shihfu, Using relevance feedback in content-based image metasearch[J]. IEEE Internet Computing, 1998,2(4),59～69
    [51]Widrow B, Stearns SD. Adaptive Signal Processing [M]. Prentice-Hall, Englewood Cliffs, New Jersey, 1985
    [52]Cho Junghoo. Efficient Crawling Through URL Ordering[J]. Computing Networks and ISDN System. 1998,161～172
    [53]GudivadaVN. Information Retrieval on the World Wide Web[J]. IEEE Internet Computing, 1997,5:58～68。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700