中文智能搜索引擎的设计与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

中文智能搜索引擎的设计与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Design and Application of Chinese Intelligent Search Engine
作者：高清霞
论文级别：硕士
学科专业名称：计算机应用
中文关键词：www技术 ; 搜索引擎 ; 人工智能Spider系统 ; ODBC ; 多线程
英文关键词：WWW technology ; Search engine ; Artificial intelligence ; Spider ; ODBC ; Multithread
学位年度：2000
导师：张书杰
学科代码：081203
学位授予单位：北京工业大学
论文提交日期：2000-04-01

摘要

随着Internet的迅速普及和发展，搜索引擎已成为Internet用户上网不可或
    缺的工具。本文通过分析国内外搜索引擎的特点和研究现状，指出了进一步研
    究中文智能搜索引擎的必要性和重要性。
     文章系统地介绍了“首信”搜索引擎的制作过程，揭示了Web搜索引擎在
    幕后的工作原理。
     “首信”搜索引擎是一个多用途、可调式的Internet中文智能搜索引擎，
    它采用浏览器/服务器（B/S）体系结构，由浏览器和服务器两端协同来提高服
    务的智能程度，并通过对网页内容进行自然语言处理来提高检索性能。
     “首信”搜索引擎主要由分布式并行Spider、全文检索数据库、智能信息
    处理模块、CGI和智能浏览器（Smart Browser）等模块构成，支持全文检索、
    基于语料库的概念检索和基于知识库的概念检索。
     其中，作者重点介绍了“首信”搜索引擎的信息获取工具Spider的设计和
    实现过程。
     Spider（或称robot，WebAgent）是Internet搜索引擎的数据来源，它决定
    着整个系统的内容是否丰富、信息是否能够得到及时更新。“首信” Spider采用
    Client/Server体系结构，是一个分布式并行搜索的系统。它由服务器端Task
    Manager（简称TM）和客户端Gather Agent（简称GA）组成。
     TM是一个基于TCP/IP的程序，采用Visual++实现。它的主要功能有：
    1）通过TCP/IP协议（Socket）以及系统的通信原语与各GA进行通信，维持
    管理与之相连的GA线性表。2）负责搜索任务的调度，向任务负载低（包括无
    负载）的GA发送搜索任务。3）搜索策略控制以及与用户的交互。
     GA的实现采用多线程（Multi-thread）技术，它的主要功能有：1）通过TCP/IP
    协议（Socket）以及系统的通信原语，与TM进行通信，报告自身的状态信息。
    2）接收由TM传来的搜索任务，即ROOT_URL表。3）采用宽度优先算法，
    获取Internet网页信息。4）收集网页，以适当的方式保存到数据库。
With the development and popularization of Internet, search engine becomes an
    essential tool for Internet users. This paper analyses the features and current
    research status of search engine domestic or overseas, and points out the necessity
    and importance of the research of Chinese intelligent search engine.
     This paper introduces systematically the design and development of “China
    Info”search engine, and uncovers the secret of how search engine works.
    a “China Info”search engine is a multipurpose and adjustable Chinese intelligent
    search engine. With the Browser/Server architecture, it improves its intelligence via
    cooperation between client and server. It also improves search performance via
    natural language processing in contents of web pages.
     “China Info”search engine consists of distributed parallel spider, whole-length
    search database, intelligent information processing model,
    CGI and smart browser, etc. It supports whole-length search, concept search based
    on language database and concept search based on knowledge database.
     Here, the design and development of “China Info”spider is the emphasis of
    this paper.
     Spider is the data source of Internet search engine. It decides whether the
    contents are abundance and the update of information is in time. “China Info”spider
    is a distributed parallel system with Client/Server architechture. It is composed of
    Task Manager(TM), the server program and Gather Agent(GA), the client program.
     TM is a program based on TCP/IP protocol , using Visual C++ as development
    tool. It achieves the goals: 1) communicate with GA via TCP/IP protocol(Socket)
    and communication primitive, and manipulate the GA Host list; 2)responsible for
    dispatching of search tasks, sending search task to a GA lowed load or unload;
    3)control the search strategy and communicate with users.
     GA is implemented with multi-thread technology, its main function includes:
    1) communicate with TM via TCP protocol(sockets) and communicate principles,
    and report its status.2)receive the search task from TM , namely ROOT-URL list. 3)
    use breadth-first strategy to get Web pages information. 4)gather web pages and
    save in database in proper form.

引文

[1]《Web 开发技术》，姚晓乐，王宇坤，1999
    [2]《Visual C++技术内幕》，David J.Kruglinski,1999
    [3]《MFC 开发人员参考手册》，Robert D.Thompson,1996
    [4]“搜索引擎在幕后怎样工作”，朱洁，中科院软件研究所，1999
    [5]“图像搜索引擎”，《中国计算机世界》，1999
    [6]“Web站点应该包括的十项内容”，《网络与信息》1999年第1期
    [7]“Internet上的智能信息搜索”，罗玉龙，史忠植，中科院计算技术研究所，1999
    [8]“了解搜索引擎”，平文胜，1999
    [9]“在你的主页中为 Web Robot设计路标”，刘建新，1999
    [10]"Strategies for Indexing and Search Engines ", Dmitry Kirsanov, 1999
    [11]"Guidelines for Robot Writers", Martijn Ko(?)ter, 1993
    [12]"Robots in the Web: threat or treat?" Martijn Koster, ConneXions, 1995
    [13]"Evaluation of the Standard for Robots Exclusion",Martijn Koster, 1996
    [14]"Internet Search Tool Details",http://sunsite.berkeley.edu/Help/searchdetails.html, 1999
    [15]"Sink or Swim: Internet Search Tools & Techniques", Ross Tyner, 1996
    [16]"Tutorial: Guide to Effective Searching of the Internet." November 12, 1998.http://thewebtools.com/searchgoodies/tutorial.htm
    [17]"Protocol Gives Sites Way To Keep Out The 'Bots", Jeremy Carl, 1999
    [18]"A Standard for Robot Exclusion", by Martijn Koster, 1997
    [19]"Search Engine Tips" ,http://submitit.linkexchange.com/subopt.htm, 1999
    [20]"Cheaters Never Win?", Kathleen Murphy, Web Week, Volume 2, Issue 6, 1999.
    [21]"Getting Listed on the Search Engines", Doug Greening, December 19, 1995
    [22]"Pragmatic Application of Information Agents", P. Janca, 1995.
    [23]"The Info Agent: an Interface for Supporting Users in Intelligent Retrieval",D. D'Aloisiand V. Giannini,1995.
    [24]"Maintaining Distributed Hypertext Infrastructures: Welcome to MOMspider's Web",Roy T. Fielding, 1999
    [25]"Using an Intelligent Agent to Enhance Search Engine Performance",by Jams Jansen, 1999
    [26]"Modeling Adaptive Autonomous Agents", by Pattie Maes, 1999
    [27]"Agents come in from cold", by Martin Cheek, 1994
    [28]"What Is Meta content Framework", Search Engine Watch, June 1997
    [29]"Back to Basics: META Tags", WebDeveloper, Nov. 1998 http://www.webdeveloper.com/categories/html/html.metatags.html
    [30]"Frames and Framesets: Netscape HTML Tag Reference",http:/developer.netscape.com/docs/manuals/htmlguid/tags11.htm,1999
    [31]"A Compendium of HTML Elements ", http://www.htmlcompendium.org/,1999
    [32]"The insertion of knowledge into the search equation",by Jan Pedersen, California, 1999

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700