Web信息检索及应用设计优化技术研究

英文题名：Research on the Optimization Techniques of Web Information Searching and Application Design
作者：张宏森
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：RDF ; 搜索引擎 ; 关键词扩展 ; Web应用开发 ; 网页浏览
英文关键词：RDF ; Search Engine ; Keywords Extension ; Web Application Development ; Web Page Browsing
学位年度：2002
导师：朱征宇
学科代码：081202
学位授予单位：重庆大学
论文提交日期：2002-05-10
答辩委员会主席：张为群

摘要

随着信息技术的不断发展，Web上的信息资源正在以前所未有的速度增长。面对Web这个巨大的知识海洋，用户在寻找自己所需要的信息时往往显得束手无策。搜索引擎由于其所具有的方便、快捷的特点，逐渐成为用户在Web上进行信息检索的主要工具。
    首先，针对传统搜索引擎在信息检索的精度(precision)、召回率（recall）、以及使用的方便性等方面存在的不足，作者对Web信息检索系统的检索方法和基本结构进行了仔细的分析研究，并完成了下述研究工作：
    为了改进搜索引擎的性能，作者将Web上的资源分为了三类：网页资源、多媒体资源和网站资源。根据W3C所提供的RDF资源元数据规范，采用XML的形式给出了三类资源的元数据描述文件及其自动生成方法。用资源的元数据来代替资源进行信息存储，大大减少了搜索引擎中的数据存储量，方便了信息的检索，并且支持了对多种资源的检索。
    普通的搜索引擎由于其结构和所存储数据等方面的限制，使其不能很好的解决在数据采集、数据存储、信息查询以及查询结果排序等方面所存在的问题。为了从结构方面对普通搜索引擎进行改进，作者设计了基于RDF元数据搜索引擎的基本结构。
    普通搜索引擎在进行信息收集时一般采用集中式的信息收集方法。集中式信息收集在信息收集的速度和性能等方面都不如分布式信息收集。作者介绍了在基于RDF元数据搜索引擎中所采用的分布式信息收集方法。分布式信息收集方法和资源元数据技术相结合可以大大减少网络上的信息流量。
    作者在对大量用户使用搜索引擎进行信息检索的模式进行观察和分析后，提出了一种基于关键词扩展的检索模式，给出了基于资源元数据库对关键词进行扩展的方法，并且设计了采用此检索模式搜索引擎的界面。这种检索模式更加符合用户检索信息的习惯，能够引导用户准确完整的提出自己的信息需求。
    此外，在当前的Web应用设计开发中，主要是以网页为基本单位对信息进行组织。采用这种方法进行Web应用开发的效率低下，并且后期的修改维护工作量巨大。针对Web应用设计开发所存在的问题，作者提出了一种模块化的网页设计及浏览技术。采用这种设计技术可以对信息进行高效的组织和维护，提高了Web应用设计开发的效率。在浏览时通过让网页上比较重要的部分首先出现在用户面前，提高了网页浏览的性能
    对于复杂网页浏览中所存在的问题进行了分析，提出了将一个复杂网页按其内

    容组成多种模式的新思想，引入了模式化的网页浏览技术，使网页浏览速度有明显改善，有效减少了网络传输时间。
    论文所做的研究工作，对进一步改善Web性能和进行检索技术的优化研究，具有一定的学术意义和较好的实用参考价值。
As the developing of information technique, the information resources in the Web are increasing with the never-heard speed. Faced with this huge information ocean, users are always overwhelming when searching information on the Web. Because of the convenience and shortcut, search engine has become a main tool for information searching.
    Firstly, as to the shortages of traditional search engine in the precision, recall and convenience, the author has carefully analyzed the search method and the basal structure of Web information searching system, and then completed the following tasks:
    To enhance the performance of search engine, the author has categorized the Web resources in three types: Web page resource, multimedia resource and Web site resource. And then present the XML documents of the three types of resource's metadata based on the RDF metadata standard that has supplied by W3C, and introduced its auto-generating method. Storing resource's information using metadata instead of resource itself has decreased the quantity of data in the database, has provided more convenience in information searching, and has supported the searching for multiform resources.
    Because of the limit in structure and data storage, common search engine cannot solve the problems in data collection, data storage, information searching and sorting of searching result. To improve common search engine in structure, the author has designed the structure of search engine based on the RDF metadata.
    Common search engine gathered information using centralized method. Centralized information gathering isn't good as distributed information gathering in speed and performance. In the paper, the author has introduced the decentralized information gathering method used in the search engine based on the RDF metadata. The distributed information gathering method combines with resource metadata technique could lighten the burden on the network.
    The author observed and analyzed the using pattern that a great deal of users search information by search engine, and then presented a new search pattern based on keyword extension and the extension method of keyword based on the metadata database, designed the interface of this search engine. This search pattern is more suitable for users' habits, can lead users to bring forward those requirements for information searching precisely and completely.


    Moreover, in the current Web application development, information is organized mainly with Web pages. This method not only is low efficiency when developing Web application but also needs large task when maintain the application. As to the problems in the Web application development, the author has proposed a block-based design for Web page. This technique can organize and maintain information more efficiently, and then improve the efficiency of Web application development. When browsing Web page, it can improve the performance of browsing that making important part of Web page to appear as early as possible
    This paper analyzed the problem in browsing of complicated Web pages, and then presented a new idea that organize Web page in multi-pattern, introduced the technique for browsing a web page based on multi-pattern. This technique can higher the browsing speed greatly, and can reduce the transfer time of Web page.
    As to enhancing the Web's performance and optimizing the information searching, the tasks of study in the paper are meaningful and valuable in some degree.

引文

[1] World Wide Web Consortium，Resource Description Framework(RDF) Model and Syntax Specification，http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/，22 February 1999
    [2] World Wide Web Consortium，Resource Description Framework(RDF) Schema Specification 1.0，http://www.w3.org/TR/2000/CR-rdf-schema-20000327/，27 March 2000
    [3] Eric Miller，An Introduction to the Resource Description Framework，http://www.dlib.org/dlib/may98/miller/05miller.html ，May 1998。
    [4] Kerstin Forsberg,Lars Dannstedt，Extensible use of RDF in a business context，Computer Networks33（2000）347-364
    [5] 林彤江志军，Internet的搜索引擎，计算机工程与应用，2000年第36卷第5期第160—163页。
    [6] 姚国祥罗伟其沈镇林，网上信息搜索技术与搜索引擎，计算机科学 2000年第27卷第7期第35页
    [7] 吴琨席海平赵玉洁，网上中文搜索引擎，第十届中国计算机学会网络与数据通信学术会议论文，南京，1998年第59—69页
    [8] 钟涛陈新明万均张世永，中文文本WEB搜索引擎的设计与实现，计算机工程与应用 2001年第17期第149页
    [9] 孟卫一吴宗寰，集成搜索引擎的文本数据库选择，计算机研究与发展，2001年第38卷第4期第396—404页
    [10] 张卫丰徐宝文周晓宇等，元搜索引擎研究，计算机科学 2001年第28卷第8期第36页
    [11] 王海波江吉发耿辉白硕祝明发，XML搜索引擎实现，计算机应用研究，2001年第4期第68—71页
    [12] 魏高山，三层Client/Server结构分析与应用，计算机工程与应用，2000年第1期第153—154页
    [13] 张德董逸生，自适应Web站点：挑战与机遇，计算机科学，2000年第20卷第7期第1—4页。
    [14] 李景峰李琰陈平，互联网软件工程的概念及关键问题，计算机科学，2001年第28卷第6期第10—13页。
    [15] 吴立德罗航哉薛向阳，基于多重倒排文件的相似性检索，软件学报，2000年第23卷第11期第1156页
    [16] 鲁松白硕，自然语言处理中词语上下文有效范围的定量描述，计算机学报，2001年第24卷第7期第742—747页


    [17] 张天庆唐常杰，基于自然语言语意分析的Internet文件分类与过滤，计算机应用，2001年第21卷第9期第4—7页
    [18] Andrei Broder, Ravi Kumar etc, Graph structure in the Web, Computer Networks33(2000) 309-320
    [19] 邹海山吴永吴月珠陈阵，中文搜索引擎中的中文信息处理技术，计算机应用研究， 2000年第12期第21—24页
    [20] 朱明王军王俊普，Web网页识别中的特征选择问题研究，计算机工程，2000年8月第26卷第8期第35—37页
    [21] 高文刘峰黄铁军，数字图书馆——原理与技术实现，清华大学出版社，2000年10月第1版第197—216页
    [22] 阳小华，分布式WWW信息收集技术，计算机工程与应用，2000年5月第145—146页
    [23] Brian B.Brewington, Geroge Cybenko，How dynamic is the Web，Computer Networks33（2000）257-276
    [24] 廖明宏程光明吴翔虎，一个WWW智能搜索引擎，计算机应用研究，2001年第5期第29页
    [25] 赵仲孟张蓓沈均毅，对搜索引擎未来发展的探讨，计算机科学 2001年第28卷第3期第60页
    [26] 徐振宁张维明张文伟，基于Ontology的智能信息检索，计算机科学，2001年第28卷第6期第21—26页
    [27] Christoph Holscher,Gerhard Strube，Web search behavior of Internet experts and newbies，Computer Networks33（2000）337-346
    [28] Jacob Palme，Talking back to the WWW，Computer Networks31（1999）2281-2286
    [29] Raymie Stata, Krishna Bharat, Farzin Maghoul，The Term Vector Database: fast access to indexing terms for Web pages，Computer Networks33（2000）247-255
    [30] 陈画辉，一个中英文全文搜索引擎的设计与实现，计算机应用研究，2001年第3期第131—133页
    [31] Mark A.C.J. Overmeer，A search interface for my questions，Computer Networks31（1999）2263-2270
    [32] Dell Zhang,Yisheng Dong，An efficient algorithm to rank Web resources，Computer Networks33（2000）449-455
    [33] Atsushi Sugiura,Oren Etzioni，Query routing for Web search engines:architecture and experiments，Computer Networks33（2000）417-429
    [34] 王静孟小峰，半结构化数据的模式研究综述，计算机科学，2001年第28卷第2期第6—10页


    [35] Mark A.C.J Overmeer My personal search engine Computer Networks31（1999）2271-2279
    [36] Ellen Spertus, Lynn Andrea Stein Squeal: a structured query language for the Web Computer Networks33（2000）337-346
    [37] 姚绍文周明天龙华曾家智，下一代的Web技术，计算机科学，2001年第28卷第1期第78—83页
    [38] 刘挺王开铸，基于篇章多级依存结构的自动文摘研究，计算机研究与发展，1999年 4月，第36卷第4期
    [39] 梁小芝阳小华，基于查询结果的WWW查询条件求精，计算机应用，2000年第20卷第11期第55—56页。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700