基于XML/Java的元搜索引擎的研究

作者：何玉菁
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：XML ; Java ; 元搜索 ; Web挖掘 ; MySearch模型
英文关键词：XML ; Java ; Meta Search ; Web Mining ; MySearch Model
学位年度：2004
导师：傅秀芬
学科代码：081203
学位授予单位：广东工业大学
论文提交日期：2004-05-01

摘要

元搜索引擎通常被称为搜索引擎之上的搜索引擎。用户只需递交一次检索请求，由元搜索引擎负责转换处理后提交给多个预先选定的独立搜索引擎，并将所有查询结果集中起来以整体统一的格式呈现到用户面前。而Java是由Sun Microsystems公司所开发的一个高级程序语言，Java提供了一个跨平台的方案，可支持分布式处理环境。Java语言成为了结合XML(eXtensible Markup Language)的最佳选择。XML以一种开放的自我描述方式定义了数据结构，在描述数据内容的同时能突出对结构的描述。由于数据显示与内容分开，XML定义的数据允许指定不同的显示方式，使数据更合理地表现出来。
本文介绍了搜索引擎和元搜索引擎的发展历史，讨论了元搜索引擎的基本工作原理并对元搜索引擎进行了分类，比较了元搜索引擎与独立搜索引擎相比的优点，讨论了元搜索引擎的几个关键技术，并分析了元搜索引擎面临的问题和将来的发展趋势。作者提出了一个元搜索引擎模型MySearch，它包括了用户界面代理，检索代理，查询数据库这三个部分。在此基础上，还探讨HTML数据到XML数据的转换；研究了JAVA,XML与JDBC的结合问题，也即与数据库的结合问题。并用JAVA SERVLET和XML建了一个基于XML、JAVA的元搜索引擎。XML作为一种数据表示的形式对Web上的数据检索和挖掘应用将带来巨大的优势。
Meta search engine is regarded as search engine based on search engines. Users only need to submit search requirements once, it is the responsibility of the meta search engine to transform, process and hand over the requirements to multiple pre-selected independent search engines, then present the search results in a uniform format to users. Java is a kind of advanced programming language developed by Sun Microsystems, and it provides a scheme independent of platforms, and it also can sustain distributing processing environment. Java is the best choice to be combined with XML. XML uses an open, self-described mode to define data structure; it can describe data content as well as structure. Due to the separation of data display and data content, it is allowed to show XML data with different method.
This thesis introduces the developing history of search engine and meta search engine; discusses the working principle of meta search engines and classify them; compares the strong points of meta search engine with search engine; it also discusses several key technology of meta search engine, and analyses the problems and trend of meta search engine in the future. The author bring forward meta search engine model MySearch, it mainly comprises user interface agent, search agent and search database. Based on MySearch model, the author probes into the transform of HTML to XML, the combination of Java, XML and JDBC, and builds a meta-search engine based on XML, Java using Java Servlet and XML techniques. XML will bring great superiority to Web searching and mining as a data expressing forms.

引文

[1] 杨沛，郑启伦，彭宏，Web主题关联知识自学习算法，计算机科学[J]，Vol．30，No．10：2003，P49-51．
    [2] 李永平，文坤梅，集成搜索引擎中结果排序的优化分析，华中科技大学学报(自然科学版)[J]，Vol．31，No．11：2003，P28-30．
    [3] 刘俊平，李书振，张志毅，智能搜索引擎实例分析，计算机应用研究[J]，Vol．20，No．1：2003，P82-84
    [4] 凌志泉，搜索引擎中的网络数据挖掘技术，计算机工程与设计[J]，Vol．24，No．9：2003，P70-72．
    [5] 何静，刘海燕，信息检索与过滤中的信息需求表示方法，计算机工程与设计[J]，Vol．24，No．8：2003，P41-43．
    [6] Weiyi Meng, Clement Yu, King-Lup Liu. Building Efficient Effective Metasearch Engines. http://panda.cs.binghamton.edu/~meng/pub.d/survey5.ps.gz
    [7] Zonghuan Wu, Weiyi Meng etal.. Towards a Highly-Scalable and Effective Metasearch Engine.http://panda.cs.binghamton.edu/-meng/pub.d/www01.ps.gz
    [8] C. Chang, and H. Garcia-Molina. Mind Your Vocabulary: Query Mapping across Heterogeneous Information Sources. ACM SIGMOD Conference, 1999.
    [9] 张卫丰，徐宝文，Web搜索引擎框架研究，计算机研究与发展[J]，2000，37(3)，376-378
    [10] Weifeng Zhang, Baowen Xu, Hongji Yang and William C.Chu, A Genetic Algorithm Based General Search Engine, Proceedings of IEEE MSE'2000
    [11] Weifeng Zhang, Baowen Xu, William C.Chu and Hongji Yang, Data Mining Algorithms for Web Pre-Fetching, Proceedings of The Workshop on the World Wide Web Semantics (WebSem'2000)
    [12] Daniel Dreilinger, "Integrating Heterogeneous WWW Search Engines", May 1995. Ftp://132.239.54.5/savvy/report.ps.gz
    [13] Baowen Xu, Weifeng Zhang, William C.Chu and Hongji Yang, Application of Data Mining in Web Pre-Fetching, to appear in: Proceedings of IEEE MSE2000


    [14] 唐菁，沈记全，杨炳儒．基于Web的文本挖掘系统的研究和实现．计算机科学[J]，2003，30(1)：60-62．
    [15] 高敏．数据挖掘应用现状与产品分析．微计算机应用[J]，2002，23(5)：281-285．
    [16] 周雪忠，吴朝晖．文本知识发现：基于信息抽取的文本挖掘．计算机科学[J]，2003．30(1)：63-66．
    [17] Giansalvators Mecca, Alberto O. Mendelzon, and Paolo Merialdo. Efficient Queries over Web Views[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14(6): 1280-1298.
    [18] Carmen Chan, and Bruce Lewis. A Basic Primer on Data Mining[J]. INFORMATION SYSTEM MANAGEMENT, 2002, 19(4): 56-60.
    [19] Marco Conti, Mohan Kumar, Sajal K. Das, and Behrooz A. Shirazi. Quality of Service Issues in Internet Web Services[J]. IEEE TRANSACTIONS ON COMPUTERS, 2002, 51(6): 593-594.
    [20] Jared Jackson, Jussi Myllymaki. Web-based data mining. http://www-900.ibm.com/developerWorks/cn/xml/x-wbdm/index_eng.shtml.
    [21] 乔智勇，刘志镜．Web数据挖掘系统的设计及实现研究．计算机科学[J]，2002，23(7)：36-38．
    [22] Wen-Syan Li, K.Selcuk Candan, Quoc Vu, and Divyakant Agrawal. Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 14(4): 768-791.
    [23] 印鉴，陈炬桦，刘斌．互联网上的文本信息采掘[J]．小型微型计算机系统，2002，23(11)：1310-1312．
    [24] 刘亚军，杨斌．一种基于前馈式神经网络的数据挖掘方法．微计算机应用[J]，2002，23(6)：348-350．
    [25] John Whaley, Michael C. Martin, Monica S. Lam. Automatic Extraction of Object-Oriented Component Interfaces. Software Engineering Notes, 2002, 27(4):221-231.
    [26] Gary William Flake, Steve Lawrence, C.Lee Giles, Frans M.Coetzee.

    Self-Organization and Identification of Web Communities. Computer, 2002,35(3): 66-71.
    [27] 孙焕良，李彤，吕立．基于XML技术的数据仓库多维数据模型[J]．小型微型计算机系统，2002，23(11)：1306-1309．
    [28] Jean-Manuel Van Thong, Pedro J. Moreno, Beth Logan, Blair Fidler, Katrina Maffey, Matthew Moores. Speechbot: An Experimental Speech-Based Search Engine for Multimedia Content on the Web[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4(1): 88-96.
    [29] Ian Foster, Carl Kesselman, Jeffrey M.Nick, Steven Tuecke. Grid Services for Distributed System Integration[J]. COMPUTER, 2002, 35(6): 37-46.
    [30] 王帮海．第三代因特网—网格．现代计算机[J]，2002，137(4)：6-10．
    [31] 张向刚，张云勇，刘锦德．一种适用于遍在计算环境的中间件体系结构框架．计算机科学[J]，2003，30(1)：34-37．
    [32] 罗清磊，李卫华．利用Java实现基于XML的Web上的数据挖掘Agent．现代计算机[J]，2002，131(1)：13-16．
    [33] W3C网站，http://www.w3c.org
    [34] SUN公司网站，http://www.sun.com
    [35] Terry R. Rayne, Peter Edwards, Claire L. Green. Experience with Rule Induction and K-Nearest Neighbor Methods for Interface Agents and Learn[J]. IEEE Transactions on Knowledge and Data Engineering. 1997.9(2): 329-335.
    [36] Syskill & Webert: Identifying Interesting Web Sites. http://www1.ics.uci.edu/～pazzani/Syskill.html.
    [37] 程晓旭．动态网页设计技术的分析和比较．计算机应用研究[J]，2002，19(12)：153-155．
    [38] 李绪成，王保保，杨建安．知识管理研究．计算机工程与设计[J]，2002，23(1)：1-3．
    [39] 张荣进，用XML构建大型通用知识库．计算机工程与设计[J]，2002，23(12)：9-12．
    [40] 高全泉．网络体系结构详解．计算机科学[J]，2003，30(1)：6-11．
    [41] 赵庆龄，钱平，汪学贫，苏晓路，赵明．采用XML和JSP技术实现对农业

    网站数据的Web动态管理．微计算机应用[J]，2003，24(1)：53-56．
    [42] Data Mining:What is Data Mining. http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm.
    [43] Data Mining Systems. http://www.megaputer.com/dm/systems.php3#dt.
    [44] 俞立文，赵政．搜索引擎的工作机制．微型机与应用[J]，2002，9(21)：31-33．
    [45] 刘丽，须文波．智能元搜索引擎技术在网络信息增值服务中的应用．微型机与应用[J]，2002，9(21)：58-60．
    [46] 盛宪锋，山岚．基于元搜索引擎的专业式智能网络信息检索系统．计算机工程与设计[J]，2004，25(1)：69-73．
    [47] 卫金茂，王石等．基于XML的数据挖掘．计算机工程与设计[J]，2003，24(10)：106-108．
    [48] 王磊，王立胜．基于多Agent的中文智能检索模型．计算机工程与设计[J]，2003，24(9)：76-79．
    [49] 杨震，邓贵仕．基于隐含语义的个性化信息检索．计算机工程与设计[J]．2003，24(7)：90-93．
    [50] 王勋，费玉莲等．基于智能学习的网络辅助浏览技术研究．计算机工程与设计[J]，2003，24(2)：1-3．
    [51] 聂哲．基于WEB的面向主题搜索引擎的设计与实现．计算机工程与设计[J]，2003．24(2)： 60-62

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700