基于JavaEE平台与Lucene的信息文档搜索引擎系统的设计与实现

英文题名：Design and Implement of Information Document Search Engine System Based on JavaEE Platform and Lucene
作者：桂许军
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：搜索引擎 ; Lucene ; Ajax ; 网络爬虫 ; JavaEE
英文关键词：Search engine ; Lucene ; Ajax ; Engine web crawler ; JavaEE
学位年度：2011
导师：何枫
学科代码：081203
学位授予单位：西南交通大学
论文提交日期：2011-05-01

摘要

随着互联网的日新月异的发展,网络应用已涉及到各大企业以及文献机构的方方面面,因而因使用互联网无时不刻都在产生着惊人的数据和信息。同时也因企业以及各大机构本身各个业务环节也会产生大量的信息文档,而这些信息文档中很大的部分属于异构文档,极其不利于检索及管理。为了极大程度提高信息资源的共享率和利用率,需要一套高效的检索系统。
     本文结合行业搜索引擎的特点以及当前的实际需求,采用了基于JavaEE平台,使用Java语言以及结合设计模式思想采取多层架构技术,同时融合了Ajax等当前的流行技术,完成了对信息文档搜索引擎系统的开发。
     论文首先介绍了课题的研究背景与意义,并分析了当前的信息文档检索的现状以及未来的发现方向。接着对搜索引擎系统所要用到的相关技术与基本原理进行了阐述与分析。然后从信息采集、索引建立、信息检索等多方面初步的分析了信息文档搜索引擎系统的总体需求,以及系统的功能与数据需求。因为系统是面向用户的,因此运用了面向对象思想的UML(统一建模语言)分析并给出了系统的用例图及总体架构图。其次,基于需求分析的基础上,划分并设计了系统的各个核心版块以及功能,使用流程图详细的说明了各大核心版块的处理流程。再次,同时也运用了UML设计了系统各个版块部分的静态结构图,结合静态结构图以及对象实体设计了系统的数据库。最后,对系统的各个版块经行了详细的设计与实现,给出了这些模块的时序图以及运行图。
     该系统具有简洁直观的用户界面,人性化操作,使用简单便捷,能较好的满足用户的检索需求。
With the rapid development of Internet, network applications have been involved in various aspects of large enterprises and document institutions, so the use of the Internet is everywhere incessantly, which makes the data and information increase faster. Meantime enterprises and document institutions will also produce a lot of information documents in their service links, and most of these information documents are heterogeneous document which are unfavorable to retrieval and management. In order to improve the sharing rate of information resources and utilization percent, we need an efficient retrieval system
     Considering the characteristics of the search engine industry and the current actual demand, this paper based on JavaEE platform, using Java language, combining the methods of design patterns and taking multi-architecture technologies. And also integrate some popular technologies in current, such as Lucene and Ajax. After all, my paper tries to complete the empolder about search engine of information document.
     Firstly, this paper introduces the research background and significance, and analyzes the current status of the information document retrieval and the direction of future. Then described and analyzed the relevant technology and basic principles about the search engine system. After that, doing a preliminary analyze about the overall system requirements of information search engine and system functions and data needs, with information collection, indexing, information retrieval and so on. The system is user-oriented, so I use object-oriented methods of UML (Unified Modeling Language) to analyze, and make the system's use case diagrams and its overall charts. Secondly, based on demand analysis, I divided and designed the core columns and features of the system, and illustrate the management procedure of major core forum with flow chart. Again, I designed static structure diagram of each column in the system with UML, and though combining with the static structure and the physical design of the system object database. Finally, in my individual views, I designed and implemented each column of the system, and finish the timing diagram of these modules and running chart.
     The system has simple and intuitive user interface, user-friendly operation, simple and convenient usage experience, it can meet the needs of the user's search better.

引文

[1]黄承慧,印鉴,陆寄远.一种改进的Lucene语义相似度检索算法[J].中山大学学报(自然科学版),2011,(02)：5-12.
    [2]杨丹波.应用Web数据挖掘的主题元搜索引擎设计与实现[D].清华大学硕士论文,2009.
    [3]张建梁.基于云计算的语义搜索引擎研究[D].复旦大学硕士论文,2009.
    [4]吕学强,苏祺,孙斌,俞士汶.搜索引擎用短语词典建设[J].清华大学学报(自然科学版),2005,(S1)：10-16.
    [5]王士博.一种基于语义的服务标识搜索引擎的设计与实现[D].北京交通大学硕士论文,2009.
    [6]林舒.基于JAVAEE的办公智能化系统[J].科技传播,2011,(02)：22-30.
    [7]李晨.网络搜索引擎与专家检索系统框架和模型研究[D].北京邮电大学,2009.
    [8]何俊伟,丁丽珊.因特网信息检索对传统信息检索的影响及对策[J].科技情报开发与经济,2008,(07)：32-40.
    [9]滕伟.面向Web信息集成的Web信息抽取中若干关键问题的研究[D].上海交通大学博士论文,2007.
    [10]韩建福,卢苇.文档聚类在Web搜索结果中的应用研究[J].中国科技信息,2006,(23)：3-10.
    [11]刘壁松.策略可扩展的搜索引擎研究和实现[D].清华大学硕士论文,2005.
    [12]杨溥.搜索引擎中爬虫的若干问题研究[D].北京邮电大学硕士论文,2009.
    [13]李思聪.流程行业企业信息化系统中工作流技术的应用研究[D].浙江大学硕士论文,2010.
    [14]程建.一种网页搜索引擎原型系统的设计与实现[D].北京邮电大学硕士论文,2009.
    [15]罗兵.支持AJAX的互联网搜索引擎爬虫设计与实现[D].浙江大学硕士论文,2007.
    [16]师东生.基于自然语言理解的智能化多媒体信息检索系统研究[J].微型机与应用,2011,(06)：22-28.
    [17]韩洪光.搜索引擎分析[D].北京交通大学硕士论文,2008.
    [18]邵晶晶,冯波,李波PageRank排名技术的新算法[J].华中师范大学学报(自然科学版),2008,(04)：36-42.
    [19]潘宁.基于语义技术的智能搜索引擎研究[D].北京邮电大学硕士论文,2009.
    [20]曾伟辉.支持AJAX的网络爬虫系统设计与实现[D].中国科学技术大学硕士论文,2009.
    [21]李盛韬,吴丽辉,于满泉,潘文锋,余智华,王斌,程学旗.主题Web信息采集的研究与设计[A].语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集[C],2003.
    [22]刘步春.基于知识库的校园对象搜索引擎的相关技术研究[D].北京邮电大学硕士论文,2009.
    [23]武旭,须德.基于向量空间模型的文本自动分类系统的研究与实现[J].北方交通大学学报,2003,(02)：22-26.
    [24]刘洋.聚合通信算法测试分析与理论研究[D].中国科学院研究生院硕士论文(软件研究所),2005.
    [25]石磊.基于数据的学习：埃尔米特算法与黎曼流形上的法向量估计[D].中国科学技术大学博士论文,2010.
    [26]李应兴,付婷,李勇.基于LUCENE的藏文信息检索的研究与应用[A].民族语言文字信息技术研究——第十一届全国民族语言文字信息学术研讨会论文集[C],2007.32-35.
    [27]樊非.基于J2EE架构的银行管理监控系统研究[D].浙江大学博士论文,2006.
    [28]黄志春.基于AJAX技术的环保监控系统[D].浙江大学硕士论文,2006.
    [29]吴鸿汉,瞿裕忠,李慧颖.基于RDF句子的语义网文档搜索[J].计算机研究与发展,2010,(02)：33-36.
    [30]陈磊,茹立云,马少平.基于用户日志挖掘的搜索引擎广告效果分析[A].第四届全国学生计算语言学研讨会会议论文集[C],2008：42-46.
    [31]李四达.全文索引引擎Lucene的研究及其手机中的应用实现[D].华北电力大学硕士论文(北京),2007.
    [32]李峰,刘彦隆.基于SSH框架与jQuery技术的JavaWeb开发应用[J].科技情报开发与经济,2010,(06)：30-36.
    [33]Dino Esposito,罗小平.小议JavaScript库—Dojo、jQuery和PrototypeJS的比较[J].程序员,2008,(08)：3-8.
    [34]吴瑞红,张环冲.浅谈JavaScript库—jQuery,ExtJs的对比研究[J].科技信息,2010,(09)：2-6.
    [35]Ounis I,de Rijke,M,acdonald,C,et al. Overview of the TREC-2006blog track[C].The Fifteenth Text REtrieval Conference (TREC2006)Proceedings. NIST,2006.
    [36]C. Zhai, J. Lafferty. A study of smoothing methods for language models applied to information retrieval.ACM Transactions on Information Systems,2004, (22):179-214.
    [37]Cao,Y.,Liu,J.,Bao,S. Research on expert search at enterprise track of tree 2005.Proceedings of TREC 05. TREC,200529. Adamic L, Adar E. How to search a social network. Social Networks.2005,27:187-203.
    [38]Miller D R H,Leek T,Schwartz R M. BBN at TREC7:Using Hidden Markov Models for Information Retrieval[C].Proceedingsof the 7th Text Retrieval Conference. 1998,:80-89.
    [39]刘世贵,郭文龙,姜惠娟.基于JavaEE多层软件架构的研究与实现[J].软件导刊,2010,(08)：12-24.
    [40]张鑫,黄灯桥,杨彦强JavaScript凌厉开发—Ext详细与实践[M].清华大学出版社,2009：44-50.
    [41]王征.JavaScript网页特效实例大全.清华大学出版社[M],2006：62-70.
    [42]胡晓翠.站点搜索引擎的研究与实现[D].武汉科技大学硕士论文,2009.
    [43]张华杰.基于维基百科的知识抽取和重用.上海交通大学硕士学位论文.2009.
    [44]马志强,刘利民,苏依拉,马瑞明.基于Lucene的站内搜索引擎研究[J].内蒙古工业大学学报(自然科学版),2009,(01)：18-28.
    [45]雷玲,陈念.企业信息化的管理模式——知识管理[J].科技创业月刊,2006,(03)：42-48.
    [46]胡健,杨炳儒,宋泽锋,钱榕.基于非结构化数据挖掘结构模型的Web文本聚类算法[J].北京科技大学学报,2008,(02)：16-22.
    [47]Fei Huang,Ying Zhang,Stephan Vogel. Mining Key Phrase Translations from Web Corpora.the proceedings of the Human Language Technologies Conference (HLT-EMNLP 2005). October 2005.
    [48]Chengye Lu,Yue Xu,Shlomo Geva. Web-Based Query Translation for English-Chinese.Computational Linguistics and Chinese Language Processing,2008, Vol.13 (No. 1,):pp.61-90.
    [49]Ziv BarYossef,Maxim Gurevich. Mining Search Engine Query Logs via Suggestion Sampling.VLDB'08. Auckland, New Zealand. August 24-30,2008.
    [50]Monika R. Henzinger. Algorithmic Challenges in Web Search Engines.Internet Mathematics,2003, Vol 1 (No.1):115-126.
    [51]Peter D. Turney Coherent Keyphrase Extraction via Web Mining.Institute for Information Technology National Research Council of Canada,2003.4.
    [52]Shaikh Mostafa Al Masum,Mitsuru Ishizuka,Md. Tawhidul Islam. Creating Topic-Specific Automatic Multimodal Presentation Mining the World Wide Web Information.University of Tokyo,Micros-Fidelio Australia Pvt. Ltd,2005.11.
    [53]Francis HEYLIGHEN. Mining Associative Meanings from the Web:from word disambiguation to the global brain.Brussels:Standaard Publishers,2001.
    [54]孙宏纲,陆余良.基于二元切分的互联网新闻主题词自动提取研究[A].第三届全国信息检索与内容安全学术会议论文集[C],2007：3-8.
    [55]朱鉴,张建,李淼,强静,杨攀.面向民族语言信息处理的汉语分词方法[A].民族语言文字信息技术研究——第十一届全国民族语言文字信息学术研讨会论文集[C],2007：12-18.
    [56]Andreas Aschenbrenner,Silvia Miksch. blog mining in a corporate environment.ASGAARD-TR-2005-11. September 2005:32-36.
    [57]Craig Silverstein,Hannes Marais,Monika Henzinger,Michael Moricz. Analysis of a Very Large Web Search Engine Query Log.Google Inc.Compaq Systems Research, Doublebill.Com, Inc,2002.10.
    [58]R. Cooley,B. Mobasher,J. Srivastava. Web Mining:Information and Pattern Discovery on the World Wide Web.University of Minnesota Minneapolis USA,2004.
    [59]HSINCHUN CHEN,XIN LI,MICHAEL CHAU,YI-JEN HO,CHUNJU TSENG. Using Open Web APIs in Teaching Web Mining The University of Arizona.The University of Hong Kong,2007.1.
    [60]Peiling Wang. A Dual-approach to Web Query Mining:Towards Conceptual Representations of Information Needs.University of Tennessee, Knoxville,2006.3.
    [61]Syed Salman Ahmed,Zahid Halim,Rauf Baig,Shariq Bashir. Web Content Mining:A Solution to Consumer's Product Hunt.PWASET,2008, VOLUME 27 (2).
    [62]Ricardo Baeza-Yates,Carlos Hurtado,Marcelo Mendoza. Query Recommendation using Query Logs in Search Engines.Universidad de Chile,Universidad de Valparaiso, 2004.5.
    [63]Chih-Lu Lin,Hung-Yu Kao. Query Result Aggregation on Multiple Search Engines.National Cheng Kung University,Tainan,Taiwan,2006.11.
    [64]Wray Buntine. Open Source Search:A Data Mining Platform.ACM SIGIR Forum, 2005, Vol.39 (No.1).

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700