基于Map/Reduce的分布式智能搜索引擎框架研究

英文题名：The Study of the Framework of Distributed Intelligent Search Engine Based on Map/Reduce
作者：付志超
论文级别：硕士
学科专业名称：国际贸易学
中文关键词：搜索引擎 ; 分布式计算 ; Map/Reduce ; HDFS
英文关键词：Search Engine ; Distributed Computing ; Map/Reduce ; HDFS
学位年度：2008
导师：聂规划
学科代码：020206
学位授予单位：武汉理工大学
论文提交日期：2008-11-01

摘要

随着搜索经济的崛起,人们开始越加关注全球各大搜索引擎的性能、技术和日流量。作为企业,会根据搜索引擎的知名度以及日流量来选择是否要投放广告等;作为普通网民,会根据搜索引擎的性能和技术来选择自己喜欢的引擎查找资料;作为技术人员,会把有代表性的搜索引擎作为研究对象。搜索引擎经济的崛起,又一次向人们证明了互联网所蕴藏的巨大商机。互联网离开了搜索将只剩下空洞杂乱的数据,以及大量等待去费力挖掘的金矿。如今互联网中的信息每天以指数级的数量增长,面对海量数据的处理和存储,传统的集中式搜索引擎显得无能为力。另外传统搜索引擎系统一般都采用关键词匹配模式,无法理解用户搜索意图,使得用户在互联网上搜索自己真正需要的信息很困难。因此搜索引擎的分布式智能化是未来发展的趋势。
     本文从研究和设计的角度出发,对分布式智能搜索引擎的相关理论和技术进行了详细的分析和讨论,将基于Map/Reduce的分布式智能搜索引擎框架研究分为三个层次,即分布式并行计算理论方法研究、搜索引擎原理的研究以及基于分布式的智能搜索引擎研究。论文主要研究的内容如下:
     论述了目前搜索引擎的国内外发展现状、存在的问题以及发展趋势;分析了搜索引擎的工作原理以及各部分的主要功能;对分布式计算理论、网格计算、云计算、Map/Reduce分布式计算模型进行分析与研究。对开源搜索引擎工具包Lucene、开源分布式计算框架Hadoop进行了详细的分析与研究。
     在基于Map/Reduce的分布式计算模型的基础上,借助语义词典,对分布式的智能搜索引擎系统进行了研究。设计并实现了基于Map/Reduce的分布式智能搜索引擎——IEBSou。重点阐述了IEBSou系统框架的实现.不仅给出了系统各模块之间的关系,而且还分析了各个模块的实现原理和思想。对IEBSou的Map/Reduce基础框架进行了设计;结合Lucene设计了统一文档处理框架,并对中文分词中人名识别、新词的识别进行了研究;提出了基于Map/Reduce的网页消重算法;提出了通过构建概念集的方式来提供基于语义联想的搜索推荐词生成算法。借助语义词典,对用户搜索关键词的概念进行语义扩展,构造概念集,让系统智能的理解用户搜索意图,提高系统的查全率和查准确率。
With the economic rise of search, more people begin to concern the world's major search engine performance, technology and daily flow. An enterprise will choose whether to launch advertising based on the search engine popularity and daily flow, as ordinary internet users, which choose a favorite search engine to find information according to search engine performance and technology, as technicians, will choose a representative of search engine as the research object. The economic rise of search engines, to the people once again demonstrates the Internet by the tremendous business opportunities. Without search engines , Internet will be left only empty clutter of data, as well as so much gold miner which needs digging with hard sledding. Today, the information in the Internet is mounted up exponentially everyday, and in the face of massive data processing and storage, the traditional centralized search engine appears to be powerless. On the other hand, traditional search engine system is generally used words matching model, and unable to understand customer search intentions, making it very difficult for the users to search on the Internet for the really wanted information. Therefore, the distributed intelligent search engine is the future development trend.
     From the research and design point of view, this thesis makes a detailed analysis and discussion on the distributed intelligence of the search engine-related theory and technology. The research on the framework is subdivided into three levels which are correlated with each other closely to support the distributed intelligent search engine based on the Map/Reduce. The first is the theory and methodology of distributed Parallel Computing. The second is the Principle of search engine. The third is the theory and methodology of the distributed intelligent search engine. The main content of the thesis is as follows:
     Firstly the thesis discusses the current development status of search engine at home and abroad, as well as the existing problems and the development trends. After analysis of the search engine's working principle as well as some of the main functions, the theory of distributed computing, grid computing, cloud computing. Map/Reduce Distributed computing model are elaborated. And the open source search engine kit Lucene, open-source distributed computing framework Hadoop are analyzed and studied.
     Based on the Map/Reduce distributed computing model and semantic dictionary, the distributed intelligence of the search engine system is studied. The distributed intelligent search engine - IEBSou, which based on the Map/Reduce, is designed and implemented. And the thesis focuses on the framework for the realization of the IEBSou system. Not only displays the relationship between the modules, but also analyzes the implemented principles and ideas of the various modules. After that the basis of the framework of the IEBSou's Map/Reduce is designed. Combined with Lucene, a unified framework for dealing with the document is designed, and then the names in Chinese word recognition and recognition of new words have been studied. The elimination re-page algorithm based on the Map/Reduce and the search recommended word generation algorithm based on the semantic association are proposed. Through constructing a concept set, IEBSou can intelligently generate the semantic related words for the users. On the other hand, with semantic dictionary, IEBSou will conduct a Semantic extension for user's searcher keywords and build a concept set, so the system can intelligently understand the user's searching intent, and improve the recall and precision.

引文

[1]中国互联网络发展状况统计报告(2007/7):http://www.cnnic.net.cn/u-ploadfiles/pdf/2007/7/18/113918.pdf
    [2]徐宝文,张卫丰.搜索引擎与信息获取技术[M].北京:清华大学出版社,2003
    [3]张卫丰,徐宝文.Web搜索引擎框架研究[J].计算机研究与发展,2000,37(3):376-378
    [4]http://industry.ccidnet.com/art/884/20071101/1261137_1.html
    [5]艾瑞咨询《2008年搜索在网络购物流程中的价值分析报告》:http://news.iresearch.cn/viewpoints/85642.shtml
    [6]乔冬梅.搜索引擎现状与发展研究[硕士学位论文].郑州大学,2002
    [7]李晓明,闫宏飞,王继民.搜索引擎--原理、技术与系统[M].科学出版社.2004
    [8]History of Search Engines[OL],2007-10,http://www.searchenginehistory.com
    [9]门凤超,苗军民.试论搜索引擎的现状与发展[J].现代情报,2008,2(2):21-22
    [10]王军.搜索引擎的过去时、现在时和将来时[J].图书情报,2008,3:65-66
    [11]胡双双,秦杰.搜索引擎技术及其发展趋势[J].福建电脑,2008,6:32-33
    [12]王香莲.Google和百度两种搜索引擎比较研究[J].现代图书情报技术,2004,(08):52-55
    [13]http://alexa.chinaz.com/?domain=baidu.com
    [14]岳清.浅析搜索引擎的原理及发展情景[J].大众科技,2005(5):58-60
    [15]钟涛,陈新明,万钧,张世永等.中文文本web搜索引擎的设计与实现[J].计算机工程,2001,5:149-151
    [16]姜鑫维.基于分布式的智能搜索引擎[硕士学位论文].武汉理工大学,2006＼
    [17]Danny Sullivan.Fifth Annual Search Engine Meeting Report.Boston,MA,Apr.2000
    [18]张晓刚,李明树.智能搜索引擎技术的研究与发[J].计算机工程与应用,2001(24):67-70
    [19]蒋建洪.主要分布式搜索引擎技术的研究[J].科学技术与工程,2007,7(10):2418-2424
    [20]Shen,Yipeng.Meta-search and distributed search systems[J].HONG KONG UNIV OF SCI AND TECH.2002,11:5-9
    [21]Jansen,Spink,Saracevic.Real Life,Real Users and Real Needs:A Study and Analysis of User Queries on the Web[J].Information Processing and Management.2000.2(36):120-124
    [22]苏云.搜索引擎Google检索技巧研究[J].甘肃科技,2005(02):69-71
    [23]Alberto O.Mendelzon,Davood Rafiei.What do the neighbors think? Computing web page reputations.IEEE Data Engineering Bulletin,Page 9-16,September 2000
    [24]王斌,张刚,孙健.大规模分布式并行信息检索技术[J].信息技术快报,2005,3(2):1-9
    [25]姚树宇,赵少东.一种使用分布式技术的搜索引擎[J].计算机应用与软件,2005,22(10):127-129
    [26]I.Stoics,R.Morris,D.Karger,M.Kaasheek,et al.Chord:A scalable peer-to-peer lookup service for Internet applications.Proceedings of the ACM SIGCOMM.O1,2001,pages 149-160.
    [27]Tylvia Ratnasamy,Paul Francis,Mark Handley,et al.A scalable content-addressable network.Proceedings of ACM SIGCOMM.01 2001,pages 161-172
    [28]B.Zhao,J.D.Kubiatowicz,A.D.Joseph,et al.Tapestry:An infrastructure for fault-tolerant wide-area location and routing.Technical report,UCB/CSD-01,1141,UC Berkeley.April 2001
    [29]A.Rowstron,P.Druschel.Pastry:Scalable,distributed object location and routing for large-scale peer-to-peer systems.Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms(Middleware) 2001.2001:329-350
    [30]M.Harren,J.M.Hellerstein,R.Huebschand.Complex Queries in DHT based Peer-to-Peer Netwroks.Proceedings of 1st International Workshop on Peer-to-Peer Systems(IPTPS'02),March 2002:242-259
    [31]张颖卓.基于P2P的分布式搜索技术研究与实现[硕士学位论文].成都理工大学,2008
    [32]Christos Gkantsidis,Milena Mihail,Admin Saberi.Hybrid Search Schemes for Unstructured Peer-to-Peer Networks[J].2005 IEEE.
    [33]丁邦旭.基于P2P的分布式中文搜索引擎的应用研究[硕士学位论文].南昌大学,2006
    [34]侯孟书,卢显良.非结构化P2P系统的路由算法[J].电子科技大学学报,2005,34(1):105-108
    [35]A.Crespo,H.Garcia-Molina.Routing Indices for Peer-to-Peer Systems.In ICDCS,July2002
    [36]Sylvia Ratnasamy,Scott Shenker,Ion Stoica.Routing Algorithms for DHTs:Some Open Questions.In IPTPS'02,January 2002
    [37]Crespo A,Garcia-Molina H.Routing indices for peer-to-peer systems[C].In Proceedings International Conference on Distributed Computing Systems,Arizona,USA,2002.
    [38]刘红星.文本信息检索技术研究[硕士学位论文].清华大学,2004.6
    [39]苏旋.分布式网络爬虫技术的研究与实现[硕士学位论文].哈尔滨工业大学,2006.6
    [40]王明功.分布式搜索引擎缓存设计及优化[硕士学位论文].北京邮电大学,2006.3
    [41]Andrei Z.Marc Najork.Efficient URL caching for World Wide Web crawling.ACM press.2003,679-689
    [42]Hung-chih Yang,Ali Dasdan,Ruey-Lung Hsiao,et al.Map-reduce-merge:simplified relational data processing on large clusters.SIGMOD'07,2007,6http://aortal.acm.ora/citation.cfm?doid=1247480.1247602
    [43]Jeffrey Dean.Experiences with MapReduce,an abstraction for large-scale computation.Proc.15th International Conference on Parallel Architectures and Compilation Techniques,2006:1
    [44]Hadoop.Open source MapReduce implementation from Apache
    [45]吴宝贵,丁振国.基于Map/Reduce的分布式搜索引擎研究[J].现代图书情报技术.2007(8):52-55
    [46]Dean J,Ghemawat S.Map/Reduce:Simplied Data Processing on Large Clusters[C].In:OSDI 2004,San Francisco,2004:137-150
    [47]张元丰,董守斌,张凌等.基于Map/Reduce的网页消重并行计算[J].广西师范大学学报(自然科学版).2007,25(2):154-156
    [48]闫翔,陈远.中文智能搜索引擎现状探析[J].情报科学,2002,20(12):1326-1328
    [49]陈建秋,邓飞其,刘发贵.智能化搜索引擎分析与探讨[J].广州大学学报(自然科学版),2002,3(1):39-42
    [50]陈治平.智能搜索引擎理论与应用研究[博士学位论文].湖南大学,2003.6
    [51]陈治平,林亚平,童调生.智能搜索引擎技术研究[J].计算机工程,2004,2:45-47
    [52]皮鹏.智能搜索引擎系统的研究[硕士学位论文].哈尔滨工业大学,2002.1
    [53]潘照明.智能中文搜索引擎若干关键技术的研究与实现[硕士学位论文].浙江大学,2006.5
    [54]凌海云.基于语义网的智能搜索技术的研究与实现[硕士学位论文].电子科技大学,2004.3
    [55]岳清.浅析搜索引擎的原理及发展前景[J].大众科技,2005,(5):58-60
    [56]卢小宾等主编.《信息检索》[M].科学出版社,2003版
    [57]徐莹.搜索引擎技术及其发展前瞻[J].科技情报开发和经济,2005(24):177-178
    [58]刘建国.搜索引擎概述[J].北京大学计算机与科学技术,1999,10(20):1-4
    [59]G.Pant,P.Srinivasan,F.Menczer.Crawling the Web.in M.Leveneand A.Poulovassilis,editors:Web Dynamics,Springer-Verlag,2003
    [60]Menczer,F.,G.Pant,and P.Srinivasan.2004,'Topical Web Crawlers:Evaluating Adaptive Algorithms'.ACM Transactions on Internet Technology
    [61]李东海.基于nutch技术的主题搜索引擎实现[硕士学位论文].吉林大学,2007
    [62]余艳.搜索引擎原理剖析及其技术发展[J].图书管学刊,2004(1):58-60
    [63]邓辉,刘畅.基于P2P技术的高效检索模型构建研究[J].信息检索技术,2004(11):39-41
    [64]金澎.搜索引擎相关性分析中的网页加权索引和结果重排[硕士学位论文],南京理工大学,2002
    [65]陈庆伟,刘军.基于Lucene的网站全文搜索的设计与实现[J].科技情报开发与经济,2005(15):242-244
    [66]陈魁.智能搜索引擎系统的分析设计与开发[硕士学位论文],大连理工大学,2004
    [67]何莘,王琬芜.自然语言检索中的中文分词技术研究进展及应用[J].情报科学,2008,26(5):787-791
    [68]周文帅,冯速.汉语分词技术研究现状与应用展望[J],山西师范大学学报(自然科学版),2006,20(1):25-29
    [69]王华栋,饶培伦.基于搜索引擎的中文分词评估方法[J].情报科学,2007.1(25):108-112
    [70]王永成,陈桂林,韩客松.一种快速单模式精确匹配算法[J].上海交通大学学报,2001,35(2):192-196
    [71]沈斌.基于分词的中文文本相似度计算研究[硕士学位论文].天津财经大学,2006
    [72]孙茂松.汉语自动分词研究的最新进展与应用--清华大学相关工作介绍[A].辉煌二十年---中国中文信息学会二十周年学术会议论文集,2001:20-41
    [73]ZouTao,Wang Jicheng,Zhang Fuyan,et al.The survey of text information retrieval[J].Computer Science(in Chinese),1999,26(9):72-75
    [74]Ian H.Witten,Alistair Moffat,Timothy C.Bell.Managing Gigabytes:compressing and indexing documents and images[M].San Franciso:Morgan Kaufmann Publishers,1999:18-20
    [75]Udi Manber,Gene Myers.Suffix arrays:A new method for on-line string searches[C].In:Proc.Of the 1st ACM-SIAM Symposium on Discrete Algorithms.New York:ACM Press,1990,319-327
    [76]C.Faloutsos,S.Christodoulakis.Signature Files:An access method documents and its analytical performance evaluation[J].ACM Transactions on Office Information Systems(TOIS),1984,2(4):267-288
    [77]颜维龙,盖杰.面向网络的全文检索中索引文件的组织[J].计算机应用研究,2002,(11):124-126
    [78]姚全珠,丁晓剑,任雪利,等.一种新的基于XML的索引机制[J].计算机工程,2006,32(15):90-92
    [79]Giovanni Manzini,Paolo Ferragina.Engineering a Lightweight Suffix Array Construction Algorithm[J].Springer Verlag,2002(2461):698-710
    [80]Roberto Grossi.Compressed Suffix Arrays and Suffix Trees with Application to Text Indexing and String Matching[C].In 32nd ACM Symposium on Thenry of Computing,2000:397-406
    [81]姚全珠,张楠,杨增辉,等.基于压缩后缀数组技术的搜索引擎[J].计算机工程,2008,34(10):83-85
    [82]Nicholas Lester,Justin Zobel,Hugh Williams.Efficient online index maintenance for contiguous inverted lists[J],Information Processing and Management 2006,42:916-933
    [83]Chiyoung Seoa,Sang-Won Leeb,Hyoung-Joo Kima.An efficient inverted index technique for XML documents using RDBMS[J],Information and Software Technology,2003,45:11-22
    [84]Robert W.P.Luka,Wai Lamb.Efficient in-memory extensible inverted file[J].Information Systems,2007,32:733-754
    [85]许涛,吴淑燕.Google搜索引擎即其技术介绍[J].现代图书情报技术,2003,(4):72-76
    [86]P.Reynolds,A.Vahdat.Efficient Peer-to-Peer Keyword Searching[C].In:Proceedings of the 2003 ACM/IFIP/USENIX International Middleware Conference(Middleware 2003).volume 2672 of Lecture Notes in Computer Science,2003,21
    [87]B.Ribeiro-Neto,R.Barbosa.Query performance for tightly coupled distributed digital libraries[C].In:Proceedings of the Third ACM International Conference on Digital Libraries.New York:ACM Press,1998:182-190
    [88]郁志辉,陈渝,刘鹏.网格计算[M].北京:清华大学出版社,2002:3-7
    [89]Stephen Baker.Google and the wisdom of clouds:A lofty new strategy aims to put incredible computing power in the hands of many.BusinessWeek.http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm
    [90]Reaching for the Sky Through The Compute Clouds http://www.readwriteweb.com/archives/reaching_for_the_sky_through_compute_clouds.php
    [91]Ghemawats,Gobioffh H,Hung S.The google file system.San Francisco:Google Inc.,2003.
    [92]Chang F.Bigtable:A Distributed Storage System for Structured Data.Proceedings of Operating Systems Deseign and Implementation,Seattle,2006:205-218
    [93]Ralf Lammel.Google's MapReduce Programming Model-Revisited.Science of Computer Programming,2008(70):1-30
    [94]Remzi H.Arpaci-Dusseau,Eric Anderson,et al.Cluster I/O with River:Making the fast case common.In Proceedings of the Sixth Workshop on Input/output in Parallel and Distributed Systems(IOPADS '99),Atlanta,Georgia,1999,5:10-22.
    [95]曹元大,贺海军,涂哲明.中文Web文档全文检索系统的设计与实现[J].北京理工大学学报,2002,22(1):68-71
    [96]Cutting D.Scalable Computing with Hadoop[EB/OL].http://wiki.apache.org/lucene-hadoop-data/attach-ments/HadoopPresentations/attachments/yahoo-sds.Pdf.
    [97]A standard for Robot Exclusion.http://www.robotstxt.org/wc/norobots.html
    [98]HttpClient,http://Jakarta.apache.org/commons/httpclient/
    [99]王建勇,谢正茂,雷鸣,等.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(11):130-132
    [100]吴平博.基于事件框架的主题相关文档智能检索的初步研究:[硕士学位论文].北京:清华大学计算机系,2004
    [101]杨小平,丁浩,黄都培.基于向量空间模型的中文信息检索技术研究[J].计算机工程与应用,2003,15:109-111
    [102]李振星,徐泽平,唐卫浩等.网页多词元快速聚类算法[J].计算机工程,2003,2:20-22
    [103]李雪蕾,张冬荣.一种基于向量空间模型的文本分类方法[J].计算机工程,2003,10:90-92
    [104]张刚,刘挺,郑实福等.大规模网页快速去重算法.中国中文信息学学会二十周年学术会论文集(续集),2001,18-25
    [105]彭渊,赵铁军,郑德权等.基于特征句抽取的网页去重研究.全国第八届计算语言学联合学术会议(JSCL-2005)论文集,2005:508-512
    [106]Narayanan Shivakumar.Finding near-replicas of documents on the web[C].In:Proceedings of Workshop on Web Databases(WebDB'98).1998,204-212
    [107]U.Manber.Finding similar files in a large file system[C].In Proceedings of the Winter 1994 USENIX Technical Conference,1994:1-10
    [108]韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究[J].中文信息学报,2001,15(2):23-30
    [109]周波,杨国纬.基于贝叶斯算法的中国人名识别[J].计算机应用,2006,26(4):998-1000
    [110]Gerard Salton,M.J.McGill.Introduction to Modeern Information Retrieval[M].McGraw Hill Book Co.,New York,1983
    [111]H.Chen,P Hsu,R.Orwig,L.Hoopes,et al.Automatic concept classification of text from electronic meetings.Communications of the ACM,1994,37(10):56-73
    [112]Todd Greanier.Discover the secrets of the Java Serialization API.http://java.sun.com/developer/technicalArticles/Programming/serialization/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700