分布式中文全文检索技术的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网的发展,搜索已成为从互联网上获取信息的一种主要手段,通过GOOGLE、百度等互联网搜索引擎,人们可以方便的从浩如烟海的互联网中寻找自己需要的信息。以GOOGLE为例,它搜集了数以亿计的网页,存储容量为T级,人们通过关键字从中检索到自己需要的信息,这一类搜索引擎通常被称为通用搜索引擎,它的数据采集对象是互联网页,它的应用对象是全世界的所有网民,它的服务方式是提供给用户关键字检索结果后服务即完成。
     另一方面,企业、组织机构内部信息化建设浪潮催生了大量的信息内容,其中大多数的数据以文件、邮件、图片等非结构化形式存放在企业内计算机系统中的各个角落,而传统的结构化数据库无法满足这些非结构化信息的存储、检索和处理要求,针对这一类应用出现了一种特定的搜索引擎――企业搜索引擎。它往往不局限于关键字搜索后就完成服务,往往还提供分类、聚类等后期处理和挖掘。
     而全文检索技术是实现企业搜索引擎的核心环节,本文将对其进行系统的阐述,并深入的探讨全文检索的各项技术和基本原理,详细地分析全文检索系统的结构和索引的组织、库结构和创建过程,提出了优化索引创建过程的方法。对检索技术、排序算法和中文分词技术进行了重点研究和总结,并针对词典分词法的不足,使用了改进的基于三数组Trie索引树匹配算法,充分实现了“智能分词”的原则。然后讨论了分布式索引的分布策略,以及基于索引数据分布上的查询策略。本文最后对本系统全文搜索引擎的特点及实现进行详细的论述,并按设计完成具体功能的实现,实际检测运行效果较好。
With the development of Internet, search engine has become a major mean of accessing to information on the Internet. It can be convenient to find information needed from the vast Internet through Internet search engine such as GOOGLE,Baidu. Take GOOGLE for example, it collects billion of web pages, the Storage capacity is based on the Terabit level. People search the information they need by input the keywords. This kind of search engine, often referred to as general search engines. It collects the Internet pages, and serve of all Internet users around the world, Its service Provide the search results to the user.
    
     On the other hand,a great deal of information content has been born in enterprises and organizations within the information technology wave. Most of the data such as file, mail, photographs, and other unstructured forms stored in the every corner of enterprise computer system. The traditional structured database can not deal with these unstructured storage,retrieval and processing requirements of information. In response to this type of application, there is a specific search engine - business search engine. It is often not limited to keyword search,often also provides classification, clustering, and data mining.
     And the realization of full-text search technology is the core of enterprise search engine. This thesis will expatiate on their systems,explore the full-text search technology and the basic principles deeply,analysis of the retrieval system and indexing,database structure and constructive process,an optimization method of the index creation process. In addition to research and concluded search technology,sorting algorithm and Chinese word segmentation techniques. For the lack of dictionary segmentation,improve the use of three array based Trie index tree matching algorithm, realize fully of the principle of "brainpower segmentation". Then discussed the strategy of distribution and strategy of query. Finally,this thesis circumstantiate the system features of full-text search engine and achieve the system,a majority of function has been achieved after the actual testing.
引文
[1] E.-S.Atlam,E.-M.Ghada,M.Fuketa,K.Morita,J.Aoe.A Compact Memory Space of Dynamic Full-Text Search Using Bi-gram Index.ISCC'04:Proceedings of the Ninth International Symposium on Computers and Communications 2004 Volume 2(ISCC"04)-Volume 02,.2004.7:38
    [2]祈延莉,赵丹群.信息检索概论[M].北京:北京大学出版社,2002:14~15
    [3]周涛.两种全文信息检索系统的比较研究[J].情报理论与实践,2002, 25(2):138-140
    [4]陈华辉.一个中英文全文搜索引擎的设计与实现[J].计算机应用研究2001,(3):131-133
    [5]苏新宁.信息检索理论与技术[M].北京:科学技术文献出版社,2004:99~100
    [6]左银龙.分布式应用系统关键技术研究与应用.[D]南京航空航天大学硕士学位论文.2001:50~56
    [7]Min Song,Il-Yeol Song,Peter P.Chen.Design and Development of a Cross Search Engine for Multiple Heterogeneous Databases Using UML and Design Patterns.[J]Information Systems Frontiers,2005.8:102~103
    [8]赵岳松,尹枫.移动Agent系统Aglet迁移机制的分析[J].武汉理工大学学报,2002,24(2):70~72
    [9]PHao Wu,Hai Jin,Xiaomin Ning.An Approach for Indexing,Storing and Retrieving Domain Knowledge.Proceedings of the 2007 ACM Symposium on Applied Computing,2007.10:22~24
    [10]Laurence Hirsch,PRobin Hirsch,Masoud Saeedi.Evolving Lucene Search Queries for Text Classification.Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation,2007.6(12):166
    [11]孙西全,马瑞芳,李燕灵.基于Lucene的信息检索的研究与应用[J].信息系统,2006.1(29):68
    [12]殷建平.汉语自动分词方法[J].计算机工程与科学, 1998,(03)
    [13]Giuseppe Pirro,PDomenico Talia.An approach to Ontology Mapping Based on the Lucene Search Engine Library .Proceedings of the 18th International Conference on Database and Expert Systems Applications,2007.9:156~158
    [14]刘晓志,黄厚宽,尚文倩.带专业词库的特征选择[J].北京交通大学学报, 2006,(02)
    [15]张素智,刘放美.基于矩阵约束法的中文分词研究[J].计算机工程, 2007 ,33(15)
    [16]S.Dumais,E.Cutrell,J.Cadiz,G.Jancke,R.Sarin,and D.C.Robbins.A System for Personal Information Retrieval and Reuse.In SIGIR’03,2003:79~82
    [17]王秀坤,李政,简幼良,等.基于Hash方法的机器翻译词典的组织与构造[J].大连理工大学学报, 1996, 36(3): 352-355.
    [18]孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究[J].中文信息学报, 2000, 14(1): 1-6
    [19]李庆虎,陈玉健,孙家广.一种中文分词词典新机制———双字哈希机制[J].中文信息学报, 2003, 17(4): 13-18
    [20]杨文峰,陈光英,李星.基于PATRICIA tree的汉语自动分词词典机制[J].中文信息学报,2001,(3)
    [21]张培颖,李村合.一种中文分词词典新机制--四字哈希机制[J].微型电脑应用, 2006,22(10):35-36
    [22]李江波,周强,陈祖舜.汉语词典快速查询算法研究[EB/OL].[2007-05-17]. http: //www. nlp. org. cn/.
    [23] Jin Zhang,Alexandra Dimitroff.The Impact of Webpage Content Characteristics on Webpage Visibility in Search Engine Results(part I)[J].Information Processing and Management,2005.5(9):175
    [24] Jun Hirai,Sriram Raghavan,Hector Garcia-Molina,and Andreas Paepcke. WebBase:A Repository of Web Pages.In Proceedings of the 9th International World Wide Web Conference.2000:277~293
    [25]Chunqiang Tang.Data Sharing and Information Retrieval in Wide-Area Distributed Systems.PhD thesis.University of Rochester.2004:89~93
    [26]杨峰.分布式并行索引研究[D].电子科技大学博士学位论文.2003:26~28
    [27]Sergey Melnik,Sriram Raghavan,Beverly Yang etc.Building a Distributed Fulltext Index for the Web.Technical Report SIDL-WP-2000-0140,Stanford Digital Library Project.Computer Sicence Department,Stanford University.2000:7~15
    [28]K.Zhao,S.Zhou,L.Xu,W.Cai,and A.Zhou.PeerSDI:A Peer-to-Peer Information Disseminaion System.In Advanced Web Technologies and Applications,6th Asia-Pacific Web Conference(APWeb).Hangzhou,China.April 2004:23~29
    [29]P.B.Danzig,J.Ahn,J.Noll,and K.Obraczka.Distributed Indexing:A ScalableMechanism for Distributed Information Retrieval.Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval.Chicago,Illinois.1991:220~229
    [30]D.Zeinalipour-Yazti.Information Retrieval in Peer-to-peer Systems.Master’s thesis, Computer Science and Engineering.University of California Riverside.2003:71~79
    [31]A.Guttman.R-Trees:A Dynamic Index Structure for Spatial Searching. Proceedings of the ACM SIGMOD Conference.1984:47~57
    [32]D.Zeinalipour-Yazti.Information Retrieval in Peer-to-peer Systems.Master’s thesis,Computer Science and Engineering.University of California Riverside. 2003:71~79
    [33] J.Xu and W.B.Croft.Cluster-Based Language Module for Distributed Retrieval.In SIGIR’99.1999:54~56
    [34]赖茂生等编著.计算机情报检索[M].北京:北京大学出版社,1998.2:25~26
    [35]陈淑燕,瞿高峰。全文检索系统的数据库设计[J].延安大学学报(自然科学版),2001,20(1):31-34
    [36]郑延斌.书面汉语自动分词及歧义分析[J].河南师范大学学报(自然科学版),1997(4):90~93
    [37]Kingshy Goh,Beitao Li,Edward Y.Chang.Semantics and Feature Discovery Via Confidence-based Ensemble[J].ACM Transactions on Multimedia Computing, Communications,and Applications(TOMCCAP),2006.5:70~72
    [38]苏武华.汉语自动分词和自动标引方法研究[J].农业图书情报学刊,2004.6:77~78
    [39]Giuseppe Antonio Di Lucca,Anna Rita Fasolino,Porfirio Tramontana.Reverse Engineering Web Applications:the WARE Approach.Journal of Software Maintenance and Evolution:Research and Practice,2004.11(3):15
    [40]黄亮,符绍宏.自动分词技术及其在信息检索中应用的研究[J].现代图书情报技术,2001,3:26~29
    [41]曹元大,贺海军.全文检索字索引技术的研究与实现[J].计算机工程,2002,28(6):260~262.
    [42]徐宝文,张卫丰等.搜索引擎与信息获取技术[M].北京:清华大学出版社,2003:155~156
    [43]King-Lup Liu,Clement Yu,Weiyi Meng.Discovering the Representative of a Search Engine.Proceedings of the Eleventh International Conference on Information and Knowledge Management,2003.11:156~158
    [44]曹元大等,中文M范b文档全文检索系统的设计与实现[J].北京理工大学学报,2002,22(l),68~71
    [45]张志民.web信息检索系统中数据组织与存储技术研究[D].南京大学计算机系硕士学位论文,2001
    [46]孙建军,张厚生.网络信息资源搜集与利用[M].东南大学出版社,80~120.
    [47]丁承邵志清,基于字表的中文搜索引擎分词系统的设计与实现[J].计算机工程, 2001,27(2)
    [48]周珏.分布式信息检索系统的研究[D].中国科学技术信息研究所硕士学位论文.1999:12~14

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700