一种开放式高性能全文检索平台的研究与实现

英文题名：Research and Implementation of an Open High-Performance Platform of Full-Text Retrieval
作者：洪田玉
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：全文检索 ; 中文分词 ; 倒排索引 ; 索引维护 ; 搜索引擎
英文关键词：Full-text retrieval ; Chinese segmentation ; Reverted index ; Index maintenance ; Search engine
学位年度：2009
导师：曾志文
学科代码：081201
学位授予单位：中南大学

摘要

信息的快速增长促使搜索引擎的迅速发展。通用搜索如Google、Baidu已取得很大成功,然而,一方面它们的技术严格保密,另一方面,开发人员不可能将庞大的通用搜索引擎无缝地嵌入到自己的应用程序中;此外,缺乏对中文支持良好的开源搜索引擎。为此,本文研究并实现了一种新的中文全文检索平台。该平台具有高性能、架构灵活等特点。它既可以很方便地应用于各种动态数据环境的实际领域,也可以用来构建信息检索的实验系统。本文的主要研究工作如下:
     1.针对传统最大正向匹配算法的效率较低和灵活性差的问题,提出了一种改进算法。该算法采用了基于HASH和TRIE树的词典结构,使分词效率提高了约200%。同时,该算法摆脱了传统最大正向匹配算法的固定最大词长度限制,具有更好的灵活性。
     2.针对传统索引结构难以满足动态数据环境的不足,本文提出一种新的索引创建方案。该方案主要包括:(1)分级的倒排索引组织结构和链式存储方式,能够很好地解决索引动态增长要求;(2)基于动态平衡树的索引合并策略;(3)可配置的限制性指数分配策略,提高了索引内存利用率和分配效率;(4)基于d-gap的差量压缩算法,使索引文件大小减少了75%,从而减少I/O次数,提高系统性能。
     3.基于前面提出的分词算法和索引创建方案,采用C++面向对象设计思想以及工厂模式等设计模式,设计和实现一个架构灵活、扩充性良好的全文检索平台,系统平台主要包括索引子系统,检索子系统,存储子系统和插件管理子系统,以及内存管理组件。
     4.利用该平台设计和实现一个实用的商用搜索引擎系统。该搜索引擎提供用户对网络监控数据的搜索。为各种类型(文本、html、email、office文档、pdf文档等)的监控数据创建大容量索引,提供基于内容分类的高性能查询。该系统投入实际使用半年多所取得显著的成效也很好地证明检索平台的高效性。
The explosive growth of information promotes the expeditious development of search engine. General search engines such as Google, Baidu have been proved to be successful. However, on the one hand, their business technology is confidential, on the other hand, developers can't seamlessly embed these general search engines into their applications; besides, it lacks open source search engines which support Chinese well. Therefore, the thesis researches and implements a new Chinese full-text retrieval platform. With high-performance and flexibility, It aims to either be applied into practical field of dynamic data environment, or provide for a feasible of research and experimentation in information retrieval. The main research works and innovations in the thesis are as follows.
     1. An improved method is presented accounting for the low-performance and poor flexibility problems of the traditional MM(maximum matching) segmentation method. It uses a new dictionary structure based on Hash and Trie Tree structure, which greatly increases the speed of word cutting by 200%. Moreover, freeing itself from fixed maximum matching length, it has more flexibility.
     2. Aiming at the traditional index structure hard to adapt the dynamic data environments, a new index creating scheme is presented. It includes: (1) improved inverted indexing structure and chain storage perfectly solves the problem of dynamic increasing index data; (2) a novel index merging strategy based on dynamic balance tree; (3) configurable memory allocating strategy based on limited exponent method greatly improves the utilization rate and efficiency of index memory; (4) differential compressing algorithm based on d-gap, which greatly reduces the size of index files by 75% and indirectly reduces I/O times.
     3. Based on the word automatic segmentation algorithm and index structures, described above, using object-oriented programming with C++ and several design patterns such as factory pattern, we design and implement a high-performance Chinese index platform with flexible architecture and scalability. The subsystems and modules includes index subsystem, searching subsystem, storage subsystem, plug-in managing subsystem and memory managing module.
     4. At last, based on the index platform, we develop a business searching engine. It creates high-capacity index for all kinds of monitoring data which records users' behaviors of accessing Internet, and provides rapid-response query services. Results from practical use for more than half a year proved the efficiency of the full-text retrieval platform.

引文

[1]O.Gospodnetic,E.Hatcher.Lucene in Action[M].Greenwich:Manning Publications,2004:404-405
    [2]徐飞,孙劲光.中文分词切分技术研究[J].计算机工程与应用,2008,5(30):621-623
    [3]张长利,赫枫龄,左万利.一种基于后缀数组的无词典分词方法[J].吉林大学学报,2004,4(42):548-553
    [4]陈耀东,王挺.基于有向图的双向匹配分词算法与实现[J].计算机应用,2005,6(25):1442-1444
    [5]张志锋,刘育熙,邓璐娟,等.基于压缩后缀数组的搜索引擎技术[J].计算机工程与应用,2007,3(20):30-34.
    [6]陈燕娜,邵志清.基于全文搜索的中文搜索引擎设计技术[J].计算机工程与应用,2002,38(17):196-198
    [7]孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究[J].中文信息学报,1999,14(1):1-6
    [8]曹元大,贺海军,涂哲明,等.全文检索字索引技术的研究与实现[J].计算机工程,2002,28(6):260-262
    [9]黄木.基于Lueene的全文检索系统模型的研究[D].广州:暨南大学,2006
    [10]吴海明.基于Lueene的搜索引擎技术的研究与改进[D].广州:暨南大学,2006
    [11]张校乾,金玉玲,侯丽波.一种基于Lucene检索引擎的全文数据库的研究与实现[J].现代图书情报技术,2005,2:40-43
    [12]晃岳峰,曹作良,郭英玲.基于Lucene的搜索引擎在远程教育平台中的实现[J].天津理工大学学报,2005,21(6):23-25
    [13]Stefan Buttcher,L.Charles,A.Clarke.Memory Management Strategies for Single-Pass Index Construction in Text Retrieval Systems[R].Waterloo:University of Waterloo Technical Report,2005
    [14]关毅,王晓龙,张凯.现代汉语计算语言模型中语言单位的频度.频级关系[J].中文信息学报,1998,13(2):8-14
    [15]Jiang Bin,Yang Chao,Zhao Huan.A king of dictionary mechanism based on the two-word-bitmap for Chinese word segmentation[J],Huan Daxue Xuebao/Journal of Hunan University Natural Sciences,2006,v33,121-123
    [16]Zhongjian Wang,K.Araki,K.Tochinai,Word segmentation method using inductive learning for Chinese text[J],Artificial Intelligence and Soft Computing.Proceedings of the IASTED International Conference,2000:452-458
    [17]黄昌宁,张普.自然语言理解与机器翻译[M].北京:清华大学出版社,2001:02-107
    [18]J.Zobel,A.Moffat,and K.Ramamohanarao.Inverted files versus signature filesfor text indexing[J].ACM Transactions on Database Systems,1998,23(4):453-490
    [19]N.Lester,J.Zobel,and H.E.Williams.In-place versus re-build versus re-merge:index maintenance strategies for text retrieval systems[A].Proceedings of the 27th Australasian Computer Science Conference.In:ACM International Conference Proceeding Series.ACSC '04[C].Dunedin:Australian Computer Society,2004:15-22
    [20]R.Baesa-Yates and B.Ribeiro-Neto.Modern Information Retrieval[M].New York:ACM press,1999:21-25
    [21]Witten,I.H.,Moffat,A.& Bell,T.C.Managing Gigabytes:Compressing and Indexing Documents and Images[M].San Francisco:Morgan Kaufmann Publishing,1999:32-35
    [22]N.Lester,A.Moffat,and J.Zobel.Fast on-line index construction by geometric partitioning[A].In:Procedings of the ACM International Conference on Information and Knowledge Management.In:Conference on Information and Knowledge Management.CIKM '05[C].New York:ACM press,2005:776-783
    [23]Stefan Buttcher,L.Charles,A.Clarke,and Brad Lushman.Hybrid index Maintenance for Growing Text Collections[A].Proceedings of the 29the ACM Conference on Research and Development on Information Retrieval.In:Annual ACM Conference on Research and Development in Information Retrieval.SIGIR '06[C].New York:ACM press,2006:356-363
    [24]Apache Luence[EB/OL].http://lucene.apache.org/,2008
    [25]Super dreadnought Estraier:Fight for the future[EB/OL],http://hyperestraier.sourceforge.net/,2007
    [26]Zebra[EB/OL].http://www.indexdata.dk/zebra/,2008
    [27]Dig WWW Search Engine Software[EB/OL].http://www.htdig.org/,2008
    [28]The Wumpus Search Engine[EB/OL].http://www.wumpus-search.org/,2008
    [29]The Zettair Search Engine[EB/OL].htttp://www.seg.rmit.edu.au/zettair/,2008
    [30]王冬,左万利,赫枫龄,等.一种增量倒排索引结构的设计与实验[J].吉林大学学报,2007,6(45):953-938
    [31]Mei Kobayashi,Koichi Takeda.Information retrieval on the web[J].ACM Computing Surveys.2000,2(32):144-173
    [32]邹海山,吴勇,吴月珠.中文搜索引擎中的中文信息处理技术[J].计算机应用研究,2000,17(12):21-24
    [33]李孝明,曹万华.文本信息检索的精确匹配模型[J].计算机科学,2004,31(9):100-102
    [34]杨传耀.中文信息检索索引模型及相关技术研究[D].上海:复旦大学,2007
    [35]赵会杰.中文全文检索系统中索引的研究[D].北京:北京交通大学,2006
    [36]陈玮,陈玉鹏,石晶,等.一种高效的全文检索索引技术[J].计算机应用研究.2004,21(7):35-36
    [37]王斌,张刚,孙健.大规模分布式并行信息检索技术[J].信息技术快报.2005,3(2):1-9
    [38]王智强,刘建毅.一种实时更新索引结构的设计与实现[J].计算机系统应用,2005.10:79-82
    [39]赫枫龄,左万利,张雪松.高性能网页索引器JU_Indexer的实现[J].吉林大学学报理学版,2006,44(1):50-56
    [40]曹元大,贺海军,涂哲明,等.全文检索字索引技术的研究与实现[J].计算机工程,2002,28(6):260-262
    [41]苏潭英,郭宪勇,金鑫.一种基于Lucene的中文全文检索系统[J].计算机工程,2007,33(23):94-96
    [42]陈立.全文检索引擎的设计研究[J].现代情报,2007,10:224-226
    [43]凌波,周水庚,周傲英.P2P信息检索系统的查询结果排序与合并策略[J].计算机学报,2007,30(3):405-414
    [44]吴栋,滕育平.中文信息检索引擎中的分词与检索技术[J].计算机应用,2004,24(7):128-131
    [45]杨广翔,俞宁,湛莉.搜索引擎结果的重排序方法[J].计算机应用,2005,25(2):305-308
    [46]韩立新.对搜索引擎中评分方法的研究[J].电子学报.2005,11(33):2094-2096
    [47]许蓉.实时系统的内存管理技术研究与实现[D].成都:电子科技大学,2004
    [48]夏洪斌.基于知识分词算法的病案全文检索系统[D].上海:第二军医大学,2004
    [49]申展,江宝林.互关联后继树模型及实现[J].计算机应用和软件.2005,22( 3):8-9
    [50]徐德志,申红婷.网页排名算法及应用[J].贵州大学学报.2007,24(5):103-112
    [51]Google Received 72 Percent of U.S.Searches in January 2009[EB/OL].ht tp://www.hitwise.com/press-center/hitwiseHS2004/google-searches-jan-09.php,2009

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700