一个改进的中文分词算法及其在Lucene中的应用

英文题名：Study on an Improved Chinese Segmentation Algorithm and Its Application in Lucene
作者：付敏
论文级别：硕士
学科专业名称：软件工程
中文关键词：中文分词 ; 两次扫描 ; 歧义消解 ; 哈希表 ; Lucene
英文关键词：Chinese Segmentation ; Two times scan ; Elimination of ambiguousness ; Hashtable ; Lucene
学位年度：2010
导师：陈传波
学科代码：081201
学位授予单位：华中科技大学
论文提交日期：2010-01-01

摘要

中文分词是中文信息处理的核心问题之一。采用基于字符串匹配与统计相结合的算法能够较好的实现中文分词。该算法首先将中文文本以标点符号为切分断点,把待切分的文本切分成含有完整意义的短句,以提高字符串匹配算法的正确率。然后将每个短句分别按照正向最大匹配和逆向最小匹配进行扫描、切分,同时在每次扫描时,根据语义和语言规则对结果进行优化,将汉字、英文字母、数字分别进行划分,增强算法对不同类型文本的处理能力。最后,根据最小切分原则和统计的方法进行歧义消解处理。
通常中文分词的算法分为三种,基于字符串匹配、基于统计方法和基于理解的。三者各有优缺点,改进的分词算法集成了基于字符串匹配算法在实现方式简单,效率高的优点,并辅以基于语言的基本规则提高了初切分阶段的正确率。在具体实现上,两次扫描分别采用了正向最大匹配与逆向最小匹配的算法。算法的选用分别利用了正向最大匹配切分片段数较少的优点和逆向最小匹配对多义型歧义解决较好的优点。利用语言规则优化则是在扫描的同时将汉字、字母和数字分开划分,并且对于汉字中的数词、量词,英文字母中的罗马数字再分别处理,较好的解决了多种类型文本的分词问题。改进的分词算法的歧义消解处理过程是根据两次扫描的结果进行比较,如果结果完全相同则直接输出。如果两次扫描结果不同,判断为有歧义字段产生,需要做相应消歧处理:如果切分的片段数不同,根据最小切分的原则选择片段数较小的作为结果输出;如果切分片段数相同,则采用统计的方法,利用词典中的词频来判断采用哪个结果作为正确输出。该算法的另一个改进是在词典的存储结构上,采用两字哈希、尾字链表处理的方式,对尾字链表按照词频排序,在一定程度上也提高了分词的效率。整个算法可应用于Lucene做为中文信息检索系统的组件,从实验结果来看,准确率比Lucene自带分词器有了较大的提高。
Chinese Segmentation is one of the most important elements of the Chinese Information Processing. The algorithm which combined by the character matching method with the statistic method can better realize the Chinese Segmentation. This algorithm firstly segments the Chinese text by identifying the punctuation which makes the text to be short sentences with completely meaning that can promote the accuracy of character matching. Then every short sentence would be scanned and segmented by the method of Maximum Match Method and Reverse Minimum Matching Method, meanwhile, the results would be optimized based on the rules of language by optimization program which could identify the characters, letter and numbers that can strengthen the process ability of algorithm on dealing with different type of text. Finally, the ambiguousness would be eliminated by Minimal Segmentation Principle and statistic method.
Chinese Segmentation algorithm has been generally defined in three ways, based on character matching, based on statistic method and based on understanding. Every of them have merits respectively. Improved segmentation algorithm which combined the merits of easy to accomplish with high efficiency that accompanied by rules of language has promoted the accuracy of basic segmentation. In practice, two times scan adopted the Maximum Match Method and Reverse Minimum Matching Method which used the strong points of less fragments of maximum matching and special ability of dealing with polysemous ambiguousness of reverse minimum matching. Characters, letter, numbers can be deal with based on rules of language at the same time of scanning. Then the numeral and classifier in Chinese, Roman numerals in English would be processed by optimization program with better solved the problem of segmentation on multi-type text. The ambiguousness eliminate processing of improved segmentation algorithm is to compare the results of scanning and output the one of them directly when they are equal. Ambiguousness would be judged as happening and should be processed by program if results of times scanning are different: To select the less fragments result as the output based on Minimal Segmentation Principle if the number of fragments are not equal, or to select the higher frequency word to output as the method of statistic when the number of fragments are equal. Another improvement of this algorithm is on constructing the structure of dictionary by adopting the method of previous two characters stored by hashtable and the rest word stored by linked list by the order of frequency. This improvement promotes the efficiency of segmentation in some way. The whole algorithm can be applied to Lucene as the composition of Chinese information searching system. From the result of experiment, this algorithm has a great improvement on accuracy compared to the segmentation system provided by Lucene.

引文

[1]王小龙,关毅.计算机自然语言处理.第一版.北京:清华大学出版社, 2005: 1-7
    [2]刘开瑛.中文文本自动分词和标注.第一版.北京:商务印书馆, 2000: 1-12
    [3]唐琴,许侃,林鸿飞.搜索引擎发展阶段研究及热点发现.情报学报, 2008, 27(5): 664-669
    [4]徐建华.网络搜索引擎原理、特性分析及未来发展趋势.图书情报工作, 2000, 8: 34-38
    [5]房志峰.中文搜索引擎中的分词技术研究.科学技术与工程, 2008, 8(9): 2481-2483
    [6]刘晓英.汉语自动分词研究的发展趋势.高校图书馆工作, 2005, 25(4): 25-28
    [7]江铭虎.自然语言处理.第一版.北京:高等教育出版社, 2006: 15-16
    [8] James F. Allen. Encyclopedia of Computer Science. 4th edition. Chichester, UK: John Wiley and Sons Ltd, 2003: 1218-1211
    [9] Meziane, Farid. Natural Language Processing and Information Systems. 1th edition. Salford, UK: Springer Berlin / Heidelberg, 2004, 3136
    [10]孙茂松,邹嘉彦.汉语自动分词研究评述.当代语言学, 2001, 3(7): 22-32
    [11] Liang, Nanyuan. Shumian hanyu zidong fenci xitong-CDWS [A written Chinese automatic segmentation system-CDWS]. Journal of Chinese Information Processing, 1986, 1(1): 44-52
    [12]揭春雨,刘源.汉语自动分词实用系统CASS的设计和实现.中文信息学报, 1991, 5(4): 27-34
    [13]周文帅,冯速.汉语分词技术研究现状与应用展望.山西师范大学学报(自然科学版), 2006, 20(1): 25-29
    [14]王彩荣.汉语自动分词专家系统的设计与实现.微处理机, 2004, 25(3): 56-58
    [15] Heidorn, G. E. Intelligent Writing Assistance. In "A Handbook of Natural Language Processing: Techniques and Applications for the Processing of Language as Text", R. Dale, H. Moisl, and H. Somers (ed. ), Marcel Dekker, New York, 2000: 181-207
    [16] HU Xiheng. Application of Maximum Matching Method in Chinese Segmentation Technology. Journal of anshan normal university, 2008, 10(2): 42-45
    [17]杨晓恝,蒋维,郝文宁.基于本体和句法分析的领域分词的实现.计算机工程, 2008, 34(23): 26-28
    [18] Li Qinghu, Chen Yujian, Sun Jiaguang. A new dictionary mechanism for Chinese word segmentation. Journal of Chinese Information Processing, 2003, 17(4): 13-18
    [19]刘群,张华平,俞鸿魁等.基于层叠隐马模型的汉语词法分析.计算机研究与发展, 2004, 41(8): 1422-1428
    [20]张磊,张代远.中文分词算法解析.电脑知识与技术, 2009, 5(1): 192-193
    [21] Li Haizhou, Yuan Baosheng. Chinese word segmentation. Language, Information and Computing, 1998, 18(20): 212-217
    [22] Yeh Ching ong, Jian Lee. Rule based word identification for Mandarin Chinese sentences– a unification approach, Computer Processing of Chinese and Oriental Languages, 1991, 5(2): 97-118
    [23] Wang Yongheng, Haiju Su, Yan Mo. Automatic processing of Chinese words. Journal of Chinese information Processing, 1990, 4(4): 1-11
    [24]吴雅娟,柳培林,丁子睿.基于统计分词的中文文本分类系统.电脑知识与技术(学术交流), 2005(4): 71-74
    [25] Gerd Gigerenzer. Rationality for Mortals: How People Cope with Uncertainty. 1th edition. New York, NY, US: Oxford University Press, 2008: 172-192
    [26]王开铸,李俊杰,吴岩.无词典自动分词的研究.计算语言学进展与应用, 1995, 13(2): 35-38
    [27]麦范金,王挺.基于双向最大匹配和HMM的分词消歧模型.现代图书情报技
    [28]魏晓宁.基于隐马尔科夫模型的中文分词研究.电脑知识与技术(学术交流), 2007(11): 885-886
    [29]傅赛香,袁鼎荣,黄柏雄等.基于统计的无词典分词方法.广西科学院学报, 2002, 18(4): 252-264
    [30]谭琼,史忠植.分词中的歧义处理.计算机工程与应用, 2002(11): 125-128
    [31]熊回香.全文检索中的汉语自动分词及其歧义处理.中国图书馆学报, 2005, 31(5): 54-57
    [32]王大亮,涂序彦,郑雪峰等.多策略融合的搭配抽取方法.清华大学学报, 2008, 48(4): 608-612
    [33] Jian Yun Nie. Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge, Communications of CLSIPS, 1995, 5(1): 47-57
    [34]彭泽润,林思佳.从两个国家标准看汉语的词.北华大学学报(社会科学版), 2006, 7(6): 63-68
    [35]骆正清,陈增武,胡上序.一种改进的MM分词方法的算法设计.中文信息学报, 2007, 10(3): 30-36
    [36]李庆虎,陈玉健,孙家广.一种中文分词词典新机制—双字哈希机制.中文信息学报, 2003, 17(4): 13-18
    [37] Erik Hatcher, Otis Gospodnetic. Lucene in Action. 1th edition. Greenwich, CT, USA: Manning Publications Co., 2004: 2-15
    [38] Zhou DengPeng, Xie KangLin. Lucene Search Engine. Computer Engineering, 2007, 33(18): 95-96
    [39] Yuangui Lei, Victoria Uren, Enrico Motta. Managing Knowledge in a World of Networks. 1th edition. Heidelberg: Springer Berlin, 2006: 238-245
    [40]郎小伟,王申康.基于Lucene的全文检索系统研究与开发.计算机工程, 2006, 32(4): 94-96
    [41] Kumiko Tanaka-Ishii, Yuichiro Ishii. Multilingual phrase-based concordancegeneration in real-time. Information Retrieval, 2007, 10(3): 275-295
    [42]徐威,董渊,白若鹞等.针对中文文本自动分类算法的评估体系.计算机科学, 2007, 34(8): 177-17

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700