基于论坛语料的未登录词自动识别新方法

作者：都菁
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：未登录词 ; 中文分词 ; 网络蜘蛛 ; 语料库
英文关键词：unknown word ; Chinese word segmentation ; web spider ; corpus
学位年度：2010
导师：熊海灵
学科代码：081202
学位授予单位：西南大学
论文提交日期：2010-04-01

摘要

未登录词识别一直是中文分词研究领域的瓶颈。为解决中文分词中未登录词识别效率低的问题,提出基于论坛语料对中文未登录词进行识别的新方法：首先利用网络蜘蛛下载论坛网页；然后对该语料库进行周期性的更新以随时保持语料的新鲜度,以构建一个具备高时效性的语料库;接下来对语料库进行分词,具体是先将Mutual Information函数和Duplicated Combination Frequency函数线性叠加构造出新统计量MD(由Mutual Information函数和Duplicated Combination Frequency函数的首字母结合而成),再用MD函数对语料库进行分词产生候选词表；最后通过对比候选词表与原始词表发现未登录词,并将识别出的未登陆词扩充到原始核心词库中,以便在下一次分词过程中可以一次性识别出该未登录词。
     中文分词与一般英文分词不同,中文的语言构成和使用习惯使得中文分词比英文分词困难很多。在该领域先后产生三种传统的中文分词算法：基于字符串查找的机械匹配算法；基于理解的算法和基于统计的算法。三种算法对于未登录词的识别都存在不同程度的问题：机械匹配算法从根本上就无法实现未登录词的识别：理解算法由于算法复杂、实现难度大,实际开发和应用并不广泛；统计算法在一定程度上可以解决部分未登录词,一度成为比较流行的算法,但是现有的统计算法仍然出现较多误判和无法判定的情况。
     总的说来,基于统计的算法是一个实际应用中相对可行的一种方法,因此本文提出一种改进的统计算法对未登录词进行识别。具体策略如下：第一,本文首次将网络论坛——天涯论坛,引入未登录词识别研究中,利用网络蜘蛛下载论坛网页。第二,通过预处理网页构建语料库,并对该语料库进行周期性的更新以获取具备较强时效性的语料。第三,将Mutual Information函数和Duplicated Combination Frequency函数线性结合构造出新统计量MD,运用该MD函数对语料库进行分词产生候选词表。第四,通过对函数的反复训练,选定较优的阈值,对比候选词表与原始词表发现未登录词。最后根据这种思想设计测试方案,搭建测试环境。通过对新词召回率和分词准确率两个指标,证明本文设计的未登录词自动识别新方法是可行的。
Identification of unknown Chinese words is the bottleneck in the field.This paper presented that download adequate web documents from BBS with web spider in order to construct a corpus which was updated periodicity. Then generate candidate words list by extracting words from the corpus with this new function. Finally, compare this candidate words list and the previous lexicon, so as to recognize the unknown words. Experiments showed that the proposed method was more efficient.
     Different with English word, Chinese word has its own characteristics. As the composition and use habit of Chinese language, parser Chinese word is a harder problem than the English.At present, the Chinese word segmentation algorithm is mainly in three ways:based on string matching algorithms, based on understanding algorithm and based on statistical algorithms. These three methods, both in the unknown word to varying degrees, there are some problems:based on string matching algorithms can not recognize unknown words fundamentally. Based on understanding algorithm is more difficult and complexity of the time complexity and the space complexity. So it is not widely used. Based on statistical algorithm is more feasible and popular method at present, but there are also some errors in identification.
     Over all, based on statistical algorithm is a relatively feasible and practical application of a method. This paper studed unknown Chinese word based on statistical algorithms for unknown words identification. First, the Chinese word segmentation, especially in unknown word recognition is descripted. Secondly, the traditional word segmentation algorithms and segmentation system has been analyzed and compared. There are three kinds of traditional Chinese word segmentation algorithms:based on string matching algorithm; based on understanding of the algorithm; based on statistical algorithms. Mechanical matching algorithms can not extract unknown word from a fundamentally reason; understanding algorithm due to algorithm complexity and great difficulty, practical development and application is not widespread; Statistics algorithm in a certain extent, may solve some of unknown words, the algorithm became more popular, but it is still available in more statistical algorithms can not determine the miscarriage of justice and circumstances. This paper presented methods that download adequate web documents from BBS with web spider in order to construct a corpus which was updated periodicity which was contrarily against to the shortage of traditional ways. This step can ensure the timeliness of the corpus. Then generate candidate words list by extracting words from the corpus with this new function MD (the Mutual Information function and Duplicated Combination Frequency are combinated to construct a new statistic MD). This candidate words list and the previous lexicon were compared, so as to recognize the unknown words. Subsequently, according to this thinking program designed to test, set up a test environment. New word recall rate and accuracy of two indicators shows that this design of unknown words automatically recognize the new method is feasible.

引文

[1]张德鑫.水至清则无鱼——我的新生词语规范[J].北京大学学报(哲社版)’2000,2000,5：106-119.
    [2]周纲,刘洋.面向Internet的中文新词语检测[A],JCIP-2004.
    [3]中国互联网络信息中心.《第25次中国互联网络发展状况统计报告》,2010,1,15,取自http://www.cnnic.net.cn/html/Dir/2010/01/15/5767.htm.
    [4]罗桂琼,费洪晓,戴弋.基于反序词典的中文分词技术研究[J].计算机技术与发展,2008,1,18(1)：80-83.
    [5]温滔,朱巧明,吕强.一种快速汉语分词算法[J].计算机工程,2004,30(19)：119-120.
    [6]孙茂松,黄昌宁,高海燕,等.中文姓名的自动辨识[J].中文信息学报,1995,9(2)：16-27.
    [7]吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1)：28-33.
    [8]王广正,王喜凤.基于知网语义相关度计算的词义消歧方法[J],安徽工业大学学报(自然科学版),2008,1,25(1)：71-75.
    [9]HeKe-kang,Xu Hui.Design of an expert system of automatic word segmentation in written Chinese text[J].Journal of Chinese Information Procesing,1991,5(2):1-14.
    [10]孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J],计算机学报,2004,6,27(6)：737-741.
    [11]贾自艳,史忠植.基于概率统计和规则方法的新词发现[J].计算机应用,2004,24(20)：19-21.
    [12]刘挺,吴岩,王开铸.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1998,12(1)：17-20.
    [13]Chin-Ming Hong,Chih-Ming Chen,Chao-Yang Chiu.Automatic extraction of new words based on Google News corpora[J],Expert Systems with Application,2008:1-3
    [14]黄昌宁,赵海.中文分词十年回顾[J],中文信息学报,2007,5,21(3)：8-10.
    [15]Hongqiao Li,The Use of SVM for Chinese New Word International[C]. International Joint Conferenceon Natural Language Processing 2004:442-445.
    [16]Aitao Chen,Chinese Word Segmentation Using Minimal Linguistic Knowledge[C]. In proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan.2003:365-368.
    [17]GOH Chooi Ling, Masayuki ASAHARA, Yuji MATSUMOTO. Chinese Unknown Word Identification Using Character-based Tagging and Chunking[C].In Companion Volume to the Proceedings of ACL Interactive. Poster/Demo Sessions,2003:197-200.
    [18]Jian-Yun Nie.Unknown Word Detection and Segmentation of Chinese using Statistical andheuristic Knowledge. Communications of COLIPS,5(I&2):47-57.
    [19]张恒,杨文昭,屈景辉,卢虹冰,张亮,赵飞.基于词典和词频的中文分词方法[J].微计算机信息(管控一体化),2008,24(1-3)：239-232.
    [20]王灿辉,张敏,马少平,黄宇.基于相邻词的中文关键词自动抽取[J].广西师范大学学报：自然科学版,2007,6,25(2)：161-164.
    [21]梁卓明,陈炬桦.基于专有名词优先的快速中文分词[J].计算机技术与发展,2008,3,18(3)：24-27.
    [22]胡燕,邱英.基于改进词共线模型的自动摘要研究[J].计算机与数字工程,2008,2：26-33.
    [23]刘晓霞,李亚军,陈平.基于字典和统计的分词方法[J].计算机工程与应用,2008,44(10)：144-146.
    [24]Richard Sproat and Tom Emerson, The First International Chinese Word Segmentation Bakeoff[C].In proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan,2003:522-525.
    [25]Keh-Jiann Chen and Ming-Hong Bai.Unknown Word Detection for Chinese By a Corpus-based Learning Method[C].In Proceedings of ROCLINGX.2003:159-174.
    [26]Chooi-Ling Goh.Chinese Unknown Word Identification by Combining Statistical Models. Master's Thesis[C],Department of Information Processing, Graduate School of Information Science Nara Institute of Science and Technology, August 29,2003:119-120.
    [27]Jian-Yun Nie.Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge. Communications of COLIPS,5(I&2),47-57.
    [28]SPROAT R., SHIH C. L.. A statistical method for finding word boundaries in Chinese text [J]. Computer Processing of Chinese and Oriental Languages,1993,4(4):336-249.
    [29]Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu.Automatic extraction of new words based on Google News corpora[J].Expert Systems with Application,2008:4-8.
    [30]CHUBERT FOO,Hui Li.Chinese word segmentation and its effect on information retrieval [J]. Information Processing and Management 2004,40:161-190.
    [31]周蕾,朱巧明.基于统计和规则的未登录词识别方法研究[J].计算机工程,2007,4,33(8)：196-198.
    [32]熊海灵,伍胜,余建桥,李航.一种基于RPUC的Web文档索引库的更新算法[J].计算机科学,2004,31(8)：95-96.
    [33]刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J].计算机工程与应用,2006,3：175-182.
    [34]孙秉强,康耀红.经过预处理的中文二元分词技术[J].计算机时代,2006,1：3-4.
    [35]Zhang Mao-yuan,Lu Zheng-ding,Zou Chun-yan.A Chinese word segmentation based on language situation in processing ambiguous words[J],Information Sciences162,2004:275-285.
    [36]李晓明,闫宏飞,王继民.搜索引擎-原理、技术与系统[M].科学出版社,2004,10：19-20.
    [37]于天恩.迅速搭建全文搜多平台[M].清华大学出版社,2007,10：44-46.
    [38]周文帅,冯速.汉语分词技术研究现状与应用展望[J],陕西师范大学学报,2006,3,20(1)：25-29.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700