WAF在文本处理中的应用研究

英文题名：Research on Application of Waf in Text Processing
作者：张黎
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：词激活力 ; 文本处理 ; 中文分词 ; 新词发现 ; 动态规划
英文关键词：Word Activation Force ; Text Processing ; word
英文关键词：segmentation ; new word identification ; dynamic programming
学位年度：2013
导师：徐蔚然
学科代码：081002
学位授予单位：北京邮电大学
论文提交日期：2012-11-29

摘要

中文分词和新词发现是中文文本处理和自然语言处理中最基本和最重要的研究,它们效果的好坏直接影响到所在领域中进一步研究的效果。
     现有方法存在着依赖词典、依赖标注语料、低频词发现效率低等问题。本文结合2元语言模型(Bi-Gram Language model)改进了WAF(Word Activation Forces,词激活力)模型,并基于它提出了一种的无监督机器学习思想,不依赖词典和标注语料,由字构词,同时完成分词功能和新词发现功能。
     对于分词和新词发现,本文结合改进的WAF模型试验了最大匹配法、入链出链对比法、排序法,最终提出了动态规划迭代法。方法利用字间关系提取候选串,解决了低频词发现效率低的问题；利用动态规划完成词义消歧,解决了依赖标注语料的问题；利用分词结果筛选词表,解决了垃圾串过滤问题。
     本文采集10万条微博数据进行实验,结果表明,本文提出的基于WAF模型的方法可以有效解决上述问题,WAF模型在文本处理中有着较好的应用效果。
The most basic and most important research in Chinese text processing and natural language processing is word segmentation and new word identification, the result of which affects the following research in text processing and natural language processing.
     There are some shortcomings of existing methods, such as relying on dictionary, relying on labeled corpus and low efficiency of low-frequency words'identification. This paper amend WAF model on the basis of Bi-Gram language model, and proposed a WAF-based and statistics-based unsupervised machine learning thought which does not rely on dictionary and labeled corpus to deal with word segmentation and new word identification at the same time.
     For word segmentation and new word identification, this paper tests the maximum matching method, inbound link and out link comparing method and sorting method, and proposes a method which contains dynamic programming and iteration at last. This method improves the efficiency of low-frequency words' identification by using the relationship among words, completes word disambiguation by using dynamic programming, and also filters garbage strings by using the result of word segmentation.
     This paper collects1000,000messages from micro blog for experiment. The result shows that the WAF based methods can effectively solve those problems, and WAF model has a good application effect for text processing.

引文

[1]邹海山,吴勇.中文搜索引擎中的中文信息处理技术[J].计算机应用研究,第12期.2000.21-24.
    [2]刘红芝.中文分词技术研究[J].电脑开发与应用.第23卷(第3期).2010.03.173-175.
    [3]张丹.中文分词算法综述[J].黑龙江科技信息.(8).2012.08.206.
    [4]麦范金,李东普,岳晓光.基于双向匹配法和特征选择算法的中文分词技术研究[J].昆明理工大学学报(自然科学版).36(1).2011.02.47-51
    [5]魏晓宁.基于隐马尔科夫模型的中文分词研究[J].电脑知识与技术.第21期.2007.11.885-886
    [6]吴应良,韦岗,李海洲.一种基于N-gram模型和机器学习的汉语分词算法[J].电子与信息学报.23卷(11期).2001.11.1148-1152
    [7]张海军.基于大规模语料的中文新词识别技术研究[D].安徽合肥.中国科学技术大学.2010.
    [8]吕红良.基于大规模语料库的中文新词识别[D].大连.大连理工大学.2008.
    [9]Guo Jun, Guo Hailiang, Wang Zhanyi. An Activation Force-based Affinity Measure for Analyzing Complex Networks [J]. Scientific Reports,2011.10, http://www.nature.com/srep/2011/111012/srep00113/full/srep00113.html
    [10]邓曙光,曾朝晖.汉语分词中一种逐词匹配算法的研究[J].湖南城市学院学报(自然科学版).第14卷01期.2005.03.76-78.
    [11]丁振国,张卓,黎靖.基于Hash结构的逆向最大匹配分词算法的改进[J].计算机工程与设计.29(12).2008.06.3208-3211,3265.
    [12]刘开瑛.中文文本自动分词和标注[M].北京：商务印书馆.2000.162
    [13]丁源,衣袭.中文全切分快速分词方法[J].大连铁道学院学报.26(2).2005.02.84-85.
    [14]黄昌宁,赵海.中文分词十年回顾[J].中文信息学报.21(3).2007.05.8-19
    [15]于江德,王希杰,樊孝忠.基于最大熵模型的词尾标注汉语分词[J].郑州大学学报(理学版).第43卷(第1期).2011.03.70-74
    [16]徐辉,何克抗,孙波.书面汉语自动分词专家系统的实现[J].中文信息学报.第5卷(第3期).1991.03.38-47.
    [17]徐秉铮,詹剑,贺前华.基于神经网络的分词方法[J].中文信息学报.第7卷(第2期).1993.02.36-44
    [18]Fu G-h, Luke K-k. Chinese unknown word identification as known word tagging[C]. In:Proceedings of the Third International Conference on Machine Learning and Cybernetics; shanghai,2004 26-29 August, p.2612-2617.
    [19]Peng Fuchun, Feng Fangfang, McCall Andrew. Chinese Segmentation and New Word Detection using Conditional Random Fields[C]. In:Proceedings of the 20th international conference on Computational Linguistics COLING'04, Morristown, NJ, USA,2004. Association for Computational Linguistics.562-568
    [20]Li H, Huang C, Gao J et al. The use of SVM for Chinese new word identification[C]. In:Proceedings of First International Joint Conference on Natural Language Processing; Sanya HaiNan island, China; 2004,723-732.
    [21]徐远方,李成城.基于SVM和词间特征的新词识别研究[J].计算机技术与发展.第22卷(第5期).2012.05.134-140
    [22]梁婷, 叶大荣.应用构词法则与类神经网路于中文新词萃取[C].In：Proceedings of Research on Computational Linguistics Conference XIII; 2000, 21-40.
    [23]贾自艳,史惠檀.基于概率统计技术和规则方法的新词发现[J].计算机工程.第30卷20期.2004.10.19-2193.
    [24]邹纲,刘洋,刘群等.面向Internet的中文新词语检测[J].中文信息学报.18(6).2004.1-9.
    [25]曹勇刚,曹羽中,金茂忠等.面向信息检索的自适应中文分词系统[J].软件学报.17(3).2006.356-363

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700