面向WI输入法的新词发现技术研究与实现

英文题名：The Research and Implementation of Discovery of New Words for WI Input Method
作者：周春波
论文级别：硕士
学科专业名称：计算机技术
中文关键词：新词发现 ; 输入法 ; 最大流最小割 ; N元递增分步
英文关键词：new word mining ; input method ; maximum flow minimum cut ; N-gram increasing algorithm
学位年度：2011
导师：关毅
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2011-06-01

摘要

拼音输入法通过输入拼音串转换为汉字串,转换的准确率很大程度上取决于词典是否涵盖常用词汇,特别是一些新兴词汇。手工向词典中加入新词费时费力,而新词发现技术则从大规模文本中自动挖掘新词,具有自动化、易于发现热门词汇等特点。本文将探讨新词发现技术,并将挖掘出来的新词添加到输入法词典中以期提高输入法的音字转换准确率。
     本文首先探讨了两类新词的挖掘方法:情感词以及商品词。在情感词挖掘中,本文提出基于最大流最小割原理的迭代中文情感词挖掘方法,实验结果显示,基于该思想在挖掘主观词方面具有较强能力,其性能高于传统的基于统计模型的主观词挖掘方法;在商品词挖掘中,本文选择用户在购物网站上的搜索日志作为发现商品词的数据来源,并根据搜索日志的数据特点,在对用户查询(query)的自然分词基础上,采用N元递增分步算法和串频统计,计算候选串的条件概率,选择候选商品词。
     最后,本文介绍了针对“苹果”公司iOS平台的输入法开发的相关流程,并展示了新词发现技术在WI输入法中发挥的重要作用。WI输入法是哈尔滨工业大学计算机学院语言技术中心网络智能研究室研发的一款面向苹果平台的中文语句级输入法。它的第一个版本于2010年11月11日发布,目前已有用户12万以上,其输入的准确性、流畅性等获得了用户的广泛好评。
Pinyin input method converts alphabetic string to Chinese character string. The accuracy of conversion depends largely on whether the dictionary covers common words, specially some new words. It will take large effort to add new words into dictionary manually. The new word discovery technology finds new words from a large-scale of text automatically, which has some features such as automatic and easy to find new words. This article will explore new word discovery technology, and then add the new words into the dictionary used in input method to increase the accuracy.
     First, this paper discusses methods of two kind of new words: emotional words and commodity words. In emotional words mining, this paper discusses the Chinese emotional words mining using iteration method which is based on the principle of the maximum flow minimum cut. Experimental results show that this method has a strong capacity on subjective word mining, its performance is better than that of traditional subjective term mining based on statistical model. In commodity words mining, the data source comes from user's search log on shopping site. First, this paper finishes word segmentation on users‘query depending on the search log data‘s characteristics. And then calculate the conditional probability of the candidate strings using N-gram increasing algorithm and the string frequency statistics. Finally, select the commodity words.
     Finally, this paper describes the related development processes of input method for iOS platform of Apple Company. And shows the important role of the new word discovery technology used in WI input method. WI input method is developed by Web Intelligence Research Center of computer science department of Harbin Institute of Technology. And it is a statement-level Chinese input method. This input method was released on November 11, 2010. Now the number of its users has been more than 120000. Its accuracy and fluency have received high praise from large number of users.

引文

[1]张德鑫.―水至清则无鱼‖――我的新生词语规范观[J].北京大学学报(哲学社会科学版),2000,37(5):106-119.
    [2]杨辉.汉语新词语发现及其词性标注方法研究[D].复旦大学硕士学位论文. 2008: 4-5.
    [3]邱明娟.论新词的发展[D].南京师范大学硕士学位论文. 2007: 1-5.
    [4]刘晓梅.当代汉语新词语研究[D].厦门大学博士学位论文. 2003:4-7.
    [5]高永伟.近20年英语国家对新词的研究[J].外语与外语教学. 1998(11):8-9.
    [6]邹纲,刘洋.面向Internet的中文新词语检测[J].中文信息学报. 2004, 18(6):1-9.
    [7]亢世勇,刘海润.新词语大词典[M].上海辞书出版社. 2003: 1-21.
    [8]张海军,史树敏.中文新词识别技术综述[J].计算机科学. 2010, 37(3):6-10.
    [9]苑春法,黄昌宁.基于语素数据库的汉语语素及构词研究[J].世界汉语教学, 1998(2):7-12.
    [10]郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报(自然科学版). 2002, 25(2):115-119.
    [11]傅爱平.汉语信息处理中单字的构词方式与合成词的识别和理解[J].语言文字应用. 2003(4):25-33.
    [12] Aitao Chen. Chinese Word Segmentation Using Minimal Linguistic Knowledge[C]. Proceedings of the second SIGHAN workshop on Chinese language.Sapporo,Japan, 2003:148-151.
    [13]颜伟.基于动态流通语料库的VSM新词发现策略[C]. 2004年辞书与数字化研讨会论文集. 2004:84-88.
    [14]秦文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报. 2004, 18(1):14-19.
    [15] Dongping Gao, Zhendong Niu. Chinese Unknown Word Recognition Based on Functional Applications of Type Theory[C], in Proceedings of International Symposium on Intelligent Information Technology Application 2008 (IITA2008), IEEE Computer Society Press, December 2008, Volume 3, Page 498-502.7.
    [16]罗盛芬,孙茂松。基于字串内部结合紧密度的汉语自动抽词实验研究[J].中文信息学报,2003,17(l):9-14.
    [17] F. C. Peng and F. Feng. Chinese Segmentation and New Word Detection Using Conditional Random Fields[C]. Proceedings of the 20th International Conferenceon Computational Linguistics, Switzerland. 2004: 221~227.
    [18]贺敏,龚才春.一种基于大规模语料的新词识别方法[J].计算机工程与应用. 2007, 43(21): 157-159.
    [19] Li Hongqiao, Huang Changning, Gao Jianfeng. The use of SVM for Chinese new word identification[C]. Proceedings of First International Joint Conference on Natural Language Processing, 2004:497-504.
    [20]周正宇,李宗葛.一种新的基于统计的词典扩展方法[J].中文信息学报. 2001, 15(5):46-51.
    [21]罗智勇,宋柔.基于多特征的自适应新词识别[J].北京工业大学学报, 2007, 33(7):718-725.
    [22] Yabin Zheng, Zhiyuan Liu. Incorporating User Behaviors in New Word Detection[C]. The International Joint Conference on Artificial Intelligence(IJCAI-09). Pasadena, California, USA. 2009:2101-2106.
    [23] Xiao Sun, De-Gen Huang. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science & Technology. 2011, 26(1):14-24.
    [24] Krister Lindén. A Probabilistic Model for Guessing Base Forms of New Words by Analogy[C]. The 9th International Conference on Intelligent Text Processing and Computational Linguistics. 2008: 106-116.
    [25] Jian-Yun Nie, Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge[J]. Communications of COLIPS, 1995: 47-57.
    [26] Andi Wu, Zixin Jiang. Statistically-Enhanced New Word Identification in a Rule-Based Chinese System[C]. Proceedings of the Second Chinese Language Processing Workshop. Hong Kong, China,2000:46-51.
    [27]梁婷,叶大荣.应用构词法则与类神经网路于中文新词萃取[C]. Proceedings of Research on Computational Linguistics Conference XIII.2000:21-40.
    [28] Yao Meng, Hao Yu, Chinese New Word Identification Based on Character Parsing Model[C], the first international joint conference on natural language processing( IJCNLP-04).2004: 489-496.
    [29]贾自艳,史忠植.基于概率统计技术和规则方法的新词发现[J].计算机工程. 2004, (20): 19-21.
    [30] Kevin Zhang, Qun Liu. Automatic Recognition of Chinese Unknown Words Based on Roles Tagging[C]. Proceedings of the 1st SIGHAN Workshop on Chinese Language Processing.Taipei, 2002:71-78.
    [31] Ji Wenyan, Peng Tao, Chinese Word Segmentation and Out-Of-Vocabulary Words Detection Using Suffix Array[C]. 2009 International Conference on Web Information Systems and Mining. 2009:56-60.
    [32]丁建立,慈祥.一种基于免疫遗传算法的网络新词识别方法[J].计算机科学. 2011, 38(1): 240-245.
    [33] Bo Pang, Lillian Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts[C]. In Proceedings of the ACL, 2004, pp.271-278.
    [34] Soo-Min Kim, Eduard Hovy. Determining the sentiment of opinions[C]. In Proceedings of COLING, 2004, pp.1367-1373.
    [35] Peter D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews[C]. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp.417-424.
    [36] Y. Mao, G. Lebanon. Isotonic conditional random fields and local sentiment flow[C]. In Proceedings of NIPS, 2006.
    [37] Bo Pang, Lillian Lee, Shivakumar Vaithyanathan. Thumbs up? Sentiment Classification using Machine Learning Techniques[C]. In Proceedings of EMNLP, 2002, pp.79-86.
    [38] Zhang, Y., Li, Z., Ren, F., et al. Semi-automatic emotion recognition from textual input based on the constructed emotion thesaurus[C]. In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp.571-576.
    [39] Hong Yu, Vasileios Hatzivassiloglou. Towards Answering Opinion Questions: Separating Facts from Opinions[C]. In Proceedings of EMNLP-03, 2003, pp.129-136.
    [40] Cardie Claire, Janyce Wiebe, Theresa Wilson, et al. Combining low-level and summary representations of opinions for multi-perspective question answering[C]. In AAAI Spring Symposium on New Directions in Question Answering, 2003, pp.20-27.
    [41] Vasileios Hatzivassiloglou, Kathleen R. McKeown. Predicting the Semantic Orientation of Adjectives[C]. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the ACL, 1997, pp.174-181.
    [42] General Inquirer. http://wjh.harvard.edu/~inquirer.
    [43]朱嫣岚,闵锦等.基于HowNet的词汇语义倾向计算.中文信息学报[J], 2006, Vol.20, No.1: 14-20.
    [44]路斌,万小军,杨建武,陈晓鸥.基于同义词词林的词汇褒贬计算[C].第七届中文信息处理国际会议, 2007: 17-23.
    [45]姚天昉,娄德成.汉语情感词语义倾向判别的研究.中文计算技术与语言问题研究-第七届中文信息处理国际会议论文集[C], 2007: 221-225.
    [46] Ding, X. & Liu, B. The utility of linguistic rules in opinion mining SIGIR '07[C]: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2007, 811-812.
    [47] Ding, X.; Liu, B. & Yu, P. S. A holistic lexicon-based approach to opinion mining WSDM '08[C]: Proceedings of the international conference on Web search and web data mining, ACM, 2008, 231-240.
    [48] Janyce Wiebe, Rebecca Brucey, Matthew Bell, et al. A Corpus Study of Evaluative and Speculative Language[C]. In Proceedings of the Second SIGdial Workshop on Discourse and Dialogue, 2001, pp.1-10.
    [49] Qi Zhang, Xi-Peng Qiu, Xuan-Jing Hung, Li-De Wu. Learning Semantic Lexicons using Graph Mutual Reinforcement based Bootstrapping[J]. Acta Automatica Sinica, Vol.34 (10), 2008: 1257-1261.
    [50] Andreevskaia, A. & Bergler, S. When specialists and generalists work together: overcoming domain dependence in sentiment tagging[C] In Proceedings of ACL-08: HLT, 2008.63-70.
    [51]何慧,李思,肖芬,徐蔚然,郭军. PRIS中文情感倾向性分析技术报告.第一届中文情感分析测评[C]. 2008: 46-55.
    [52] T. Jaynes. Information Theory and Statistical Mechanics[J]. Physics Reviews. 1957(106): 620-630.
    [53] Adam L. Berget, Stephen A. Della Pietra, Vincent J.Della Pietra. A Maximum Entropy Approach Natural Language Processing[J]. Computational Linguistics, Vol. 22, No. 1. (1996), pp. 39-71.
    [54] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Section 26.2: The Ford–Fulkerson method". Introduction to Algorithms (Second ed.)[M]. MIT Press and McGraw–Hill. pp. 651–664. ISBN 0-262-03293-7.
    [55]彭启民,贾云得.一种基于最小割的稠密视差图恢复算法[J].软件学报. 2005. 53-59.
    [56]向红艳,张邻,杨波.基于最大流的路网结构优化[J].西南交通大学学报. 2009.21-27.
    [57] Ford, L. R.; Fulkerson, D. R. (1956). "Maximal flow through a network"[J]. Canadian Journal of Mathematics 8: 399–404.
    [58] Sogou Labs. http://www.sogou.com/labs/.
    [59] Myspace.cn. http://www.myspace.cn/.
    [60] Lun-Wei Ku, Yu-Ting Liang, Hsin-Hsi Chen. Opinion Extraction, Summarization and Tracking in News and Blog Corpora[C]. In Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006,pp.100-107.
    [61] Fangzhong Su; Katja Markert. Subjectivity Recognition on Word Senses via Semi-supervised Mincuts. Human Language Technologies[C]: The 2009 Annual Conference of the North American Chapter of the ACL, Boulder, Colorado, June 2009. 1-9
    [62] Taras Zagibalov John Carroll. Automatic Seed Word Selection for Unsupervised Sentiment Classification of Chinese Text[C]. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, August 2008. 1073–1080.
    [63]刘非凡,赵军等.面向商务信息抽取的产品命名实体识别研究[J].中文信息学报, 2006(01).23-26.
    [64] Wu A, Jiang z.Statistically-Enhanced New Word Identification in a Rule-Based Chinese System[C]//Proceedings of the Second Chinese Language Processing Workshop.Hong Kong,China,2000:46-51.
    [65]郑家恒、李文花.新词语自动识别方法研究[A].自然语言理解与机器翻译[M].北京:清华大学出版社, 2001.37-43.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700