摘要
针对当前维吾尔语语言模型存在的语料库数据稀疏问题以及困惑度较高等问题,在SRILM和MITLM两种工具生成的2-gram,3-gram,…,9-gram语言模型做了对比实验,试图找出在一定规模的维吾尔语语料条件下使困惑度最低的N-gram语言模型。通过对比分析最终得出结论,对于基于维吾尔语句子的N-gram模型,维度N取在介于3~5之间较宜,困惑度和计算复杂度等因素考虑N=3为较优。这一结论将有助于维吾尔语自然语言处理的发展。
In allusion to the problems of sparse corpus data and high perplexity degree of the current Uyghur language models,a contrast experiment was carried out for the 2-gram,3-gram,4-gram,…,and 9-gram language models generated by the SRILM and MITLM tools,so as to find out the N-gram language model with the lowest perplexity degree under a certain scale of Uyghur corpus. It is concluded from the contrastive analysis that it is better to determine the value of the dimension N between 3 and 5 for the N-gram model based on Uyghur sentences,and N=3 is more appropriate considering the factors of confusion degree and computation complexity. The conclusion can contribute to the development of Uyghur natural language processing.
引文
[1]宗成庆.统计自然语言处理[M].北京:清华大学出版社,2013.ZONG Chengqing.Statistical natural language processing[M].Beijing:Tsinghua University Press,2013.
[2]李春生.一种体现长距离依赖关系的语言模型[J].科技视界,2014(5):55-56.LI Chunsheng.A language model reflecting long-distance dependence relation[J].Science&technology vision,2014(5):55-56.
[3]文娟.统计语言模型的研究与应用[D].北京:北京邮电大学,2010.WEN Juan.Research and application of statistical language model[D].Beijing:Beijing University of Posts and Telecommunications,2010.
[4]吴军.数学之美[M].北京:人民邮电出版社,2012.WU Jun.Beauty of mathematics[M].Beijing:Posts&Telecom Press,2012.
[5]王贺福.统计语言模型应用与研究[D].上海:复旦大学,2012.WANG Hefu.Application and research of statistical language model[D].Shanghai:Fudan University,2012.
[6]张亚军.维吾尔语的N-gram语言模型及其平滑算法研究[D].乌鲁木齐:新疆大学,2010.ZHANG Yajun.Research of Uyghur N-gram model and smoothing algorithm[D].Urumqi:Xinjiang University,2010.
[7]唐亮.维吾尔语统计语言模型中建模基元的研究[D].成都:电子科技大学,2013.TANG Liang.Research on modeling primitives in Uyghur language statistical language model[D].Chengdu:University of Electronic Science and Technology of China,2013.
[8]MAHMUT G,NIJAT M,MEMET R,et al.Exploration of Chinese-Uyghur neural machine translation[C]//Proceedings of International Conference on Asian Language Processing.[S.l.:s.n.],2017:176-179.
[9]张亚军.维吾尔语的N-gram语言模型研究[J].电脑知识与技术,2011,7(17):4177-4179.ZHANG Yajun.Research of Uyghur N-gram model[J].Computer knowledge and technology,2011,7(17):4177-4179.
[10]古丽尼尕尔·买合木提,热木土拉·买买提,毛丽旦·尼加提,等.基于双语对话文本的汉、维口语翻译技术研究[C]//第十四届全国人机语音通讯学术会议论文集.连云港:中国中文信息学会语音信息专业委员会,2017:490-494.Gulnigar Mahmut,Multura Maimaiti,Mewlude Nijat,et al.Research on Chinese-Uyghur oral translation technology based on bilingual dialogue texts[C]//Proceedings of the 14th National Conference on Man-machine Speech Communication.Lianyungang:Speech Information Specialty Committee of Chinese Information Processing Society of China,2017:490-494.
[11]努尔艾力·喀迪尔,彭良瑞.基于SRILM的阿拉伯和维吾尔文语言模型建立方法[C]//第三届全国少数民族青年自然语言信息处理、第二届全国多语言知识库建设联合学术研讨会论文集.乌鲁木齐:中国中文信息学会,2010:94-97.Nurali Kadir,PENG Liangrui.A method to build Arabic and Uyghur language model based on SRILM[C]//Proceedings of the Third National Minority Youth Natural Language Information Processing and the Second National Multilingual Knowledge Base Construction Joint Academic Seminar.Urumqi:Chinese Information Processing Society of China,2010:94-97.
[12]ZHANG Wenyang.Comparing the effect of smoothing and N-gram order:finding the best way to combine the smoothing and order of N-gram[D].Melbourne:Florida Institute of Technology,2015.
[13]SADIQUI A,ZINEDINE A.A new method to construct a statistical model for Arabic language[C]//Proceedings of the Third IEEE International Colloquium in Information Science and Technology.Tetouan:IEEE,2015:296-299.
[14]ALUM?E T,KURIMO M.Efficient estimation of maximum entropy language models with N-gram features:an SRILM extension[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association.Chiba:[s.n.],2010:1820-1823.