摘要
针对维吾尔语形态变化,提出了利用规则和词典相结合的混合处理方法进行形态还原技术。利用从左到右地分析和Lovin算法实现了词干提取器。通过总结词法连接规则,提出了规则实现词干提取、用词典验证提取结果。经过对不同新闻内容的五次测试得出平均准确率达到了77.4%。
This paper proposed changes in morphology of Uygur language,mixed processing method using a combination of rules and dictionaries phase morphology reduction technology.And proposed rules stemming and used a dictionary method to verify the extraction results.It are performed tests on the different combination of features.Experimental results show achieves recall of 77.4%.
引文
[1]The Porter stemming algorithm[EB/OL].[2014-01-25].http://tartarus.org/martin/Porter Stemmer/.
[2]The lancaster stemming algorithm[EB/OL].[2014-01-21].http://www.comp.lancs.ac.uk/computing/research/stemming/.
[3]The Lovins stemming algorithm[OL].[2013-12-21].http://snowball.tartarus.org/algorithms/lovins/stemmer.html.
[4]DAWSON J L.Suffix removal for word conflation[J].Bulletin of the Association for Literary&Linguistic Computing,1974,2(3):33-46.
[5]MAYFIELD J,MCNAMEE P.Single n-gram stemming[C]//Proc of the 26th Annual International Retrieval.New York:ACM Press,2003:415-416.
[6]MELUCCI M,ORIO N.A novel method for stemmer generation based on hidden Markov models[C]//Proc of the 12th International Conference on Information and Knowledge Management.New York:ACM,2003:131-138.
[7]AISHA B,SUN Ma-song.A statistical method for uyghur tokenization[C]//Proc of IEEE International Conference on NLP-KE.2009:383-387.
[8]AISHAN W,TUERGEN Y,ZAOKERE K.Shengwei tian conditional random fields combined FSM stemming method for uyghur proceeding[C]//Proc of the 2nd IEEE International Confrence on Computer and Information Technology.2009:295-299.
[9]早克热·卡德尔,艾山·吾买尔,吐尔根·依布拉音,等.维吾尔语名词构形词缀有限状态自动机的构造[J].中文信息学报,2009,23(6):116-121.
[10]阿依克孜·卡德尔,开沙尔·卡德尔,吐尔根·依布拉音.面向自然语言信息处理的维吾尔语名词形态分析研究[J].中文信息学报,2006(3):43-48.
[11]司马义·阿不都热依木.现代维吾尔语造词法研究[D].乌鲁木齐:新疆大学,2006.