融合多策略的维吾尔语词干提取方法

英文篇名：A Multi-Strategy Approach to Uyghur Stemming
作者：赛迪亚古丽·艾尼瓦尔 ; 向露 ; 宗成庆 ; 艾克白尔·帕塔尔 ; 艾斯卡尔·艾木都拉
英文作者：Sediyegvl Enwer;Xiang Lu;Zong Chengqing;Akbar Pattar;Askar Hamdulla;Institute of Information Science and Engineering,Xinjiang University;National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences;
关键词：维吾尔语 ; 形态 ; 词干提取 ; N-gram模型 ; 词性特征 ; 上下文词干信息
英文关键词：Uyghur;;morphology;;stem segmentation;;N-gram model;;part of speech;;context information
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：新疆大学信息科学与工程学院;中国科学院自动化研究所模式识别国家重点实验室;
出版日期：2015-09-15
出版单位：中文信息学报
年：2015
期：v.29
基金：国家自然科学基金(61163032)
语种：中文;
页：MESS201505027
页数：7
CN：05
ISSN：11-2325/N
分类号：208-214

摘要

维吾尔语是形态变化复杂的黏着性语言,维吾尔语词干词缀切分对维吾尔语信息处理具有非常重要的意义,但到目前为止,维吾尔语词干提取的性能仍存在较大的改进空间。该文以N-gram模型为基本框架,根据维吾尔语的构词约束条件,提出了融合词性特征和上下文词干信息的维吾尔语词干提取模型。实验结果表明,词性特征和上下文词干信息可以显著提高维吾尔语词干提取的准确率,与基准系统比较,融入了词性特征和上下文词干信息的实验准确率分别达到了95.19%和96.60%。
Uyghur is an agglutinative language with complex morphology,Uyghur words stem segmentation plays an important role in Uyghur language information processing.But so far,the performance of the Uyghur words stem segmentation still has much room for improvement.According to the constraints of Uyghur word formation,we proposed a stem segmentation model for Uyghur which fuses the part of speech feature and context information based on N-gram model.Experimental results show that,the part of speech feature and the context information of stem can increase the performance of Uyghur words stem segmentation significantly with the accuracy reaching 95.19% and 96.60% respectively compared to the baseline system.

引文

[1]Nagata,Masaaki,A stochastic Japanese morphological analyzer using a forward-DP backward-A N-best search algorithm[C]//Proceedings of the 15th conference on Computational linguistics-Volume 1,1994.
    [2]Buckwalter Tim.Buckwalter Arabic Morphological Analyzer Version 1.0,2002.
    [3]姜文斌,吴金星,乌日力嘎等.蒙古语有向图形态分析器的判别式词干词缀切分[J].中文信息学报,2011,25(04):30-34.
    [4]早克热·卡德尔,艾山等.维吾尔语名词构形词缀有限状态自动机的构造[J].中文信息学报,2009,23(6):116-121.
    [5]古丽拉·阿东别克,米吉提·阿布力米提.维吾尔语词切分方法初探[J].中文信息学报,2004,18(6):61-65.
    [6]麦热哈巴·艾力,姜文斌,王志洋等.维吾尔语词法分析的有向图模型[J].软件学报,2012,23(12):3115-3129
    [7]Aisha B.A Letter Tagging Approach to Uyghur Tokenization[C]//Proceedings of the 2010International Conference on Asian Language Processing:IEEE Computer Society,2010:11-14.
    [8]Ablimit M,Eli M,Kawahara T.Partly supervised Uyghur morpheme segmentation[C]//Proceedings of the Oriental-COCOSDA Workshop.2008.71-76.
    [9]米吉提·阿布力米提,库尔班·吾布力.在多文种环境下的维吾尔语文字校对系统的开发研究[J].系统工程理论与实践,2003,05:117-124.
    [10]哈力克·尼亚孜.基础维吾尔语[M].乌鲁木齐:新疆大学出版社.1997:73.
    [11]哈米提·铁木尔.现代维吾尔语语法[M].北京:民族出版社.1987:47-48.
    [12]米热古丽·艾力,米吉提·阿不力米提,艾斯卡尔·艾木都拉.基于词法分析的维吾尔语元音弱化算法研究.中文信息学报[J].2008,04:43-47.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700