藏语分词与词性标注研究

作者：康才畯
论文级别：博士
学科专业名称：中国少数民族语言文学
中文关键词：藏语黏写形式 ; 藏语分词 ; 条件随机场模型 ; 藏语人名识别 ; 藏语词性标注 ; 最大熵模型 ; 分词标注一体化
英文关键词：Tibetan abbreviated forms ; Tibetan word segmentation ; conditional
英文关键词：random field ; Tibetan name recognition ; Tibetan part-of-speech ; max entropy model ; integration of word segmentation and part-of-speech
学位年度：2014
导师：江荻 ; 潘悟云
学科代码：050107
学位授予单位：上海师范大学
论文提交日期：2014-05-01

摘要

藏语信息处理技术经过二十多年的发展，无论是在藏文信息处理研究及其相关标准制定方面，还是在藏语信息处理应用开发方面，都取得了不少成绩。藏语信息处理技术也逐步迈入到语言信息处理层面。虽然藏语信息处理研究在技术上紧跟着英汉语等之后，但作为信息处理研究基础的语料资源相对贫乏。公开的藏语语料库都是未标注的生语料库，其应用价值非常有限。由于对藏语的本体研究不够深入，许多对藏语信息处理有价值的属性未能挖掘和描述出来，因而限制了藏语信息处理技术的发展和应用范围。针对以上问题，本文采用了多种统计模型和方法来进行藏语分词和词性标注研究，并取得了以下几个方面的主要成果：
     一、提出了基于词位的藏语分词方法，在国内外较早地将藏语黏写形式的特征融合到藏语分词研究当中。
     我们采用了基于词位的统计方法来处理藏语分词问题，将藏语分词转化为序列标注问题，实现了一个藏语分词系统。该系统采用条件随机场模型，针对藏语黏写形式的语法特征，将汉语分词中常用的四词位标签集改进为更适合藏语特点的六词位标签集，并使用100万余经人工反复校对的语料对模型进行训练。经实验测试，在大规模真实语料的测试中，系统的开放测试F值达到了91%，分词性能基本上令人满意。在进一步的研究中，我们经分析发现分词精度主要受到了藏语黏写形式识别结果的限制。考虑到黏写形式的复杂多样，我们在总结前人的研究成果的基础上，加入了基于规则的后处理环节，最终的测试结果F值达到了95%以上，已能满足藏语语料库建设的实际需求。
     二、在藏语分词研究的基础上，根据藏族人名特征探讨了藏语人名识别方法。
     通过研究藏语人名的特点，我们总结了藏语人名识别的多种策略并最终选择了基于统计的方法来实现藏语人名的识别。我们基于条件随机场模型，通过使用名字边界、前后缀、上下文等特征，给出了藏文人名识别的一种方法。最终实验系统在开放测试中取得的F值达到了91.26%。虽然未能进一步发掘名字与普通词语同形这一极易导致歧义现象的特征，导致系统识别性能未能达到十分理想的效果，但可以通过对特征标签集进行调整，同时优化特征模板集，进一步提高识别效果。
     三、综合使用了多种统计模型实现了藏语词性标注研究，在国内外首次采用最大熵结合条件随机场模型实现了藏语的词性标注方法。
     通过对藏语词性的研究，在满足基本的词法分析的需求下，我们将藏语词类标记集精简到统计模型切实可用的规模，然后选择最大熵模型构建了一个藏语词性标注系统，并采用小规模的语料进行训练。实验结果显示，在小规模语料训练下，基于最大熵的词性标注系统达到了87.76%的准确率，已基本接近词法分析可用的要求。
     在最大熵模型的基础上，我们提出了基于条件随机场的修正模型。该模型在最大熵模型的输出结果上进行训练，从而可以将最大熵模型中次优结果和再次优结果中的正确标注挑选出来，提高词性标注的准确率。实验证明，采用同样规模的训练语料和测试语料，最大熵结合条件随机场的词性标注模型达到了89.12%的准确率，已接近同类汉语词性标注系统的水平。
     四、实现了一种基于条件随机场的藏语分词标注一体化模型，将分词和词性标注整合到一个统一的系统中，为藏语词法分析提供了新的解决途径。
     我们充分利用了分词与词性标注间更深层次的依赖关系，在一体化模型中利用词性信息来处于分词过程中遇到的歧义问题。在较小的训练语料规模下，藏语分词标注一体化模型在开放测试中分词结果的F值达到了89.0%，这表明一体化模型将词位信息和所属词的词性信息很好的结合起来，能更有效的提高分词精度，其分词效果已基本可以满足语料库对自动分词的需求。一体化模型的词性标注准确率也达到了85.35%，虽然还稍稍落后于独立的词性标注模型，但通过扩大模型的训练语料规模，词性标注性能应该可以取得一定程度的提升。
Tibetan information processing technology has been developed over twentyyears. Whether in the aspect of Tibetan information processing research, or in theaspect of application development, great achievements have been made. Tibetaninformation processing technology have gradually entered into the languageinformation processing level. Although Tibetan information processing is followingEnglish and Chinese technically, the research-based Tibetan corpus for informationprocessing are relatively scarce. Almost all the open corpus are untagged corpus withlimited value. The ontology research of Tibetan is not deep enough so that manyvaluable properties for Tibetan information processing cannot be mined anddescripted, and application development and scope of Tibetan informationprocessing technology are limited. To solve the above problems, we adopt severalstatistical models and methods to study the Tibetan word segmentation andpart-of-speech tagging. Finally we made the achievements in the following aspects:
     First, we put forward the Tibetan word segmentation method based on wordposition, which early took full advantage of Tibetan abbreviated forms in the Tibetanword segmentation research both at home and abroad.
     We adopted a statistical method based on word position to deal with Tibetanword segmentation, which turns Tibetan word segmentation into sequence labelingtask, and established a Tibetan word segmentation system. The system is based onconditional random field and improved4-tag set in Chinese word segmentation to6-tag set according to the grammar features of Tibetan abbreviated forms, which ismore suitable for Tibetan word segmentation. We trained the conditional randomfield model with the corpus of more than1million syllable characters which wereproofread manually. The large-scale corpus experiment shown that the F value of thesystem reached91%, which is satisfactory, in open test. In the further research, wefound out that the precision was limited by the recognize results of Tibetanabbreviated forms. In consideration of the complexity of Tibetan abbreviated forms,we summarized predecessors’ research results and introduced a post-processing module based on rules. In the final experiment, the F value of open test reachedmore than95%, which means the system has been able to meet the actual demandof the construction in Tibetan corpus.
     Second, we study the features of Tibetan name and discuss a recognitionmethod based on the research of Tibetan word segmentation.
     Through the research on Tibetan names, we summarized several strategies onTibetan name recognition and finally choose an approach based on statistics torealize the Tibetan name recognition. The approach is still based on conditionalrandom field, while use the features of boundaries, prefix and suffix, and context ofTibetan names. The experiment shown that the F value of the approach reached91.26%in open test. Regrettably we did not solve the problem on identifying theTibetan name and general words that have the same forms, which damp theperformance of recognition. However, through adjusting the tag set and optimizingthe feature templates, we should be able to improve the performance of Tibetanname recognition.
     Third, we used a combination of several statistic models to study the Tibetanpart-of-speech tagging. For the first time, we used the maximum entropy modelcombined with conditional random field model to achieve a Tibetan part-of-speechtagging method.
     Through the research on Tibetan part of speech, we first simply the Tibetan partof speech tagging set to a usable size for the statistic model, then use the maxentropy model to construct a Tibetan part-of-speech tagging system, and train it withsmall-scale corpus. The experiment shown that the precision of the Tibetanpart-of-speech tagging system based on max entropy model reached87.76%, whichis almost meet the demand of lexical analysis.
     Based on the research of max entropy model, we put forward an errorcorrection model with conditional random field. The error correction model wastrained with the outputs of max entropy model so that it could pick out the righttagging result from the three outputs of the highest probability and improve theprecision of the Tibetan part-of-speech. The experiments shown that, with the same train and test corpus, the mixed model combined max entropy with conditionalrandom field reached89.12%accuracy and was close to the same level of Chinesepart-of-speech tagging system.
     Forth, we achieved an integration model of Tibetan word segmentation andpart-of-speech tagging, which is based on conditional random field. To integrateTibetan word segmentation and part-of-speech tagging into a unified system, we putforward a new approach for Tibetan lexical analysis.
     We took full advantage of the deep dependencies in the word segmentation andpart-of-speech tagging, and used the lexical information to deal with the ambiguityproblem in word segment processing. In a small-scale of training corpus, the F valueof the integration model reached89.0%, which proved the integration modelcombined the word position information with the part-of-speech context well andcould be more effective in the improvement of word segmentation precision. Theperformance of our integration model was able to meet the corpus’ demand ofautomatic word segmentation. Though the precision of part-of-speech of theintegration model reached85.35%, which was still behind the independentpart-of-speech tagging model, we should be able to improve its performance ofpart-of-speech by expanding the scale of training corpus.

引文

[1]中国中文信息学会.我国中文信息处理技术的发展与展望[C].“科学技术面向新世纪”学术年会论文集.1998.9.137-140.
    [2]江荻.面向机器处理的现代藏语句法规则和词类、组块标注集[C].中国民族语言工程研究新进展.北京:社会科学文献出版社,2005.
    [3]孙茂松，邹嘉彦.汉语自动分词研究评述[J].当代语言学.2001,3(1):22-32.
    [4]黄昌宁，赵海.中文分词十年回顾[J].中文信息学报.2007,21(3):8-19.
    [5]罗秉芬，江荻.藏文计算机自动分词的基本规则[C].中国少数民族语言文字现代化文集.北京:民族出版社,1999.
    [6]扎西次仁.一个人机互助的藏语分词和词登录系统的设计[C].中国少数民族语言文字现代化文集.北京:民族出版社,1999.
    [7]江荻.藏语文本信息处理的历程与进展[C].中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集.北京:清华大学出版社,2006.83–97.
    [8]陈玉忠，李保利，俞士汶等.基于格助词和接续特征的藏文自动分词方案[J].语言文字应用.2003,(01):75-82.
    [9]陈玉忠,李保利,俞士汶.藏文自动分词系统的设计与实现[J].中文信息学报.2003,(03):15-20.
    [10]江荻.现代藏语的句法组块与形式标记[C].语言计算与基于内容的文本处理,北京:清华大学出版社,2003.160-166.
    [11]江荻.现代藏语谓语动词的识别与信息提取[C]. Maosong Sun, Tian Shunyao, ChunfaYuan(eds). Advances in Computation of Oriental Languages, Beijing: Tsinghua UniversityPress,2003.154-160.
    [12]江荻.现代藏语组块分词的方法和过程[J].民族语文,2003,4:30-39.
    [13]江荻.现代藏语的机器处理及发展之路[C].汉语自然语言处理若干重要问题,北京:科学出版社,2003.438-448.
    [14]祁坤钰.信息处理用藏文自动分词研究[J].西北民族大学学报(哲学社会科学版).2006,(04):92-97.
    [15]才智杰.藏文自动切分系统中紧缩词的识别[J]．中文信息学报.2009,23(1):35-37．
    [16]才智杰.班智达藏文标注词典设计[J].中文信息学报.2010,24(5):46-49.
    [17]才智杰.班智达藏文自动分词系统的设计与实现[J].青海师范大学民族师范学院学报.2010(002):75-77.
    [18]孙媛，罗桑强巴，杨锐等.藏语交集型歧义字段切分方法研究[C].中国少数民族语言文字信息处理研究与进展——第十二届中国少数民族语言文字信息处理学术研讨会论文集.北京:民族出版社,2009.238–243.
    [19]孙媛，罗桑强巴，杨锐等.藏语自动分词方案的设计[C].中国少数民族语言文字信息处理研究与进展——第十二届中国少数民族语言文字信息处理学术研讨会论文集.北京:民族出版社,2009.228–237.
    [20] Yuan Sun, Zhijuan Wang, Xiaobing Zhao, et al. Design of a Tibetan Automatic WordSegmentation Scheme[C].1st IEEE International Conference on Information Engineering andComputer Science.2009.1–6.
    [21] Yuan Sun, Xiaodong Yan, Xiaobing Zhao, et al. A resolution of overlapping ambiguity inTibetan word segmentation[C].3rd International Conference on Computer Science andInformation Technology.2010.222–225.
    [22] Norbu S, Choejey P, Dendup T, et al. Dzongkha word segmentation[C]. Proceedings of the8thWorkshop on Asian Language Resources.2010.95-102.
    [23]江荻.藏语动词屈折现象的统计分析[J].民族语文.1992,4:011.
    [24]江荻，董颖红.藏文信息处理属性统计研究[J].中文信息学报.1995,9(2):37-44.
    [25]史晓东，卢亚军.央金藏文分词系统[J].中文信息学报.2011,25(4):54-56.
    [26] Jiang T, Yu H, Jam Y. Tibetan word segmentation system based on conditional randomfields[C]. Software Engineering and Service Science (ICSESS),2011IEEE2nd InternationalConference on. IEEE.2011.446-448.
    [27] Liu H, Nuo M, Ma L L, et al. Tibetan Word Segmentation as Syllable Tagging UsingConditional Random Field[C]. PACLIC.2011.168-177.
    [28]刘汇丹.藏文分词及文本资源挖掘研究[D].北京:中国科学院软件研究所,2012.
    [29] Marcus M P, Marcinkiewicz M A, Santorini B. Building a large annotated corpus of English:The Penn Treebank[J]. Computational linguistics.1993,19(2):313-330.
    [30] Zhao Q. An algorithm of tagging Chinese pos based on statistics and rule [J]. ChineseInformation Journal.1996,9(3):1-9.
    [31] Garside R. The computational analysis of English: A corpus-based approach[M]. London:Longman,1988.
    [32] Bahl L R, Jelinek F, Mercer R. A maximum likelihood approach to continuous speechrecognition[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on.1983(2):179-190.
    [33] Leech G, Garside R, Bryant M. CLAWS4: the tagging of the British National Corpus[C].Proceedings of the15th conference on Computational linguistics-Volume1. Association forComputational Linguistics,1994.622-628.
    [34] Steven J D. Grammatical category disambiguation by statistical optimization[J].Computational Linguistics.1988,14(1):31-39.
    [35] Weischedel R, Schwartz R, Palmucci J, et al. Coping with ambiguity and unknown wordsthrough probabilistic models[J]. Computational linguistics.1993,19(2):361-382.
    [36] Jelinek F, Lafferty J, Managerman D et al. Decision tree parsing using a hidden deviationmodel[C]. Proceedings of the Human Language Technology Workshop. Plainsboro, NJ,1994.272-277.
    [37] Magerman D M. Statistical decision-tree models for parsing[C]. Proceedings of the33rdannual meeting on Association for Computational Linguistics. Association for ComputationalLinguistics,1995.276-283.
    [38] Zhao J, Wang X. Chinese POS tagging based on maximum entropy model[C]. MachineLearning and Cybernetics,2002. Proceedings.2002International Conference on. IEEE,2002,2.601-605.
    [39]王敏，郑家恒.基于改进的隐马尔科夫模型的汉语词性标注[J].计算机应用.2006,26(12):197-198.
    [40]洪铭材，张阔，唐杰等.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学.2006,33(10):148-151.
    [41]姜维，关毅，王晓龙.基于条件随机域的词性标注模型[J].计算机工程与应用.2006,21:13-16.
    [42] Xiao J, Wang X, Liu B. The study of a nonstationary maximum entropy Markov model and itsapplication on the pos-tagging task[J]. ACM Transactions on Asian Language InformationProcessing (TALIP).2007,6(2):7.
    [43] Jiang D. Text-annotation Oriented Tibetan-Chinese Dictionary and Its Construction[C]. The4th China-Japan Joint Conference to Promote Cooperation in Natural Language Processing.(CJNLP-04), HongKong,2004.10-15.
    [44]才让加，吉太加.藏语语料库中词性分类代码的确定[C].中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集.北京:清华大学出版社,2006.
    [45]苏俊峰,祁坤钰,本太.基于HMM的藏语语料库词性自动标注研究[J].西北民族大学学报:自然科学版.2009,30(1):42-45.
    [46]扎西多杰，安见才让.基于HMM藏文词性标注的研究与实现[J].计算机光盘软件与应用.2012,12:100-101.
    [47]于洪志，李亚超，汪昆等.融合音节特征的最大熵藏文词性标注研究[J].中文信息学报.2013,27(5):160-165.
    [48]索南坚赞.西藏王统记[M].刘立千译注.拉萨:西藏人民出版社,1987.
    [49]巴卧·祖拉陈瓦.贤者喜宴[M].黄颢译注.北京:中央民族大学出版社,2010.
    [50]江荻，龙从军.藏文字符研究[M].北京:社会科学文献出版社,2010.
    [51]王尧.吐蕃文化[M].吉林人民出版社,1989.137-137.
    [52] Jiang Di, Kang Caijun. The Methods of Lemmatization of Bound Case forms in ModernTibetan[C]. IEEE International Conference on Natural Language Processing and KnowledgeEngineering. IEEE Press,2003.
    [53]欧珠.藏文分词系统中紧缩格识别和藏字复原的算法研究[J].西藏科技,2012(2):73-75.
    [54]李永隧.论藏缅语豁着语素与语言类型学[J].民族语文,2002(2).
    [55]梁金宝.藏语历史文献词汇统计[D].北京:中国社科院民族学与人类学研究所,2013.
    [56]黄行,江荻.现代藏语判定动词句主宾语的自动识别方法[C].语言计算与基于内容的文本处理,北京：清华大学出版社,2003.167-172.
    [57] Xue N, Converse S P. Combining classifiers for Chinese word segmentation[C]. Proceedingsof the first SIGHAN workshop on Chinese language processing-Volume18. Association forComputational Linguistics,2002.1-7.
    [58] Xue N, Shen L. Chinese word segmentation as LMR tagging[C]. Proceedings of the secondSIGHAN workshop on Chinese language processing-Volume17. Association forComputational Linguistics,2003.176-179.
    [59] Sproat R, Emerson T. The first international Chinese word segmentation bakeoff[C].Proceedings of the second SIGHAN workshop on Chinese language processing-Volume17.Association for Computational Linguistics,2003,133-143．
    [60] Low J K, Ng H T, Guo W. A maximum entropy approach to Chinese word segmentation[C].Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.2005,161-164.．
    [61] Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for sighanbakeoff2005[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese LanguageProcessing.2005,171．
    [62] Zhao H, Huang C N, Li M. An improved Chinese word segmentation system with conditionalrandom field[C]. Proceedings of the Fifth SIGHAN Workshop on Chinese LanguageProcessing. Sydney: July,2006.108-117.
    [63] Xue N. Chinese word segmentation as character tagging[J]. Computational Linguistics andChinese Language Processing,2003,8(1):29-48.
    [64]黄昌宁，赵海.由字构词——中文分词新方法[C].中国中文信息学会第六次全国会员代表大会暨成立二十五周年学术会议.2006,53-63.
    [65] Rabiner L. A tutorial on hidden Markov models and selected applications in speechrecognition[J]. Proceedings of the IEEE,1989,77(2):257-286.
    [66] Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural languageprocessing[J]. Computational linguistics,1996,22(1):39-71.
    [67] Ratnaparkhi A. A simple introduction to maximum entropy models for natural languageprocessing[J]. IRCS Technical Reports Series,1997:81.
    [68] Ratnaparkhi A. A maximum entropy model for part-of-speech tagging[C]. Proceedings of theconference on empirical methods in natural language processing.1996,1.133-142.
    [69] Shannon C E. A mathematical theory of communication[J]. ACM SIGMOBILE MobileComputing and Communications Review,2001,5(1):3-55.
    [70] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models forsegmenting and labeling sequence data[J].2001.
    [71] Wallach H M. Conditional random fields: An introduction[J]. Technical Reports (CIS),2004:22.
    [72] Kang C, Jiang D, Long C. Tibetan Word Segmentation Based on Word-Position Tagging[C].Asian Language Processing (IALP),2013International Conference on. IEEE,2013.239-242.
    [73] Ng H T, Low J K. Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-basedor character-based?[C]. EMNLP.2004.277-284.
    [74] Fuchun Peng, Fangfang Feng, Andrew McCallum. Chinese Segmentation and New WordDetection Using Conditional Random Fields [C]. Proceedings of the20th InternationalConference on Computational Linguistics. Geneva, Switzerland,2004.562-568.
    [75] Hai Zhao, Changning Huang, Mu Li, Baoliang Lu. Effective tag set selection in Chinese wordsegmentation via conditional random field modeling [C]. Proceedings of the20th Pacific AsiaConference on Language, Information and Computation. Wuhan, China,2006.87-94.
    [76]才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报.2009,23(1):35-37.
    [77] Long C, Kang C, Jiang D. The Comparative Research on the Segmentation Strategies ofTibetan Bounded-Variant Forms[C]. Asian Language Processing (IALP),2013InternationalConference on. IEEE,2013.243-246.
    [78]项保,张国喜.汉藏机器翻译中汉族人名翻译问题探讨[J].青海师范大学学报:自然科学版.2012,27(4):88-90.
    [79]噶玛降村.藏族人名的佛教文化内涵[J].中国西藏(中文版).1998,3.
    [80]噶·达哇才仁.藏族人名文化[J].西藏大学学报.1996,6.
    [81]王贵.藏族人名研究[M].北京:民族出版社,1991.
    [82]吕雅娟,赵铁军.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报.2001,15(1):28-33.
    [83]毛婷婷，李丽双,黄德根.基于混合模型的中国人名自动识别[J].中文信息学报.2007,21(2):22-28.
    [84]丁伟伟，常宝宝.基于语义组块分析的汉语语义角色标注[J].中文信息学报.2009,23(5):53-61.
    [85]窦嵘，加羊吉，黄伟.统计与规则相结合的藏文人名自动识别研究[J].长春工程学院学报.2010,11(2):113-115.
    [86]邱莎，段玻，申浩如，等.基于条件随机场的中文人名识别研究[J].昆明学院学报.2011,33(6):64-66.
    [87]唐钊.条件随机场模型在中文人名识别中的研究与实现[J].现代计算机.2012,7:3-7.
    [88]张华平，刘群.基于角色标注的中国人名自动识别研究[J].计算机学报.2004,27(1):86-91.
    [89]才让加,吉太加.基于藏语语料库的词类分类方法研究[J].西北民族大学学报:自然科学版.2005,26(2):39-42.
    [90]扎西加,索南尖错.基于藏语信息处理的词类体系研究[J].西藏大学学报(自然科学版).2008,23(1):36-41.
    [91]才让加.藏语语料库词语分类体系及标记集研究[J].中文信息学报.2009,23(4):107-112.
    [92]王志敬.从藏语格标记看结构语言学的局限性[J].西藏大学学报.2012(1):143-150.
    [93]吴军,谷歌.数学之美[M].北京:人民邮电出版社,2012.
    [94] Shannon C E. A mathematical theory of communication[J]. ACM SIGMOBILE MobileComputing and Communications Review.2001,5(1):3-55.
    [95] Csiszár I. I-divergence geometry of probability distributions and minimization problems[J].The Annals of Probability.1975:146-158.
    [96] Csiszar I. A geometric interpretation of Darroch and Ratcliff's generalized iterative scaling[J].The Annals of Statistics.1989,17(3):1409-1413.
    [97] Della Pietra S, Della Pietra V, Lafferty J. Inducing features of random fields[J]. PatternAnalysis and Machine Intelligence, IEEE Transactions on,1997,19(4):380-393.
    [98] Khudanpur S, Wu J. Maximum entropy techniques for exploiting syntactic, semantic andcollocational dependencies in language modeling[J]. Computer Speech&Language.2000,14(4):355-372.
    [99] Berger A. The improved iterative scaling algorithm: A gentle introduction[J]. Unpublishedmanuscript,1997.
    [100] Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization[J].Mathematical programming.1989,45(1-3):503-528.
    [101]于江德,王希杰,樊孝忠.基于最大熵模型的词位标注汉语分词[J].郑州大学学报:理学版.2011,43(1):70-74.
    [102] Koeling R. Chunking with maximum entropy models[C]. Proceedings of the2ndworkshop on Learning language in logic and the4th conference on Computational naturallanguage learning-Volume7. Association for Computational Linguistics,2000:139-141.
    [103]于江德,葛彦强,余正涛.基于条件随机场的汉语词性标注[J].微电子学与计算机.2011,28(10):63-66.
    [104]刘群,张华平,俞鸿魁,等.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展.2004,41(8):1421-1429.
    [105]白栓虎.汉语词切分及词性自动标注一体化方法[D].1995.
    [106]高山,张艳,徐波,等.基于三元统计模型的汉语分词及标注一体化研究[C].自然语言理解与机器翻译——全国第六届计算语言学联合学术会议论文集.2001.
    [107] Ng H T, Low J K. Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based?[C]. EMNLP.2004:277-284.
    [108]佟晓筠,宋国龙,刘强,等.中文分词及词性标注一体化模型研究[J].计算机科学.2007,34(9):174-175.
    [109] Jiang W, Huang L, Liu Q, et al. A cascaded linear model for joint chinese wordsegmentation and part-of-speech tagging[C]. In Proceedings of the46th Annual Meeting of theAssociation for Computational Linguistics.2008.
    [110]褚颖娜,廖敏,宋继华.一种基于统计的分词标注一体化方法[J].计算机系统应用.2009(12):55-58.
    [111]石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报.2010,24(2):39-45.
    [112]朱聪慧,赵铁军,郑德权.基于无向图序列标注模型的中文分词词性标注一体化系统[J].电子与信息学报.2010,32(3):700-704.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700