西里尔和传统蒙古文的形态和转换系统研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
蒙古族以前使用过若干个文字,但是目前主要使用传统蒙古文、西里尔蒙古文和托(?)文。
     本文旨在研究传统蒙古文和西里尔蒙古文的信息化技术,该信息化技术一是指研究传统蒙古文和西里尔蒙古文之间的文字转化,二是研究传统蒙古文和西里尔蒙古文的形态即词法。本文绪论中详细介绍了上述研究工作的意义、目的和目标。
     将计算机技术与蒙古学研究相结合已经成为研究蒙古文计算语言学的必然趋势。尽管在蒙古国内已有相关公司及个人在此领域从事相关研究工作并研发了一些相关的应用程序,但上述应用程序的研发水平尚不能与发达国家的相关研究水平相媲美。
     鉴于此,本人致力于研究西里尔蒙古文和传统蒙古文的信息化技术。
     在这项工作中,我们试图从形态分析西里尔蒙古文和传统蒙古文,并利用蒙古文构词规则研究了西里尔蒙古文和传统蒙古文的相互转换问题。这个过程包含以下两个步骤:首先,从形态分析西里尔蒙古文或传统蒙古文语义,找出词干和后缀;然后,将它们转换成对应的传统蒙古文或西里尔蒙古文词干和后缀,并利用构词规则生成对应的传统蒙古文或西里尔蒙古文。本文完成的主要研究工作如下:
     1.本文研究了西里尔蒙古文和传统蒙古文的相关特点,从而试图将二级形态的模型(Two Level Morphology Model)应用在蒙古文当中。从计算语言学角度考虑,西里尔蒙古文和传统蒙古文有着很多相似之处,也有一些不同之处。目前,西里尔蒙古文的书写规则有66大类。传统蒙古文确只有3项书写规则,即元音和谐规则、辅音规则、连接音规则。蒙古文是粘着语,是词干加后缀的形式生成新词的。在词干和后缀缀接方面,西里尔蒙古文和传统蒙古文也有不同之处,这是因为书写规则不同而导致。根据上述情况,本人研究了名词和动词的生成和解析模型,同时研究出词干加构形后缀的规则,并找出了词干加多个构形后缀的所有可能。
     2.完成上述工作后建立对应资源库的工作显得十分紧迫。资源库是继续开展西里尔蒙古文和传统蒙古文相互转换工作的基础。该资源库包括词干资源库、形态资源库和附加资源库。蒙古文词干后缀加构形后缀后可以生成大规模的蒙古文单词,所以本人选用词干作为资源库的基本单元,主要优点是:资源库的数据不会太大;加快应用程序的运算速度;可以确定词汇生成规则,进而掌握生成某种词汇的所有可能。词干资源库包含3个子库:西里尔蒙古文和传统蒙古文对应词干库,并包含单词解释(包含72000词条);带有词性标注的西里尔蒙古文和传统蒙古文对应词干库(包含61000词条);由词干编码,词汇生成、词汇解析编码组成的资源库(48000条)。形态资源库包含2个子库:西里尔蒙古文和传统蒙古文对应构形后缀库(包含86词条);多个构形后缀缀接条件库(包含876词条)。附件资源库包含2个子库:专有名词库(包含9135条);缩略语库(包含1100条)。
     3.根据二级形态的模型及“有穷自动机”制作出西里尔蒙古文和传统蒙古文书写规则模型。根据该模型对单词的构成进行分析,并做了西里尔蒙古文和传统蒙古文相互转换试验。PC-Kimmo是用于词形分析的开源系统,它由两个组成部分,即词汇形式和规则形式。本文以PC-Kimmo为工具制作完成了西里尔蒙古文和传统蒙古文相互转换模型。本文将词汇分成了名词和动词两大类,并分别建立了名词生成模型和动词生成模型。本人将西里尔蒙古文和传统蒙古文书写规则分别制作了模型,并利用该模型及资源库建立了西里尔蒙古文和传统蒙古文相互转换系统,并把该系统命名为KIM_MON(第一版)。该系统能够为用户解析、研判、生成词汇并将最终结果告知用户。
     4.最后,利用KIM_MON系统进行了蒙古文词法分析的实验工作。实验结果表明:当我们对西里尔蒙古文和传统蒙古文的形态分析时,正确率达到了97.6%。在正确分析蒙古文形态基础上KIM MON能够100%的正确的连接单词。在词法研究工作的基础上,我们对西里尔蒙古文和传统蒙古文相互转换工作进行实验,实验结果表明:从西里尔蒙古文到传统蒙古文的转换准确率达到了91.3%,从传统蒙古文到西里尔蒙古文的转换准确率达到了89.1%。在西里尔蒙古文的词同义不同单词的转换实验中,准确率达到了86.9%。并且通过实验得出,随着训练数据的增多会提高词同义不同单词的转换准确率。
Although Mongolian people have used several scripts in their historical period, they use three main scripts such as Traditional Mongolian script, Cyrillic Mongolian and Tod scripts.
     In this thesis, we demonstrated morphological and script's conversion between two types of Mongolian such as Traditional Mongolian script and Cyrillic Mongolian. In introduction part, we showed significance of research work in detail. And also, you can see the aim and objective of research work in introduction. Countries, which have understood that language processing industry is critical in creating next generation of knowledge based, knowledge processing computers, have supported this industry greatly by public policy, established national level research centers and implemented many national level projects which require a lot of capital. Coordinating Mongolian studies with modern technology and developing Mongolian computational linguistics are topical requirements.
     Recognizing Mongolian word and sentence in computer helps to reveal and study Mongolian principle and feature thanks to modern approaches and technologies. That is, our further research work will be effective as a result of this work. Even though, some Mongolian companies and individuals have done research and analysis, and created some applications and programs in this industry, it is dissatisfactory compared to the level of other countries. Furthermore, we haven't created unified system yet for the industry.
     Thus, I chose processing Mongolian using computer as main subject of thesis.
     In this work we tried to do morphological analyze both in Cyrillic Mongolian and Traditional Mongolian script and define inflection method of affix in accordance to orthography rule using computer. The aim of this work is to convert from Cyrillic Mongolian text to Traditional Mongolian script and vice versa. This process runs in following steps:First, to do morphological analyze in Cyrillic Mongolian and Traditional Mongolian word, find out stem and affixes of and then convert them to Traditional Mongolian and Cyrillic Mongolian script. Then join converted word stem with affix and generate word Traditional Mongolian script. This combined process is belonged to morphology of computational linguistics. Word which is written differently due to its meaning in Traditional Mongolian script is the same in Cyrillic script. Thus, we intended to define the meaning of word. In the frame of research work, we executed following activities.
     1. We demonstrated feature of both Cyrillic Mongolian and Traditional Mongolian script, Mongolian parts of speech and word structure. Traditional Mongolian script is a type of phonetic script and there are many words which have the same tones. It observes the principles of morphology and the traditions. The Cyrillic Mongolian script observes the principles of phonetics and it has the disadvantage of not observing the other principles.
     For computational linguistics, Traditional Mongolian script and Cyrillic Mongolian may have both same features. Contrariwise, there are large numbers of different features in both two scripts. For orthography, they may be similar in some ways. Because scientists who created the Cyrillic letter rule have mentioned that the Cyrillic Mongolian letter rule was based on the Traditional Mongolian script's rule. The Cyrillic Mongolian alphabet that we use now consists of66articles. But the Traditional Mongolian script which has been inherited from thousand years consists of only3rules:vowel harmony (conformity), syllable closing consonants rule, and combining vowels. Mongolian is agglutinative language and rule for generating and inflecting word is based on approaches like attaching suffix and affix to word stem. But we follow different rules in both Cyrillic Mongolian and Traditional Mongolian script in order to attach suffix and affix to word stem. It is not Mongolian feature, but it is feature of orthographic rule followed in that script.
     When Mongolian noun, adjective and pronoun lie in sentence, they are inflected by plural suffix, case and possessive suffix. But verb is inflected by voice, state, temporal ending suffix, possessive ending suffix, subordinating conjunctive suffix and determining suffix. Then we developed model of noun and verb inflection.
     Thus, we calculated suffix sequence possibility and formulated suffix combination rule.
     2. We needed to create certain database after carrying out mentioned-above researches. Thus, I created both Mongolian morphological and inflectional suffix's databases that fulfilled requirements of feature of Mongolian language and my own research work. This database will be the base of our many tasks which we will be doing in computer linguistics. Using our database, we will initially complete Mongolian language, Mongolian script morphology and conversion system research. Saving the word stems and grammatically transformed units into entries would be deemed as the most simple and crude method. Therefore, we have defined the database unit will be "word stem". Main advantages are:Words saved in the database will not be fictionally high; Program speed will increase; Word grammatical form will be solved based on the grammar, so all the possible transformations can be included;
     Basic database can be consists of following3types of bases:Primitive database of primitive key, Cyrillic Mongolian and Traditional Mongolian head words and explanation (72210); Database of word class(53294); Inflectional database with their code that shows grammar inflection (48000);
     We created vocabulary of abbreviated word containing1100words and vocabulary of proper noun consisting of9135words.
     According to the research, there are86suffixes such as Instrumental, directive, dative-locative, plural and negative etcin Mongolian language. We created vocabulary of suffix consisting of Cyrillic Mongolian and Traditional Mongolian script's form by numbering that suffix. Sequence of doubling suffix has accurate principle. Morphemes which participates in word structure has own accurate position and sequence and their margins are obvious. But there are some exceptions that break the rule of morpheme's certain position and sequence. For two scripts, we created sequence database of suffixes that were estimated accurately.
     3. As a result of executing mentioned-above activities, I was able to decide goal of doing Mongolian morphological analysis using two-level morphology based on created database. We demonstrated modeling rule of Traditional Mongolian script and Cyrillic Mongolian in order to analyze in Mongolian morphology. In order to do this, we model Mongolian rule using finite-state automata and two-level morphology in Mongolian morphology. We conducted experiment on parsing word as structure and generating word through this model. We studied it deeply, turned it into practical usage and executed following activities.
     In the work process, it became obvious that two-level finite state morphology can be used in Mongolian morphology. It gave us opportunity to use these actions such as generating and parsing word in further research work. Two actions like parsing and generating word as inflectional affixes need to be based on finite state automata in computational morphology. Thus, it is important to describe design for automata that inflect word of database unit. Because we classified database into inflect and non-inflect word and inflected words were divided into noun and verb. Word grammar inflections suit noun and verb inflection.In order to process description in PC-KIMMO, all rules should be created true and to be checked consequently.
     In addition, we considered approaches related to creating rule in chapter. We modeled Mongolian rule and did morphological analysis. In order to do this, we modeled Cyrillic Mongolian and Traditional Mongolian script individually and created suitable rule files. We developed morphological analysis'software of Cyrillic Mongolian and Traditional Mongolian script using rule file and lexical file and then we tested successfully.
     For word automata, it has to parse inserted text of user, process, generate correct word by attaching appropriate affixes to stem and show result or text of word structure. When we do processing in Unicode text, we need to execute following additional works.
     As a result, Cyrillic Mongolian and Traditional Mongolian texts can be processed and first version of KIM_MON program was developed. Result of text processing is irrelative to character coding (Latin, Cyrillic, etc.) but directly depends on how it provides and classifies sufficient vocabulary file and how it defines the rules correctly.
     4. I conducted experiment on Mongolian morphological analysis using KIM_MON program and created database. Let us state about result of experiment in brief. When we parse morphology on text, correct conversion comprises97.6%. For attaching word action, it attached correctly mentioned-above word that was correctly and draws correct result. When we do conversion in accordance with developed algorithm, following results were appeared.
     a) While converting from Cyrillic Mongolian to Traditional Mongolian script, recognizing word sense is91.3%.
     b) While converting from Traditional Mongolian script to Cyrillic Mongolian, recognizing word sense is89.1%.
     While doing experiment related to recognizing word sense, recognizing word sense is86.9%. From the experiment process, creating massive training database can increase recognizing percent.
引文
3 In 1964 "ALPAC" (Automatic Language Processing Advisory Committee) was founded by the Congress of USA to develop Machine translation.
    [23]S. Benus and S. Darjaa, "Semi-automatic Approach to ASR Errors Categorization in Multi-speaker Corpora", in Natural Language Processing, Multilinguality, Modra, Slovakia, 2011.
    [26]U. Batbayar, Ch. Lodoiravsal and R. Amartuvshin, "Recognition of printed Traditional Mongolian script", in Conference proceeding MITA 2011, Ulaanbaatar, Mongolia, 2011.
    [29]J. Purev, Z. Tsolmon, Ch. Altangerel and Cheol-Young, "PC-KIMMO based description of Mongolian morphology", Ulaanbaatar, 2005.
    [30]Ch. Altangerel and B. Adiyatseren, "Two level rules for Mongolian Language" in Conference proceedings MITA 2011, Ulaanbaatar, 2011.
    [31]R. Wicentowski, "Modelling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework", Baltimore, Maryland: The Johns Hopkins University, a thesis for doctor, 2002.
    [32]U. Batbayar, Ch. Lodoiravsal and R. Amartuvshin, "Recognition of printed Traditional Mongolian script", in Conference proceeding MITA 2011, Ulaanbaatar, Mongolia, 2011.
    [33]J. Purev, "An Inroduction to NLP research in PANL1 On Phase Ⅱ," School of Information Technology, National University of Mongolia, Thimhu, Butan, 2007.
    [34]R. Roark and R. Sproat, "Computational Approaches to Morphology and Syntax", New York: Oxford University Press Inc, 2007.
    [35]R. Morrison, A. Dearie, R. Connor and A. Brown, "An ad hoc approach to the implementation of polymorphism", North Haugh, Scotland: Department of Computational Science University of St Andrews,1991.
    [37]K. Altangerel and G. Tsend, "N-gram analysis of a Mongolian text", Russia, Tomsk:IFOST 2008 Third International Forum, 2008.
    [38]I. A. Bolshakov and A. Gelbukh, "Computational linguistics Models", Resources, Applications, Mexico:National Polytechnic Institute, 2004.
    [39]S. R. Spiegler, "Machine Learning for the Analysis of Morphologically Complex Languages", Bristol, United Kingdom:University of Bristol, a thesis for doctor, 2011.
    [41]N. Chomsky, "Three Models for the Description of Language, Department of Modern Languages and Research Laboratory of Electronics", Massachusetts Instutute of Technology, 1957.
    [43]D. Jurafsky and J.H. Martin, "Speech and Language Processing", New Jersey, Pearson Education, Inc.,2009.
    [44]J.G. Edward Barton, "The computational complexity of two-level morphology" in Massachusetts institute of technology artificial intelligence laboratory, 1985.
    [45]K. Koskenniemi, "Two-level morphology: a general computational model for word-form recognition and production"', Helsinki, Finland: University of Helsinki, a thesis for doctor, 1983.
    [46]H. Trost, "Computational morphology," 2009. http://ccl.pku.edu.cn/doubtfire/NLP/Lexical_Analysis/Word_Lemmatization/Introduction/Computational%20Morphology.htm. [2013]
    [47]J. Hans, "A two-level engine for tagalog morphology and a structured XML output for PC-KIMMO", Brigham Young University. Department of Linguistics and English Language, 2004.
    [48]Kasetsart University, Bangkok, Thailand, "Computational morphology," 2006. http://naist.cpe.ku.ac.th/LAICS-NLP/document/170ct06/rooml/2.1.1 Computational Morphology%28BaliRanaivo%29.pdf. [2013].
    [50]N. Chomsky, "Aspects of the theory of syntax", The M.I.T. Press, Massachusetts Institute of Technology, Cambridge,1957.
    [51]J. Goldsmith, "An Algorithm for the Unsupervised Learning of Morphology" in Natural Language Engineering, United Kingdom, Cambridge University Press,2005.
    [52]D. Yarowsky and R. Wicentowski, "Minimally Supervised Morphological Analysis by Multimodal Alignment" in Proceedings of the 38th Meeting of the Association for Computational Linguistics, Hong Kong, 2000.
    [54]G. C. Sodhy, "Prefix Extraction of Malay Words using Backpropagation Neural Network", Malaysia, 2002.
    [55]R. Bod, J. Hay and S. Jannedy, "Probabilistic Linguistics", London, England, The MIT Press, 2003.
    [56]B. Bataa, "Word Sense Disambiguation in Mongolian Language" in The 7th International Forum on Strategic Technology (IFOST2012), Tomsk, Russia, 2012.
    [57]C.D. Manning and H. Schutze, "Foundations of Statistical Natural Language Processing", London; England:The MIT Press, 2000.
    [60]W.A. Gate, K.W. Church and D. Yarowsky, "One Sense Per Discourse" in Proceedings of the 4th DARPA workshop on Speech and Natural Language, 1992.
    [64]D. Yarowsky, "One sense per collocation", Princeton:Proceedings of the workshop on ARPA Human Language Technology, 1993.
    [65]D. Yarowsky, "Decision lists for lexical ambiguity resolution:application to accent restoration in Spanish and French" in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Las Cruces, 1994.
    [66]O. Smrz, "Functional Arabic Morphology", Prague, Charles University, Czech, a thesis for doctor, 2007.
    [67]C. Monson, "from Paradigm Structure to Natural Language Morphology Induction", Pittsburgh, School of Computer Science Carnegue Mellon University, USA, a thesis for doctor,2009.
    [68]M.A. Attia, "Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation", Manchester, University of Manchester, a thesis for doctor, 2008.
    [69]B. Prasain, "a Computational Analysis of Nepali Morphology: a Model for .Natural Language Processing", Kathmandu, Tribhuvan University, Nepal, a thesis for doctor, 2011.
    [75]M. Erdenechimeg, "Multi-directional, multi-lingual script processing and the MultiScript system", Ulaanbaatar, School of Mathematics and computer science, NUM, a thesis for doctor,2000.
    [78]Hao Li, Bao Sarina, "The Study of Comparison and Conversion about Traditional Mon-golian and Cyrillic Mongolian", 2011 4th International Conference on Intelligent Networks and Intelligent Systems, 2011
    [81]FeiLong Bao, GuangLai Gao "Language Model for Cyrillic Mongolian to Traditional Mongolian Conversion", NLP&CC, Chongqing China, 2013
    [84]Unicode consortium, "Universal Character Set," Wikipedia project, [Online]. Available, http://en.wikipedia.org/wiki/TSO/TEC_10646. [Accessed 124 2012].
    [88]飞龙,高光来,闫学亮,魏宏喜,“传统蒙古文与西里尔蒙古文相互转换方法研究”,计算机工程与应用,2014(己网络出版)
    [96]. Gao Hongxia, Ma Xiaolei, "Research on text-transform of Cyrillic Mongolian to Traditional Mongolian conversion system", Journal of Inner Mongolia University for Nationalities, 2012
    [97]K. Oflazer, "Computational Morphology", Malaga, Spain, European Summer School in Logic, Language and Information, 2006.
    [98]H. Trost, "Computational morphology",2009. http://ccl.pku.edu.cn/doubtfire/NLP/Lexical AnalysisAVor_Lemmatization/Introduction/Computational%20Morphology.htm. [2013].
    [99]R.M. Bali, "Computational Morphology" in Language, Artificial Intelligence and Computer Science for Natural Language Processing applications, Bangkok, Tailand, 2006.
    [100]P. d. Lacy, "Morphological Haplology and Correspondence" in University of Massachusetts Occasional Papers, University of Massachusetts, Amherst, 1999.
    [101]Fei long "Research on speech keyword spotting technology for Mongolian", Inner Mongolia University, a thesis for doctor, 2013
    [103]D. Uuganbaatar, GuangLai Gao, I. Byambasuren and B. Nergui, "Using the two-level morphology on modern mongolian linguistics", Proceedings of the Mongolian Academy of Sciences, UB, 2012.
    [104]P. Schone and D. Jurafsky, "Knowledge-Free Induction of Morphology using Latent Semantic Analysis" in Proceedings of the Conference on Computational Natural Language Learning, Lisbon, Portugal,2000.
    [106]淑琴,"The construction of Mongolian homographs knowledge base", Inner Mongolia University, a thesis for doctor, 2010
    [111]Christopher M. Bishop "Pattern Recognition and machine learning", Cambridge,2006
    [112]IBM, "Dictionary of computing", Nortn America, 1994

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700