摘要
自动拼写检查是自然语言处理领域一项极具挑战性的研究课题,在语料库建设、文本编辑、语音和文字识别等诸多方面具有广阔的应用前景。藏文字是一种表音拼音文字,由1~7个基本构件横向和纵向拼接而成。藏文文本中非真字出现的频率很高,是藏文字拼写检查的基础和重点。该文通过分析藏文文法中的构字规则,利用藏文字向量模型将藏文字用计算机易于操作的数字(向量)表示,建立基于规则约束的藏文字向量模型,进而设计该模型下的藏文字拼写检查模型及算法。算法简单易实现,经测试算法拼写检查的平均准确率达99.995%,平均每秒检查1 060个字。
Automatic spelling checking is a challenging task in natural language processing with broad application in corpus construction,text editing,speech recognition and OCR.Tibetan scripts are alphabetic writing formed by 1 to 7 alphabets horizontally and vertically.Non-real Tibetan characters appear frequently,which is the focus in Tibetan spelling checking.Through the analysis of the characters' formation rules in the Tibetan grammar,this paper proposes a Tibetan characters vector model to represent Tibetan characters by numbers(vectors)with rule constraints.Then the Tibetan spelling checking model is established.The experiment shows an average accuracy of 99.995%for the proposed method,at the speed of 1 060 words per second.
引文
[1]关白,洛藏,才科扎西.现代藏文自动校对现状分析[J].西藏科技,2011(8):78-80.
[2]Karen Kukich.Techniques for automatically correcting words in text[J].ACM Computing Surveys,1992,24(4):377-438.
[3]James L Peterson.Computer programs for detecting and correcting spelling errors[J].Communication of the ACM,1980(12):676-687.
[4]Polock J J,Zamora A.Automatic spelling correc-tion in scientific and scholarly text[J].Communications of the ACM,1984,27(4):358-368.
[5]Chaohuang Chang.A pilot study on automatic chinese spelling error correction[J].Communication of COLIPS,1994,4(2):143-149.
[6]吴岩,李秀坤,刘挺,等.中文自动校对系统的研究与实现[J].哈尔滨工业大学学报,2001,33(1):60-64.
[7]张仰森,俞士汶.文本自动校对技术研究综述[J].计算机应用技术研究,2006,23(6):8-12.
[8]骆卫华,罗振声,宫小瑾.中文文本自动校对技术的研究[J].计算机技术研究与发展,,2004,41(1):244-248.
[9]Mayra Hapar,Gulila Ahenbek.Study and implementation of Kazakh text proofreading system based on NGram[J].Computer Applications and Software,2012,29(4):9-12.
[10]Hao Li,Aodengbala,Gong Zheng,et al.A research on automatic proofreading for mongolian text based on Bayes algorithm[J].Journal of Inner Mongolia University,2010,41(4):440-442.
[11]关白.自动校对中现代藏文音节字研究[J].西藏大学学报(自然科学版),2011,26(1):69-75.
[12]扎西次仁.一个藏文拼写检查系统的设计[C].中文信息处理国际会议,1998.
[13]多杰卓玛.N元模型在藏文文本局部查错中的应用研究[J].计算机科学与工程,2009,31(4):117-119,123.
[14]珠杰,李天瑞,刘胜久.TSRM藏文拼写检查算法[J].中文信息学报,2014,28(3):92-98.
[15]珠杰,李天瑞,刘胜久.藏文文本自动校对方法及系统设计[J].北京大学学报(自然科学版),2014,50(1):142-148.
[16]安见才让.基于分段的藏字校对算法研究[J].中文信息学报,2013,27(2):58-64.
[17]百度百科.藏文[EB/OL].http://baike.baidu.com/view/230052.htm,2013-01-12.
[18]才让卓玛,李永明,才智杰.基于语料库的藏语语音合成混合基元选择算法[J].软件学报,2015,26(6):1409-1420.
[19]江荻,董颖红.藏字叠加结构线性处理统计分析[J].中文信息学报,1994,8(4):44-46.
[20]才智杰,才让卓玛.藏文字符的向量模型及构件特征分析[J].中文信息学报,2016,30(2):202-206.
[21]才智杰.藏文自动切分系统中紧缩词的识别[J].中文信息学报,2009,23(1):35-37.
[22]才让卓玛,李永明,才智杰.基于Mealy机的藏文字构件分解[J].电子学报,2015,43(5):935-939.
[23]才让卓玛,才智杰.藏文字频统计系统中字构件分解算法[J].计算机工程与科学,2011,31(3):159-162.