规则与统计相结合的兼类词处理机制
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
词性标注是自然语言处理中的一项基础性课题,词性标注的正误对汉语语料库标注、机器翻译和大规模文本的信息检索等都有重要的意义。
     本文对词性标注的方法进行了研究,分析了基于规则的方法和基于统计的方法的优缺点。在此基础上提出了规则和统计相结合的排歧策略。在规则方法中,改进了规则库的构建方法,用兼类词词性代替兼类词本身,并尝试使用统计辅助构建规则库;在统计方法中,在二元语法模型基础上引入了学习机制的概念,根据学习结果对词性概率和词汇概率的获取方法进行了修正。按照上述策略,实现了一个兼类词处理系统,闭式标注正确率达97.85%,开式标注正确率达96.71%。试验测试结果标明规则和统计相结合的兼类词处理机制可以有效地提高词性排歧正确率和词性标注正确率。
Part-of-speech tagging is a fundamental theme in natural language processing . It is significant to the tagging of Chinese corpus-based, machine translation and information indexing of large scale text.
    In this paper, we study the method of the part-of-speech tagging and analyze the rule method and the statistics method. Basing on it we bring forward the disambiguation strategy using rule techniques and statistics techniques .In rule model, the acqusition method of rules base is improved .We use the part-of-speech of syntactic category to replace the syntactic category .In addition, statistics method is used to help to construct the rule base. In statistics model, the concept of learning machine-made is presented .In according to the result of learning,the method of calculating transition probabilities and symbol probabilities are amended. With the above method, a system of disambiguation is materialized. The overall accuracy of close test is 97.85% and the accuracy of open test is 96.71% . The experimental results show the tagging accuracy and disambiguation accuracy are raised by using rule techniques and statistics techniques .
引文
1.郭锐,语文词典的词性标注问题,中国语文,1999,2,150-158
    2.周强,规则和统计结合的汉语词类标注方法,中文信息学报,1995,9(2),1-10
    3.孟琮、郑怀德、孟庆海、蔡文兰,《动词用法词典》,上海辞书出版社,1985。
    4. Greene, Barbara B. and Rubin, Gerald M. Automated Grammatical Tagging of English. Brown University. 1971.
    5. Kucera, H. and Francis, W. Nelson. Frquency Analysis of English Usage: Lexicon and Grammar. Houghton-Mifflin Company, Boston. 1982.
    6. Marshall, Jan. "Choice of Grammatical Word-Class Without Global Syntactic Analysis", Computers in the Humanities17:1983, 139-150.
    7. Shannon, C."The mathematical theory of communication". Bell System Technical Journal, 1948, 27, 398-403.
    8. Church, K. "A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text", Second Conference on Applied Natural Language Processing(ACL), 1988,136-143.
    9. Derose, J. Steven. "Grammatical Category Disambiguation by Statistical Optimization", Computational Linguistics, 1988, Vol 14, 31-39.
    10. Hindle, D."Acquiring Disambiguation Rules from Text"ACL, 1989, 118-125.
    11.白栓虎、夏莹、黄昌宁,“汉语语料库词性标注方法研究”,《机器翻译研究进展》,1992,408-418。
    12.刘开瑛、郑家恒、赵军,“语料库词类自动标注算法研究”,《机器翻译研究进展》,1992,378-386。
    13.周强、俞士汶,“一种切词和词性标注相融合的汉语语料库多级加工方法”,《计算机研究与运用》,北京语言学院出版社,1993,126-131。
    
    
    14.余炬,朱凤石,基于人工神经网络的汉语兼类处理方法的研究,计算机研究与发展,1998,35(4):1-10
    15.魏欧,孙玉芳,汉语词性标注方法的研究,计算机科学,2000,27(7):71-75
    16. BernardMerialdo. TaggingEnglishtextwithaprobabilisticmodel. ComputationalLinguistics, 1994,20(2):155-171
    17.魏欧,吴健,孙玉芳,基于统计的汉语词性标注方法的分析与改进,软件学报,2000,11(4):473-480
    18.刘开瑛,中文文本自动分词和标注,北京商务印书馆,2000.5,162-200
    19.王素格,张永奎,汉语词性标注排歧方法探讨,计算机工程与应用,2001,7:70-72
    20.黄德根,朱和合,杨元生,基于单词与双词可信度的汉语分词,计算机研究与发展,2001,7:132-135
    21. Merialdo B. Tagging English text with a probabilistic model. Computational Lingistics, 1994,20(2):155-171
    21.姚天昉,林莉,玉素甫·艾白都拉,基于德语语料库词性标注和统计方法的研究,上海交通大学学报,1996,6

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700