用户名: 密码: 验证码:
图情档术语自动提取研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
国内开展的术语自动提取方面的研究都未以期刊论文的摘要为语料,而摘要作为一篇期刊论文概要性的陈述,其中包含了大量该学科领域内的术语,应当作为开展术语自动提取研究的重要语料。因此,本文希望通过对图书、情报和档案学领域内的期刊论文摘要,分别利用互信息和条件随机场模型进行术语自动提取方面的研究。
     本文首先介绍了研究的背景和意义,总结了术语自动提取的研究现状,明确了研究的基础,并给出了全文的框架结构。在第二章的部分介绍了术语的相关概念,以及术语的特征,包括领域特征和结构特征等等。
     然后本文对术语的表现特征、同义术语以及术语前后界进行统计分析。术语的表现特征中包括术语词频、术语词性序列以及术语词性词频;同义术语是通过利用编辑距离的方法统计得出;术语的前后界是通过统计出现在术语之前或者之后的词而得到。这些对术语特征的考察一方面为从语言学角度量化的研究术语内部提供了数据,另一方面也为之后的实验提供了语言学的知识。
     接着开展了基于互信息的术语自动提取方面的研究,介绍了互信息理论以及预处理的过程,实验主要是以二元词和三元词为考察对象,依据互信息计算公式,计算词语内部的关联程度,并设定不同的阈值,对结果进行统计。在首次实验结果不理想的情况下,对语料作进一步处理,第二次实验中准确率有了大幅度的提升,二元词和三元词的最高值分别达到了58.555%和58.814%。虽然在改进后,提取效果有所提高,但仍然不够理想,造成这种情况的原因在于基于统计的方法本身的局限性。
     最后开展了基于条件随机场的术语自动提取方面的研究,介绍了条件随机场模型、预处理的过程以及特征和特征模板的确定,分别用原子特征模板、增加了词性特征的特征模板以及增加了语言学特征的特征模板对基于字的和基于词的语料进行了实验,4轮实验的平均F值分别为91.927%、90.311%、90.681%和90.6818%。这说明基于条件随机场的术语自动提取效果要优于基于互信息的方法。
There are few studies on term extraction which take the abstract of paper as corpus. But the abstracts as papers'summary own lots of terms in the field of the subject. We shall absolutely take the abstracts as the corpus in the study of the term extraction. So, this paper intends to do some research on the term extraction by using Conditional Random Fields (CRFs) and Mutual Information (MI) method on the abstracts of the Library and Information Science.
     This paper firstly introduces the research background, research importance, research bases and the structure of the paper, then shortly summarizes the situation of the study on the term extraction. In the chapter two, the paper introduces some related conception of the term and some characteristics of term, including the field feature, the structure feature and so on.
     In the chapter three, the paper analyses the representational features of the term, synonymous terms and pre-words and after-words of the term based on statistical data. The representational features include the frequency of term, sequence of the part of speech and frequency of the part of speech. The synonymous terms are analyzed by using the "Edit Distance" method. The pre-words and after-words are found by calculating the words which are before or after the terms. For one thing, these statistical data can be used to investigating the inside of terms; for another the data offers the linguistic knowledge for the research of the term extraction.
     Then, the paper do some research on the term extraction by using Mutual Information method. It introduces the theory of MI and the process of the disposal to the corpus. The study mainly investigates the two-letter word and three-letter word by using the formula of MI, calculating the internal connection of these words, setting different thresholds and then counting the results. Because the results of first experiment are not very good, so the paper adjusts the corpus. After then the accuracy rates increase by a large margin, the highest rate of the two-letter word and three-letter word respectively reaches 58.555% and 58.814%. Although the accuracy has increased, the results are still not very well. The reason causing the results is the own limits of the MI method.
     At last, the paper discuss the term extraction by using Conditional Random Fields. Firstly, it introduces the theory of CRFs, the process of the disposal to the corpus and the identification of features and model of feature. Secondly, the paper does respective test in the letter based corpus and word based corpus by the model of simple feature, the model with part of speech, and the model with linguistic features. The average F-score in the four tests respectively reaches 91.927%,90.311%,90.681% and 90.6818%. These results indicate that the CRFs is better than the MI model in the term extraction.
引文
1 王强军,李芸,张普.信息技术领域术语提取的初步研究[J].自然语言处理,2003(1):32-33.
    2 张文静,梁颖红.术语抽取技术研究[J].信息技术,2008(3):6-9.
    3 刘豹.术语自动抽取技术的研究与应用[D].辽宁:沈阳航空工业学院,2007.
    1 张二艳.术语自动抽取技术研究[D].黑龙江:哈尔滨工业大学,2009.
    2 刘建舟.术语自动抽取系统的设计及关键技术研究[D].湖北:华中师范大学,2004.
    3 刘豹.术语自动抽取技术的研究与应用[D].辽宁:沈阳航空工业学院,2007.
    1 罗准辰.关键词抽取的研究与实现[D].国防科学技术大学研究生院,2008.
    1 张文静,梁颖红.术语抽取技术研究[J].信息技术,2008(3):6-9.
    2 刘豹.术语自动抽取技术的研究与应用[D].辽宁:沈阳航空工业学院,2007.
    3 D. Bourigault. Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases. In Proceedings of COLING-92. France,1992:977~981.
    4 E. Riloff, W. Lehnert. Classifying Texts Using Relevancy Signatures. In Proc.ofAAAI.1992:165~176.
    5 D.Maynard, S.Ananiadou. Identifying Contextual Information for Multi-Word Term Extraction[M].1999:277~319.
    6 张二艳.术语自动抽取技术研究[D].黑龙江:哈尔滨工业大学,2009.
    1 王强军,李芸,张普.信息技术领域术语提取的初步研究[J].自然语言处理,2003(1):32-33.
    2 张勇.中文术语自动抽取相关方法研究[D].湖北:华中师范大学,2006.
    1 苏意玲.基于机器学习的本体匹配的研究[J].福建电脑,2009(8):91-92.
    2 何国辉,吴礼发.基于机器学习的文本分类技术的研究[J].计算机与现代化,2009(8):4-6.
    3 http://cssci.nju.edu.cn/news_show.asp?Articleid=119[EB/OL].
    1 http://zh.wikipcdia.org/zh-cn/%E6%9C%AF%E8%AF%AD[EB/OL].
    2 张勇.中文术语自动抽取相关方法研究[DJ.湖北:华中师范大学,2006.
    1 马志斌.特定领域术语自动抽取方法的研究[D].黑龙江:哈尔滨工业大学,2009.
    2 王强军,李芸,张普.信息技术领域术语提取的初步研究[J].自然语言处理,2003(1):32-33.
    1 马志斌.特定领域术语自动抽取方法的研究[D].黑龙江:哈尔滨工业大学,2009.
    1 陆勇.面向信息检索的汉语同义词自动识别[M].东南大学出版社,2009.12.
    1 K. Church and K Hanks. Word Association Norms, Mutual Information and Lexicography[A], Proceedings of the 27th annual meeting on Association for Computational Linguistics[C]. Vancouver, British Columbia, Canada, 1989:76-83.
    2 Damerau. RJ, Evaluating Domain-Oriented Multi-Word Terms from Texts Information[J]. Processing and Management.1993,29(4),433-447.
    3 Dunning Ted. Accurate methods for the statistic of surprise and coincidence[J]. Computational Linguistics.1994, 19(1):61-74.
    4 索红光,杨涛.基于互信息的Web文档聚类方法[J].广西师范大学学报:自然科学版,2007.6(2):131-134.
    5 黄德根,马玉霞,杨元生.基于互信息的中文姓名识别方法[J].大连理工大学学报,2004.9(5):744-748.
    6 张锋,许云,侯艳,樊孝忠.基于互信息的中文术语抽取系统[Jl.计算机应用研究,2005:72-73.
    1 于江德,樊孝忠,尹继豪.基于条件随机场的中文科研论文信息抽取[J].华南理工大学学报,2007.9(9):90-94.
    2 邓箴,包宏.基于条件随机场的中文自动文摘系统[J].西安石油大学学报,2009.1(1):96-99.
    3 洪铭材,张阔,唐杰,李涓子.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006:148-151.
    4 迟呈英,于长远,战学刚.基于条件随机场的中文分词方法[J].情报杂志,2008(5):79-81.
    1 朱丹浩,王东波,谢靖.基于条件随机场的介宾结构自动识别[J].现代图书情报技术,2010:79-83.
    2 洪铭材,张阔,唐杰,李涓子.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006:148-151.
    3 于江德,樊孝忠,尹继豪.基于条件随机场的中文科研论文信息抽取[J].华南理工大学学报,2007.9(9):90-94.
    [1]王强军,李芸,张普.信息技术领域术语提取的初步研究[J].自然语言处理,2003(1):32-33.
    [2]张文静,梁颖红.术语抽取技术研究[J].信息技术,2008(3):6-9.
    [3]刘豹.术语自动抽取技术的研究与应用[D].辽宁:沈阳航空工业学院,2007.
    [4]张二艳.术语自动抽取技术研究[D].黑龙江:哈尔滨工业大学,2009.
    [5]刘建舟.术语自动抽取系统的设计及关键技术研究[D].湖北:华中师范大学,2004.
    [6]罗准辰.关键词抽取的研究与实现[D].国防科学技术大学研究生院,2008.
    [7]张勇.中文术语自动抽取相关方法研究[D].湖北:华中师范大学,2006.
    [8]苏意玲.基于机器学习的本体匹配的研究[J].福建电脑,2009(8):91-92.
    [9]何国辉,吴礼发.基于机器学习的文本分类技术的研究[J].计算机与现代化,2009(8):4-6.
    [10]http://cssci.nju.edu.cn/news_show.asp?Articleid=119 [EB/OL].
    [11]http://zh.wikipedia.org/zh-cn/%E6%9C%AF%E8%AF%AD [EB/OL].
    [12]马志斌.特定领域术语自动抽取方法的研究[D].黑龙江:哈尔滨工业大学,2009.
    [13]http://ictclas.org/ictclas_introduction.html[EB/OL].
    [14]陆勇.面向信息检索的汉语同义词自动识别[M].东南大学出版社,2009.12.
    [15]索红光,杨涛.基于互信息的Web文档聚类方法[J].广西师范大学学报:自然科学版,2007.6(2):131-134.
    [16]黄德根,马玉霞,杨元生.基于互信息的中文姓名识别方法[J].大连理工大学学报,2004.9(5):744-748.
    [17]张锋,许云,侯艳,樊孝忠.基于互信息的中文术语抽取系统[J].计算机应用研究,2005:72-73.
    [18]于江德,樊孝忠,尹继豪.基于条件随机场的中文科研论文信息抽取[J].华 南理工大学学报,2007.9(9):90-94.
    [19]邓箴,包宏.基于条件随机场的中文自动文摘系统[J].西安石油大学学报,2009.1(1):96-99.
    [20]洪铭材,张阔,唐杰,李涓子.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006:148-151.
    [21]迟呈英,于长远,战学刚.基于条件随机场的中文分词方法[J].情报杂志,2008(5):79-81.
    [22]朱丹浩,王东波,谢靖.基于条件随机场的介宾结构自动识别[J].现代图书情报技术,2010:79-83.
    [23]王东波.有标记联合结构的自动识别[D].江苏:南京师范大学,2008.4.
    [24]Daniel Jurafsky, James H. Martin. An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition[M].电子工业出版社社,2005.
    [25]王小捷,常宝宝.自然语言处理技术基础[M].北京:北京邮电大学出版社,2002.
    [26]Church K, Hanks K. Word Association Norms[J].Mutual Informantion and Lexicography,1990,16(1):22-29.
    [27]Patrick Pantel, Dekang Lin. A Statistical Corpus-Based Term Extractor[J].Canadian Conferernce on AI 2001,2001:36-46.
    [28]邓志鸿,唐世渭,张铭.Ontology研究综述[J].北京大学学报(自然科学版),2005,38(5):35-39.
    [29]Sayori Shimohata. An empirical method for identifying and translating technical terminology[A]. In Proceedings of the 17th conference on Computational linguistics[C], Prague, Czech Republic Prague, Czech Republic,2000:782-788.
    [30]Richard Sproat and Tom Emerson. The First International Chinese Word Segmentation Bakeoff[A].In proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C], Sapporo, Japan.2003.
    [31]D. Maynard, S. Ananiadou. Identifying Contextual Information for Multi-Word Term Extraction[M].1999:277~319.
    [32]K. Church and K Hanks. Word Association Norms, Mutual Information and Lexicography[A], Proceedings of the 27th annual meeting on Association for Computational Linguistics[C]. Vancouver, British Columbia, Canada,1989:76-83.
    [33]Damerau. RJ, Evaluating Domain-Oriented Multi-Word Terms from Texts Information[J]. Processing and Management.1993,29(4),433-447.
    [34]Dunning Ted. Accurate methods for the statistic of surprise and coincidence[J]. Computational Linguistics.1994,19(1):61-74.
    [35]Uchimoto, K, Sekine, S, Murata, M, Ozaku, H, Isahara, H. Term recognition using corpora from different fields.2000:233~256.
    [36]金春霞,周海岩.基于机器学习的Web文本分类技术及算法[J].长春工业大学学报(自然科学版),2009.6(3):347-351.
    [37]许晓丽,卢志茂,张格森.基于条件随机场的中文命名实体识别研究[J].中国新技术新产品,2009(2):15.
    [38]王昊,邓三鸿.HMM和CRFs在信息抽取应用中的研究[J].情报与分析研究,2007:57-62.
    [39]于卫.自动中文术语识别若干方法研究[D].哈尔滨:哈尔滨工业大学,2004.
    [40]Shengfen Luo, MaosongSun. Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures[C]. Sapporo, Japan:Proceedings of the 2nd SIGHAN Work Shop on Chinese Language Processing,2003.24-30.
    [41]F. Zhang, X.Z. Fan, Y. Xu. Chinese Term Extraction Based on PAT-Tree[J]. Journal of Beijing Institute of Technology.2006,15(2):162~166.
    [42]C.D. Manning, Hinrich Schutze. Foundations of Statistical Natural Language Processing.5.MIT Press,2002:106~110.
    [43]Brandow R, Mitze K, and Rau LR. The automatic condensation of electronic publications by sentence selection. Information Processing and Management, 1995,31(5):675~685
    [44]J.N. Darroch, D Ratcliff. Generalized Iterative Scaling for Log-linear Models[J]. The annals of Mathematical Statistics,1972,12(43):1470-1480.
    [45]贾美英,杨炳儒,郑德权,杨靖.采用CRF技术的军事情报术语自动抽取研究[J].计算机工程与应用,2009:126-129.
    [46]周新栋,王挺.基于N元语言模型的文本分类方法[J].计算机应用,2005.1(1):11-13.
    [47]赵正文,康耀红.统计语言模型在信息检索中的应用[J].计算机工程与应用,2006(36):158-161.
    [48]D. Bourigault. Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases. In Proceedings of COLING-92. France,1992:977~981.
    [49]E. Riloff, W. Lehnert. Classifying Texts Using Relevancy Signatures. In Proc.ofAAAI.1992:165~176.
    [50]D.Maynard, S.Ananiadou. Identifying Contextual Information for Multi-Word Term Extraction[M].1999:277~319.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700