科技论文关键词抽取技术的研究

英文题名：Study on Key Phrase Extraction Technology for Scientific and Technical Essays
作者：严春风
论文级别：硕士
学科专业名称：计算机技术
中文关键词：关键词抽取 ; PAT-Tree ; 互信息 ; 同义词
英文关键词：Key phrase extraction ; PAT-Tree ; mutual information ; synonym
学位年度：2009
导师：姚建民
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2009-10-01

摘要

本文以万方数据和会议集作为测试语料,重点介绍了基于PAT-Tree关键词的抽取方法和知网在关键词抽取中的应用。首先通过实验验证关键词具有的一些特征并介绍了常用的关键词的过滤方法。接着介绍了能够方便快捷地进行全文串频统计的PAT-Tree数据结构以及互信息。在此基础上提出了基于PAT-Tree关键词的抽取方法,抽取过程基于从原始文本中得到的统计信息,取出符合筛选条件的字符串。总体来说分为四个阶段,分别为:对文本进行预处理;在预处理过的文本上建立PAT-Tree,获取文章词频信息;在PAT-Tree上抽取候选关键词;对关键词过滤以及选取关键词。我们把抽取的重点放在了自动过滤符合统计条件的字符串,进一步精选候选关键词上面。我们在精选过程中采用了新的过滤手段,并借鉴了其它方法的优点,形成了一套综合的过滤手段,有效地提高了精确度,减少了计算量。本文的另外一个特色,考虑到会议集是领域语料,特别使用分治法的思想来处理密集计算,高效地建立PAT-Tree,一方面为抽取领域关键词提供了方便,另一方面也使得关键词抽取能够用分布式计算的方法来实现,提供了进一步扩大处理能力的空间。实验结果表明,采用此方法能够高效地抽取关键词,特别是领域关键词的抽取取得了良好的效果,达到了预期目的。最后,引入知网来计算同义词的相似度,以此来解决关键词集合中同义词同现问题和词语由于同义词问题不能进入关键词集合的问题。
This essay, using Wanfang Data and conference collections as testing materials, focuses on the PAT-Tree-based methods for key phrase extraction and the application of CNKI in key phrase extraction. Firstly, some characteristics of the key phrases and the methods for filtration are verified by experiments. Secondly, it introduces the PAT-Tree data structure, which can conveniently and quickly compile string frequency statistics on the whole text, as well as mutual information. On the above base, it raises the methods for extracting the key phrases according to PAT-Tree. In general, depending on the statistic information from the original text, the character string conforming with the filter criteria is extracted and the extraction process can be divided into four steps, such as: setting up PAT-Tree on the pre-processed text, getting the term frequency, extracting the candidate key phrases on PAT-Tree, and filtrating and electing the key phrases. The emphasis of extraction is on the character string which is automatically filtrated and accords with the statistics criteria and on the key phrases which are carefully picked out, in the period of which the new filtration methods are used, the advantages of other ways are used for reference and an integrated set of filtrating methods is formed, which improves the precision effectively and decreases the calculating quantity. In this essay, the other characteristic is to deal with the denseness calculation by using the divide and conquer method, which effectively sets up PAT-Tree and provides the convenience for extracting the key phrases in the fields. The experiments shows that by such methods the key phrases can be extracted with high effect and the good results can be achieved so that the expected goal can be reached. Finally, CNKI is introduced to calculate the similarity of synonyms, in order to solve the problem that there are synonyms in the key phrase collection and that some phrases can not enter the key phrase collection due to synonyms.

引文

[1]章成志.自动标引研究的回顾与展望.现代图书情报技术.2007, 157(11). Pp:33-39
    [2]刘华.基于关键短语的文本内容标引研究[D].中国博士学位论文全文数据库.北京语言大学:2005.
    [3]罗昶.第四代搜索引擎-主题搜索引擎的设计与实现[D].北京大学学士论文.2001.
    [4]张静,刘细文等国内外专利分析工具功能比较研究[J].情报理论与实践, 2008, 31 (1) :141 -145.
    [5] H. P. Luhn. A Statistical Approach to Mechanized Encoding and Searching of Literary Information [J]. IBM Journal of Research and Development, 1957, 1(4): 309-317.
    [6] P. D. Turney. Learning to Extract Keyphrases from Text[R].NRC Technical Report ERB -1057, National Research Council, Canada. 1999:1-43.
    [7] I.H.Witten, G.W.Paynter, E.Frank, C.Gutwin,C.G.Nevill—Manning.KEA:Practical Automatic Keyphrase Extraction[C].In Proc of the 4th ACM conference Oil Digital libraries,1999:254-255.
    [8] Samhaa R, l-Beltagy.KP-Miner: A Simple System for Effective Keyphrase Extraction[C]. In Proc of 3th IEEE International Conference on Innovations on Innovations in Information Technology, 2006:1 - 5.
    [9] Niraj Kumar,Kannan Srinathan. Automatic keyphrase extraction from scientific documents using n-gram filtration technique[C]. In Proc of DocEng’08 Conference, 2008: 199-208.
    [10] Chien L F.PAT-tree-based Keyword Extraction for Chinese Information Retrieval [C].In Proc of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1997), 1997:50-58.
    [11]李素建,王厚峰,俞士汶等.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9): 1192-ll97.
    [12]王军.词表的自动丰富——从元数据中提取关键词及其定位[J].中文信息学报,2005,19(6):36-43.
    [13]韩客松,王永成.中文全文标引的主题词标引和主题概念标引方法[J].情报学报,2001,20(2):212-216.
    [14] Chengzhi Zhang,Huilin Wang,Yao Liu,Dan Wu,Yi Liao,Bo Wang.Automatic Keyword Extraction from Documents Using Conditional Random Fields[J].Journal of Computational Information Systems,2008:4(3):1169-1180.
    [15] J. D. Cohen. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting [J]. Journal of the American Society for Information Science, 1995, 46(3): 162-174.
    [16] Lonsdale D and Strong-Krause D. Automated Rating of ESL Essays [EB/OL]. http://acl.ldc.upenn.edu/W/W03/W03-0209.pdf, 2003/2006-03-20.
    [17] G. Salton, C. S. Yang, C. T. Yu. A Theory of Term Importance in Automatic Text Analysis [J]. Journal of the American society for Information Science, 1975, 26(1): 33-44.
    [18]徐文海,温有奎.一种基于TFIDF方法的中文关键词抽取算法[J].情报理论与实践,2008,31(2):298-302
    [19] Y. Matsuo, M. Ishizuka. Keyword Extraction from a Single Document Using Word Co-ocuurrence Statistical Information [J]. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169.
    [20] Lee-Feng Chien. PAT-Tree-Based adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval [J]. Information Processing and Management, 1999, 35: 501-521.
    [21] Hui Jiao, Qian Liu, Hui-bo Jia. Chinese Keyword Extraction Based on N-gram and word Co-occurrenc[C].In Proc of International Conference on Computational Intelligence and Security Workshops,2007: 152-155.
    [22]刘华.基于关键短语的文本分类研究[J].中文信息学报,2007,21(4):34-41.
    [23] C. E. Shannon. Mathematical Theory of Communication [J]. Bell System Technical Journal. 1948, 27(1):623-656.
    [24] Landauer T K, Laham D, and Foltz P W. Automated Essay Scoring: A Cross Disciplinary Perspective [C]. In: Automated Essay Scoring and Annotation of Essays with the Intelligent Essay Assessor, Mahwah, United States, 2003, 87-112.
    [25] G.Salton, A. Wong, C.S. Yang. 1975. A Vector Space Model for Automatic Indexing [J]. Communications of ACM, 18(11): 613-620.
    [26] J. B. Keith Humphreys. Phraserate: An Html Keyphrase Extractor[R]. Technical Report, University of California, Riverside, 2002: 1-16.
    [27] E. Frank, G. W. Paynter, I. H. Witten. Domain-Specific Keyphrase Extraction[C]. In Proc of the 16th International Joint Conference on Aritifcal Intelliegence, 1999:668-673.
    [28]张建蓉.学术论文中关键词标引的常见问题分析[J].编辑学报,2003,15(2):104-105
    [29] Magerman, D. & Marcus, M. 1990,“Parsing a natural language using mutual information statistics”, in Proceedings of AAAI’90. pp.984-989.
    [30] Thian-Huat Ong and Hsinchun Chen, "Updatable PAT Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management", Proceeding s of the Second Asian Digital Library Conference, November 8-9,1999,pp.63-84.
    [31] Gaston H. Gonnet, Ricardo A. Baeza-yates and Tim Snider,“New Indices for Text: Pat Trees and Pat Arrays”, Information Retrieval Data Structures & Algorithms, Prentice Hall, pp. 66-82, 1992.
    [32] Morrison.D,“PATRICIA: Practical Algorithm to Retrieve Information Coded in Alphanumeric”, JACM, pp. 514-534, 1968.
    [33] Lee-Feng Chien, "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval", ACM SIGIR'97,50-59.
    [34] G.Salton, A. Wong, C.S. Yang. 1975. A Vector Space Model for Automatic Indexing [J]. Communications of ACM, 18(11): 613-620.
    [35] Lee-Feng Chien,. PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent,1998. Chinese Information Retrieval. in special issue on Information Retreival with Asian Languages, Information Processing and Management, Elsevier Press.
    [36] Wei-Yun Ma and Ken-Jiann Chen, "A bottom-up Merging Algorithm for Chinese Unknown Word Extraction", In Proceedings of SIGHAN.
    [37] R. Ferrer-i-Cancho, R. V. Sole. The Small World of Human Language[C]. In Proc of the Royal
    [38] P.Turney. Extraction of Keyphrases from Text Evaluation of Four Algorithms Technical Report ERB-1051,National Research Council,Institute for Information Technology.1997
    [39]刘群,李素建基于《知网》的词汇语义相似度的计算.第三届汉语词汇语义学研讨会,台北,2002
    [40] Li Sujian,Zhang Jian,Huang Xiong and Bai Shuo,Semantic Computation in Chinese Question-Answering System,Journal of Computer Science and Technology. 2002,17(6):933-939
    [41]印鉴,谭焕云.基于χ2统计量的KNN文本分类算法[J].小型微型计算机系统. 2007, 28(6): 1094-1097.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700