基于词语网络的关键字提取策略研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
关键字是表述文档中心内容的词汇,是计算机系统标引论文内容特征的词汇,是便于信息系统汇集以供读者检索的词汇。关键字提取是文本挖掘领域的一个分支,是文档检索、文档比较、摘要生成、文档分类和聚类的基础性工作。
     关键字提取算法可分为两类:基于训练集的关键字提取策略和不需要训练集的关键字提取策略。基于训练集的方法将关键字提取视为分类问题,通过将文档中出现的词语划分到关键字类或非关键字类,再从关键字类中选择若干个词语作为关键字,该类算法由Peter.D.Turney首次提出,其技术已日趋成熟。
     不需要训练集的算法,可分为以下四类:基于统计的方法,如频率统计;基于词语图的方法,如KeyGraph;基于词语网络的方法,如中介性指标(BC,Betweenness Centrality);基于SWN的方法;上述四种方法都是建立在词频统计基础上。基于统计的方法简单快速,能够提取高频词语,却忽略对文档具有重要意义但出现频率不高的词语,因此提取的关键字具有片面性。基于词语图的方法需要设定的参数过多,如顶点数、边数等,因而常造成边界上的取舍问题,影响算法的稳定性和精度。基于SWN的方法是以平均距离长度为关键字提取依据,而SWN理论以连通图为基础,故对非连通的文档结构图,无法衡量顶点的重要性,也无法正确地提取文档关键字。
     本文主要研究基于词语网络的关键字提取算法,在分析已有基于词语网络的关键字提取算法的基础上,针对存在问题,提出一个新的基于词语网络的英文文档关键字提取策略,采用节点删除指标度量顶点(词语)的重要性。所提取的关键字不仅包括高频单词和短语,而且包括对文档中心内容贡献大但出现频率不高的单词和短语。
     实验数据来自KEA和Extractor算法中的测试数据集,及世界著名的科技出版集团之一——德国施普林格提供的学术期刊及电子图书的论文为测试数据。以论文作者提供的关键字为基准,采用平均准确率和平均召回率作为衡量提取效果的依据,通过将本文算法的实验结果与TF和BC算法的实验结果相比较,证明了本文算法的正确性和有效性。
With the advent of Internet since 1990, we have seen a tremendous growth in the volume of online text documents available on the Internet, such as electronic emails、web pages、and digital books et al. To make more effective use of these documents, there is increasingly need for tools to deal with text documents. To meet such increasingly needs, some product for analyzing text documents has been developed. All techniques involved in document analysis have formed a new exciting research area often called as Text Mining.
     Keywords extraction plays a very important role in the text mining domain, because keywords are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyword extraction is to select keywords from the text of a given document. Automatic keywords extraction makes it feasible to generate keywords for the huge number of documents that do not have manually assigned keywords.
     There are some previous approaches on keywords extraction: 1 Supervised Classification, Turney firstly approach the problem of automatically extracting keywords from text as a supervised learning task, he treats a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keywords. The performance has been satisfactory for a wide variety of applications. 2 Unsupervised Classification, these keywords extraction algorithms that applies to a single document without using a corpus are presented, such as term frequency, based on SWN, the term graph, the term network..
     Based on the analysis of existing keywords extraction using term network, an effective algorithm is proposed to extract not only high frequent terms, but also important terms with low frequency. It bases on the term network and deleting actor index. The experiment results support the conclusion.
引文
[1].Pang-Ning Tan M.S.,Vipin Kumar著,范明,范宏建等译.数据挖掘导论[M].第1版.北京:人民邮电出版社,2007,45-46
    [2].Jiawei Han M.K.范.孟.等.数据挖掘概念与技术[M].第1版.北京:机械工业出版社,2006,285-289
    [3].R.Cooley B.M.,J.Srivastava,Web Mining'Information and Pattern Discovery on the World Wide Web[A].Proceeding of the 9th IEEE International Conference on Tools with Artificial Intelligence[C].Newport Beach,CA,USA:IEEE,1997,558-567
    [4].Huaizhong Kou G.G.Keywords Extraction,Document Similarity and Categorization[EB/OL].http://www.prism.uvsq.fr/rapports/2002/docu ment_2002_22,pdf,2002
    [5].李素建,王厚峰,俞士汶,辛乘胜.关键词自动标因的最大熵模型应用研究.计算机学报[J].2004,27(9):1192-1197
    [6].Olena Medelyan I.H.W.,Thesaurus-Based Index Term Extraction for Agricultural Documents[A].The 6th Agricultural Ontology Service [C].Vila Real,Portugal:2005,18-26
    [7].Eibe Frank G.W.P.,Ian H.Witten et al,Domain-Specific Keyphrase Extraction[A].International Joint Conferences on Artificial Intelligence[C].Stockholm,Sweden:Springer,1999,668-673
    [8].Turney P.D.Learning Algorithms for Keyphrase Extraction.Information Retrieval[J].2000,2(4):303-336
    [9].Yongzheng Zhang N.Z.-H.,Evangelos Milios,Narrative Text Classification for Automatic Key Phrase Extraction in Web Document Corpora[A].Proceedings of the 7th annual ACM international workshop on Web information and data management[C].Bremen,Germany ACM Press,2005,51-58
    [10].Witten Ian H G.P.,Eibe Frank,et aI,KEA:Practical Automatic Keyphrase Extraction[A].Proceedings of the Fourth ACM Conference on Digital Libraries[C].California,USA:ACM Press,1999,254-255
    [11].Uzun Y.Keyword Extraction Using Naive Bayes[EB/OL].http://www.cs.bilkent.edu.tr/~guvenir/courses/CS550/Workshop/Yasin_Uzun.pdf,2003
    [12].Luhn H.P.A statistical approach to the mechanized encoding and searching of literary information.IBM Journal of Research and Development[]].1957,4(1):309-317
    [13].Yukio Ohsawa N.B.,Masahiko Yachida,KeyGraph:Automatic Indexing by Co-occurrence Graph based on Building Construction Metaphor[A].Proceedings of the Advances in Digital Libraries Conference table of contents[C].Santa Barbara,CA,USA:Springer,1998,22-29
    [14].耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法.南京大学学报(自然科学)[J].2006,42(2):156-162
    [15].Matsuo Y O.Y.,Ishizuka M.KeyWorld:Extracting Keywords in a Document as a Small World.Discovery Science[J].2001,2226(271-281
    [16].董洛兵.基于SWN理论的文本复合关键字提取算法的研究:[硕士学位论文][D].西安:西安电子科技大学,2006.
    [17].汪小帆,李翔,陈关荣等.复杂网络理论及其应用[M].第2版.北京:清华大学出版社,2006,9-14
    [18].张敏耿焕同,王煦法.一种利用BC方法的关键词自动提取算法研究.小型微型计算机系统[J].2007,28(1):189-192
    [19].李鹏翔 任玉晴,席酉民.网络节点(集)重要性的一种度量指标.系统工程[J].2004,22(4):13-20
    [20].Daniel Jurafsky J.H.M.冯.孙.译.自然语言处理综述[M].第1版.北京:电子工业出版社,2005,36-38
    [21].Helena Ahonen O.H.,Mika Klemettinen,and A.Inkeri Verkamo.Mining in the Phrasal Frontier.European Sysposium on Principles of Data Mining and Knowledge Discovery[J].1997,1263(343-350
    [22].Ken Barker N.C.Using Noun Phrase Heads to Extract Document Kevphrase.Lecture Notes in Computer Science[J].2000,1822(40-53)
    [23].李凡,鲁明羽,陆玉.关于文本特征抽取新方法的研究.清华大学学报(自然科学版)[J].2001,41(7):27-34
    [24].宋爽.共现分析在文本知识挖掘中的应用研究:[硕士学位论文][D].南京:南京理工大学,2006.21-24,31-32
    [25].Thomas Hofmann J.P.Statistical Models for Co-occurrence Data[EB/OL].http://citeseer.ist.psu.edu/22698.html,1998
    [26].Ido Dagan L.L.,Fernando Pereira.Similarity-Based Models of Word Co-occurrence Probabilities.Machine Learning[J].1999,34(43-69
    [27].Kenneth Ward Church P.H.,Word Association Norms,Mutual Information,and Lexicography[A].Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics[C].Vancouver,B.C.:Association for Computational Linguistics,1989,76-83
    [28]FurnkranzJ.A Study Using n-gram Features for Text Categorization[EB/OL].http://citeseer.ist.psu.edu/176994.html,199
    [29].董洛兵,马力,焦李成.一种基于Small-World和相似度的文本聚类算法.情报杂志[J].2006,2(52-57)
    [30]M.E.J.Newman.The Structure and Function of Networks.SIAM Review[J].2003,45(167-256
    [31].王柏,吴巍,徐超群,吴斌.复杂网络可视化研究综述.计算机科学[J].2007,31(4):17-22
    [32]Brandes U.A Faster Algorithm for Betweenness Centrality.Journal of Mathematical Sociology[J].2001,25(2):37-53
    [33]Zhu Mengxiao C.Z.,Cai Qingsheng,Automatic Keywords Extraction of Chinese Document Using Small World Structure[A].Natural Language Processing and Knowledge Engineering[C].Hefei,China:IEEE,2003,438-443
    [34].Porter M.F.An Algorithm for Suffix Stripping Program.Automated Library and Information System[J].1980,14(3):130-137
    [35].Chong Huang Y.T.,Zhi Zhou,Keyphrase Extraction using Semantic Networks Structure Analysis[A].Proceedings of the Sixth International Conference on Data Mining[C].Hongkong China:Springer,2006,275-284
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.