基于万有引力模型的关键词自动抽取方法

英文篇名：Automatic keyword extraction method based on gravitational model
作者：李欢 ; 吕学强 ; 李宝安 ; 徐丽萍
英文作者：LI Huan;LYU Xue-qiang;LI Bao-an;XU Li-ping;Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University;Beijing Research Center of Urban System Engineering;
关键词：万有引力模型 ; 词频-文档分布熵 ; 关键词抽取 ; 词语关联度 ; 依存句法距离
英文关键词：extraction;;correlation degree;;dependency syntax distance
中文刊名：SJSJ
英文刊名：Computer Engineering and Design
机构：北京信息科技大学网络文化与数字传播北京市重点实验室;北京城市系统工程研究中心;
出版日期：2019-04-16
出版单位：计算机工程与设计
年：2019
期：v.40;No.388
基金：国家自然科学基金项目(61671070);; 国家社会科学基金重大基金项目(15ZDB017);; 国家语委重大课题基金项目(ZDA125-26);; 北京成像技术高精尖创新中心基金项目(BAICIT-2016003)
语种：中文;
页：SJSJ201904031
页数：8
CN：04
ISSN：11-1775/TP
分类号：198-205

摘要

为解决传统万有引力模型因词语质量、词间距离度量不足导致关键词效果较差的问题,分别从词语质量表示和距离计算两方面对传统万有引力模型进行改进。提出基于词频-文档分布熵的方法构建通用词表,过滤候选词后,综合位置、词性、词长特征改进TF-IDF方法,计算词语外部重要性;构建共现网络图,通过计算词语关联度衡量单词内部重要性,融合内部重要性和外部重要性计算词语质量并赋予图节点差异化初始权重;在语义距离的基础上引入依存句法距离,计算词间引力作为边的权重,多次迭代后排序输出TopK个关键词。实验结果表明,该方法在3GPP技术规范和公开的SemEval2010、DUC2001数据集上较传统方法取得了更好的效果,验证了方法的有效性和通用性。
To solve the problem of poor effects of the traditional gravitational model owing to improper word quality and distance measurement,the traditional universal gravitational model was improved from both the mass expression and the distance calculation perspectives.A method based on word frequency-document entropy to build a universal word list was proposed,after filtering candidate words,the features of position,part of speech and length were combined to improve TF-IDF,which was used to calculate the external importance of word.The co-occurrence network map was constructed,word's internal importance was calculated by the word correlation degree,the internal and external importance were combined to express the word mass which was treated as the initial differential weight of the graph nodes.The dependency syntax distance was introduced based on the semantic distance,and the gravitational force was calculated as the weight of the edge.After multiple iterations,TopK key words were output.Experimental results show that the proposed method achieves better performance than the traditional methods in the3 GPP specification,the open SemEval2010 dataset and DUC2001 dataset,the validity and generality of the method are demonstrated.

引文

[1]WANG Jinbo,WANG Lianzhi,GAO Wanlin,et al.A improved naive Bayesian keyword extraction algorithm[J].Computer Applications and Software,2014,31(2):174-176(in Chinese).[王锦波,王莲芝,高万林,等.一种改进的朴素贝叶斯关键词提取算法研究[J].计算机应用与软件,2014,31(2):174-176.]
    [2]Chen Y,Yin J,Zhu W,et al.Novel word features for keyword extraction[G].LNCS 9098:Web-Age Information Management. SpringerInternationalPublishing, 2015:148-160.
    [3]WANG Xuxiang.Research on question keywords extraction techniques for question answering[D].Harbin:Harbin Institute of Technology,2016(in Chinese).[王煦祥.面向问答的问句关键词提取技术研究[D].哈尔滨:哈尔滨工业大学,2016.]
    [4]NIU Ping,HUANG Degen.TF-IDF and rules based automatic extraction of Chinese keywords[J].Journal of Chinese Computer Systems,2016,37(4):711-715(in Chinese).[牛萍,黄德根.TF-ID与规则相结合的中文关键词抽取研究[J].小型微型计算机系统,2016,37(4):711-715.]
    [5]LIN Manshan,HAN Xuejiao,SONG Wei.Based on multithread and multi-factor weighted keyword extraction algorithm[J].Computer Engineering and Design,2013,34(7):2398-2402(in Chinese).[林满山,韩雪娇,宋威.基于多线程多重因子加权的关键词提取算法[J].计算机工程与设计,2013,34(7):2398-2402.]
    [6]LIU Tong.Algorithm research of text key word extraction based on complex networks[J].Application Research of Computers,2016,33(2):365-369(in Chinese).[刘通.基于复杂网络的文本关键词提取算法研究[J].计算机应用研究,2016,33(2):365-369.]
    [7]LI Junfeng,LYU Xueqiang,ZHOU Shaojun.Research on patent keyword indexing of weighted complex graph model[J].New Technology of Library and Information Service,2015,31(3):26-32(in Chinese).[李军锋,吕学强,周绍钧.带权复杂图模型的专利关键词标引研究[J].现代图书情报技术,2015,31(3):26-32.]
    [8]Wang R,Liu W, Mcdonald C.Corpus-independent generic keyphrase extraction using word embedding vectors[C]//Proceedings of DLWSDM.ACM,2015:39-46.
    [9]Giamblanco N,Siddavaatam P.Keywords and keyphrase extraction using Newton’s law of universal gravitation[C]//Electrical and Computer Engineering.IEEE,2017:1-4.
    [10]Halliday MAK,Hasan R.Cohesion in English[M].London:Routledge,2014.
    [11]Le Q, Mikolov T.Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning,2014:1188-1196.
    [12]Haque R,Penkale S,Way A.TermFinder:Log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction[J].Language Resources and Evaluation,2018,52(2):365-400.
    [13]YANG Ying,DAI Bin.Chinese keyword extraction method based on multi-features[J].Computer Applications and Software,2014,31(11):109-112(in Chinese).[杨颖,戴彬.基于多特征的中文关键词抽取方法[J].计算机应用与软件,2014,31(11):109-112.]
    [14]ZHAN Xuegang,WU Qiang.Keywords extraction algorithm based on TF statistics and syntactic parsing[J].Computer Applications and Software,2014,31(1):47-49(in Chinese).[战学刚,吴强.基于TF统计和语法分析的关键词提取算法[J].计算机应用与软件,2014,31(1):47-49.]
    [15]XIA Tian.Study on keyword extraction using word position weighted TextRank[J].New Technology of Library and Information Service,2013,29(9):30-34(in Chinese).[夏天.词语位置加权TextRank的关键词抽取研究[J].现代图书情报技术,2013,29(9):30-34.]

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700