摘要
为解决传统万有引力模型因词语质量、词间距离度量不足导致关键词效果较差的问题,分别从词语质量表示和距离计算两方面对传统万有引力模型进行改进。提出基于词频-文档分布熵的方法构建通用词表,过滤候选词后,综合位置、词性、词长特征改进TF-IDF方法,计算词语外部重要性;构建共现网络图,通过计算词语关联度衡量单词内部重要性,融合内部重要性和外部重要性计算词语质量并赋予图节点差异化初始权重;在语义距离的基础上引入依存句法距离,计算词间引力作为边的权重,多次迭代后排序输出TopK个关键词。实验结果表明,该方法在3GPP技术规范和公开的SemEval2010、DUC2001数据集上较传统方法取得了更好的效果,验证了方法的有效性和通用性。
To solve the problem of poor effects of the traditional gravitational model owing to improper word quality and distance measurement,the traditional universal gravitational model was improved from both the mass expression and the distance calculation perspectives.A method based on word frequency-document entropy to build a universal word list was proposed,after filtering candidate words,the features of position,part of speech and length were combined to improve TF-IDF,which was used to calculate the external importance of word.The co-occurrence network map was constructed,word's internal importance was calculated by the word correlation degree,the internal and external importance were combined to express the word mass which was treated as the initial differential weight of the graph nodes.The dependency syntax distance was introduced based on the semantic distance,and the gravitational force was calculated as the weight of the edge.After multiple iterations,TopK key words were output.Experimental results show that the proposed method achieves better performance than the traditional methods in the3 GPP specification,the open SemEval2010 dataset and DUC2001 dataset,the validity and generality of the method are demonstrated.
引文
[1]WANG Jinbo,WANG Lianzhi,GAO Wanlin,et al.A improved naive Bayesian keyword extraction algorithm[J].Computer Applications and Software,2014,31(2):174-176(in Chinese).[王锦波,王莲芝,高万林,等.一种改进的朴素贝叶斯关键词提取算法研究[J].计算机应用与软件,2014,31(2):174-176.]
[2]Chen Y,Yin J,Zhu W,et al.Novel word features for keyword extraction[G].LNCS 9098:Web-Age Information Management. SpringerInternationalPublishing, 2015:148-160.
[3]WANG Xuxiang.Research on question keywords extraction techniques for question answering[D].Harbin:Harbin Institute of Technology,2016(in Chinese).[王煦祥.面向问答的问句关键词提取技术研究[D].哈尔滨:哈尔滨工业大学,2016.]
[4]NIU Ping,HUANG Degen.TF-IDF and rules based automatic extraction of Chinese keywords[J].Journal of Chinese Computer Systems,2016,37(4):711-715(in Chinese).[牛萍,黄德根.TF-ID与规则相结合的中文关键词抽取研究[J].小型微型计算机系统,2016,37(4):711-715.]
[5]LIN Manshan,HAN Xuejiao,SONG Wei.Based on multithread and multi-factor weighted keyword extraction algorithm[J].Computer Engineering and Design,2013,34(7):2398-2402(in Chinese).[林满山,韩雪娇,宋威.基于多线程多重因子加权的关键词提取算法[J].计算机工程与设计,2013,34(7):2398-2402.]
[6]LIU Tong.Algorithm research of text key word extraction based on complex networks[J].Application Research of Computers,2016,33(2):365-369(in Chinese).[刘通.基于复杂网络的文本关键词提取算法研究[J].计算机应用研究,2016,33(2):365-369.]
[7]LI Junfeng,LYU Xueqiang,ZHOU Shaojun.Research on patent keyword indexing of weighted complex graph model[J].New Technology of Library and Information Service,2015,31(3):26-32(in Chinese).[李军锋,吕学强,周绍钧.带权复杂图模型的专利关键词标引研究[J].现代图书情报技术,2015,31(3):26-32.]
[8]Wang R,Liu W, Mcdonald C.Corpus-independent generic keyphrase extraction using word embedding vectors[C]//Proceedings of DLWSDM.ACM,2015:39-46.
[9]Giamblanco N,Siddavaatam P.Keywords and keyphrase extraction using Newton’s law of universal gravitation[C]//Electrical and Computer Engineering.IEEE,2017:1-4.
[10]Halliday MAK,Hasan R.Cohesion in English[M].London:Routledge,2014.
[11]Le Q, Mikolov T.Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning,2014:1188-1196.
[12]Haque R,Penkale S,Way A.TermFinder:Log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction[J].Language Resources and Evaluation,2018,52(2):365-400.
[13]YANG Ying,DAI Bin.Chinese keyword extraction method based on multi-features[J].Computer Applications and Software,2014,31(11):109-112(in Chinese).[杨颖,戴彬.基于多特征的中文关键词抽取方法[J].计算机应用与软件,2014,31(11):109-112.]
[14]ZHAN Xuegang,WU Qiang.Keywords extraction algorithm based on TF statistics and syntactic parsing[J].Computer Applications and Software,2014,31(1):47-49(in Chinese).[战学刚,吴强.基于TF统计和语法分析的关键词提取算法[J].计算机应用与软件,2014,31(1):47-49.]
[15]XIA Tian.Study on keyword extraction using word position weighted TextRank[J].New Technology of Library and Information Service,2013,29(9):30-34(in Chinese).[夏天.词语位置加权TextRank的关键词抽取研究[J].现代图书情报技术,2013,29(9):30-34.]