改进的关键词提取算法研究

英文篇名：Study on an Improved Keyword Extraction Algorithm
作者：王涛 ; 李明
英文作者：WANG Tao;LI Ming;School of Computer and Information Sciences,Chongqing Normal University;
关键词：词向量 ; TextRank ; 图模型 ; LDA
英文关键词：word vector;;TextRank;;graph model;;LDA
中文刊名：CQSF
英文刊名：Journal of Chongqing Normal University(Natural Science)
机构：重庆师范大学计算机与信息科学学院;
出版日期：2019-05-09 19:30
出版单位：重庆师范大学学报(自然科学版)
年：2019
期：v.36;No.167
基金：重庆市教育委员会教改项目(No.092055);重庆市教育委员会科技项目(No.kj098820)
语种：中文;
页：CQSF201903012
页数：7
CN：03
ISSN：50-1165/N
分类号：103-109

摘要

【目的】针对词主题信息与词相似性信息对关键词提取的影响进行了研究,提出一种改进的TextRank关键词提取方法。【方法】首先,使用隐含狄利克雷分布(Latent Dirichlet allocation,LDA)主题模型对文档建模计算词主题信息;其次,使用FastText生成词向量,并计算词相似性矩阵;最后,融合词主题信息与词相似性信息的综合权重来优化TextRank词汇节点的初始权重,并进行词图模型的迭代运算与关键词提取。【结果】实验表明,改进方法的提取结果优于传统方法。【结论】证明了考虑词主题信息的全局性与词相似性信息的局部性能有效提高TextRank算法提取关键词的性能。
[Purposes]Aiming at the influence of word topic and word similarity on keyword extraction,an improved TextRank keyword extraction method is proposed.[Methods]First,by using Latent Dirichlet Allocation(Latent Dirichlet Allocation,LDA)word theme topic influence model to calculate the document model.Secondly,by employing FastText to generate word vectors and calculate word similarity matrices.Finally,by integrating the weight of word theme influence and word similarity influence to optimize the initial weight of vocabulary node in TextRank,iterative operation and keyword extraction of word graph model.[Findings]Experiments show that the extraction result of the improved method is better than the traditional method.[Conclusions]It is proved that the global influence of word topic and the local influence of word similarity can effectively improve the performance of TextRank algorithm in extracting keywords.

引文

[1]HABIBI M,POPESCU-BELIS A.Keyword extraction and clustering for document recommendation in conversations[J].IEEE/ACM Transactions on Audio Speech&Language Processing,2015,23(4):746-759.
    [2]XIE F,WU X,ZHU X.Efficient sequential pattern mining with wildcards for keyphrase extraction[J].Knowledge-Based Systems,2017,115:27-39.
    [3]LI J,FAN Q,ZHANG K.Keyword extraction based on TF/IDF for chinese news document[J].Wuhan University Journal of Natural Sciences,2007,12(5):917-921.
    [4]BLEI D M,NG A Y,JORDAN M I.Latentdirichlet allocation[J].J Machine Learning Research Archive,2003,3:993-1022.
    [5]MIHALCEA R,TARAU P.TextRank:bringing order into texts[J].Emnlp,2004:404-411.
    [6]牛萍,黄德根.TF-IDF与规则相结合的中文关键词自动抽取研究[J].小型微型计算机系统,2016,37(4):711-715.NIU P,HUANG D G.Study on Chinese keyword automatic extraction based on TF-IDF and rules[J].Journal of Chinese Mini-Micro Computer Systems,2016,37(4):711-715.
    [7]黄磊,伍雁鹏,朱群峰.关键词自动提取方法的研究与改进[J].计算机科学,2014,41(6):204-207.HUANG L,WU Y P,ZHU Q F.Research and improvement of automatic keyword extraction methods[J].Computer Science,2014,41(6):204-207.
    [8]刘啸剑,谢飞.结合主题分布与统计特征的关键词抽取方法[J].计算机工程,2017,43(7):217-222.LIU X J,XIE F.Keyword extraction method combining subject distribution and statistical features[J].Computer Engineering,2017,43(7):217-222.
    [9]刘啸剑,谢飞,吴信东.基于图和LDA主题模型的关键词抽取算法[J].情报学报,2016,35(6):664-672.LIU X J,XIE F,WU X D.Keywords extraction algorithm based on graph and LDA theme model[J].Journal of the China Society for Scientific and Technical Information,2016,35(6):664-672.
    [10]WAN X J,XIAO J G.Single document keyphrase extractionusing neighborhood knowledge[C]//AAAI’08Proceedings of the 23rd National Conference on Artificial Intelligence.Chicago,Illinois:AAAI Press,2008:855-860.
    [11]阿力甫·阿不都克里木,李晓.基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类[J].计算机科学,2016,43(12):36-40.GHALIP A,LI X.Uighur keyword extraction and text classification based on textrank algorithm and mutual information similarity[J].Computer Science,2016,43(12):36-40.
    [12]柳林青,余瀚,费宁,等.一种基于TextRank的单文本关键字提取算法[J].计算机应用研究,2018(3):705-710.LIU L Q,YU W,FEN N,et al.A single text keyword extraction algorithm based on textrank[J].Application Research of Computers,2018(3):705-710.
    [13]方俊,郭雷,王晓东.基于语义的关键词提取算法[J].计算机科学,2008,35(6):148-151.FANG J,GUO L,WANG X D.Semantic-based keyword extraction algorithm[J].Computer Science,2008,35(6):148-151.
    [14]WANG R,LIU W,Mc DONALD C.Corpus-independent generic keyphrase extraction using word embedding vectors[EB/OL].[2018-08-15].https://www.ixueshu.com/document/6c41cd4df44fd223318947a18e7f9386.html.
    [15]宁建飞,刘降珍.融合Word2vec与TextRank的关键词抽取研究[J].现代图书情报技术,2016,27(6):20-27.NING J F,LIU J Z.Research on keyword extraction of Word2vec and TextRank[J].New Technology of Library and Information Service,2016,27(6):20-27.
    [16]夏天.词向量聚类加权TextRank的关键词抽取[J].数据分析与知识发现,2017,1(2):28-34.XIA T.Keyword extraction of word vector clusteringweighted TextRank[J].Data Analysis and Knowledge Discovery,2017,1(2):28-34.
    [17]HAVELIWALA T H.Topic-sensitive page rank:a context-sensitive ranking algorithm for web search[J].IEEE Transcations on Knowledge&Data Engineering,2003,15(4):784-796
    [18]顾益军,夏天.融合LDA与TextRank的关键词抽取研究[J].数据分析与知识发现,2014,30(z1):41-47.GU Y J,XIA T.Research on keyword extraction of fusion LDA and TextRank[J].Data Analysis and Knowledge Discovery,2014,30(z1):41-47.
    [19]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
    [20]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,1(5):135-146.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700