基于词向量与TextRank的关键词提取方法

英文篇名：Keyword extraction method based on word vector and TextRank
作者：周锦章 ; 崔晓晖
英文作者：Zhou Jinzhang;Cui Xiaohui;School of Cyber Science & Engineering,Wuhan University;
关键词：抽取 ; 语义差异性 ; TextRank ; 词向量 ; 隐含主题分布
英文关键词：keyword extraction;;semantic difference;;TextRank;;word vector;;implied subject distribution
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：武汉大学国家网络安全学院;
出版日期：2018-03-14 17:30
出版单位：计算机应用研究
年：2019
期：v.36;No.330
基金：中央高校基本科研业务费专项资金资助项目(2042017gf0035)
语种：中文;
页：JSYJ201904021
页数：4
CN：04
ISSN：51-1196/TP
分类号：97-100

摘要

针对词汇语义的差异性对TextRank算法的影响进行了研究,提出一种基于词向量与TextRank的关键词抽取方法。利用FastText将文档集进行词向量表征,基于隐含主题分布思想和利用词汇间语义性的差异,构建TextRank的转移概率矩阵,最后进行词图的迭代计算和关键词抽取。实验结果表明,该方法的抽取效果相比于传统方法有明显提升,同时证明利用词向量能简单而有效地改善TextRank算法的性能。
This paper studied the influence of lexical semantic difference on TextRank algorithm,and presented a keyword extraction method based on word vector and TextRank. Firstly,it used FastText to represent word vector from the document corpus. Then,based on the idea of implicit subject distribution and used the differences in lexical semantics to build a probability transfer matrix for TextRank. Finally,it iteratively calculated the lexical graph model and extracted keywords. Experimental results show that the extraction performance of this method is significantly improved compared with the traditional method. In addition,it is proved that the use of word vectors can improve the performance of TextRank algorithm simply and effectively.

引文

[1]Blei D M,Ng A Y,Jordan M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3(1):993-1022.
    [2]Li Juanzi,Fan Qina,Zhang Kuo.Keyword extraction based on TF/IDFfor Chinese news document[J].Wuhan University Journal of Natural Sciences,2007,12(5):917-921.
    [3]Mihalcea R,Tarau P.TextRank:bringing order into text[C]//Proc of Conference on Empirical Methods in Natural Language Processing.2004.
    [4]刘俊,邹东升,邢欣来,等.基于主题特征的关键词抽取[J].计算机应用研究,2012,29(11):4224-4227.(Liu Jun,Zou Dongshen,Xing Xinglai,et al.Keyphrase extraction based on topic feature[J].Application Research of Computers,2012,29(11):4224-4227.)
    [5]罗燕,赵书良,李晓超,等.基于词频统计的文本关键词提取方法[J].计算机应用,2016,36(3):718-725.(Luo Yan,Zhao Shuliang,Li Xiaochao,et al.Text keyword extraction method based on word frequency statistics[J].Journal of Computer Applications,2016,36(3):718-725.)
    [6]耿焕同,蔡庆生,于琨,等.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报:自然科学版,2006,42(2):156-162.(Geng Huantong,Cai Qingsheng,Yu Kun,et al.A kind of automatic text keyphrase extraction method based on word co-occurrence[J].Journal of Nanjing University:Natural Science,2006,42(2):156-162.)
    [7]顾益军,夏天.融合LDA与TextRank的关键词抽取研究[J].现代图书情报技术,2014,30(7):41-47.(Gu Yijun,Xia Tian.Study on keyword extraction with LDA and TextRank combination[J].New Technology of Library and Information Service,2014,30(7):41-47.)
    [8]夏天.词向量聚类加权TextRank的关键词抽取[J].数据分析与知识发现,2017,1(2):28-34.(Xia Tian.Extracting keywords with modified TextRank model[J].New Technology of Library and Information Service,2017,1(2):28-34.)
    [9]李鹏,王斌,石志伟,等.Tag-TextRank:一种基于tag的网页关键词抽取方法[J].计算机研究与发展,2012,49(11):2344-2351.(Li Peng,Wang Bing,Shi Zhiwei,et al.Tag-TextRank:a Webpage keyword extraction method based on tag[J].Journal of Computer Research and Development,2012,49(11):2344-2351.)
    [10]李跃鹏,金翠,及俊川.基于word2vec的关键词提取算法[J].科研信息化技术与应用,2015,6(4):54-59.(Li Yuepeng,Jin Cui,Ji Junchuan.A keyword extraction algorithm based on word2vec[J].e-Science Technology&Application,2015,6(4):54-59.)
    [11]姜芳,李国和,岳翔.基于语义的文档关键词提取方法[J].计算机应用研究,2015,32(1):142-145.(Jiang Fang,Li Guohe,Yue Xiang.Semantic-based keyword extraction method for document[J].Application Research of Computers,2015,32(1):142-145.)
    [12]Ortega F J,Troyano J A,Galán F J,et al.Str:a graph-based tagging technique[J].International Journal on Artificial Intelligence Tools,2011,20(5):955-967.
    [13]Cruz F,Troyano J A,Enríquez F.Supervised TextRank[C]//Proc of the 5th International Conference on Advances in Natural Language Processing.Berlin:Springer,2006:632-639.
    [14]Brin S,Page L.Reprint of:the anatomy of a large-scale hypertextual Web search engine[J].Computer Networks,2012,56(18):3825-3833.
    [15]Bojanowski P,Grave E,Joulin A,et al.Enriching word vectors with subword information[EB/OL].(2016-07-15).http://arxiv.org/abs/1607.04606.
    [16]Joulin A,Grave E,Bojanowski P,et al.Bag of tricks for efficient text classification[EB/OL].(2016-07-06)[2016-08-09].https://arxiv.org/abs/1607.01759.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700