一种Spark GraphX框架下的关键词抽取方法

英文篇名：Keyword Extraction Method Based on Spark GraphX Framework
作者：程传鹏
英文作者：CHENG Chuan-peng;School of Computer Science,Zhongyuan Institute of Technology;
关键词：Spark ; GraphX ; 关键词提取 ; 图排序 ; 词语权重
英文关键词：Spark GraphX;;key words extraction;;graph sorting;;word weight
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：中原工学院计算机学院;
出版日期：2019-02-15
出版单位：小型微型计算机系统
年：2019
期：v.40
基金：河南省科技厅科技攻关项目(172102210594)资助;; 河南省高等学校重点科研项目(17A520066)资助
语种：中文;
页：XXWX201902017
页数：4
CN：02
ISSN：21-1106/TP
分类号：90-93

摘要

TextRank算法根据文本词语的位置关系构造图,应用图排序的算法计算出词语的权重,在计算过程中需要进行大量的迭代运算,在数据规模较大的时候,计算时间尤为可观.针对此问题,提出了一种基于Spark GraphX的关键词抽取方法,利用Spark GarpX所提供的分布式计算的图框架,将文本图数据分布式存储在不同的节点上,高效地实现了文本关键词的抽取.实验表明,本文中提出的基于Spark GraphX的关键词抽取方法,不仅计算时间短,抽取的关键词与人工标注的结果非常接近,具有一定的合理性.
The graph is constructed based on the positional relationship of the text words in textrank algorithm,and the the weight of words is calculated by using the algorithm of graph sorting. A lot of iterative operations are needed in the computing process,When the size of the data is large,the calculation time is particularly considerable. To solve this problem,A method of keyword extraction based on Spark GraphX is proposed. Using the graph framework of distributed computing provided by Spark GarpX,the text graph data is distributed on different nodes,and the text keyword extraction is efficiently realized. The result of experiments shows that automatic scoring method in the paper is more approximation to manual scoring. Therefore,the method has certain reasonableness. The key word extraction method based on Spark GraphX proposed in this paper is not only short in computation time,but also very close to the result of artificial annotation. and the experimen results showthat the method has a certain rationality and feasibility.

引文

[1]Zhang Jian-e.Chinese text keyword extraction method based on multi-feature fusion[J].Information Studies:Theory&Application,2013,36(10):105-108.
    [2]Hu Xing-hua,Wu Bin.Automatic keyword extraction using linguistic features[C].Data M ining Workshops,2006:19-23.
    [3]Peter D Turney.Learning algorithms for keyphrase extraction[J].Information Retrieval,2000,2(4):303-336.
    [4]Cheng Lan-lan,He Pi-lian,Sun Yue-heng.Study on Chinese keyw ord extraction algorithm based on na6ve bayesmodel[J].Computer Application,2005,12(25):2780-2782.
    [5]Zhang Qing-guo,Zhang Cheng-zhi.Automatic Chinese keyword extraction based on KNN for implicit subject extraction[C].KAM'08Proceedings of the 2008 International Symposium on Know ledge Acquisition and M odeling,2008:689-692.
    [6]Meng Wen-chao,Liu Lian-chen,Dai Ting.A modified approach to keyw ord extraction based on w ord similarity[C].Intelligent Computing and Intelligent Systems,2009:388-392.
    [7]Fang Jun,Guo Lei,Wang Xiao-dong.Semantically improved automatic keyphrase extraction[J].Computer Science,2008,35(6):148-151.
    [8]Zhang Ying-ying,Xie Qiang,Ding Qiu-lin.Chinese keyword extraction algorithm based on synonym chains[J].Computer Engineering,2010,36(19):93-95.
    [9]Jiang Chang-jin,Peng Hong,Chen Jian-chao,et al.Keywords extraction algorithm based on combined w ord and synset[J].Application Research of Computers,2010,27(9):2853-2856.
    [10]Zhu Ge.PageRank-based document similarity search algorithm[J].Computer Engineering and Applications,2013,49(8):142-145.
    [11]Mihalcea R,Tarau P.TextRank:bringing order into texts[C].Proceedings of EM NLP 2004,Barcelona,ACM,2004:404-411.
    [12]Li Zhi-ying,Yang Wu,Xie Zhi-jun.Research on PageRank algorithm[J].Computer Science,2011,38(10A):185-188.
    [13]Yu Shan-shan,Su Jin-dian,Li Peng-fei.Improved TextRank-based method for automatic summarization[J].Computer Science,2016,43(6):240-247.
    [1]张建娥.基于多特征融合的中文文本关键词提取方法[J].情报理论与实践,2013,36(10):105-108.
    [4]程岚岚,何丕廉,孙越恒.基于朴素贝叶斯模型的中文关键词提取算法研究[J].计算机应用,2005,12(25):2780-2782.
    [7]方俊,郭雷,王晓东.基于语义的关键词提取算法[J].计算机科学,2008,35(6):148-151.
    [8]张颖颖,谢强,丁秋林.基于同义词链的中文关键词提取算法[J].计算机工程,2010,36(19):93-95.
    [9]蒋昌金,彭宏,陈建超,等.基于组合词和同义词集的关键词提取算法[J].计算机应用研究,2010,27(9):2853-2856.
    [10]朱戈.一种基于PageRank的文献相似性搜索算法[J].计算机工程与应用,2013,49(8):142-145.
    [12]李稚楹,杨武,谢治军.PageRank算法研究综述[J].计算机科学,2011,38(10A):185-188.
    [13]余珊珊,苏锦钿,李鹏飞.基于改进的TextRank的自动摘要提取方法[J].计算机科学,2016,43(6):240-247.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700