摘要
随着近年来数据泄露事件的日益增多,对数据的保护变得越来越重要,要做到对数据的精确保护,首先就要完成对数据的准确识别。传统的做法是使用TF-IDF提取关键词,但是TF-IDF没有考虑文本上下文的关联语义,所以使用TF-IDF提取关键词进行内容态势感知效果并不好。提出使用Word2vec结合TF-IDF算法对基准语料库提取关键词词集,再使用关键词词集进行内容态势感知,实验表明该方案能够得到更准确全面的内容态势感知结果。
With the increasing number of data leakage incidents in recent years, the protection of data becomes more and more important. To ensure the precise protection of data, the first step is to complete the accurate identification of data. The traditional approach uses TF-IDF to extract keywords, but TF-IDF does not consider the relevance semantics of text contexts, so the effect of using TF-IDF to extract keywords for content situational awareness is not good. It is proposed to extract keyword word sets from reference corpus by using Word2 vec combined with TF-IDF algorithm, and then the keyword word sets are used for content situational awareness. Experiments indicate that this scheme can obtain more accurate and comprehensive content situational awareness results.
引文
[1]刘俊,邹东升,邢欣来等.基于主题特征的关键词抽取[J].计算机应用研究,2012,29(11):4224-4227.LIU Jun,ZOU Dong-sheng,XING Xin-lai,et al.Keyphrase Extraction Based on Topic Feature[J].Application Research of Computers,2012,29(11):4224-4227.
[2]李跃鹏,金翠,及俊川.基于Word2vec的关键词提取算法[J].科研信息化技术与应用,2015,6(4):54-59.LI Yue-peng,JIN Cui,JI Jun-chuan.A Keyword Extraction Algorithm Based on Word2vec[J].E-science Technology&Application,2015(4):54-59.
[3]周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148.ZHOU Lian.Exploration of the Working Principle and Application of Word2vec[J].Sci-Tech Information Devel opment&Economy,2015(2):145-148.
[4]Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[J].Computer Science,2013,25(05):213-219.
[5]Mikolov T,Sutskever I,Chen K,et al.Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013(26):3111-3119.
[6]徐文海,温有奎.一种基于TFIDF方法的中文关键词抽取算法[J].情报理论与实践,2008,31(2):298-302.XU Wen-hai,WEN You-kui.Chinese Keywords Extraction Based on TFIDF Method[J].Information Studies:Theory&Application,2008,31(2):298-302.