一种基于Word2vec的内容态势感知方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:A Content Situational Awareness Method based on Word2vec
  • 作者:魏忠 ; 周俊 ; 石元兵 ; 黄明浩
  • 英文作者:WEI Zhong;ZHOU Jun;SHI Yuan-bing;HUANG Ming-hao;Westone Information Industry, Ltd.;
  • 关键词:提取 ; Word2vec ; 词向量
  • 英文关键词:keyword extraction;;Word2vec;;word embedding
  • 中文刊名:TXJS
  • 英文刊名:Communications Technology
  • 机构:卫士通信息产业股份有限公司;
  • 出版日期:2019-05-10
  • 出版单位:通信技术
  • 年:2019
  • 期:v.52;No.329
  • 基金:“核高基”国家科技重大专项(No.2017ZX01030-201)~~
  • 语种:中文;
  • 页:TXJS201905033
  • 页数:6
  • CN:05
  • ISSN:51-1167/TN
  • 分类号:204-209
摘要
随着近年来数据泄露事件的日益增多,对数据的保护变得越来越重要,要做到对数据的精确保护,首先就要完成对数据的准确识别。传统的做法是使用TF-IDF提取关键词,但是TF-IDF没有考虑文本上下文的关联语义,所以使用TF-IDF提取关键词进行内容态势感知效果并不好。提出使用Word2vec结合TF-IDF算法对基准语料库提取关键词词集,再使用关键词词集进行内容态势感知,实验表明该方案能够得到更准确全面的内容态势感知结果。
        With the increasing number of data leakage incidents in recent years, the protection of data becomes more and more important. To ensure the precise protection of data, the first step is to complete the accurate identification of data. The traditional approach uses TF-IDF to extract keywords, but TF-IDF does not consider the relevance semantics of text contexts, so the effect of using TF-IDF to extract keywords for content situational awareness is not good. It is proposed to extract keyword word sets from reference corpus by using Word2 vec combined with TF-IDF algorithm, and then the keyword word sets are used for content situational awareness. Experiments indicate that this scheme can obtain more accurate and comprehensive content situational awareness results.
引文
[1]刘俊,邹东升,邢欣来等.基于主题特征的关键词抽取[J].计算机应用研究,2012,29(11):4224-4227.LIU Jun,ZOU Dong-sheng,XING Xin-lai,et al.Keyphrase Extraction Based on Topic Feature[J].Application Research of Computers,2012,29(11):4224-4227.
    [2]李跃鹏,金翠,及俊川.基于Word2vec的关键词提取算法[J].科研信息化技术与应用,2015,6(4):54-59.LI Yue-peng,JIN Cui,JI Jun-chuan.A Keyword Extraction Algorithm Based on Word2vec[J].E-science Technology&Application,2015(4):54-59.
    [3]周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148.ZHOU Lian.Exploration of the Working Principle and Application of Word2vec[J].Sci-Tech Information Devel opment&Economy,2015(2):145-148.
    [4]Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[J].Computer Science,2013,25(05):213-219.
    [5]Mikolov T,Sutskever I,Chen K,et al.Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013(26):3111-3119.
    [6]徐文海,温有奎.一种基于TFIDF方法的中文关键词抽取算法[J].情报理论与实践,2008,31(2):298-302.XU Wen-hai,WEN You-kui.Chinese Keywords Extraction Based on TFIDF Method[J].Information Studies:Theory&Application,2008,31(2):298-302.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700