基于词向量的话题焦点识别方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Topic Focus Recognition Method Based on Word2vec
  • 作者:张佩瑶 ; 刘东苏
  • 英文作者:ZHANG Pei-yao;LIU Dong-su;School of Economics and Management, Xidian University;
  • 关键词:网络话题 ; 焦点特征词 ; 词向量 ; BTM ; 相似度计算
  • 英文关键词:online social topics;;focus feature;;word embeddings;;biterm topic model;;similarity calculation
  • 中文刊名:QBKX
  • 英文刊名:Information Science
  • 机构:西安电子科技大学经济与管理学院;
  • 出版日期:2019-07-01
  • 出版单位:情报科学
  • 年:2019
  • 期:v.37;No.335
  • 基金:国家自然科学青年基金项目“大规模动态社交网络社团检测算法研究”(71401130)
  • 语种:中文;
  • 页:QBKX201907010
  • 页数:5
  • CN:07
  • ISSN:22-1264/G2
  • 分类号:63-66+73
摘要
【目的/意义】移动互联网时代,微博以其快速、便捷的优点迅速成为信息传播与共享的平台之一。在互联网信息传播过程中,话题内容焦点会随着时间推动发生动态迁移,及时准确的发现话题内容焦点的迁移有助于了解网络舆情的演化趋势。【方法/过程】首先,定义基于焦点特征词分布的焦点词提取公式,构造焦点特征词集合;然后,使用Skip-gram模型在大规模语料上训练得到词向量,再通过BTM对文本建模,直接在BTM主题维上结合焦点特征词集合构造主题词向量;最后,计算主题特征词间的相似度,将其应用到聚类算法中实现话题焦点识别。【结果/结论】通过对新浪微博数据集上的实验结果表明,本方法能够充分利用词向量引入的语义信息,提高文本聚类效果,有效的获取各阶段的话题焦点。
        【Purpose/significance】In the area of mobile,microblog has been playing a significant role in the distribution and transmission of many hotspot topics effectively. In the process of Internet information transmission,the news of certain is constantly updated,but the reports focus on different contents. Timely and accurate discovery of the shift of topic focus is helpful to understand the evolution trend of online public opinion.【Method/process】Firstly,the formula for feature word extraction based on focus feature distribution is defined,the focus feature words set is constructed. Secondly,the word embeddings that represents semantics of the feature word is gained through training in large-scale corpus with the Skip-gram model. BTM is used in texts modeling,in BTM thematic dimension,the theme word vector is constructed by combining the focus feature.Finally,the similarity between thematic feature words is calculated,and is applied to the clustering algorithm to realize topic focus recognition【Result/conclusion】The result of experimental analysis on Sina Weibo data shows that the proposed method can make full use of semantic information contained by word vector,which can effectively get the topic focus of each stage.
引文
1李依霖,朱嘉奇,吴云坤.一种微博热点事件子话题的可视分析方法[J].中国科学技术大学学报,2017,47(1):48-56.
    2 孙胜平.中文微博客热点话题检测与跟踪技术研究[D].北京:北京交通大学,2011.
    3 孙励.基于微博的热点话题发现[D].北京:北京邮电大学,2012.
    4 Shen Y, Li S, Zheng L, et al. Emotion Mining Research on Micro-blog[C]//Web Society, 2009.Sws'09 . IEEE Symposium on. IEEE Xplore, 2009:71-75.
    5 Liu Z, Yu W, Chen W, et al. Short Text Feature Selection for Micro-Blog Mining[C]//International Conference on Computational Intelligence and Software Engineering. IEEE,2010:1-4.
    6 Chen X,Zhang Y,Cao L,et al.An improved feature selection method for Chinese short texts clustering based on HowNet[J].Lecture Notes in Electrical Engineering,2014,(277):635-642.
    7 Bouras C,Tsogkas V. A clustering technique for news articles using WordNet[J].Knowledge-Based Sytems,2012,(36):115-128.
    8 史剑虹,陈兴蜀,王文贤.基于隐主题分析的中文微博话题发现[J].计算机应用研究,2014,31(3):700-704.
    9 汤秋莲.基于BTM的短文本聚类[D].合肥:安徽大学,2014.
    10 王少鹏,彭岩,王洁.基于LDA的文本聚类在网络舆情分析中的应用研究[J].山东大学学报(理学版),2014,49(9):129-134.
    11 单斌,李芳.基于LDA话题演化研究方法综述[J].中文信息学报,2010,24(6):43-49.
    12 X.Wang,A.McCallum. Topic over time:A non-markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Philadelphia,PA,USA,2006:424-433.
    13 D.M.Blei,J.D.Lafferty.Dynamic topic model[C]//Proceding of the 23rd Internation Conference on Machine Learning.Pittsburgh, Pennsyvania,2006:113-120.
    14 X.Song, C. Y. Lin, B. L. Tseng, etc. Modeling and predicting personal information dissemination behavior[C]//Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Chicago,Illinois,USA,2005:479-488.
    15 Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word representations in vector space[C]//International Conference on Learning Representation,2013.
    16 Yan X,Guo J,Lan Y,et al. A biterm topic model for short texts[C]//Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee,2013:1445-1456.
    17 Hintonge. Learning distributed representations of concets[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society. NewJersey:Lawrence Erlbaum Associates, 1986.
    18 王亚民,胡悦.基于BTM的微博舆情热点发现[J].情报杂志,2016,35(11):116-124,140.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700