摘要
针对微博文本篇幅短小、网络新词层出不穷等特点以及在话题发展过程中产生的漂移问题,提出了基于双向量模型的自适应微博话题追踪方法.该方法首先提出双向量模型,将文本用词嵌入和VSM向量空间模型两种方法分别向量化,保留文本语义的同时也解决了微博新词问题.其次,将话题和微博分别用双向量模型表示,计算话题双向量模型和微博双向量模型的余弦相似度作为话题与微博的相似度.接着,将话题与微博的相似度与自适应学习获得的相似度阈值进行比较,判定微博是否为话题相关微博.最后,自适应更新话题模型,能够有效地应对微博话题发展所产生的漂移.实验结果表明,该方法能够实时地跟踪话题并降低了话题相关微博的漏检率和误检率.
In order to handle the characteristics of microblog such as short texts,continuous emergence of network neologisms and topic drifting,an adaptive microblog topic tracking method based on Double-Vector model is proposed. Firstly,a Double-Vector model is proposed to transform texts into vectors with word embedding technology and VSM( Vector Space Model),so that the text semantics is preserved and the problem of microblog neologisms is solved. Secondly,the similarity between a microblog and a topic is represented by the cosine value of the Double-Vector model of the microblog and the Double-Vector model of the topic. Thirdly,the similarity between a microblog and a topic is compares with the similarity threshold that is obtained by self-adaptive learning to determine whether the microblog is topic relevant microblog or not. Finally,through self-adaptive updating the topic model,the topic drift aroused by the development of microblog topics can be effectively overcomed. Experimental results show that the proposed method can effectively track the changes of the topic in real time and reduce the missing rate and false positive rate of the topic related microblog.
引文
[1] Allan J. Topic detection and tracking[M]. Springer US,2002.
[2] Pilli L E,Mazzon J A. Information overload,choice deferral,and moderating role of need for cognition:empirical evidence[J]. Revista De Administra92o,2016,51(1):36-55.
[3] Xiong Cai-quan,Ke Lv,Wang Hao,et al. Personalized group recommendation model based on argumentation topic[C]//Conference on Complex,Intelligent,and Software Intensive Systems(CISIS),Springer,Cham,2018:206-217.
[4] Gao Tian,Du Jun-ping,Wang Su,et al. Topic detection for emergency events based on FCM document clustering[C]//IEEE International Conference on Broadband Network and Multimedia Technology(IEEE IC-BNMT),IEEE,2011:1181-1185.
[5] Cui Zheng-yan. Short message classification of microblogging based on semantic[J]. Modern Computer,2010,(8):18-20,24.
[6] Ye Cheng-xu,Yang Ping,Liu Shao-peng. Hot microblogging topics discovery based on subject terms[J]. Computer Applications&Software,2016,(2):46-50.
[7] Lu Rong,Xiang Liang,Liu Ming-rong,et al. Discovering news topics from microblogs based on hidden topics analysis and text clustering[J]. Pattern Recognition&Artificial Intelligence,2012,25(3):382-387.
[8] Tang Xiao-bo,Wang Zhong-qin,Zhong Lin-xia. Microblog topic tracking model based on Wikipedia semantic extension[J]. Information Science,2017,(2):80-85.
[9] Duan Ya-juan,Wei Fu-ru,Zhou Ming,et al. Graph-based collective classification for tweets[C]//ACM International Conference on Information and Knowledge Management(CIKM),ACM,2012:2323-2326.
[10] Kyosuke Nishida,Takahide Hoshide,Ko Fujimura. Improving tweet stream classification by detecting changes in word probability[C]//International Acm Sigir Conference on Research&Development in Information Retrieval(SIGIR),ACM,2012:971-980.
[11] Fu Peng,Lin Zheng,Yuan Feng-cheng,et al. Convolutional neural network and user information based model for microblog topic tracking[J]. Pattern Recognition&Artificial Intelligence,2017,30(1):73-80.
[12] Zheng Yan,Lu Ran. An adaptive topic tracking method based on feedback stories[C]//International Symposium on Information Technology in Medicine and Education(ISITME),IEEE,2012:1021-1025.
[13] Zhang Jia-ming,Xi Yao-yi,Wang Bo,et al. Method of micro-blog event tracking based on word vector[J]. Computer Engineering&Applications,2016,52(17):73-78.
[14] Lin J,Snow R,Morgan W. Smoothing techniques for adaptive online language models:topic tracking in tweet streams[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(SIGKDD),ACM,2011:422-429.
[15] Feng Jun-jun,He Xiao-chun,Wang Hai-pei. Research on microblog topic tracking based on naive bayesian network[J]. Computer&Digital Engineering,2017,45(11):2244-2247.
[16] Tang Xiao-jun. A method of tracking the topic of microblogs based on random forest[D]. Huainan:Anhui University of Science and Technology,2017.
[17] Wang Hui. Research and design of microblog topic tracking method[D]. Beijing:Beijing Jiaotong University,2014.
[18] Wu Jun-na. Research on technologies of adaptive topic tracking[D]. Beijing:North China Electric Power University,2013.
[19] Yan Xiao-hui,Guo Jia-feng,Lan Yan-yan,et al. A biterm topic model for short texts[C]//International Conference on World Wide Web(WWW),ACM,2013:1445-1456.
[20] Mikolov Tomas,Chen Kai,Corrado Grey,et al. Efficient estimation of word representations in vector space[C]//Proceedings of Workshop at International Conference on Learning Representations(ICLR),2013.
[21] Hong Yu,Zhang Yu,Liu Ting,et al. Topic detection and tracking review[J]. Journal of Chinese Information Processing,2007,21(6):71-87.
[5]崔争艳.基于语义的微博短信息分类[J].现代计算机(专业版),2010,(8):18-20,24.
[6]叶成绪,杨萍,刘少鹏.基于主题词的微博热点话题发现[J].计算机应用与软件,2016,(2):46-50.
[7]路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现[J].模式识别与人工智能,2012,25(3):382-387.
[8]唐晓波,王中勤,钟林霞.基于维基语义扩展的微博话题追踪模型研究[J].情报科学,2017,(2):80-85.
[11]付鹏,林政,袁凤程,等.基于卷积神经网络和用户信息的微博话题追踪模型[J].模式识别与人工智能,2017,30(1):73-80.
[13]张佳明,席耀一,王波,等.基于词向量的微博事件追踪方法[J].计算机工程与应用,2016,52(17):73-78.
[15]冯军军,贺晓春,王海沛.基于朴素贝叶斯网络的微博话题追踪技术研究[J].计算机与数字工程,2017,45(11):2244-2247.
[16]唐孝军.基于随机森林的微博话题追踪的方法探究[D].淮南:安徽理工大学,2017.
[17]王慧.微博话题追踪方法研究与设计[D].北京:北京交通大学,2014.
[18]武军娜.自适应话题跟踪技术研究[D].北京:华北电力大学,2013.
[21]洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.