面向舆情事件的子话题标签生成模型ET-TAG
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:ET-TAG:A Tag Generation Model for the Sub-Topics of Public Opinion Events
  • 作者:周楠 ; 杜攀 ; 靳小龙 ; 刘悦 ; 程学旗
  • 英文作者:ZHOU Nan;DU Pan;JIN Xiao-Long;LIU Yue;CHENG Xue-Qi;CAS Key Laboratory of Network Data Science & Technology;Institute of Computing Technology,Chinese Academy of Sciences;University of Chinese Academy of Sciences;
  • 关键词:话题发现 ; PLSA ; with ; Background ; Language ; 关键词聚类 ; 话题标签生成
  • 英文关键词:Sub-topics detection;;PLSA with Background Language Model;;Key words clustering;;sub-topic tag generation
  • 中文刊名:JSJX
  • 英文刊名:Chinese Journal of Computers
  • 机构:中国科学院网络数据科学与技术重点实验室;中国科学院计算技术研究所;中国科学院大学;
  • 出版日期:2017-10-03 16:07
  • 出版单位:计算机学报
  • 年:2018
  • 期:v.41;No.427
  • 基金:国家自然科学基金(61572473,61472400);; 国家青年科学基金(61303156)资助~~
  • 语种:中文;
  • 页:JSJX201807004
  • 页数:14
  • CN:07
  • ISSN:11-1826/TP
  • 分类号:62-75
摘要
关于舆情事件的新闻数据是纷繁复杂的.即便是关于同一舆情事件的新闻数据,往往包含有不同的子话题(事件的不同侧面).因此,如何生成能够准确描述事件子话题含义的标签对深入分析舆情事件(包括掌握事件热点、监测发展走向等)具有重要意义.事件子话题标签的生成通常包括两个关键步骤:首先发现子话题,然后依据每个子话题的关键词或文档内容生成描述该子话题的有效标签.传统方法在发现话题时多采用聚类或分类的方法,它们将同一个话题的文档整合到一个簇中.然而,由于隶属同一事件的文档具有很强的相似性,现有方法难以度量他们之间的距离,因此无法应用于发现事件子话题这一任务.此外,在为子话题生成标签时,传统的方法通常通过抽取来实现.此类方法所生成标签的准确性无法保证.为此,该文提出了一种基于PLSA with Background Language并结合关键词聚类发现事件内部子话题,进而基于维基百科等知识库生成事件子话题标签的模型ET-TAG.在多类舆情事件数据集上的实验结果表明,ET-TAG算法相比K-means和LDA等已有子话题发现方法具有更好的性能;从子话题标签生成角度而言,ET-TAG生成的标签相对于传统方法也具有更好的准确性和概括性.该文最后将ET-TAG算法生成的子话题标签用于事件的对比和追踪,结果表明通过子话题标签可以发现事件共性,并反映事件子话题热度的变化趋势.
        The public opinion system is a system to monitor the trend of public opinion on the Web.Through the public opinion system,we can understand hot spots on the Web and track their trends.Events are the focus of the public opinion system.News data about public opinion events are very complicated.Even for the data about the same event,it often contains different sub-topics(different perspective of the event).The sub-topics of an event can reflect its different aspects.For example,in the event of an earthquake,sub-topics include earthquake details,rescue work,post-disaster reconstruction,and so on.These sub-topics not only embody different aspects of the event,but also reflect the hot spots that public opinion may concern about.Tags of events sub-topics can be regarded as the attributes of events,which can help us to describe and comprehensively understand the events.Through sub-topics,we can compare the similarities anddifferences between different events,and the sub-topic tags in a certain period of time can reflect changes in public opinion for the spots of events.It is significance to detect sub-topics of events and generate accurate sub-topic tags for public opinion system.It usually contains two major steps to generate the tags of sub-topics of a public opinion event:It first discovers sub-topics and then generates effective tags for them based on their corresponding keywords and documents.Existing methods for discovering topics or sub-topics are usually based on clustering or classification,which put the documents about the same topic into the same cluster.However,as the documents about the same event are similar to each other,it is very difficult for existing methods to measure the distance between these documents and thus they cannot effectively differentiate the sub-topics in the same event.There are a lot of high frequency background words in each document,how to ensure the diversity of sub-topics is a big problem.In addition,traditional methods often employ an extraction based manner to generate sub-topics' tags,where the accuracy of the tags cannot be guaranteed.And it is difficult to ensure the intelligibility of the generated tags.For overcoming such problems,this paper proposes an ET-TAG model,which uses PLSA-BLM to discover subtopic keywords,KL divergence to merge similar sub-topics,and then utilizes co-occurrence relations to update sub-topic keywords.Based on the sub-topic keywords,the external knowledge base is used to generate the corresponding tags for each sub-topic.ET-TAG has higher accuracy when generating sub-topic tags,ET-TAG performs much better.Furthermore,the tags generated by ET-TAG are more accurate and summary.Finally,the tags generated by Experiments on Sogou news corpus and specific multi-category public opinion events corpus can prove that ET-TAG has obvious advantages compared with traditional methods(including K-means and LDA)in sub-topic discovery.It has higher accuracy when generating sub-topic tags.ET-TAG is used to compare and track events,which shows that sub-topic tags may help find the common points between different events and reflect the heat trends of the sub-topics of events.
引文
[1]Allan J,Topic detection and tracking:Event-based information organization.Springer Science&Business Media,Berlin,German:Springer,2012
    [2]He T,Qu G,Li S,et al.Semi-automatic hot event detection//Proceedings of the International Conference on Advanced Data Mining and Applications.Berlin,Germany,2006:1008-1016
    [3]Aiello L M,Petkos G,Martin C,et al.Sensing trending topics in Twitter.IEEE Transactions on Multimedia,2013,15(6):1268-1282
    [4]Becker H,Naaman M,Gravano L.Beyond trending topics:Real-world event identification on twitter//Proceedings of the International AAAI Conference on Web and Social Media.Barcelona,Spain,2011:438-441
    [5]Nguyen D T,Jung J E.Real-time event detection for online behavioral analysis of big social data.Future Generation Computer Systems,2017,66:137-145
    [6]Petkos G,Papadopoulos S,Kompatsiaris Y.Two-level message clustering for topic detection in twitter//Proceedings of the WWW.Seoul,Korea,2014:49-56
    [7]Yang Y,Pierce T,Carbonell J.A study of retrospective and on-line event detection//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Melbourne,Australia,1998:28-36
    [8]Huang B,Yang Y,Mahmood A,et al.Microblog topic detection based on LDA model and single-pass clustering//Proceedings of the Rough Sets and Current Trends in Computing.Berlin,Germany,2012:166-171
    [9]Allan J,Papka R,Lavrenko V.On-line new event detection and tracking//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Melbourne,Australia,1998:37-45
    [10]Yan X,Zhao H.Chinese microblog topic detection based on the latent semantic analysis and structural property.Journal of Networks,2013,8(4):917-923
    [11]Brants T,Chen F,Farahat A.A system for new event detection//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Toronto,Canada,2003:330-337
    [12]Sakaki T,Okazaki M,Matsuo Y.Earthquake shakes Twitter users:Real-time event detection by social sensors//Proceedings of the 19th International Conference on World Wide Web.Raleigh,USA,2010:851-860
    [13]Weng J,Lee B S.Event detection in twitter//Proceedings of the International AAAI Conference on Web and Social Media.Barcelona,Spain,2011,11:401-408
    [14]Nallapati R,Feng A,Peng F,et al.Event threading within news topics//Proceedings of the 13th ACM International Conference on Information and Knowledge Management.Washington,USA,2004:446-453
    [15]Hongeng S,Nevatia R.Large-scale event detection using semi-hidden Markov models//Proceedings of the 9th IEEE International Conference Louis.Missouri,USA,2003:1455-1462
    [16]Ghaeini R,Fern X Z,Huang L,et al.Event nugget detection with forward-backward recurrent neural networks//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany,2016:369-384
    [17]Cui A,Zhang M,Liu Y,et al.Discover breaking events with popular hashtags in twitter//Proceedings of the ACM International Conference on Information and Knowledge Management.New York,USA,2012:1794-1798
    [18]Katragadda S,Virani S,Benton R,et al.Detection of event onset using Twitter//Proceedings of the 2016International Joint Conference on Neural Networks.Vancouver,Canada,2016:1539-1546
    [19]Drury B,Rocha C,Moura M F,et al.The extraction from news stories a causal topic centred Bayesian graph for sugarcane//Proceedings of the 20th International Database Engineering&Applications Symposium.Montreal,Canada,2016:364-369
    [20]Xu R,Ye L,Xu J.Reader’s emotion prediction based on weighted latent Dirichlet allocation and multi-label k-nearest neighbor model.Journal of Computational Information Systems,2013,9(6):2209-2216
    [21]Johri N,Roth D,Tu Y.Experts’retrieval with multiwordenhanced author topic model//Proceedings of the NAACL HLT 2010 Workshop on Semantic Search.Association for Computational Linguistics.Uppsala,Sweden,2010:10-18
    [22]Darling W M,Song F.Probabilistic topic and syntax modeling with part-of-speech LDA.arXiv preprint arXiv:1303.2826,201
    [23]Blei D M.Probabilistic topic models.Communications of the ACM,2012,55(4):77-84
    [24]Lu Y,Mei Q,Zhai C X.Investigating task performance of probabilistic topic models:an empirical study of PLSA and LDA.Information Retrieval,2011,14(2):178-203

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700