基于LDA和随机森林的微博谣言识别研究—

基于LDA和随机森林的微博谣言识别研究——以2016年雾霾谣言为例

英文篇名：Research on Microblog Rumor Identification Based on LDA and Random Forest
作者：曾子明 ; 王婧
英文作者：Zeng Ziming;Wang Jing;Center for the Study of Information Resources;Laboratory Center for Library and Information Science;
关键词：微博 ; 谣言识别 ; LDA ; 随机森林 ; 雾霾
英文关键词：Weibo;;rumor identification;;LDA;;random forest;;haze
中文刊名：QBXB
英文刊名：Journal of the China Society for Scientific and Technical Information
机构：武汉大学信息资源研究中心;武汉大学图书情报实验教学中心;
出版日期：2019-01-24
出版单位：情报学报
年：2019
期：v.38
基金：教育部人文社会科学重点研究基地重大项目“大数据资源的智能化管理与跨部门交互研究——面向公共安全领域”(16JJD870003)
语种：中文;
页：QBXB201901010
页数：8
CN：01
ISSN：11-2257/G3
分类号：93-100

摘要

网络谣言的肆虐对人们的日常生活和社会稳定造成了较为严重的负面影响,为了辅助网络谣言管控的有效推进,本文以2016年雾霾谣言为例,根据微博数据和以往研究定义了用户可信度和微博影响力特征变量,采用LDA主题模型深入挖掘微博文本的主题分布特征,并基于以上特征变量采用随机森林算法进行谣言识别的模型训练。实验表明,LDA提取的文档-主题分布特征在谣言识别中发挥了重要作用,且基于LDA的随机森林模型能够有效提高谣言识别的准确率。
The spread of Internet rumors has a negative impact on everyday life and social stability. In order to assist in rumor control, this paper analyzes information about the "haze" rumors on the Sina Weibo microblogging platform in 2016, and constructs reliability and influence variables based on Weibo data and history research. In addition, the LDA model is used to gather the topic distribution of the experimental text data. Based upon the reliability variable, the influence variable, and the probability of topics, the paper uses random forest for classification to achieve rumor identification. The experiment results show that the probability of topics plays an important role in rumor identification,and that the random forest model, based on LDA, can lead to an improvement in the accuracy of rumor identification.

引文

[1]李桂华,王亚男,朱一凡.网络谣言的信息接收反应机制及其风险治理[J].情报学报,2014,33(3):305-312.
    [2]贺刚,吕学强,李卓,等.微博谣言识别研究[J].图书情报工作2013,57(23):114-120.
    [3]闵庆飞,刘晓丹.谣言研究综述:基于媒介演变的视角[J].情报杂志,2015,34(4):104-109.
    [4]李丹丹,马静.复杂社会网络上的谣言传播模型研究综述[J].情报理论与实践,2016,39(12):130-134.
    [5]张志安,束开荣,何凌南.微信谣言的主题与特征[J].新闻与写作,2016(1):60-64.
    [6]武庆圆,何凌南.基于多标签双词主题模型的短文本谣言分析研究[J].情报杂志,2017,36(3):92-97.
    [7]Zhang Q,Zhang S,Dong J,et al.Automatic detection of rumor on social network[M]//Natural Language Processing and Chinese Computing.Cham:Springer,2015:113-122.
    [8]刘雅辉,靳小龙,沈华伟,等.社交媒体中的谣言识别研究综述[J].计算机学报,2018,41(7):1536-1545.
    [9]Wu K,Yang S,Zhu K Q.False rumors detection on sina weibo by propagation structures[C]//2015 IEEE 31st International Con-ference on Data Engineering.IEEE,2015:651-662.
    [10]王理,谢耘耕.公共事件中网络谣言传播实证分析——基于2010~2012年间网络谣言信息的研究[J].上海交通大学学报(哲学社会科学版),2014,22(2):86-99.
    [11]蒙在桥,傅秀芬,陈培文,等.基于OSN的谣言传播模型及影响力节点研究[J].复杂系统与复杂性科学,2015,12(3):45-52.
    [12]Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Jour-nal of Machine Learning Research,2003,3:993-1022.
    [13]Dhillon I S,Modha D S.Concept decompositions for large sparse text data using clustering[J].Machine Learning,2001,42(1):143-175.
    [14]张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590.
    [15]Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32.
    [16]Breiman L.Statistical modeling:The two cultures(with com-ments and a rejoinder by the author)[J].Statistical Science,2001,16(3):199-231.
    [17]邓生雄,雒江涛,刘勇,等.集成随机森林的分类模型[J].计算机应用研究,2015,32(6):1621-1624.
    [18]Han J W,Kamber M.数据挖掘概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2001.
    [19]刘知远,张乐,涂存超,等.中文社交媒体谣言统计语义分析[J].中国科学:信息科学,2015,45(12):1536-1546.
    [20]袁旭萍,王仁武,翟伯荫.基于综合指数和熵值法的微博水军自动识别[J].情报杂志,2014,33(7):176-179.
    [21]周志华.机器学习[M].北京:清华大学出版社,2016:33-37.
    [22]Wolfe F,Clauw D J,Fitzcharles M A,et al.The American college of rheumatology preliminary diagnostic criteria for fibromyalgia and measurement of symptom severity[J].Arthritis Care&Re-search,2010,62(5):600-610.
    [23]汪海燕,黎建辉,杨风雷.支持向量机理论及算法研究综述[J].计算机应用研究,2014,31(5):1281-1286.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700