微博舆情分析系统关键技术研究

作者：谢乾龙
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：微博 ; 舆情 ; 文本检索 ; 突发话题检测 ; 话题聚类
英文关键词：microblog ; public opinion ; text retrieval ; burst event detec-
英文关键词：tion ; topic clusting
学位年度：2013
导师：徐蔚然
学科代码：081002
学位授予单位：北京邮电大学
论文提交日期：2013-03-07

摘要

随着网络舆情的快速发展,特别是随着微博的流行,民众对于公共事务公开表达的具有影响力意见的平台正在迅速转移,微博平台上的舆情分析已经成为一个热点研究方向。
     本文以微博舆情分析系统作为切入点,重点对微博平台下的检索系统和突发话题检测算法进行研究。主要内容如下：
     1.论文首先对检索模型和突发话题检测算法进行了深入研究,分析了国内外相关技术的研究成果和现状,接着重点介绍了一些经典的检索模型和突发话题检测算法。
     2。针对微博数据设计和实现了短文本话题检索系统,该算法对于用户给定的话题,使用基于词激活力(WAF)的查询扩展算法进行查询扩展,通过“二次检索”返回同时具有高相关性和高时效性的微博。最后将该算法用于2011TREC Micro-blog Track中,取得了第二名的优秀成绩。
     3.提出一种基于状态自动机的突发特征检测算法,针对微博数据长度小,语言不规范,噪声大,数据量大的特点,优化预处理过程和状态自动机模型参数；提出一种突发话题聚类算法,对特征词的词频向量表示进行改进,并引入基于词激活力(WAF)的词法特征,使得聚类效果更加准确,得到的突发话题可读性更强。最后通过实验方法验证了算法的可行性。
With the rapid development of network public opinion, especially on mi-cro blog, the domain platform for people to express their opinions in public events has rapidly changed. Public opinion analysis has become a popular re-search direction.
     In this paper, we mainly focused on the retrieval system and burst event detection system in Microblogging environment, the main research includes the following three aspects:
     1. This paper first discusses current research in text retrieval and burst detection, and then introduces some classic retrieval model and burst detection algorithm.
     2. Implement a short text retrieval system based on microblogging corpus. For a given topic, This system use query expansion algorithm based on the word active force (WAF) and "twice retrieval" algorithm to returns micro blog that is not only high correlation and high timeliness. Finally, the algorithm was used in2011TREC Micro-blog Track, and achieved good results.
     3. Provide some optimizations for the classical burst detection algorithm base on automaton adapted to the context of the micro blog. The two main di-rections include:first, due to the differences between micro blog and traditional webpage, pre-processing and model parameters could be different. Second, fo-cus on topic clustering method. That included promotion of eigenvectors and introducing lexical characteristics based on the word active force to similarities between words. Finally, there are experiments verified the feasibility of the algorithm.

引文

[1]殷风景面向网络舆情监控的热点话题发现技术研究[学位论文],北京,国防科学技术大学,2010.
    [2]禹航基于微博客的社区挖掘研究[学位论文],武汉,华中科技大学,2011.
    [3]W.Bruce Croft.Donald Metzler and Trevor Strohman Search Engines Information Retrieval in Prac-tice机械工业出版社2010 142-151.
    [4]刘洁清网站聚焦爬虫研究[学位论文],江西,江西财经大学,2006
    [5]罗阳文本聚类关键技术研究[学位论文],北京,北京邮电大学.2011
    [6]杜刚新闻数据中突发话题检测研究[学位论文],北京,北京邮电大学,2011
    [7]Kleinberg, J., Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, 2003.7(4):p.373-397.
    [8]Salton, G., A. Wong, and C.S. Yang, A vector space model for automatic indexing. Communications of the ACM,1975.18(11):p.613-620.
    [9]Chen, Q.Z.P.M.B., Temporal and information flow based event detection from social text streams. Association for the Advancement of Artificial Intelligence,2007.2:p.1501-1506.
    [10]Liu, X., Y. Wang, Y. Li, et al., Identifying Topic Experts and Topic Communities in the Blogspace, in Database Systems for Advanced Applications, J. Yu, M. Kim, and R. Unland, Editors.2011, Springer Berlin/Heidelberg, p.68-77.
    [11]He, Q., K. Chang, and E. Lim. Analyzing feature trajectories for event detection. In 30th annual international ACM SIGIR conference on Research and development in information retrieval.2007. p.214.
    [12]Fung, G., J. Yu, P. Yu, et al. Parameter free bursty events detection in text streams. In Proceedings of the 31st international conference on Very large data bases.2005. p.181-192.
    [13]Larsen, B. and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining.1999. p.16-22.
    [14]Su, M.C. and C.H. Chou, A modified version of the K-means algorithm with a distance based on cluster symmetry. Pattern Analysis and Machine Intelligence, IEEE Transactions on,2001.23(6):p. 674-680.
    [15]Allan, J., V. Lavrenko, and R. Papka. Event tracking. In Technical Report IR-128, Department of Computer Science, University of Massachusetts.1998.
    [161 Allan, J., R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval.1998. p.37-45.
    [17]Wang, C., M. Zhang, L. Ru, et al. Automatic online news topic ranking using media focus and user attention based on aging theory. In Proceeding of the 17th ACM conference on Information and knowledge management.2008. p.1033-1042.
    [18]Dobkin, DP., D. Gunopulos, and W. Maass, Computing the maximum bichromatic discrepancy, with applications to computer graphics and machine learning. Journal of Computer and System Sci-ences,1996.52(3):p.453-470.
    [19]Leskovec, J., L. Backstrom, and J. Kleinberg. Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.2009. p.497-506.
    [20]He, Q., K. Chang, E.P. Lim, et al. Bursty feature representation for clustering text streams. In Proceedings of the Seventh SIAM International Conference on Data Mining.2007. p.26 C28.
    [21]Chen, L. and A. Roy. Event detection from flickr data through wavelet-based spatial analysis. In Proceeding of the 18th ACM conference on Information and knowledge management.2009. p.523-532.
    [22]He, Q., K. Chang, and E. Lim. Using burstiness to improve clustering of topics in news streams. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining.2007. p.493-498.
    [23]陈友,程学旗,杨森面向网络论坛的突发话题发现中文信息学报,第3期,第24卷,2010年5月
    [24]Eddy, S.R., Hidden markov models. Current opinion in structural biology,1996.6(3):p.361-365.
    [25]Guo Jun, Guo Hailiang, Wang Zhanyi. An Activation Force-based Affinity Measure for Analyzing Complex Networks [J]. Scientific Reports,2011.10
    [26]Martin Ester, Hans-Peter Kriegel, J6rg Sander, Xiaowei Xu (1996-). "A density-based algorithm for discovering clusters in large spatial databases with noise". In Evangelos Simoudis, Jiawei Han, Usama M. Fayyad. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, pp.226-231.
    [27]N g R, Han J. CLARAN S:a method fo r clustering objects for spatial data m ining[J]. IEEE Trans on Know 1, Data Eng,2002,14 (5):100321016.
    [28]Sudipto Guh a, R. Rastogi, and K. Shim. CURE:A clustering algorithm for large databases. Techni-cal report, Bell Laboratories, Murray Hill,1997.
    [29]Clark F. Olson, Parallel Algorithms for Hierarchical Clustering, University of California at Berkeley, Berkeley, CA,1994
    [30]Yu Xiao-gao,Jian Yin.A New Clustering Algorithm based on KNN and DENCI-UE[C] Proceedings of 2005 International Conference on Machine Learning and Cybernetics,Guang7.hou,China,2005,4:2033-2038.
    [31]宋海林基于语言模型的信息检索中负反馈技术的研究与实现[学位论文],内蒙古,内蒙古大学,2011.
    [32]俞山青关于全局最优交通寻路算法的研究[学位论文],上海,上海大学,2008.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700