基于微博的热点话题发现

英文题名：Hot Topic Detection Based on Microblog Data
作者：孙励
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：自然语言处理 ; 微博 ; 新词识别 ; 话题检测 ; 热度评估
英文关键词：Natural Language Processing ; Microblog ; New Words Identification ; Topic
英文关键词：Detection ; Heat evaluation
学位年度：2013
导师：王小捷
学科代码：0812
学位授予单位：北京邮电大学
论文提交日期：2012-12-30

摘要

微博作为新兴的互联网媒体,已经逐渐成为广大用户发表观点、共享信息的平台,其每日发布信息数以百万计、信息量庞大,用户难以浏览所有微博。同时,微博话题传播速度快、传播范围广,社会影响力高,因此从微博数据中获取热点话题并返回重要微博能帮助用户迅速把握社会关注热点,对于各类微博用户快速了解关键信息具有非常重要的价值。而当前微博平台基于用户关系的构建方式使得微博用户只能接收与其相关的微博内容而不能直接得到整个微博网络中的热点话题信息,所以从微博数据中挖掘热点话题返回给用户,可以获得更好的用户体验。虽然目前微博平台上已经提供了类似于热点话题榜的应用,但是介入了大量人工编辑因素导致热点话题的生成并不客观,并且以话题热度判断以讨论频次作为主要衡量指标,难以反映真实情况。
     本文首先研究了话题检测与热度判断的国内外相关技术,之后结合对微博热点话题的分析与总结、对已有微博热点话题相关应用的研究,提出了基于LDA模型的热点话题检测方法。该方法首先从微博内容特征出发,利用N元递增模型抽取重复字串,依据绝对词频、相对词频及互信息、邻接信息熵等统计特征过滤垃圾字串从而进行新词识别提取微博新词,并利用此结果提升分词结果的准确性；之后利用LDA模型挖掘微博数据的主题信息,将主题作为话题从而得到候选话题列表,同时可确定话题、词语、文档之间的关系；最后利用GibbsLDA++[具的结果,将词语与其所属话题看作一个整体即单义词单元,并通过计算单义词单元的权重即热度得到话题热度,对话题按热度排序以得到热点话题。该方法从微博的时问及内容特征出发、较有针对性,排除了人工编辑因素,因此挖掘的话题更为客观,并且通过实验验证了该方法在新词识别及话题检测上的有效性。
     为了使用户对热点话题有更全面的了解,本文进而提出了一种基于微博内容与话题相关性及发布者价值的相关微博返回方法,改进了目前微博平台仅以关键词语的匹配作为微博与话题相关性的判断机制,并结合影响微博内容价值的直接因素即微博自身评论数和转发数、间接因素即发布者影响力,对微博价值进行有效评估,从而实现对返回的话题相关微博的排序,使得用户可以以较小的阅读代价迅速了解热点话题相关事件及有代表性的用户讨论内容。
As an emerging Internet media, microblog has gradually become a platform for majority of users to express their views and share information, there can be millions of microblogs released each day, the huge amount if information makes it difficult for users to browse all of the microblogs. At the same time, the propagation velocity of microblog topics is fast, the transmission range is wide and the social influence is high, therefore accessing hot topics from microblog data and return the relevant important microblogs can help users to quickly grasp the Public Interest, this has a high value for all kinds of microblog users to quickly understand key information. Meanwhile, the building way of microblog platform based on user relationship makes users can only receive relevant microblog information but can not directly receive the hot topic information of the entire microblog network, therefore hot topic detection from the microblog data mining can obtain a better user experience. Although microblog platform now has application such as hot topic list, it needs a lot of manual editing factors and main measure is term frequency, so it is difficult to reflect the true situation.
     This paper studies the topic detection and heat judgment related technologies at home and abroad first, then analyze hot topic of microblog data and related research on the application of the existing microblog hot topic, proposed a hot topic detection method based on the LDA model which can fully tap the theme information of the text for the shortcomings of the existing methods without traditional clustering methods. First, starting from the microblog content features, using N-gram model to extract repeated strings, then use statistical characteristics including both absolute and relative term frequency, mutual information, and adjacency information entropy to filter spam strings and extract microblog new words, so as to enhance the accuracy of segmentation results. Then use LDA model to mining theme information of microblog data, and treat theme as topic so getting a list of candidate topics, meanwhile determine the distribution of the topics on the words and the distribution of the documentation on the topics. At last, untilizing the results of GibbsLDA++tool, make each word and its respective topic a whole unit which is called single-word unit, calculate the weight of single-word units corresponding to words, so as to calculate the heat of topic, and finally find the hottest topics. The method using both the the time features and content feature of microblog and is more targeted, and rule out human-edited factors, so the topics are more objective, and validity of the method is verified by experiments both on new word identification and topic detection.
     To make the users have a more comprehensive understanding of the hot topics, proposed a topic-related microblog return method based on the relevance of the microblog content and topics, and also words matching. And then combine with the direct and indirect affect factors of the value of microblog content in order to effectively assess the value of microblog and sort the return microblogs, which make users can quickly understand the hot events related to hot topics and the focus of discussion of hot topics of majority of users with a small reading cost.

引文

[1]Wu A, Zixin Jiang. Statistically enhanced new word identification in a rule based Chinese system. In Proceedings of the Second Chinese Language Processing Workshop, Hong Kong, China,2000:114-120.
    [2]Fu G H, Luke K K. Chinese unknown word identification as known word tagging[C]. In Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, China,2004:2612-2617.
    [3]Zhang H P, Liu Q. Automatic Recognition of Chinese Unknown Words Based on Roles Tagging. In Proceedings of the 1st SIGHAN Workshop on Chinese Language Processing, Taipei, China,200:71-78.
    [4]秦浩伟,步丰林一个中文新词识别特征的研究计算机工程2004 30(12)：369-371.
    [5]Li H, Huang C N, Gao J, et al. The Use of SVM for Chinese New Word Identification. In Proceedings of First International Joint Conference on Natural Language Processing, Sanya HaiNan Island, China,2004:723-732.
    [6]Peng F, Feng F, McCallum A. Chinese Segmentation and New Word Detection using Conditional Random Fields. In Proceedings of The 20th International Conference on Computational Linguistics, University of Geneva, Switzerland, 2004:562-568.
    [7]郑家恒,李文花基于构词法的网络新词自动识别初探.山西大学学报自然科学版2002 25(2)：115-119.
    [8]邹纲,刘洋,刘群等面向Internet的中文新词语检测中文信息学报2004 18(6)： 1-9.
    [9]崔世起,刘群,孟遥等基于大规模语料库的新词检测.计算机研究与发展2006 43(5)：927-932.
    [10]曹勇刚,曹羽中,金茂忠等面向信息检索的自适应中文分词系统.软件学报2006 17(3)：356-363.
    [11]NIST. The 2003 Topic Detection and Tracking Task Definition and Evaluation Plan. http://www.nist.gov/speech/tests.tdt/tdt2003/evalplan.html
    [12]贾自艳,何清,张俊海等一种基于动态进化模型的事件探测和追踪算法计算机研究与发展2004 41(7)：1273-1280.
    [13]Jia Z Y, He Q, Zhang H J. A New Event Detection and Tracking Algorithm Based on Dynamic Evolution Model. Journal of Computer Research and Development. 2004,41(7):1273-1280.
    [14]F. Walls, H. Jin, S. Sista and R. Schwartz. Topic Detection in Broadcast News. In Proceedings of the DARPA Broadcast News Workshop.1999:248-255.
    [15]J. Allan and J. Carbonell. Topic Detection and Tracking Pilot Study:Final Report. In Proceedings of the DARPA Broadcast News Transcriptions and Understanding Workshop.1998:11-17.
    [16]J. Allan and V. Lavrenko. UMass at TDT 2000. http://www.nist.gov/speech/tests/tdt/tdt2000/papers.html
    [17]Zhang Zhen Ya, Cheng Hong Mei, Wang Jin. An Approach on the Data Structure for the Matrix Storing Based on the Implementation of Agglomerative Hierarchical Clustering Algorithm, Computer Science,2006(1):14-17.
    [18]Y. Yang, J. Carbonell and R. Brown. Multi-Strategy Learning for Topic Detection and Tracking. In Proceedings of the TDT2002 Workshop.2002: 85-114.
    [19]殷风景,肖卫东,葛斌等一种面向网络话题发现的增量文本聚类算法计算机应用研究2011 28(1)：54-57.
    [20]雷震,吴玲达,雷蕾等初始化类中心的增量K均值法及其在新闻事件探测中的应用情报学报2006 25(3)： 289-295.
    [21]W. Lam, H. Meng and K. Hui. Multilingual Topic Detection Using, a Parallel Corpus. In Proceedings of the TDT 2000 Workshop.2000:184-196.
    [22]李保利汉语新闻报道中的话题跟踪与识别研究[学位论文]北京北京大学2003.
    [23]S. Dharanipragada, M. Franz, and T. McCarley. Segmentation and Detection at IBM. Data Mining and Knowledge Discovery.2002,8(6):135-148.
    [24]徐戈,王厚峰自然语言处理中主题模型的发展计算机学报2011 34(8)：1423-1436.
    [25]Hofmann T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International SIGIR Conference. New York:ACM Press,1999:50-57.
    [26]Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research,2003,3:993-1022.
    [27]周亚东,孙钦东,管晓宏流量内容词语相关度的网络热点话题提取西安交通大学学报2007 41(10)：1142-1146.
    [28]李恒训,张华平,秦鹏基于主题词的网络热点话题发现第五届全国信息检索学术会议论文集2009 pp.134-143.
    [29]R. Swan and J. Allan. Extracting Significant Time Varying Features from Text. In Proceedings Eighth Int'1 Conf. Information and Knowledge Management (CIKM'99),1999:38-45.
    [30]Swan R, Allan J. Automation generation of overview timelines. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA:ACM,2000:49-56.
    [31]薛峰,周亚东,高峰一种突发性热点话题在线发现与跟踪方法西安交通大学学报2011 45(12)：64-70.
    [32]Hong Li, Jinfeng Wei. Netnews Bursty Hot Topic Detection Based on Bursty Features. In Proceeding of 2010 International Conference on E-Business and E-Government,2010:1437-1440.
    [33]Fung G P C, Yu J X, Liu H, et al. Time-dependent event hierarchy construction. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining. New York, USA:ACM,2007:300-309.
    [34]Fung G P C, Yu J X, Yu P S, et al. Parameter free bursty events detection in text streams. In Proceedings of the 31st International Conference on Very Large Data Bases. Trondheim, Norway:VLDB Endowment,2005:181-192.
    [35]Wang Xuanhui, Zhai Chengxiang, Hu Xiao, et al. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the 13st ACM International Conference on Knowledge Discovery and Data Mining. New York, USA:ACM,2007:784-793.
    [36]Suba I, Berendt B. From bursty patterns to bursty facts:the effectiveness of temporal text mining for news. Proceedings of the 19th European Conference on Artificial Intelligence, Fairfax, VA, USA:IOS Press,2010:517-522.
    [37]罗亚平,王枞,周延泉基于关注度的热点话题发现模型第七届中文信息处理国际会议武汉2007：402-408.
    [38]Canhui Wang, Min Zhang, Liyun Ru. Automatic online news topic ranking using media focus and user attention based on aging theory. In Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA,2008.
    [39]Jianshu Wneg, Ee-Peng Lim, Jing Jiang. Twitter Rank:finding topic-sensitive influential twitters. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM),2010:261-270.
    [40]荀恩东,李晟,刘群采用术语定义模式和多特征的新术语及定义识别方法计算机研究与发展2009 46(1)：62-69.
    [41]Luo S F, Sun M S. Two charater Chinese word extraction based on hybrid of internal and contextual measures. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processings. Sapporo, Japan,2003.24-30.
    [42]徐亮中文新词识别研究[学文论文]大连大连理工大学2009.
    [43]路荣,项亮,刘明荣等基于隐主题分析和文本聚类的微博客新闻话题发现研究《模式识别与人工智能》2012年第03期：382-387.
    [44]X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In Proc of WWW '08,2008:91-100.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700