基于内容分析的Blog话题检测方法研究

英文题名：Research on Topic Detection in Blogosphere Based on Content Analysis
作者：何金艳
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：博客 ; 话题检测 ; 话题模型 ; 专题提取
英文关键词：Blogosphere ; topic detection ; topic model ; special topic extract
学位年度：2010
导师：黄哲学 ; 叶允明
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2009-12-01

摘要

话题检测技术是面向文本信息流进行未知话题识别的信息处理技术,它是话题检测与追踪技术的重要组成部分。这项技术旨在从特定时间和地点发生的事件扩展为具备更多相关外延的话题,它在信息抽取和舆情监控方面有很大的实用价值。目前,常见的话题检测算法大多面向具备突发性和延续性规律的新闻网站语料,而专门针对博客空间的话题检测算法并不成熟,这是因为博客属于个人媒体,跟新闻语料相比,具有数据量庞大和形式多样化的特点。
     本文通过对博客数据的结构深入分析,明晰了对博客数据进行话题检测的主要技术需求。针对博客数据形式多样化的特点,选取必要特性转化为新的话题模型——话题质心和关键词序列为主的话题模型,并基于该话题模型设计了话题检测算法,话题关键词提取算法,专题提取算法。本文的主要贡献体现在以下几个方面:
     (1)本文设计了符合博客数据特性的话题模型。话题模型由多个特征组成,其中包括:话题名称、关键词序列、话题质心、博文集合、话题发起时间。话题模型贯穿于本文的三个核心算法:话题检测算法和话题关键词提取算法在博文的基础上生成话题模型;专题抽取算法在话题模型的基础上作进一步话题组织工作。
     (2)文中通过分析各类常用的文本聚类算法,从中选取了增量聚类算法作为话题检测算法的基础。引入了改进话题检测效果的三项优化策略:话题质心更新、文本过滤、话题模型选择。通过对比实验证明了话题检测算法的有效性。
     (3)设计了话题关键词提取算法,为每一个话题提取标志性词汇集合。此算法主要采用了文本特征选择的互信息原理,并引入了对在博文标题中出现的词进行加权的优化策略。通过实验证明了关键词提取算法的有效性。
     (4)在话题模型的基础上实现了专题提取算法。该算法以层次聚类思想为基础,主要选用了话题模型特征中的三项特征:关键词集合、话题质心、话题发起时间。对各项特征建立不同的相似度计算公式,以计算话题模型之间的相似度。最后通过实验证明了专题提取算法的有效性。
     基于以上研究成果,本文设计博客话题检测系统,该系统由五大模块组成:数据库模块,数据预处理模块,话题检测模块,话题模型特征提取模块,专题提取模块。通过编程技术实现了Blog话题检测原型系统,为博客话题检测技术的研究打下了坚实的基础。
Topic detection technology is an unknown topic identification technology faced to text-oriented information flow, which is an important component of topic detection and tracking technology. This technology seeks a particular time and place events in expanded with more topics related to outreach, which has great practical value in the information extraction and monitoring of public opinion. At present, the most common topic detection algorithms are designed to deal with the news websites corpus. While the algorithm for Blogosphere is not mature. That is because Blogosphere is a personal media. The corpus from Blogosphere is more complex and has a huge number compared with news.
     This paper analyses deeply the structure of data from Blogosphere. It ascertains the main needs of topic detection on Blog data. This paper designs the topic model based on the character of Blog data. The model contains topic center and keywords set as main feature. The topic detection algorithm, the keywords extract algorithm and the special topic extract algorithm are based on the topic model. The main contributions of this paper are as follow:
     1. This paper designs the topic model base on the characters of Blog data. The topic model contains five features: topic name, keywords set, topic center, posts of topic, time of topic. The algorithms in this paper are all based on the topic model. The topic detection algorithm and the keywords extract algorithm create each feature of topic model. And the special topic extract algorithm is based on the topic model.
     2. This paper analyses various types of text clustering algorithms, and chooses the incremental clustering algorithm as the main component of topic detection algorithm. Three optimization strategies are imported: topic center update, text filtering, selection of topic models. By the experiment, it proves the efficiency of topic detection algorithm.
     3. The topic keywords extract algorithm is designed to extract keywords for each topic. The words contained in each topic are weighted by the mutual information formula. The word appeared in title is more important to describe the topic.
     4. The special topic extract algorithm is based on the topic model. It chooses three factures of topic model: keywords set, topic center, time of topic. This algorithm designs three different formulas to calculate the similarity of topic models. At last, it proves the efficiency of special topic extract algorithm by the experiment.
     Based on the above studying, this paper designs the topic detection system base on Blogosphere. The system is composed by five modules: database module, data pretreatment module, topic detection module, topic feature extract module, special topic extract module. This system is the base of topic detection research in Blogosphere.

引文

1.李保利,俞士汶.计算机识别与跟踪研究.计算机工程与应用. 2003, 39(17): 7-10
    2.金珠,林鸿飞,赵晶.基于HowNet的话题跟踪及倾向性分类研究.情报学报. 2005, 24(5): 10-22
    3.杨宇航.基于内容与链接分析的重要Blog信息源发现.哈尔滨工业大学. 2006
    4.时达明. Blog热点话题发现及其作者声誉度研究.大连理工大学. 2007
    5.陈华,梁循,阮进.网络舆情关联分析系统的设计实现.第三届全国信息检索与内容安全学术会议,苏州, 2007: 45-49
    6.樊旭亮.基于中文Blog的话题识别方法探讨.大众科技.2008,02
    7. Allan J, Lavrenko V. Connell M E. A month to Topic Detection and Tracking in Hindi. In ACM Transactions on Asian Language Processing, 2003. 2(2): 85-100
    8. Jonathan G F, George R D. Topic Detection and Tracking evaluation overview in Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers. 2002: 17-31
    9. Allan J, Carbonell J, Doddington G et a1. Topic Detection and Tracking pilot study: final report. Proceedings of the DRRPA Broadcast News Transcription and Understanding Workshop, San Francisco; Morgan Kaufmann Publishers, 1998: 194-218
    10. Wayne C. Multilingual Topic Detection and Tracking: successful research enabled by corpora and evaluation. Language Resources and Evaluation Conference(LREC), Greece, 2000: 1487-1494
    11. He t, Qu G, Li S et a1. Semi-automatic hot event detection. Advanced Data Mining and Application(ADMA), 2006, 4093: 1008-1016
    12.郑伟,张宇,邹博伟等.基于相关性模型的中文话题跟踪研究.第九届全国计算语言学学术会议(JSCL2007),大连. 2007: 558-563
    13. Lavrenko v, Allan J, DeGuzman E. Relevance models for Topic Detection and Tracking. Pro-ceedings of the Human Language Technology Conference, San Diego CA, 2002: 104-110
    14.周亚东,孙钦东,管晓宏等.流量内容词语相关度的网络热点话题提取.西安交通大学学报, 2007, 41(10): 1142-1150
    15.王会珍,朱靖波,季铎.基于反馈学习自适应的中文话题追踪.中文信息学报, 2006, 03: 94-100
    16.德成,姚天叻.汉语句子语义极性分析和观点抽取方法的研究.计算机应用. 2006, 26(11): 2622-2625
    17. Emai1i K S, Neshati M, Jamali M et a1. Comparing performance of recommendation techniques in the Blogsphere. In Proceedings of ECAl2006 Workshop on Recommender Systems, Riva del Garda Italy, 2006
    18.丁伟莉.中文Blog热门话题检测与排序技术研究.中国优秀硕士学位论文全文数据库, 2007
    19.闵可锐,赵迎宾等.互联网话题识别与追踪系统设计及实现.计算机工程. No.19, Vol.34
    20.海军,何婷婷,翟国忠等.热点事件发现.第九届全国计算语言学学术会议(JSCL2007),大连. 2007: 576-581
    21.陈华,梁循,阮进.网络舆情关联分析系统的设计实现.第三届全国信息检索与内容安全学术会议(NcIRcS’2007),苏州, 2007: 45-49
    22. C, Wayne. Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation. Language Resources and Evaluation Conference (LREC), 2000: 1487-1494
    23. Liang Ma, Qunxiu Chen, Shaoping Ma, and Min Zhang, Lianhong Cai, Incremental Learning for Profile Training in Adaptive Document Filtering. In Proceedings of the 11th Text Retrieval Conference(TREC-11), 2002
    24.李保利,俞士汶.计算机识别与跟踪研究.计算机工程与应用. 2003, 39(17): 7-10
    25.林鸿飞,高天,姚天顺.中文文本的可视化表示.东北大学学报, 2000. 21(5) : 501-503
    26.吴平博,陈群秀,马亮.基于事件框架的事件相关文档的智能检索研究.中文信息学报, 2003, 17(6): 25-30
    27. A.Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET Curve in Assessment of Detection Task Performance. EuroSpeech 1997: Vol 4
    28.洪宇,张宇,刘挺,李生.话题检测与追踪的测评机研究综述.哈尔滨工业大学信息检索实验室, 150001
    29. Yiming Yang, Tom Ault, Thomas Pierce, Charles W. Lattimer. Improving Text Categorization Methods for Event Tracking. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval ( SIGIR22000), 2000. 65-72
    30. K Hui, W Lam. Automatic event generation from multi-lingual news stories. In :Proc of the First ACM/ IEEE2CS Joint Conf on Digital Libraries. Roanoke, New York: ACM Press, 2001, 23-24
    31. N Stokes, J Carthy, A F Smeaton. Segmenting broadcast news streams using lexical chaining. In: T Vidal, P Liberatore, eds. Proc of STAIRS 2002. Amsterdam: IOS Press, 2002. 145-154
    32. D Randall. The Universal Journalist, Second Edition1 London: Pluto Press. 2000
    33. Hatzivassiloglou V, McKeown K R. Predicting the semantic orientation of adjectives. In: Proceedings of the 35th annual meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the ACL,Madrid Spain, 1997: 174-181
    34. Turney P D, Littman M L. Measuring praise and criticism: inference of semantic orientation from association. ACM Transactions on Information Systems, 2003, 21(4): 315-346
    35. Kamps J, Marx M, Mokken R J et a1. Using WordNet to measure semantic orientation of adjectives. In: Proceedings of LREC-4, 4th International Conference on Language Resources and Evaluation,Lisbon,2004:1115-1118
    36. Rilooff E, Wiebe J. Learning extraction patterns for subjective expressions. In the Proceedings of HLT-EMNLP’2003.Sapporo Japan, 2003: 25-32
    37. Pang B, Lee L, Vaithyanathan S. Thumbs up? sentiment classification using machine learning techniques, In Pro. Of EMNLP’2002, University of Pennsylvania, PA USA, 2002: 79-86
    38. Kobayashi N, Iida R, Inui K et a1. Opinion mining as extraction of attribute-value relations. New Frontiers in Artificial Intelligence, 2006, 4012: 470-481
    39. Zhang Y, Li Z, Ren F et a1. Semi-automatic emotion recognition from textual input based on the constructed emotion thesaurus. IEEE, 2005: 571-576
    40. Yi J, Niblack w. Sentiment mining in WebFountain. In: Proceedings of the 21st International Conference on Data Engineering(ICDE 2005)Tokyo Japan, 2005: 1073-1083
    41. Mishne G. Experiments with mood classification in Blog posts. In Style2005-1st Workshop On Stylistic Analysis of Text for Information Access, at SIGIR 2005, Salvador Bahia Brazi 1, 2005
    42.健文,董守斌,蔡斌.模板化网页主题信息的提取方法.清华大学学报, 2005, 45(SI): 1743-1747
    43.承杰,关毅.基下统计的网页正文信息抽取方法的研究.中文信息学报. 2004, 18(5): 17-22
    44. Ester M, Kriegel H-P, Sander J, et al. Incremental clustering for mining in a data warehousing environment[C]//Proceedings of the 24th International Conference on Very Large Data Bases. New York: Morgan Kaufmann Publishers Inc, 1998: 323-333
    45.倪国元.基于模糊聚类的增量式挖掘算法研究[D].武汉:华中科技大学, 2004
    46.刘建晔,李芳.一种基于密度的高性能增量聚类算法[J].计算机工程, 2006,vol32
    47. Hsu C-C, Huang Y-P. Incremental clustering of mixed data based on distance hierarchy[J].Expert Systems with Applications: An International Journal, 2008, 35 (3): 1177-1185
    48.刘青宝,侯东风,邓苏,等.基于相对密度的增量式聚类算法[J].国防科技大学学报, 2006: vol28
    49.陈峰.基于聚类的增量数据挖掘研究[D].大连:大连海事大学, 2007

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700