虚拟社区热点话题意见挖掘模型研究

英文题名：Research on Model of Hot Topic Opinion Mining in Virtual Communities
作者：麦林
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：虚拟社区 ; 话题提取 ; 话题热度评估 ; 话题意见挖掘 ; 基于结构信息的主题相关度 ; 多特征融合的分类方法
英文关键词：Virtual Community ; Topic Extraction ; Hotspot Evaluation ; Topic Opinion Mining ; Topic Relevancy Algorithm Based on Structure Information ; Multi-Feature Fusion Classification
学位年度：2009
导师：俞能海
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2009-05-17

摘要

随着互联网的蓬勃发展和Web2.0应用的兴起,用户已经成为互联网不可或缺的重要组成部分,用户产生内容已经成为互联网上最活跃、最受关注和最有价值的资源。用户产生内容源于真实世界,在很大程度上体现了用户的真实想法和感悟,具有较高的真实性。虚拟社区拥有数量最多的用户产生内容,因此,对虚拟社区和其内容进行挖掘具有理论意义和实用价值。
     本文的主要工作和创新成果包括以下几个方面:
     1.以虚拟社区为研究对象,讨论了虚拟社区的特点、结构和内容的组织方式,以及虚拟社区中话题的组成、结构和特征。对虚拟社区中的话题和主题的概念进行了区分,使用树形结构对话题进行表示,为后续研究奠定基础。
     2.提出基于结构信息的主题相关度算法。对话题中“跑题”现象的成因、特点以及对主题质量的影响进行了研究,提出用主题相关度评估主题中讨论内容与原主题的一致性。分别介绍了基于文本相似度算法的主题相关度算法和基于主题结构信息的主题相关度算法,通过实验对两种主题相关度算法进行比较。实验结果表明,本文提出的基于结构信息的主题相关度算法效果更好。
     3.提出多特征融合的分类方法。对互联网文本信息的多特征性进行了研究,充分考虑各项特征对于文本表现能力的差异。提出基于朴素贝叶斯分类算法的多特征融合的分类方法,并将方法应用于博客文章分类。实验结果表明,多特征融合的分类方法可以获得更高的准确率。
     4.在以上几点的基础上,提出了虚拟社区的话题提取、话题热度评估和话题意见挖掘方法,并将三者结合为一个整体,构建了虚拟社区热点话题意见挖掘模型。在话题提取方面,采用分类与聚类相结合的方法;在热度评估方面,提出从主题关注度、主题相关度和时效性三个方面综合评估主题的热度;在话题意见挖掘方面,通过对每个帖子的主观性、意见极性以及意见对象进行判断,最终得到用户对于话题的整体意见。实验结果表明,本文的话题提取方法准确率较高,热度评估结果与现实情况较为符合,意见挖掘结果在一定程度上能够反映用户对于话题的整体态度。因此,本文提出的热点话题意见挖掘模型是合理、有效的。
While the Internet is developing rapidly and a lot of Web2.0 Applications become popular, the users are more and more important to Internet. User Generated Content (UGC) is the most actively, concerned and valued resource on the Web. The UGC come from the real world and reflect the really think of the users. Because virtual communities contain the greatest number of UGC, it’s meaningful to study virtual communities and to mine the UGC.
     Our work and innovations are as follows:
     The dissertation studies the features, the structure and the content organization of virtual communities. The dissertation also differentiates the defenition of Subject and Topic, studies the component, the structure and the features of topics. And the dissertation proposes constructing the tree structure of the topic by the reply-relation.
     The dissertation studies the cause and the features of the phenomenon of“Topic Drift”, and proposes the concept of Topic Relevancy to detect“Topic Drift”. Because of the absence of standardization of UGC text, the performance of traditional algorithms based on text similarity is poor. The dissertation proposes a novel method which can computer the topic relevancy by the structure information of topics. And the method achieves good result in practice.
     The dissertation studies the multi features of Web documents, and evaluates the importance of different features. The dissertation proposes a novel text classification method which makes full use of different features of Web documents. The method is based on Na?ve Bayes Classification. The method is applied to Blogpost Classification and achieves good result in practice.
     Based on the work above, the dissertation proposes the topic extraction method, hotspot evaluation method and opinion mining method in virtual communities. And the three methods compose the hot topic and opinion mining model in virtual communities. The topic extraction method combines the classification and the clustering algorithm. The hotspot evaluation method evaluates the hot degree of topics by the attention rate, relevancy and timeliness. The topic opinion mining method gets the overall opinion on the topic by analyzing the subjectivity, opinion polarity and opinion object of each post in the topic. The topic extraction method achieves high precision in practice. The result of hotspot evaluation agrees with the practical situation. And the result of opinion mining can reflect the overall opinion of the users. So the hot topic and opinion mining model in virtual communities proposed by us is effective and makes sense.

引文

[1]中国互联网信息中心,“第23次中国互联网络发展状况统计报告”,2009.
    [2] Raymond Kosala, Hendrik Brockeel. Web Mining Research: A Survey[C]. ACM SIGKDD. July, 2000.
    [3] http://en.wikipedia.org/wiki/User_generated_content
    [4]洪宇,张宇,刘挺等,“话题检测与跟踪的评测及研究综述”[J],中文信息学报,2007,21(6),71-87.
    [5]李保利,俞士汶,“话题识别与跟踪研究”[J],计算机工程与应用,2003,39(17),6-10.
    [6] J Allan, V Lavrenko, and R Swan. Explorations within topic tracking and detection [A]. In: Topic Detection and Tracking: Event-based Information Organization [C]. Kluwer Academic: Massachusetts, 2002, 197-224.
    [7] J M Schultz, M Y Liberman. Towards an universal dictionary for multi-language IR applications [A]. In: Topic Detection and Tracking: Event-based Information Organization [C]. Kluwer Academic: Massachusetts, 2002, 225-241.
    [8] J Yamron, L Gillick, P van Mulbregt, etc. Statistical models of topical content [A]. In: Topic Detection and Tracking: Event-based Information Organization [C]. Kluwer Academic: Massachusetts, 2002, 115-134.
    [9] Leek T, Schwartz R M, Sista S. Probabilistic approaches to topic detection and tracking [A]. In: Topic Detection and Tracking: Event-based Information Organization [C]. Kluwer Academic: Massachusetts, 2002, 67-83.
    [10] Y. Zhang, J. G. Carbonell, J. Allan. Topic Detection and Tracking: Detection-Task [A]. In: Proceedings of the Workshop of Topic Detection and Tracking [C], 1997.
    [11] J Carbonell, Y Yang, J Lafferty, R D. Brown, etc. CMU Report on TDT-2: Segmentation, Detection and Tracking [A]. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop [C]. San Francisco: Morgan Kauffman, 1999, 117-120.
    [12] Ron Papka. On-line New Event Detection, Clustering and Tracking [D]. Amherst: Department of Computer Science, UMASS, 1999.
    [13] J Allan, R Papka, V Lavrenko. On-Line New Event Detection and Tracking [A]. In: Proceedings of SIGIR’98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New York: ACM Press, 1998, 37-45.
    [14] Y Yang, T Pierce, J Carbonell. A study on Retrospective and On-Line Event detection [A]. In: Proceedings of the 21st annual international ACM SIGIR conference on Research anddevelopment in information retrieval [C]. 1998, CMU, USA: ACM, 28-36.
    [15] D Trieschnigg, W Kraaij. TNO hierarchical topic detection report at TDT 2004[A]. In: The 7th Topic Detection and Tracking Conf [C]. 2004.
    [16]贾自艳,何清,张俊海等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
    [17]赵华,赵铁军,于浩等.面向动态演化的话题检测研究[J].高技术通讯,2006,12(16):1230-1235.
    [18]于满泉,骆卫华,许洪波,白硕等.话题识别与跟踪中的层次化话题识别技术研究[J].计算机研究与发展,2006,43(3):489-495.
    [19]骆卫华,于满泉,许洪波等.基于多策略优化的分治多层聚类算法的话题发现研究[J].中文信息学报,2006,20 (1):29-36.
    [20]邱立坤,程薇,龙志祎等.面向BBS的话题挖掘初探[A].自然语言理解与大规模内容计算[C].北京:清华大学出版社,2005,401-407.
    [21] Lan You, Xuanjing Hua, Lide Wu etc. Exploring Various Features to Optimize Hot TopicRetrieval on WEB [C]. ISNN 2004, LNCS 3173, 1025–1031, 2004.
    [22] Lan You, Yongping Du, Jiayin Ge etc. BBS Based Hot Topic Retrieval Using Back-Propagation Neural Network [C]. IJCNLP 2004, LNAI 3248, 139-148, 2005.
    [23] Tingting He, Guozhong Qu, Siwei Li etc. Semi-automatic Hot Event Detection [C]. ADMA 2006, LNAI 4093, 1008– 1016, 2006.
    [24] Kuanyu Chen, Luesak Luesukprasert, and Seng-cho T. Chou. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling [J]. IEEE Transactions on Knowledge and Data Engineering, 19(8), 2007.
    [25] Minhui Ye, Wei Cheng, Guanzhong Dai. Design and Implementation of On-Line Hot Topic Discovery Model [J]. Wuhan University Journal of Natural Sciences, Vol.11 No.1, 2006, 21-26.
    [26]曹依灵,许洪波.网络热点信息发现研究[J].通信学报,28(12),2007.
    [27]周亚东,孙钦东,管晓宏等.流量内容词语相关度的网络热点话题提取[J].西安交通大学学报,2007,41(10),1142-1145.
    [28] B Liu. Web Data Mining: Exploring hyperlinks, Contents, and Usage Data [M]. A forthcoming book. 2006/2007.
    [29] P. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews [C]. In Proc. of the ACL’02, 417–424, 2002.
    [30] B. Pang, L. Lee, S. Vaithyanathan. Thumbs up? Sentiment Classification Using Machine Learning Techniques [C]. In Proc. of the EMNLP’02, 2002.
    [31] K. Dave, S. Lawrence, D. Pennock. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews [C]. In Proc. of the WWW’03, 519–528, 2003.
    [32] V. Hatzivassiloglou, J. Wiebe. Effects of Adjective Orientation and Gradability on Sentence Subjectivity [C]. In Proc. of the Intl. Conf. on Computational Linguistics (COLING’00), 299–305. 2000.
    [33] M. Hu, B. Liu. Mining Opinion Features in Customer Reviews [C]. In Proc. of the 19th National Conf. on Artificial Intelligence (AAAI’04), 755–760, 2004.
    [34] S. Kim, E. Hovy. Determining the Sentiment of Opinions [C]. In Proc. of the Intl. Conf. on Computational Linguistics (COLING’04), 2004.
    [35] J. Wiebe, E. Riloff: Creating Subjective and Objective Sentence Classifiers from Unannotated Texts [C]. In Proc. of CICLing, 486–497, 2005.
    [36] A.-M. Popescu, O. Etzioni. Extracting Product Features and Opinions from Reviews [C]. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP’05), 2005.
    [37] G. Carenini, R. Ng, E. Zwart. Extracting Knowledge from Evaluative Text [C]. In Proc. of the Third Intl. Conf. on Knowledge Capture (K-CAP’05), 11–18. 2005.
    [38] B. Liu, M. Hu, J. Cheng. Opinion Observer: Analyzing and Comparing Opinions on the Web [C]. In Proc. of the 14th Intl. World Wide Web Conf. (WWW’05), 342–351, 2005.
    [39] N. Jindal, B. Liu. Identifying Comparative Sentences in Text Documents [C]. In Proc. of ACM SIGIR Intl. Conf. on Research and Development in Information Retrieval (SIGIR’06), 244–251, 2006.
    [40] N. Jindal, B. Liu. Mining Comparative Sentences and Relations [C]. In Proc. of National Conference on Artificial Intelligence (AAAI’06), 2006.
    [41] N. Jindal, B. Liu. Opinion Spam and Analysis [C]. WSDM’08, California, USA. 2008.
    [42] P. D. Turney, M. L. Littman. Measuring Praise and Criticism: Inference of Semantic Orientation from Association [J]. ACM Transactions on Information Systems (TOIS), 2003, 21 (4) : 315-346.
    [43] W. M. Yuen, Y. W. Chan, B. Y. Lai etc. Morpheme-based Derivation of Bipolar Semantic Orientation of Chinese Words [A] . In Proc. of the 20th International Conference on Computational Linguistics (COLIN G’04) [C]. Geneva, Switzerland: 2004, 1008-1014.
    [44] K. Y. Tsou, W. M. Yuen, O. Y. Kwong etc. Polarity Classification of Celebrity Coverage in the Chinese Press [A]. In Proc. of the Intl. Conf. on Intelligence Analysis [C]. McLean, USA: 2005.
    [45] Y. Xia, K.-F. Wong, W. Li. A Phonetic-Based Approach to Chinese Chat TextNormalization [A]. In Proc. of the 21st Intl. Conf. on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006 ) [C]. Sydney , Australia: 2006 , 993-1000.
    [46]姚天昉,程希文,徐飞玉等.文本意见挖掘综述[J].中文信息学报,2008,22(3),71-80.
    [47]娄德成,姚天昉.汉语句子语义极性分析和观点抽取方法的研究[J].计算机应用,26(11),2006,2622-2625.
    [48]姚天昉,娄德成.汉语语句主题语义倾向分析方法的研究[J].中文信息学报,21(5),2007,73-79.
    [49]姚天昉,聂青阳,李建超等.一个用于汉语汽车评论的意见挖掘系统[A].见:曹右琦,孙茂松主编,中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集[C].北京:清华大学出版社,2006,260-281.
    [50]朱嫣岚,闵锦,周雅倩等.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20.
    [51]刘永丹,曾海泉,李荣陆等.基于语义分析的倾向性文本过滤[J].通信学报,2004,25(7):78-85.
    [52]章剑锋,张奇,吴立德等.中文观点挖掘中的主观性关系抽取[J].中文信息学报,2008,22(2):64-68.
    [53]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报,2007,21(6):88-94.
    [54]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J].中文信息学报,2007,21(1):96-100.
    [55]金珠,林鸿飞,赵晶.基于HowNet的话题跟踪及倾向性分类研究[J].情报学报,2005,24(5):555-561.
    [56]徐琳宏,林鸿飞,潘宇等.情感词汇本体的构造[J].情报学报,2008,27(2):180-185.
    [57]徐琳宏,林鸿飞,赵晶.情感语料库的构建和分析[J].中文信息学报,2008,22(1):116-122.
    [58]黄小江,万小军,杨建武等.汉语比较句识别研究[J].中文信息学报,2008,22(5):30-38.
    [59]刘全升,姚天昉,黄高辉等.汉语意见型主观性文本类型体系的研究[J].中文信息学报,2008,22(6),63-68.
    [60] A. H. Tan. Text Mining: The state of the art and the challenges [C]. In: Proceeding of the Pacific Asia Conference on Knowledge Discovery and Data Mining PAKDD' 99 Workshop onKnowledge Discovery from Advanced Databases,Beijing,China,1999,65～70
    [61]苏新宁,杨建林,邓三鸿,周军.数据挖掘理论与技术.北京:科学技术文献出版社. 2003
    [62] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[C]. Communications of the ACM, 1975, 18(11): 603-620.
    [63] T. Cover, P. Hart. Nearest Neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967
    [64] A. McCallum, K. A Nigam. Comparison of Event Models for Na?ve Bayes Text Classification [C]. AAAI-98 Workshop on Learning for Text Categorization, AAAI Press. http://www.cs.cmu.edu/~mccallum.
    [65] Xiao-chuan Ni, Xiao-yuan Wu, Yong Yu. Automatic Identification of Chinese Weblogger’s Interests Based on Text Classification[C]. Proceedings of the 2006 IEEE/WIC/ACM Internationl Conference on Web Intelligence, 247-253, 2006.
    [66] Cortes, C. and Vapnik, V. Support vector networks[M]. Machine Learning, 20:273–297, 1995.
    [67] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery,1998,2(2):283-304.
    [68] M. Steinbach, Ge.Karypis,and V.Kumara. A Comparison of Document Clustering Techniques[C]. KDD-2000 Workshop on Text Mining,August 20-23, 2000, Boston MA USA.109–110
    [69] M.Ester, H.P.Kriegel, J.Sander etc..A density-based algorithm for discovering clusters in large spatial databases[C]. Proc.1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), 226-231.
    [70] M.Ankerst, M.Breunig, H.P.Kriegel etc. OPTICS: Ordering points to identify the clustering structure. Proc.1999 ACM-SIGMOD Int. Conf. Management of data (SIGMOD’99), 49-60.
    [71] H. Rheingold. The virtual community: Homesteading on the electronic frontier[M]. 2000.
    [72] V. Hatzivassiloglou, J. Klavans, E. Eskin. Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning[C]. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 1999: 203-212.
    [73] C. Corley, R. Mihalcea. Measuring the Semantic Similarity of Texts[C]. Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, 2005:13-18.
    [74] J. Peng, D. Q. Yang, S. W. Tang, et al. A New Similarity Computing Method based on Concept Similarity in Chinese Text Processing[J]. Science in China Series F: Information Sciences, 2008, 51(9): 1009-2757
    [75] Yang Yi-ming, Pederson J. O. A Comparative Study on Feature Selection in Text Categorization [C]. Proceedings of the 14th International Conference of Machine Learning. San Francisco: Morgan Kaufmann Publishers, 412-420, 1997.
    [76] S. M. Kim and E. Hovy. Determining the Sentiment of Opinions [A]. In: Proceedings of COL ING-04, the Conference on Computational Linguistics (COLING-2004)[C]. Geneva, Switzerland: 2004, 136721373.
    [77] HowNet’s Homepage: http://www.keenage.com
    [78] Sina News: http://news.sina.com.cn. Sina Blog: http://blog.sina.com.cn
    [79] USTC BBS: http://bbs.ustc.edu.cn

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700