面向论坛回帖的文本倾向性分析研究

英文题名：Research and Analysis on Semantic Orientation of Forum Replies
作者：陆彬
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：论坛回帖 ; 文本倾向性 ; 信息安全 ; 论坛楼层结构 ; 论坛用语
英文关键词：forum replies ; semantic orientation ; information security ; forum floor structure ; forum language
学位年度：2011
导师：郭捷 ; 刘功申
学科代码：081203
学位授予单位：上海交通大学
论文提交日期：2011-12-01

摘要

随着互联网的快速发展,网络论坛已经成为了网络时代的重要组成部分。在论坛中,主题帖固然重要,然而多数人都是通过对所关心的主题帖进行回帖来表达自身的观点,因此论坛中的回帖往往更能反映出社会的舆论倾向。
     要对网络论坛中的回帖进行准确的情感倾向性分析,就必须要把握论坛中的特点,本文首先分析了论坛回帖中的特点,如楼层的层次关系、论坛回帖的语言特点等。
     本文以论坛回帖为研究对象,提出了一种结合论坛回帖的特点的基于论坛楼层结构的倾向性分析系统,该系统首先提取所需分析的论坛页面的源代码并进行预处理,得出论坛回帖的楼层层次结构形态及各楼层文本内容。
     接着对各楼层回帖中无意义帖子进行判断,对于长帖子,还要判断其是否与主题帖相关,然后通过机器学习的方法进行分类。对于短帖子,则进行分词以及语法分析操作,结合预先根据论坛回帖语言特点整理得到的情感词词库以及其他常用词库,对短帖进行倾向性分析。
     最后,根据单个回帖的倾向性以及之前建立的楼层层次结构,得出并统计出主题帖下所有回帖的情感倾向性。实验表明,新系统的判别准确率在80%左右,具有良好的应用前景。
With the quick developing of internet, network forum become very major part in the information age. In forum, the main post is important, but most people express their opinion by replying the main post which they concern. The forum replies reflect more emotion orientation to social events.
     To accurately analyze semantic orientation of forum replies, it is necessary to grasp the features of forum. This paper analyzes the features of the forum replies first, such as floor structure, feature of forum replies language and so on.
     This paper presents a new system for predicting semantic orientation of forum replies based on forum floor structure and features of forum language. Firstly, this system extracts the required source code of the forum pages. From analyzing the html code of forum pages, the system creates a forum floor structure and saves the dividing text by sequence of forum floor.
     Next the system will analyze if forum replies are the meaningless. It’s also necessary for long replies to analyze if they are post-related, then classify them by method of machine learning. For short replies, we do word dividing, grammatical analysis work and analyze semantic orientation combined with some word libraries.
     Finally, we get the semantic orientation of all the replies under the post based on individual reply’s orientation and the forum floor structure which created before. Experiment results have proved the effectiveness of the system.

引文

[1]中国互联网信息中心,第28次中国互联网络发展状况统计报告,2011 http://www.cnnic.cn/research/bgxz/tjbg/201107/P020110721502208383670.pdf
    [2] Huifeng Tang, Songbo Tan, Xueqi Chen. A survey on sentiment detection of reviews. Expert Systems With Applications. 2009, 36: 10760-10773.
    [3] Hong Yu, Vasileios Hatzivassiloglou. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the 2003 conference on empirical methods in natural language processing. 2003: 129-136.
    [4] Bo Pang, Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd ACL. 2004: 271-278.
    [5] J Yi, T Nasukawa, R Bunescu et al. Sentiment analyzer extracting sentiments about a given topic using natural language processing techniques. In: The Third IEEE International Conference on Data Mining. USA: IEEE Computer Society, 2003. 427~434
    [6] Kazama J, Tsujii J. Maximum entropy models with inequality constraints: A case study on text categorization. Machine Learning, 2005,60(1-3):159?194.
    [7] Li R, Wang J, Chen X, Tao X, Hu Y. Using maximum entropy model for Chinese text categorization. Journal of Computer Research and Development, 2005,42(1):94?101.
    [8] Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 2004,56(6):584?596.
    [9] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: Nedellec C, Rouveirol C, eds. Proc. of the 10th European Conf. on Machine Learning (ECML-98). Chemnitz: Springer-Verlag, 1998. 137?142.
    [10] Yang Y, Liu X. A re-examination of text categorization methods. In: Gey F, Hearst M, Rong R, eds. Proc. of the 22nd ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-99). Berkeley: ACM Press, 1999. 42?49.
    [11] Lewis DD, Li F, Rose T, Yang Y. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004,5(3):361?397.
    [12] Forman G, Cohen I. Learning from little: Comparison of classifiers given little training. In: Jean FB, Floriana E, Fosca G, Dino P, eds. Proc. of the 8th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD-04). Pisa: Springer-Verlag, 2004. 161?172.
    [13] Bo Pang, Lillian Lee, Shivakumar Vaithyanathan. sentiment classification using machine learning techniques. In Proceedings of the 2002 conference on empirical methods in natural language processing(EMNLP). 2002, 10: 79-86.
    [14] Bo Pang, L Lee. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In: Proceedings of the Association for Computational Linguistics (ACL 2005). 115 ~124
    [15] Lewis DD. Feature selection and feature extraction for text categorization[A]. Proceedings of Speech and Natural Language Workshop[ C] . San Francsico: Morgan Kauf mann ,February 1992. 212-217.
    [16] Mineau GW. A Simple KNN Algorithm for Text Categorization[A]. Sponsored by the IEEE Computer Society. 2001 IEEE International Conference on Data Mining[C]. Doubletree Hotel ,San Jose ,California ,USA ,November 29 - December 2 ,2001.
    [17] Lu Mingyu ,Diao Lili , et al . The design and implementation of anexcellent text categorization [A]. Proceedings of the 4th World Congress on Intelligent Control and Automation [C]. Shanghai, June 10 - 14 ,2002.
    [18]黄萱菁,吴立德,等.独立于语种的文本分类方法[J] .中文信息学报,2000 , (6) .
    [19]王素格,杨安娜,李德玉等.基于支持向量机的文本倾向性分类研究.中北大学学报(自然科学版), 2008, 29(5): 421~425
    [20]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制.中文信息学报, 2007, 21(1): 96~100
    [21] Liu Gongshen, Lai Huoyao. Predicting the Semantic Orientation of Movie Reviews,FSKD'10, 2010: 2483–2489.
    [22] V Hatzivassiloglou, K R McKeown. Predicting the semantic orientation of adjectives. In: Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics (ACL 1997). 174~181
    [23] Turney, D Peter, L Littman. Measuring praise and criticism: Inference of semantic orientation from association. In: ACM Transactions on Information Systems. New York: ACM Press, 2003. 315~346
    [24] Turney, D Peter, Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proceedings of the 40th ACL. 2002: 417-424
    [25] J Kamps, M Marx, R J Mokken et al. Using WordNet to measure semantic orientation of adjectives. In: Proceedings 4th International Conference on Language Resources and Evaluation (LREC-04). 1115~1118
    [26] Zhu Yanlan, Min Jin, Zhou Yaqian et al. Semantic Orientation Computing Based on HowNet. Journal of Chinese information process. 2006(1): 140~146
    [27] Casey Whitelaw, Navendu Garg, Shlomo Argamon. Using appraisal groups for sentiment analysis. In Proceedings of CIKM-05, 14th ACM International Conference on Information and Knowledge Management. 2005: 625-631.
    [28] Johan Bollen, Alberto Pepe, Huina Mao. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. 2009. arXiv:0911.1583v1
    [29]乐媛,杨伯溆.网络极化现象研究_基于四个中文BBS论坛的内容分析[J].2010:02-94
    [30] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM computing surveys. 2002, 34(1): 1-47.
    [31] Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval, 1st edition. Addison-Wesley-Longman, Reading, MA, 1999.
    [32] Thorsten Joachims. Text categorization with Support Vector Machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning. 1998.
    [33]李明慧. BBS舆论传播的形态_限度和调控[D].山东:山东大学新闻学院,2005
    [34]刘娟.网络语言的语法特征分析[J].内江科技, 2009:10-111
    [35] Princeton University. WordNet. http://wordnet.princeton.edu/.
    [36]董振东.知网(HowNet). http://www.keenage.com/.
    [37]来火尧,刘功申,基于主题相关性分析的文本倾向性研究,信息安全与通信保密, 2009(3):77~78
    [38]刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004:41-8
    [39]胡芊.基于概率上下文无关文法的汉语句法分析方法研究[D].北京:北京邮电大学计算机学院. 2009
    [40]知网情感分析用词语集. cnki. http://www.keenage.com/html/c_bulletin_2007.htm
    [41]谭松波.中文情感挖掘语料. [2010-06-29]. http://www.searchforum.org.cn/tansongbo/ corpus-senti.htm
    [42]朱杰,刘功申,陈卓,中文文本倾向性分类技术比较研究,信息安全与通信保密, 2010(4):56~58

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700