基于文本分类技术的文本情感倾向性研究

英文题名：Text Sentiment Analysis Based on Text Classification
作者：郭明
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：支持向量机 ; 贝叶斯 ; K近邻 ; 特征选择 ; 权重计算
英文关键词：Support vector machine ; Bayes ; K nearest neighbor ; feature selection ; weight calculation
学位年度：2010
导师：昝红英
学科代码：081201
学位授予单位：郑州大学
论文提交日期：2010-04-01

摘要

文本情感倾向性研究在近些年成为众多学者关注的热点,其应用领域也在不断的拓宽。从社会舆论监督到产品口碑检测都离不开文本情感倾向性研究。本文在传统的文本分类技术基础上提出一种基于规则与统计方法相结合的情感分析模型。并将该模型在两种有代表性的语料中做了实验。语料一：领域背景复杂且分布极不平衡的新闻文本语料；语料二：领域背景单一的股票领域的专家的股评语料。
     (1)分析新闻文本的情感倾向性,为新闻文本自动播报提供情感信息。本文提出一种中心句确定方法,并在提取的中心句的基础上运用统计方法提取潜在规则来对人工构建的规则库做补充,使规则库相对完备提高情感分析的效果。实验中采用支持向量机、贝叶斯分类器和K近邻分类器作为分类器与规则结合,并且使用多种特征提取方法和特征权重计算方法来进行对比实验。由于新闻语料自身的分布的极不平衡性,导致单纯的统计的方法在稀有类上的表现比较差,而规则与统计相结合的方法虽然没有能够完全解决这一难题,但却在一定程度上改善了实验效果。实验效果表明规则与统计方法相结合的情感分析模型相比于单纯的统计模型在效果上有了较明显的提高,表明规则结合统计的方法具有很好的普适性。
     (2)本研究是建立在股票领域的垂直搜索应用上的。该应用需要对股评专家对某支股票的评论做看多、看平、看空、不确定进行分类。在这部分实验中因为所用语料短小、领域性非常强、口语化比较严重,通用的分词软件不能很好的进行分词。本文提出一种简便的定位特征词的方法,不仅满足了实验需求且时间效率非常高,时间复杂度为0(n)。由于领域单一容易提取较完备的规则,在这部分实验中规则的平均准确率均在90%以上,且均优于统计的方法。
     本文提出的规则结合统计方法的分类模型在背景复杂的新闻文本语料中取得了很好的效果,较单纯的统计方法分类效果有了明显的提高,有效地改善了稀有类的分类效果。但是在背景单一的股票领域语料上并没有多大的提高,说明规则的方法较适用于背景单一的语料。
Study of Text emotional tendency becomes a focus, more and more scholars tends to work on it and its applications are constantly expanding. Word of mouth from community supervision by public opinion to the test product can not do without emotional bias of the text. In this paper, we proposed a the combining method based on traditional text categorization. Experiments have been done in two representative corpus. Corpus 1:news text corpus which background is complex and extremely uneven distribution; CorpusⅡ:Stock corpus which Background is single.
     (1) Analysis the emotional tendency of news text and provide emotional information for the news broadcast automatically to. In this experiment we presents a method to determine the main sentence, and extracted potential rules in the main sentences using statistical methods to supply the rule base which was build on the artificial in order improve the result of analysis of the effect of emotion. In this experiments we use support vector machines, Bayes classifier and the K nearest neighbor classifier as the classifier combined with the rules and use a variety of feature extraction methods and feature weighting method to do experiments and then compare their result. As the news text corpus extremely uneven distribution of its own, leading to the simple statistical method's performance in the rare class of relatively poor, but the combination of rules and statistical methods were not able to completely solve the problem, but has improved the experimental results. Experimental results show that the combination of rules and statistical analysis model is better in many field than simple statistical model. It shows that Rules combined with statistical methods have good universal.
     (2) This study is based on vertical search applications in the field of stock. The application needs to analysts stock experts's comments on certain stocks do call and think flat, bearish and uncertain classification. In this part of the experiment because of the Corpus is short, the field background is very strong, colloquial more serious, common segmentation software can not do this job well. This paper presents a simple method of positioning feature words, not only meet the test requirements and is more efficiency on time, the time complexity is O (n). As the field background is simple, we can extract rules easily and completely, the Accuracy in this part of the experiment reach 90% or more, and rule method's Performance is better than statistical method's.
     The combined classification model did well in the news text corpus which background is complex obtain good results than the simple statistical methods, it effectively improved the classification of rare class effect. However, on the single background stock corpus, the combined method have not much increase. It shows that the rules method is suit for the single background corpus.

引文

[1]Wei-Hao Lin,Theresa Wilson,Janyce Wiebe and Alexander Hauptmann. Which Side are You on? Identifying Perspectives at the Document and Sentence Levels[A].
    [2]Minsky M. The Society of Mind. New York:Simon & Schuster,1985.7-23
    [3]人工情感研究综述王国江王志良杨国亮王玉洁陈锋军计算机应用研究 2006年第11期 7-11
    [4]Turney P D, littmanM L.Measuring praise and Critism:Inference of semantic orientation from association [J]. ACM Tanslatiions on Information Systems,21(4),2003:315-346.
    [5]J. Wiebe, J. M. A corpus study of evaluative and speculative language[C]. In:Association for Computational Linguistics Morristown. Proceedings of the 2nd ACL SIG on Dialogue Workshop on Discourse and Dialogue. NJ, USA.2001,1～10
    [6]Pang. B, L. Lee, S.Vaithyanathan. Thumbs up? Sentiment. Classification using Machine Learning Techniques[C].Proeeedings of the Conferenee on Empirieal Methods in Natural Language Processing,2002:79-86.
    [7]唐惠杰,谭松波,程学旗.基于监督学习的中文情感分类技术的比较研究[J].中文信息学报,2007,21(6)：88-94.
    [8]徐军,丁宇新,王晓龙.使用机器学习方法进行新闻的情感自动分类.中文信息学报,2007：2195-100.
    [9]王根,赵军.基于多重标记CRF的句子情感分析研究[J]. 中文信息学报,2007,21(5)：51-55.
    [10]章剑锋,张奇,黄萱菁,吴立德,中文评价挖掘中的主观性关系抽取[C],第三届全国信息检索与内容安全学术会议,苏州,2007.
    [11]Peter D.Turney, Michael L.littman;Measuring praise and criticism:Inference of Semantic Orientation from Association[J]. ACM Transactions on Information Systems,2003,21(4): 315～346.
    [12]Hugo Liu, Henry Lieberman, Ted Selker. A Model of Textual Affect Sensing using Real-World Knowledge[C]. Proceedings of the 2003 International Conference on Intelligent User Interfaces,2003,125-132.
    [13]YI J, NIBLACK W. Sentiment mining in WebFountain [A]. Proceedings of the 21st International Conference on Data Engineering(ICDE 2005). Washington, DC, USA:IEEE Computer Society Press,2005.1073～1083.
    [14]HowNet [R]. HowNet's Home Page. http://www.keenage.com.
    [15]Hatzivassiloglou V, McKeown K. Predicting the semantic orientation of adjectives[C]. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97),1997,174～181
    [16]Wilson T, Wiebe J, Hoffmann P. Recognizing contextual polarity in phrase-level sentiment analysis[C]. In:Proceedings of joint conference on Human Language Technology Conference on Empirical Methods in Natural Language Processing.2005:347～354
    [17]王超,李楠,李欣丽,梁笔循,文本倾向性分析用于金融市场波动率与金融信息相互关系的研究, 中文信息学报,2009：0195-99.
    [18]田家乐,昝红英,柴玉梅,冼家扬.一个面向IT领域的褒贬评价系统[J],计算机科学,2008,117-121.
    [19]NaJin Cheon, Khoo Christopher, Wu Paul Horng Jyh. Use of negation phrases in automatic sentiment classifieation of product reviews [J], Libra Colleetions,Aequisitions&Teehnieal Services,2005,180-191.
    [20]T'sou Benjamin, Kwong Olivia, Wong Wei-Lung, Lai Tom. Sentiment and Content Analysis of Chinese News Coverage[J]. International Journal of Computer Processing of Oriental Languages,2005,18:171-183.
    [21]王波,王厚峰,基于自学习策略的产品特征自动识别[C],全国第九届计算语言学学术会议,清华大学出版社,大连,2007.
    [22]苏祺.问答系统中的情感倾向性问题回答策略[D].北京大学博士学位论文,2006.
    [23]姚天防,程希文,徐飞玉,汉思·乌思克尔特,王睿.文本意见挖掘综述,中文信息学报,2008,22(3)：71-80.
    [24]宗成庆.统计自然语言处理.
    [25]周立柱,贺宇凯,王建勇.情感分析研究综述,计算机应用,2008,28(11)：2725-2728.
    [26]黄萱菁,赵军.中文文本情感倾向性分析[J].中国计算机学会通讯,2008：47-53.
    [27]B Liu, M Hu, J Cheng. Opinion Observer:Analyzing and Comparing Opinions on the Web. In:Proceedings of WWW'05, the 14th International Conference on Worldwide Web,Chiba, Japan, 2005:342-351.
    [28]张军,于浩,内野宽治.UGC中产品评论信息的挖掘[C].全国第九届计算语言学学术会议,大连,2007.
    [29]姚天昉,聂青阳,李建超等.一个用于汉语汽车评论的意见挖掘系统[C].中文信息处理前沿进展—中国中文信息学会二十五周年学术会议论文集.清华大学出版,2006：260-281.
    [30]张伟,刘缙等.学生褒贬义词典.中国大百科全书出版社.2004.
    [31]HowNet [R]. HowNet's Home Page. http://www.keenage.com.
    [32]俞士汶等,现代汉语语法信息词典详解(第二版),北京：清华大学出版社,2003：40-41.
    [33]王治敏,朱学锋,俞士汶.基于现代汉语语法信息词典的词语情感评价研究[J].Computational Linguistics and Chinese Language Processing.2005,10(4):581-592.
    [34]R. Xu, K.F. Wong and Y. Xia. Opinmine-Opinion Analysis System by CUHK for NTCIR-6 Pilot Task. Proc. of NTCIR-6.2007.
    [35]苏祺.问答系统中的情感倾向性问题回答策略[D].北京大学博士学位论文,2006.
    [36]YI J, NIBLACK W. Sentiment mining in WebFountain [A]. Proceedings of the 21st International Conference on Data Engineering(ICDE 2005) [C]. Washington, DC, USA:IEEE Computer Society Press,2005.1073～1083.
    [37]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J].中文信息学报,2007,21(1)：96～100.
    [38]陈建美,林鸿飞,杨志豪.基于贝叶斯模型的词汇情感消歧[C].全国第九届计算语言学学术会议,大连,2007.
    [39]娄德成.基于NLP技术的中文网络评论观点抽取方法的研究[D].上海交通大学硕士学位论文，2007.
    [40]秦进,陈笑蓉,汪维家等.文本分类中的特征抽取[J].计算机应用,2003,23(2)：45～46.
    [41]周茜,赵明生.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3)：17～23.
    [42]Salton G, Automatic Text Processing:The Transformation Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading, Mass,1989.
    [43]Salton G, Wong A and Yang C.S, A vector space model for automatic indexing. Communications of ACM Vol.18, No.11, P613-P620,1997
    [44]V Hatzivassiloglou, JM Wiebe. Effects of Adjective Orientation and Gradability on Sentence Subjectivity [A].In:Proceedings of 18th International Conference on Computational Linguistics (COLING 2000) [C]. New Brunswick, NJ, USA:2000.
    [45]Y. Xia, K.-F. Wong, and W. Li. A Phonetic-Based Approach to Chinese Chat Text Normalization [A]. In:Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006) [C]. Sydney, Australia:2006,993～1000.
    [46]倪茂树.基于语义理解的观点评论挖掘研究[D].大连理工大学硕士学位论文,2007.
    [47]彭其伟.基于统计方法的中文文本情感倾向分类研究[D].山西大学硕士学位论文,2007.
    [48]熊德兰.中文网页褒贬倾向性分类研究[D].郑州大学硕士学位论文,2006.
    [49]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J]计算机应用研究,2001,18(9).
    [50]宋光鹏.文本的情感倾向分析研究[D].北京邮电大学硕士学位论文,2008.
    [51]苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9).
    [52]Y. Yang, Jan O. Pedersen. A Comparative Study on feature Selection in Text Categorization[C]. (ICML-97).1997:412～420.
    [53]Su JS, Zhang BF, Xu X. Advances in machine learning based text categorization. Journal of Software,2006,17(9):1848～1859.
    [54]M. Hu and B. Liu. Mining Opinion Features in Customer Reviews[A]. In:Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI2004) [C]. San Jose, USA 2004.
    [55]J. Kamps, M. Marx, R. J. Mokken, and M. D. Rijke. Using WordNet to measure semantic orientation of adjectives. In Proceedings of LREC-04,4th International Conference on Language Resources and Evaluation, volume IV, pages 1115～1118, Lisbon, PT,2004.
    [56]Guang Qiu, Kangmiao Liu, Jiajun Bu, Chun Chen, Zhiming Kang, Extracting opinion topics for Chinese opinions using dependence grammar, Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, p.40-45, August 12-12,2007, San Jose, California
    [57]A. Esuli and F. Sebastiani. Sentiwordnet:A publicly available lexical resource for opinion mining. In Proceedings of LREC-06, the 5th Conference on Language Resources and Evaluation, Genova, Italy,2006.
    [58]J. Wiebe, E. Breck, C. Buckley, C. Cardie, P. Davis, B. Fraser, D. Litman, D. Pierce, E. Riloff, T. Wilson, D. Day, and M. Maybury. Recognizing and organizing opinions expressed in the world press. In Proceedings of the 2003 AAAI Spring Symposium on New Directions in Question Answering. AAAI Press,2003.
    [59]A. Esuli and F. Sebastiani. Determining term subjectivity and term orientation for opinion mining. In Proceedings EACL-06, the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy,2006.
    [60]Philip Beineke, Trevor Hastie, Shivakumar Vaithyanathan, The sentimental factor: improving review classification via human-provided information, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.263-es, July 21-26,2004, Barcelona, Spain.
    [61]Kanayama Hiroshi, Nasukawa Tetsuya, Watanabe Hideo, Deeper sentiment analysis using machine translation technology, Proceedings of the 20th international conference on Computational Linguistics,494-es, August 23-27,2004, Geneva, Switzerland.
    [62]Satoshi Morinaga, Kenji Yamanishi, Kenji Tateishi, Toshikazu Fukushima, Mining product reputations on the Web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26,2002, Edmonton, Alberta, Canada.
    [63]Vincent Ng, Sajib Dasgupta, S. M. Niaz Arifin, Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews, Proceedings of the COLING/ACL on Main conference poster sessions, p.611-618, July 17-18,2006, Sydney, Australia.
    [64]Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, Wayne Niblack, Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques, Proceedings of the Third IEEE International Conference on Data Mining, p.427, November 19-22, 2003.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700