基于组块分析的中文短语情感倾向研究

英文题名：Research on Chinese Phrase Sentiment Analysis Based on Chunking
作者：孙慧
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：情感倾向分析 ; 消歧 ; 情感短语 ; 情感分类
英文关键词：Sentiment analysis ; Disambiguation ; Sentiment phrase ; Sentiment classification
学位年度：2010
导师：关毅
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2010-06-01

摘要

随着Internet的迅速发展,特别是论坛和blog等大量的主观性媒体的出现,打破了信息发布者与接收者之间森严的界限,这使文本正在成为最重要的交互方式之一,其中包含的观点信息越来越引起公司和政府的注意。但是这种改变也使网络上的文本信息数量呈爆炸式增长,文本情感倾向性分析作为自动获取其中观点信息的一种手段,成为自然语言处理的一个热点问题。
     文本情感倾向性分析,就是对说话人的态度(或称观点、情感)进行分析,也就是对文本中的主观性信息进行分析。词汇情感倾向性分析作为文本倾向性分析的基础,有着举足轻重的作用。短语作为词汇和句子之间过渡的桥梁,可以增大情感分析粒度,对提高句子乃至篇章情感倾向性分析系统性能有重要意义。
     本文针对基于词典的词汇情感倾向性分析方法中对情感词倾向绝对化标注问题,提出了一种获取上下文相关的词汇情感倾向方法。同时针对目前缺少包含上下文相关情感词标注资源的问题,使用最大熵交叉验证和手工校正结合的方法加以构造,并在此基础上构造了上下文相关的特征集合用来预测情感词在上下文中的情感倾向。实验表明,此种方法与基于词典的词语情感倾向性分析方法相比,F值提高了4.9%。
     针对二词短语情感倾向分析问题,使用了基于规则的分析方法。在此方法中构造了特征模板,使用互信息对组块情感倾向进行计算。并说明了程度副词和否定副词对于组块情感倾向的影响以及收集方法。针对更加普遍的组块情感倾向分析问题,使用了情感分类方法进行分析,本文以短语包含的词的情感倾向以及短语类型等为特征,分别应用了最大熵模型和支持向量机模型对组块情感倾向进行分类,并将结果与传统的基于累加的方法进行比较,最后支持向量机模型取得最好的效果。
     最后,分别使用词汇和短语对句子的情感倾向进行分析,结果表明使用短语增大了情感分析的粒度,对于句子的情感倾向性分析性能有很大提高。本文使用上述方法,将短语情感倾向性分析分为两个层次进行了研究,分别是词汇情感倾向消歧以及短语情感倾向性分析,句子级别情感倾向性分析结果表明,本文中系统对于文本情感倾向性分析有积极作用。
With the rapid development of Internet, particularly the popularity of subjective media, such as forums and blog, etc, the strict boundaries between information distributors and receivers has been broken. Text is becoming one of the most important interaction ways, the subjective information contained in it has drawn increasing attention from companies and governments. And this change has made textural information explosive growth. Text sentiment analysis is a method to obtain subjective information automatically, which has become a hotpot in natural language processing.
     Text sentiment polarity analysis means analyzing the speakers and writers’attitude (or point of view, emotion), that is, analyzing the subjectivity information of text. Word sentiment analysis is the foundation of text sentiment analysis, it plays a important role. As a bridge between words and sentences, phrase can increase granularity of text sentiment analysis. Therefore phrase sentiment analysis has profound significance.
     In current research, one disadvantage of methods based on lexicons is that it is to tag words priori sentiment polarity: out of context. This paper presents a method to obtain the contextual sentiment polarity of words. For the lack of contextual corpus for sentiment analysis, we combine maximum entropy based cross-validation and manual annotation to construct the corpus. Then a valid set of contextual features is extracted to predict the word contextual sentiment polarity in the context. Compared with the methods based on lexicons, experiments show that F score is improved by 4.9%.
     For the two-word phrases sentiment analysis, this paper used rule-based method. We constructed templates, and used PMI (Pointwise Mutual Information) to obtain the sentiment polarity of phrases. Besides, this paper describes the function and the collection methods of adverbs of degree and negative words in phrase sentiment analysis.
     For more common phrase sentiment analysis, this paper used classification method to solve it. This paper constructed feature set including word sentiment polarity and chunk type, etc, and used maximum entropy model and support vector machine as the classification algorithms. Compared to the method based on the summation of polarity, the support vector machine obtained the best result.
     At last, we obtain sentence sentiment polarity by words and phrases respectively. The result shows that phrases increase granularity of sentiment polarity analysis and improve the performance of sentence sentiment polarity analysis.
     This paper obtained phrase sentiment polarity based on above techniques. The research was divided into two levels, namely contextual sentiment polarity disambiguation of Chinese words and phrase sentiment analysis. Experiments on sentence sentiment analysis show that the approach in this paper achieves a good result.

引文

1.黄萱菁,赵军.中文文本情感倾向性分析[J].中国计算机学会通讯. 2008, 4(2): 47-53
    2. Hatzivassiloglou and McKeown. Predicting the Semantic Orientation of Adjectives. [C]. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the ACL, 1997:174-181
    3. J.Kamps,M.Marx,R.mokken,et al. Using WordNet to Measure Semantic Orientations of Adjectives[C]. Proceedings of the 4th International Conference on Language Resources and Evaluation.2004:1115-1118
    4. WordNet. http://wordnet.princeton.edu
    5. General Inquirer. http://wjh.harvard.edu/~inquirer
    6.朱嫣岚,闵锦,周雅倩,等.基于HowNet的词汇语义倾向计算[J].中文信息学报, 2006, 20(1): 14-20)
    7. Peter D Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews [C]. In:Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 417-424
    8. Zhang Qi, Qiu Xipeng, Hung Xuanjing, et al. Learning Semantic Lexicons using Graph Mutual Reinforcement based Bootstrapping[J]. Acta Automatica Sinica, 2008, 34(10): 1257-1261
    9. J. Wiebe, J. M. A corpus study of evaluative and speculative language[C]. In Proceedings of the 2nd ACL SIG on Dialogue Workshop on Discourse and Dialogue (Aalborg, Denmark).2001:1-10
    10. Theresa Wilson, Janyce Wiebe,Paul Hoffmann. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis[A]. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing[C]. 2005:347-354.
    11. Tetsuya Nasukawa,Jeonghee Yi. Sentiment analysis: capturing favorability using natural language processing[A]. Proceedings of the 2nd international conference on Knowledge capture[C]. 2003:70-77.
    12.陈建美,林鸿飞,杨志豪.基于贝叶斯模型的词汇情感消歧[A].内容计算的研究与应用前沿-第九届全国计算语言学学术会议论文集[C].2007:594-599.
    13.姚天昉,聂青阳,李建超等.一个用于汉语汽车评论的意见挖掘系统[A].中文信息处理前沿进展-中国中文信息学会二十五周年学术会议论文集[C]. 2006:260-281
    14. Whitelaw C , Garg N , Argamon S. Using Appraisal Groups for Sentiment Analysis[A] . In : Proceedings of t he 14t h ACM inter2national conference on Information and knowledge management[ C] , Bermen ,Germany , 2005. 625～631
    15. Takamura H , Inui T , Okumura M. Latent Variable Models for Semantic Orientations of Phrases[A] . In :Proc. of t he 11t h Conference of t he European Chapter of t he Association for Computational Linguistics ( EACL22006) [ C] , 2006
    16. Faye Baron, Graeme Hirst. Collocations as Cues to Semantic Orientation.
    17. Fei Zhongchao, Liu Jian, Wu Gengfeng. Sentiment Classification Using Phrase Patterns [C]. In: Proceedings of Fouth International Conference on Computer and Information Technology (CIT' 04). 2004
    18. Xia Yunqing, Xu Ruifeng, Wong Kamfai,Zheng Fang. The unified collocation framework for opinion mining. In proceeding of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong.2007.844-850
    19.李钝.基于短语模式的文本情感分类研究.博士论文.2007
    20. HATZIVASSILOGLOU V, W IEBE JM. Effects of adjective orientation and gradability on sentence subjectivity[C ]. Proceedings of the 18th Conference on Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics, 2000: 299 - 305.
    21. Hong Yu, Vasileios H. Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences [C]. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. 2003: 129-136
    22. PANG B, LEE L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts[C]. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Morristown, NJ, USA: Association for Computa2tional Linguistics, 2004: 271 - 278.
    23. Kim S Min, Eduard H. Determining the Sentiment of Opinions [C]. In Proceedings of COLING-04: The Conference on Computational Linguistics. 2004: 1367-1373
    24. Kim S Min, Eduard H. Identifying Opinion Holders for Question Answering in Opinion Texts [C]. In: Proceedings of AAAI-05 Workshop on Question Answering Restricted Domains. 2005
    25. J. Yi, T. Nasukawa, R. Bunescu, W. Niblack. Sentiment Analyzer: ExtractingSentiments about A Given Topic using Natural Language Processing Techniques [C]. In: Proceedings of Third IEEE International Conference. 2003: 427-434
    26.王根,赵军.基于多重标记CRF的句子情感分析研究[C].全国第九届计算语言学学术会议. 2007: 600-605
    27.章剑锋,张奇,黄萱菁,吴立德.中文评论挖掘中的主观性关系抽取[C].第三届全国信息检索与内容安全学术会议. 2007: 675-681
    28. S. Buchholz, J. Veenstra, W. Daelemans. Cascaded Grammatical Relation Assignment. Proceedings of EMNLP-99, College Park, USA, 1999:239-246
    29.孙广路,基于统计学习的中文组块分析技术研究,哈尔滨工业大学博士论文.2008
    30.贾宁,张全.基于最大熵模型的中文姓名识别.计算机工程. 2007, 33(9):31-33
    31.吴军,王作英.汉语信息熵和语言模型的复杂度.电子学报, 1996, 24(10):69-71
    32. Ronald Rosenfeld. Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Doctor Degree of Carnegie Mellon University, 1994:34-35
    33.周稚倩.最大熵方法及其在自然语言处理中的应用.复旦大学博士学位论文. 2004:13-17
    34. Adam L. Berger, Vincent J. Della, Stephen A. Della. A Maximum Entropy Approach to Natural language Processing. Computational Linguistic, 1996,22(1):39-71
    35. J.N.Darroch, D.Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics,1972,43(5):1470-1480.
    36. Adam L.Berger, Stephen A.Della, Vincent J.Della Pietra. A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistic, 1996,22(1):39-71.
    37. Joshua Goodman. Sequential Conditional Generalized Iterative Scaling. Proceedings of the 40th Annual Meeting of the ACL, 2002:9-16.
    38. Robert F. Transmission of Information. MIT Press and John Wiley and Sons, 1961
    39.朱德熙.语法讲义.商务印书馆. 2000
    40.王力.中国现代语法[M].商务印书馆. 1985: 131-132
    41.李泉.汉语语法考察与分析.北京语言文化大学出版社. 2001
    42.徐军,丁宇新,王晓龙.使用机器学习方法进行新闻的情感自动分类[J].中文信息学报. 2007, 21(6): 95-100.
    43.唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报. 2007, 21(6): 88-94
    44.刘康,赵军.基于层叠CRFs模型的句子褒贬度分析研究[J].中文信息学报. 2008, 22(1): 123-128
    45.路斌,万小军,杨建武,等.基于同义词词林的词汇褒贬计算[C].第七届中文信息处理国际会议论文集. 2007: 17-23
    46. Kennedy A, Inkpen D. Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence. 2006, 22(2): 110-125
    47.徐军,丁宇新,王晓龙.使用机器学习方法进行新闻的情感自动分类[J].中文信息学报. 2007, 21(6): 95-100
    48.张小艳,李强.基于SVM的分类方法综述[J].科技信息,2008,28:344-345
    49.王晓龙,关毅.计算机自然语言处理.清华大学出版社. 2005: 128-129

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700