文本情感分类的研究

英文题名：Research of Text Sentiment Classification
作者：张彦博
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：情感分类 ; 动态规划 ; 特征选择 ; 领域适应
英文关键词：sentiment classification ; dynamic programming ; feature selection ; domain adaptation
学位年度：2010
导师：瞿有利
学科代码：081202
学位授予单位：北京交通大学
论文提交日期：2010-05-28
答辩委员会主席：于剑

摘要

文本情感分类是指通过挖掘和分析文本中的立场、观点、情绪等主观信息,对文本的情感倾向做出类别判断。随着人们在web中表达自己观点越来越普遍,针对文本情感分类的研究也变得越来越重要。
     本文提出了一种文本情感分类算法,分为主观性分类和极性分类两个部分。
     主观性分类部分分为训练和分类两个过程,训练过程接受已标记的训练文本集,经过文本预处理、文本表示和特征选择得到语句特征表示；利用主观性分类模型训练算法对这些语句特征表示进行处理,得到文本主观性分类模型。分类过程接受语句集,经过文本预处理、文本表示和特征选择以后得到各输入语句的特征表示,接下来利用文本主观性分类算法结合分类模型进行主客观初分类,最后利用动态规划对分类结果进行修正,得到主观性文本子集。
     极性分类的训练过程接受源领域标记文本集合和目标领域未标记文本集合,经过文本预处理、文本表示、特征选择和基于支点SCL的特征选择得到各文本的训练语句特征表示,利用极性分类模型训练算法对语句特征表示进行处理,得到文本极性分类模型。分类过程接受文本主观句集,经过文本预处理、文本表示、特征选择和基于支点SCL的特征选择得到各输入语句的特征表示,文本极性分类算法利用这些特征表示和极性分类模型得出肯定句子集和否定句子集。
     实验表明：主观性初步分类准确率为94.7%；基于动态规划修正的贝叶斯分类器的准确率为95.8%；基于支点特征选择的SCL算法的极性分类逻辑平均误分率为0.16,低于普通的SCL算法。
A text is automatically classified as positive or negative sentiment through text sentiment classification, i.e. mining and analyzing subjective information in the text, such as standpoint, view, mood, and so on. As more and more people express their viewpoints on web, text sentiment classification becomes more and more important.
     This paper presents and implements a text sentiment classification algorithm, which contains two steps, subjectivity classification and polarity classification.
     Subjectivity classification contains two procedures:training and classification. In the training procedure, feature presentations for sentences are obtained from labeled training text sets via text preprocessing, text presentation and feature selection; then, text subjectivity classification model is obtained via subjectivity classification model training algorithm. In the classification procedure, feature presentations for sentences to be classified are obtained via text preprocessing, text present and feature selection; then, text subjectivity classification algorithm together with classification model is used to classify the sentences as an objective text subset and a subjective text subset; at last, the results are corrected by dynamic programming.
     In the training procedure of polarity classification, a source domain labeled text set and a target domain unlabeled text set are combined as a training set, feature presentations for sentences of the training set are obtained via text preprocessing, text presentation and SCL feature selection based on pivot features; then a text polarity classification model is obtained via polarity classification model training algorithm. In the classification procedure, feature presentations for sentences to be classified are obtained from subjective text via text preprocess text present and SCL feature selection based on pivot features; then text polarity classification algorithm together with classification model is used to classify the sentences as a positive sentence subset and a negative sentence subset.
     Experiments indicate that the precision of the preliminary subjectivity classification is 94.7%; the precision of the Bayes classifier based on the dynamic programming correction is 95.8%; the LAMP (Logistic Average Misclassification Percentage) of the SCL algorithm based on pivot feature is 0.16, which is lower than normal SCL algorithm.

引文

[1]王素格.基于Web的评论文本情感分类问题研究[学位论文].上海.上海大学.2008.1-5
    [2]Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval.1999.1(1).76-78
    [3]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制.第三届学生计算语言学研讨会论文集.沈阳.2006.北京.中国中文信息学会.2006.91-100
    [4]姚天防,娄德成.汉语语句主题语义倾向分析方法的研究.第九届全国计算语言学学术会议论文集.北京.2007.北京.清华大学出版社.2007.582-587
    [5]王根,赵军.中文褒贬义词语倾向性的分析.第三届学生计算语言学研讨会论文集.沈阳.2006.北京.中国中文信息学会.2006.81-85
    [6]B. Pang, L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval.2008.2(1-2).1-135
    [7]V. Vapnik. Statistical Learning Theory. New York. John Wiley.1998.23-42
    [8]薛德军.中文文本关键技术研究[学位论文]．北京.清华大学.2004.13-18
    [9]黄永文.中文产品评论挖掘关键技术研究[学位论文].重庆.重庆大学.2009.14-17
    [10]陈博.WEB文本情感分类中关键问题的研究[学位论文].北京.北京邮电大学.2009.11-14
    [11]D. Xue, M. Sun. A study on feature weighting in Chinese text categorization. In Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Text Processing. Mexico City, Mexico.2003. Germany. Springer Verlag.2003 592-601
    [12]李荣陆.文本分类若干关键技术研究[学位论文]上海.复旦大学.2005.12-15
    [13]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究.中文信息学报.2004.18(1).14-20
    [14]周茜,赵明生等.中文文本分类中特征选择研究.中文信息学报.2004.18(3).17-23
    [15]崔彩霞,王素格.基于类内频率的文本分类特征加权选择方法.计算机工程与设计.2007.28(17).4249-4251
    [16]H. Yu, V. Hatzivassiloglou. Towards answering opinion questions:separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the 2003 conference of Empirical methods in natural language processing. Sapporo, Japan.2003. Cambridge, MA. MIT Presspages.2003.129-136
    [17]J. Wiebe, E. Riloff. Creating subjective and objective sentence classifiers from unannotated texts. In Proceedings of the 6th International Conference on Computational Linguistics and Intelligent Text Processing, volume 3406 of Lecture Notes in Computer Science. Mexico City, Mexico.2005. Germany. Springer.2005.486-497
    [18]V. Hatzivassiloglou, K. McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Madrid, Spain.1997. Morristown, USA. Association for Computational Linguistics.
    1997.174-181
    [19]V. Hatzivassiloglou, J. Wiebe. Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the International Conference on Computational Lingustics. Saarbrucken, Germany.2000. Germany. Morgan Kaufmann.2000. Volumel:299-305
    [20]K. Toutanova, C. D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Hong Kong, China.2000. Cambridge, MA. MIT Press.2000.63-70
    [21]D. Klein, C. D. Manning.. Fast Exact Inference with a Factored Model for Natural Language Parsing. Advances in Neural Information Processing Systems 15. Vancouver, Canada. 2003. Cambridge, MA. MIT Press.2003.3-10
    [22]L. Qu, C. Toprak. Sentence Level Subjectivity and Sentiment Analysis Experiments in NTCIR-7 MOAT Challenge. In Proceedings of NTCIR-7 Workshop Meeting. Tokyo, Japan. 2008.210-217
    [23]P. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania.2002. Morristown, USA. Association for Computational Linguistics.2002.417-424
    [24]K. Dave, S. Lawrence, D. Pennock. Mining the peanut gallery Opinion extraction and semantic classification of product reviews. WWW2003. Budapest, Hungary.2003.20-24
    [25]B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of Empirical Methods in Natural Language Processing. Philadelphia, USA.2002. Cambridge, MA. MIT Press.2002.79-86
    [26]B. Pang, L. Lee. Seeing stars:Expliting class relationships for sentiment categorization with respect to rating scales. In Proceedings of Association for Computational Linguistics. Michigan, USA.2005. Morristown, USA. Association for Computational Linguistics.2005. 115-124
    [27]B. Pang, L. Lee. A sentimental education:Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Association for Computational Linguistics. Barcelona, Spain.2004. Morristown, USA. Association for Computational Linguistics.2004.271-278
    [28]L. Jian, Y. Jianxin, and W. Gengfeng. Sentiment classification using information extraction technique. In Proceedings of the International Symposium on Intelligent Data Analysis. Madrid, Spain.2005. Germany. Springer.2007.216-227
    [29]E. Breck, Y. Choi, and C. Cardie. Identifying expressions of opinion in context. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence. Hyderabad, India.2007. Germany. Springer.2007.2683-2688
    [30]J. Wiebe, T. Wilson, and C. Cardie. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation.2005.39(2-3).165-210
    [31]P. Turney, M. Littman. Measuring praise and criticism:Inference of semantic orientation from association. ACM Transactions on Information Systems.2003.21(4).315-346
    [32]J. Blitzer. Domain adaptation of natural language processing system [Dissertation]. Pennsylvania. The university of Pennsylvania.2007.26-30
    [33]J. Blitzer, K. Crammer, A. Kulesza. Learning bounds for domain adaptation. In Proceedings of Neural Information Processing Systems. Vancouver, Canada.2008. Vancouver. Curran Associates.2008.2-4
    [34]J. Jiang, C. Zhai. Instance weighting for domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech.2007. Morristown, USA. Association for Computational Linguistics.2007.264-271
    [35]X. Zhu, Z. Ghahramani. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on Machine Learning. Washington DC.2003. Menlo Park, USA. AAAI Press.2003.2-4
    [36]J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the Empirical Methods in Natural Language Processing(EMNLP). Sydney, Australia.2006. Morristown, USA. Association for Computational Linguistics.2006. 120-128
    [37]J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In Proceedings of Association for Computational Linguistics. Prague, Czech. Morristown, USA. Association for Computational Linguistics.2007.440-447
    [38]T. Zhang. Solving large-scale linear prediction problems with stochastic gradient descent. In Proceedings of the 21th international conference on Machine learning,2004. New York, USA. ACM.2004.116-118.
    [39]D. Metzler, T. Strohman, H. Turtle, et al. Indri at TREC 2004:Terabyte Track. In Proceeding of the 13th Text Retrieval Conference. Gaithersburg, Maryland.2004.2-5
    [40]M. Garmon. Sentiment classification on customer feedback data:Noisy data, large feature vectors, and the role of linguistic analysis.In Proceedings of the International Conference on Computational Lingustics. Seoul, Korea.2004. Berlin. Springer.2004.187-193

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700