基于支持向量机的中文客户评论情感文本分类研究

英文题名：Research on Sentiment Text Classification of Chinese Customers Reviews Based on SVM
作者：陶敏
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：情感分类 ; 支持向量机 ; 停用词表 ; 客户评论
英文关键词：sentiment classification ; vector support machine ; stop word list ; customers reviews
学位年度：2011
导师：夏火松
学科代码：1201
学位授予单位：武汉纺织大学
论文提交日期：2011-04-01

摘要

随着网络媒体形式内容的丰富,更多的人开始在论坛和评论中发表自己的观点。这些网络文本中带有个人情感色彩的文章、言论也大量出现,其中互联网上的客户评论对于网络消费者的购买决策有着重要的影响,如何从海量客户评论文本数据中自动的抽取出有价值的信息,已成为目前亟待解决的问题。文本主要研究将传统的基于主题的文本分类方法应用于情感文本分类,考虑应用统计学方法实现对情感文本分类的研究,结合传统的基于主题的中文文本分类技术,分析中文情感文本分类的关键技术问题,着重对提高情感文本分类精度过程和方法上进行研究;分析不同的特征选择方法、特征表示方法以及不同的分类器模型的构建对中文情感文本分类精度的影响。
论文对情感文本分类问题的关键技术进行了研究,最终确定了有效的分类模型,提出提高中文情感文本分类的较为有效的特征选择方法、特征表示方法以及有效的情感文本分类器;提出4种不同的基于中文情感文本分类特征的停用词表,通过实验分析使用不同停用词表对中文情感文本分类的贡献,并给出有效的停用词表。最后,将实验总结的分类模型应用于实际,验证了研究结果的有效性。将基于支持向量机的情感文本分类模型应用商品推荐领域,实现对国内知名购物网站的商品评论文本信息进行分类实验,提取消费者对产品评论的有效特征,情感分类所得客户评论的情感倾向,并就得出的结果给出了合理的分析,为情感文本分类的应用提出了建设性的意见。
In recent years, with the quick development of media, more and more people began to comment in the forum and express their opinions. The network version of the article and sentence with a personal emotional polarity have appeared with large numbers, which the customer comments on the Internet for purchase decisions of online consumers have an important impact, and how the comment text from the mass customer data automatically extracted valuable Information, has become an urgent problem. This paper uses the methods of traditional text classification to sentiment text classification. Considering use the statistical methods as a solution to solve the problem of sentiment text classification. Combination with the technology of traditional Chinese text classification based on the theme, have a research on the key techniques of Chinese sentiment text classification, focusing on improving the precision of the result of the sentiment text classification. Analysis the influence of different feature selection methods, feature representation methods and different classification Model have on the accuracy of sentiment classification.
This paper have an research on the key technology of Chinese sentiment text classification and ultimately confirm the effective classification model which proposed an effective feature selection methods, feature representation model and effective sentiment text classifier; constitute four different stop list which based on the feature of sentiment text classification. Analysis the different contribution of the four different stop list to the result of sentiment text classification through some experiment. Finally, this paper confirmed the effective stop word list. Finally, this paper applied the classification model to practical problem and verified the validity of research results. Using text classification model based on SVM for goods recommended, have a classification experiment to classify the product reviews which collected from a well-known shopping site. Extract the effective consumer product reviews characteristics, polarity of the sentiment text. Give the final results a reasonable analysis and put forward some constructive opinions on the application of sentiment text classification.

引文

[1] Chen B, He H, Guo J. Language feature mining for document subjectivity analysis. In Proc. Of the 1st Int.Symp.on Data, Privacy&E-Commerce,2007,62-67.
    [2] Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Communications of the ACM, 18(5),1975,613-620
    [3]汤代禄,韩建俊,边振兴.互联网的变革-Web2.0理念与设计.电子工业出版社, 2007.23-24
    [4]朱嫣岚.文本情感倾向分析若干问题研究.复旦大学硕士学位论文,2007,5
    [5]陈博.web文本情感分类中关键问题的研究.北京邮电大学博士学位论文,2008,5
    [6]叶强,张紫琼,罗振雄.面向互联网评论情感分析的中文主观性自动判别方法研究[J].信息系统学报,2007.1(1).:79-91
    [7]董梅,胡学刚.基于多特征选择的中文文本分类[J].计算机技术与发展, 2007, 17(7):117-119
    [8]马忠宝,刘冠蓉.基于支持向量机的中文文本分类模型研究[J].计算机技术与发展, 2006, 16(11):70-72
    [9] T Usuner, D Godes. Better sales networks. Harvard business review. 2006, 84(7-8): 102-12, 188.
    [10]余传明.从产品评论中挖掘观点:原理与算法分析[J].情报理论与实践. 2009. 7(32):106-109.
    [11] Gamon M, Aue A, Corston-Oliver S, et al. Pulse: Mining customer opinions from free text. In Proc. of the 6th Int. Symp. on Intelligent Data Analysis, 2005, 121-132.
    [12] Mofinaga S, Yamanishi K, Tateishi K, et al. Mining product reputations on the Web. In Proc. of the 8th ACM SIGKDD Int. Conf. On Knowledge Discovery And Data Mining, 2002, 341-349.
    [13] Dave K, Lawrence S, Pennock DM. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proc. of the 12th Int. Conf. on World Wide Web, 2003, 519-528.
    [14] Liu B, Hu M, Cheng J. Opinion observer: analyzing and comparing opinions on the web. In Proc. of the 14th Int. Conf. on World Wide Web, 2005, 342-351.
    [15] Pang B, Lee L, Vaithyanathan S. Thumbs up Setiment classification using machine learning techniques[C]. The Conference on Empirical Methods in Natural Language Processing,2002:79-86
    [16] Salvetti F, Lewis S, Reichenbach C. Automatic opinion polarity classification of movie reviews. Colorado Research in Linguistics, 17(1), 2004, 1-15.
    [17] Chesley P, Vincent B, Xu L, et al. Using verbs and adjectives to automatically classify blog sentiment. In Proc. of Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, Nicolov N, Salvetti F, Liberman M, Maartin J H(eds.), AAAI Press, Menlo Park, CA, Technical Report SS-06-03, 2006, 27-29.
    [18] Kennedy A Inkpen D. Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence, 22(2), 2006, 110-125.
    [19]刘康,赵军.基于层叠CRFS模型的句子褒贬度分析研究.中文信息学报, 22(1), 2008, 123-128.
    [18] Pang B, Lee L. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proc. of the 42nd Meeting of the Association for Computational Languages, 2004, 271-278.
    [19] Goldberg AB, Zhu X. Seeing stars when there are’t many stars: Graph-based semi-supervised learning for sentiment categorization. In Proc. of HLT-NAACL 2006 Workshop on Textgraphs: Graph-based Algorithms for Natural Language Processing, 2006, 45-52.
    [20] Hu M, Liu B. Mining and summarizing customer reviews. In Proc. of the 10th ACMSIGKDD lnt. Conf. on Knowledge Discovery and Data Mining, 2004, 168-177.
    [21] Hu M, Liu B. Mining opinion features in customer reviews. In Proc. of the 19th National Conf. on Artificial Intelligence(AAAI-2004), 2004, 755-760.
    [22] Lin WH, Wilson T, Wiebe J, et al. Which side are you on? Identifying perspectives at the document and sentence levels. In Proc. of the 10th Conf. on Computational Natural Language Learning, 2006, 109-116.
    [23] Whitelaw C, Garg N, Argamon S. Using appraisal groups for sentiment analysis. In Proc. of the 14th ACM Int. Conf. on Information and Knowledge Management, 2005, 625-631.
    [24] Bruce R, Wiebe J. Recognizing subjectivity: a case study in manual tagging. Natural Language Engineering, 5(2), 1999, 1-16.
    [25] Wiebe J, Riloff E. Creating subjective and objective sentence classifiers from unannotated texts. In Proc. of the 6th Int. Conf. on Computational Linguistics and Intelligent Text Processing, 2005, 486-497.
    [26] Ni X, Xue G, Ling X, et al. Exploring in the Weblog space by detecting informative and affective articles. In Proc. of the 16th Int. Conf. on World Wide Web, 2007, 281-290.
    [27] Bluenight.文本分类概述. 2009.10http://blog.csdn.net/chl033/archive/2009/10/27/4733647.aspx
    [28]应英,周峰,周昌乐.汉语情感意义的机器标注初探.中文信息学报. 2002. 16(2):27-33
    [29]徐军,丁宇新,王晓龙.使用机器学习方法进行新闻的情感自动分类.中文信息学报, 21(6), 2007, 95-100.
    [30]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究.中文信息学报, 21(6), 2007, 88-94.
    [31]熊德兰,柴玉梅,咎红英.基于内容的名人网页褒贬性评价.平顶山工学院学院.2005.(4): 47-49,67
    [32] Silva C, Ribeiro B. The importance of stop word removal on recall values in text categorization. Neural Networks, 2003,3:20-24
    [33]王素格,魏英杰.停用词表对中文文本情感分类的影响,情报学报,2008年4月,第27卷第2期175-179
    [34]卜东波.聚类分类理论研究及其在文本挖掘中的应用.中国科学院计算技术研究所博士论文,2000.
    [35]于瑞萍.中文文本分类相关算法的研究与实现.西北大学硕士学位论文,西安, 2007
    [36]李荣陆.文本分类若干关键技术研究[D].复旦大学博士学位论文, 2005.
    [37]熊云波.文本信息处理的若干关键技术研究.复旦大学博士学位论文. 2006.9.
    [38]尹世群.Web文本分类关键技术研究.西南大学博士学位论文,2008,5
    [39]何克抗,徐辉,孙波.书面汉语自动分词专家系统设计原理.中文信息学报,5(2).1991,1-14.
    [40]徐秉铮,詹剑.基于神经网络的分词方法.中文信息学报,7(2),1993,36-44.
    [41]秦文,苑春法.基于决策树的汉语未登录词识别.中文信息学报, 18(1), 2003, 14-19.
    [42]孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究.中文信息学报, 14(1), 2000, 1-6.
    [43] Zorkadis V, Karras D A, Panayotou M. Efficient information theoretic strategies for classifier combination: feature extraction and performance evaluation in improving false positives and negatives for spam e-mail filtering [J]. Neural Networks, 2005,18:799-807
    [44] Salton G, Wang A, Yang C S. A vector space model for automatic indexing. Communication of the ACM, 1975, 18(11):613-620.
    [45]饶文碧,柯慧燕. Web文本分类计数研究及其实现.计算机技术与发展,Vol.16,No.3,2006,116-118
    [46] Fuchun Peng. Using self-supervised word segmentation in Chinese information retrieval,SIGIR’02, 2002:345-350
    [47]鲁松,晓黎,白硕,王实.文档中词语权重计算方法的改进.中文信息学报, 2000, 14(6): 8-13.
    [48]周水庚.中文文本数据库若干关键技术研究.复旦大学博士论文,上海, 2000.
    [49] Thomas Emerson, Segmenting Chinese in Unicode. 16th International Unicode Conference.2000.
    [50]吴雅倩等.基于最大熵方法的中英文基本名词短语识别.计算机研究与发展, 2003, 40(3): 440-446.
    [51] Salton G, Buckley B. Term-Weighting approaches in automatic text retrieval. Information Processing and Management, 1998,24(5):513-523.
    [52]柯慧燕. Web文本分类研究及应用[D].硕士学位论文.武汉理工大学, 2006.
    [53]王明文,付雪峰,左家莉.网页文本自动分类综述[J].南昌工程学院学报, 2005, 24(3).
    [54] Lewis DD. Na?ve (Bayes) at forty: The independence assumption in information retrieval. In Proc. of the 10th European Conf. on Machine Learning(ECML), 1998, 4-15.
    [55] Han EH, Karypis G, Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. In Proc. of the 5th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, 2001, 53-65.
    [56] Joachims T. Text categorization with support vector machines: learning with many relevant features. In Proc. of the 10th European Conf. on Machine Learning, 1998, 137-142.
    [57] Ruiz ME, Srinivasan P. Hierarchical neural networks for text categorization. In Proc. of the 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1999, 281-282.
    [58] Nigam K, Lafferty J, McCallum A. Using maximum entropy for text classification. In Proc. of the Int. Joint Conf. on Artificial Intelligence IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999, 61-67.
    [59] T. Joachims: Text categorization with support vector machines: Learning with many relevant features. LS-8 Report 23, Computer Science Department, University of Dortmund, 1998.
    [60] Sebastiani F. Machine learning in automated text categorization: a survey. Tech. Rep. IEl-B4-31-1999, Istituto di Elaborazione dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999.
    [61] Yang Y. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1, 1999, 69-90.
    [62]刘磊.中文Web文本自动分类的研究与实现[D].硕士学位论文.长春理工大学, 2007.
    [63]谭松波.中文情感挖掘语料-ChnSentiCorp.2008.12 http://www.searchforum.org.cn/tansongbo/corpus-senti.htm
    [64]都云琪,肖诗斌.基于支持向量机的中文文本自动分类研究[J].计算机工程, 2002, 28(11):137-139.
    [65] Yang Y, Pedersen J. A comparative study on feature selection in text categorization[M]. San Francisco: Morgan Kaufmann Publishers, 1997.
    [66] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques[M].Beijing: China Machine Press,2006.
    [67] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques[M]. Beijing: China Machine Press, 2006.
    [68] Rumelhart DE, et al. Backpropagation: The basic theory[J]. Mathematical perspectives on neural networks,1996: 533-566
    [69] A. Selamat. Web page classification method using neural networks[J]. IEEE Tram, EIS, 2003, 123(5)
    [70]施沽斌.基于概率神经网络的文本自动分类研究[J].情报学报, 2004, 23(2): 147-151
    [71]杨建良,王永成.基于KNN与自动检索的迭代近邻法在自动分类中的应用[J].情报学报, 2004, 23(2):137-141.
    [72] M. Taboada. C. Anthony and K. Vol1. Methods for creating semantic orientation dictionaries. In proceedings of fifth international conference on language resources and evaluation. Genoa, Italy
    [73]李盼池,许少华.支持向量机在模式识别中的核函数特性分析[J].计算机工程与设计, 2005, 26(2): 302-304.
    [74]刘清.基于SVM的网络文本分类问题研究与应用.南昌大学硕士学位论文.南昌, 2007
    [75] SEO两百个秘密:停用词.2011.1. http://www.dugutianjiao.com/post/stopwords.html
    [76] Silva C, Ribeiro B. The importance of stop word removal on recall values in text categorization[J]. Neural Networks, 2003,3:20-24.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700