面向中文Web评论的情感分析技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着网络技术的迅猛发展,网络已成为越来越多的人们获取信息的重要来源,同时,也成为人们表达自己观点的平台。对迅速增长的网上文本资源,尤其对用户主动发布的评论进行挖掘和分析,识别出其情感倾向及演化规律,可以更好地理解用户的行为,分析热点舆情,也可以为政府,企业和其他机构在决策时提供重要的依据。
     本文首先介绍了情感分析的研究背景和应用前景;然后以中文Web评论为研究对象,对其概念、特点进行了介绍;接下来按照Web评论的情感分析流程,分别从Web评论的获取和预处理、Web评论的情感分析方法两方面进行了深入研究。其中,对于Web评论的情感分析,本文分别研究了基于文本分类技术和基于情感词典的文本情感分析方法。
     文本情感分析的价值在于从某一主题的评论中分析得出总结性的结论,这首先涉及到从网络上获取大量的评论数据。同一主题的评论通常集中在某些站点,同一站点的网页呈现高度结构化。针对这一特点,本文设计了基于消息中间件的网页实时处理技术来并行下载和预处理网页,得到可供情感分析的评论数据。
     接着,本文运用了两种基于不同思想的情感分析方法:(1)基于文本分类技术:首先在传统特征选择方法基础上提出了基于相关性和冗余度的联合特征选择算法,旨在删除冗余特征,保留有利于分类的特征,从而提高文本情感分类效果;最后采用支持向量机的文本分类方法进行情感极性分类。(2)基于情感词典技术:利用《知网》建立情感词典,并计算中文词语的情感倾向,接着根据短语结构进一步计算文本中短语的情感倾向值,最后通过求和获得整个评论的情感倾向值。
     最后,以网络上的公开评论数据集和课题获取的手工标注数据集为实验测试数据,对文中提出的两种情感分析方法进行对比分析,实验结果表明:本文提出的两种情感分析方法均是有效的,而且基于情感词典的方法在性能上要略优于基于文本分类的方法。
With the rapid development of Web technology, the Web has become a very important source from which more and more people obtain information. In the meanwhile, it is becoming a significant platform for people to express their viewpoints. Mining and analyzing this rapidly expanding information on web, especially the sentiment of the online reviews posted by users, can better our understanding of the consuming habits and public opinions of various users. Besides, it plays a crucial role in decision-making for many institutions, such as enterprises, the government, etc.
     At the beginning, this paper introduces the background of sentiment analysis and its prospect, and describes the conception and features of Chinese Web reviews. And then, according to the process of sentiment reviews for Web reviews, this paper makes a research in the approach of gathering and preprocessing Web reviews, and the technology of sentiment analysis. For sentiment analysis, this paper researches two methods based on text classification and sentiment dictionary respectively.
     The biggest value of sentiment analysis is generating summaries from many reviews which focus on the same topic, so this refers to how to get large numbers of reviews spreading on the Web. Generally, the reviews on one topic are distributed intensively on several Websites and Web pages in the same Website are highly structured. So this paper design a real-time Web page processing technique based on Message-Oriented Middleware aimed at parallel downloading and preprocessing Web pages, which gets the reviews data for sentiment analysis.
     Then, this paper proposes two approaches for sentiment analysis. Firstly, based on text classification technology, we propose a joint feature selection method based on relevance and redundancy to eliminate redundant features, find significant features for classification and consequently improve the accuracy of text sentiment classification, and then the well known classification technique, support vector machine, is used to classify the sentiment polarity. Secondly, based on sentiment dictionary technology, we utilize HowNet to construct a sentiment dictionary which is used to compute the sentiment orientation of words and phrases in the reviews. And then, the sentiment orientation of phrases is summed to compute the sentiment orientation of reviews.
     Finally, we use these two proposed methods to analyze the sentiment orientation of the public data set, as well as the data sets collected in this research. The experimental results show that the feature selection method and the sentiment dictionary based sentiment analysis method proposed in this paper are effective, and the sentiment dictionary based method outperforms the text classification based method.
引文
[1]Kim S M and Hovy E. Determining the Sentiment of Opinions[C]. In: Proceedings of COLING-04, the Conference on Computational Linguistics (COLING-2004). Geneva, Switzerland:2004,1367-1373.
    [2]Inquirer Home Page. http://www.wjh.harvard.edu/-inquirer/.
    [3]Fellbaum C. WordNet:An Electronic Lexical Database[M]. Bradford Book, 1998.
    [4]Esuli A and Sebastiani F. Sentwordnet:A publicly available lexical resource for opinion mining[C]. In:Proceedings of LREC-06, the 5th Conference on Language Resources and Evaluation. Genova:IT,2006.
    [5]Turney P D and Littman M L. Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus[C]. Technical Report ERC-1094 (NRC 44929). National Research Council of Canada:2002.
    [6]Gam on M and Aue A. Automatic Identification of Sentiment Vocabulary: Exploiting Low Association with Known Sentiment Terms[C]. In:Proceedings of the ACL2 2005 Workshop on Feature Engineering for Machine Learning in NLP. Michigan, USA:2005,57-64.
    [7]Yu H and Hatzivassiloglou V. Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences[C]. In:M. Collins and M. Steedman (eds):Proceedings of EMNLP-03,8th Conference on Empirical Methods in Natural Language Processing. Sapporo, Japan:2003,129-136.
    [8]Nigam K and Hurst M. Towards a Robust Metric of Opinion[C]. In: Proceedings of the AAAI Spring. Symposium on Exploring Attitude and Affect in Text:Theories and Applications. Standford, USA:2004.
    [9]Turney P D. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews[C]. In:Proceedings of ACL202,40th Annual Meeting of the Association for Computational Linguistics, USA:2002, 417-424.
    [10]Turney P D and Littman M L. Measuring Praise and criticism:inference of semantic orientation from association[J]. ACM Transactions on Information Systems,2003,21(4):315-346.
    [11]Hu M Q, Liu B. Mining and Summarizing Customer Reviews[C]. In: Proceedings of KDD-2004.2004,168-177.
    [12]Kamps J, Marx M, Mokken R J, and Rijke M de. Words with Attitude[C]. In: Proceedings of the 1st International Conference on Global WordNet. Mysore, India:2002.
    [13]Pang B and Lee L. A Sentimental Education:Sentiment Analysis using Subjectivity Summarization based on Minimum Cuts[C]. In:Proceedings of the ACL-2004. Barcelona, Spain:2004,271-278.
    [14]Dave K, Lawrence S, Pennock D M. Mining the Peanut Gallery:Opinion Extraction and Semantic Classification of Product Reviews[C]. In:Proceedings of the 12th International World Wide Web Conference (WWW2003). Budapest, Hungary:2003.
    [15]Takamura H, Inui T and Okumura M. Extracting semantic orientations of words using spin model[C]. In:Proceedings of the 43rd AnnualMeeting of the ACL. Ann Arbor, June 2005:133-140.
    [16]Wiebe J, Breuce R, Bell M et al. A corpus study of evaluative and speculative language[C]. In:Proceedings of 2nd ACL SIGdial Workshop on Discourse and Dialogue. Aalborg, Denmark, September,2001.
    [17]Wilson T, Wiebe J, and Hoffmann P. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis[C]. In:Proceedings of Human Language Technologies Conference/ Conference on Empirical Methods in Natural Language Processing (HL T/EMNL P 2005). Vancouver, Canada: 2005, 347-354.
    [18]Takamura H, Inui T. Latent variables for semantic orientation of phrases[C]. In: Proceeding of the 11th Conference of the European Chapter of the Association for Computational Linguistic (EACL2006), Trento, Italy.2006:201-208.
    [19]Matsumoto S, Takamura H, and Okumura M. Sentiment Classification using Word Sub-sequences and Dependency Sub-trees[C]. In:Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-05). Hanoi, Vietnam:2005,301-310.
    [20]Yi J, Nasukawa T, Bunescu R, and Niblack W. Sentiment Analyzer:Extracting Sentiments about a Given Topic using Natural Language Processing Techniques[C]. In:Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM-2003). Melbourne, Florida:2003,427-434.
    [21]Nasukawa T and Yi J. Sentiment Analysis:Capturing Favorability using Natural Language Processing[C]. In:Proceedings of the 2nd International Conference on Knowledge Capture (K-CAP 2003). Sanibel island, Florida, USA:2003, 70-77.
    [22]Sista S P and Srinivasan S H. Polarized lexicon for review classification[C]. In: Proceedings of ICAI-04, the International Conference on Artificial Intelligence. Las Vegas, USA, CSREA Press,2004:867-872.
    [23]Pang B, Lee L, and Vaithyanathan S. Thumbs up? Sentiment Classification using Machine Learning Techniques[C]. In:Proceedings of EMNLP-02, the Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA:2002,79-86.
    [24]Cui H, Mittal V O, Datar M. Comparative Experiments on Sentiment Classification for Online Product Reviews[C]. In:Proceedings of AAAI-2006. 2006.1265-1270.
    [25]Kim S M, Hovy E. Automatic Identification of Pro and Con Reasons in Online Reviews[C]. In:Proceedings of the COLING/ACL-2006.2006.483-490.
    [26]Zhao J, Liu K, Wang G. Adding Redundant Features for Crfs-based Sentence Sentiment Classification[C]. In:Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, (EMNLP).2008.117-126.
    [27]Mullen T, and Collier N. Sentiment Analysis using Support Vector Machines with Diverse Information Sources[C]. In:Dekang Lin and Dekai Wu (eds.): Proceedings of EMNLP-04,9th Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain:2004,412-418.
    [28]Pang B, Lee L. Seeing Stars:Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales[C]. In:Proceedings of the Association for Computational Linguistics (ACL).2005,115-124.
    [29]Goldberg A B, Zhu X. Seeing Stars When There Aren't many Stars: Graph-based Semi-supervised Learning for Sentiment Categorization[C]. In: Proceedings of HLT-NAACL 2006 Workshop on Textgraphs:Graph-based Algorithms for Natural Language Processing.2006.
    [30]Cheng X. Automatic Topic Term Detection and Sentiment Classification for Opinion Mining[D]. Master Thesis. Saarbrucken, Germany:The University of Saarland,2007.
    [31]董振东,董强.知网.http://www.keenage.com/zhiwang/c_zhiwang.html.
    [32]张伟,刘给,郭先珍.学生褒贬义词典[M].中国大百科全书出版社,2004.
    [33]王治敏,朱学锋,俞士汉.基于现代汉语语法信息词典的词语情感评价研究[J]. Computational Linguistics and Chinese Language Processing,2005,10(4): 581-592.
    [34]《同义词词林(扩展版)》[M].哈尔滨工业大学信息检索研究中心.http://ir.hit.edu.en/phpwebsite/index.php.
    [35]Tsou B, Yuen R, Kwong O, Lai T, and Wong W. Polarity Classification of Celebrity Coverage in the Chinese Press[C]. In:Proceedings of the International Conference on Intelligence Analysis. McLean, USA:2005.
    [36]邱立坤,程薇,龙志讳,孙娇华.面向BBS的话题挖掘初探.见:孙茂松,陈群秀主编,自然语言理解与大规模内容计算[C].北京:清华大学出版社,2005,401-407.
    [37]Ye Qiang, Shi Wen, Li Yijun. Sentiment Classification for Movie Reviews in Chinese by Improved Semantic Oriented Approach[C]. In:Proceedings of the 39th Hawaii International Conference on System Sciences-2006.
    [38]Wei Wei, Liu Hongyan, He Jun, Yang Hui, Du Xiaoyong. Extracting Feature and Opinion Words Effectively from Chinese Product Reviews[C]. In: Proceedings of the fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD,2008).
    [39]Xia Y, Wong K-F, and Li W. A Phonetic-Based Approach to Chinese Chat Text Normalization[C]. In:Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006). Sydney, Australia:2006, 993-1000.
    [40]姚天昉,彭思崴.汉语主客观文本分类方法的研究[C].第三届全国信息检索与内容安全学术会议.2007,117-123.
    [41]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20.
    [42]李钝,曹付元,曹元大,万月亮.基于短语模式的文本情感分类研究[J].计算机科学,2008,35(4).
    [43]Su Qi, Xu Xinying, Guo Honglei. Hidden sentiment association in Chinese web opinion mining[C]. In:Proceedings of the WWW 2008/Alternate Track:WWW in China-Mining the Chinese Web April 21-25,2008·Beijing, China.
    [44]杜伟夫,谭松波,云晓春,程学旗.一种新的情感词汇语义倾向计算方法[J].计算机研究与发展,2009,46(10).
    [45]娄德成,姚天昉.汉语句子语义极性分析和观点抽取方法的研究[J].计算机应用,2006,26(11),2622-2625.
    [46]Hou Feng, Li Guohui. Mining Chinese Comparative Sentences By Semantic Role Labeling[C]. In:Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, July 2008,12-15.
    [47]Fei Zhongchao, Liu Jian, Wu Gengfeng. Sentiment Classification Using Phrase Patterns[C]. In:The Fourth International Conference on Computer and Information Technology,2004.1147-1157.
    [48]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J].中文信 息学报,2007,21(1):96-100.
    [49]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报,2007,21(6):96-100.
    [50]Zhang Changli, Zuo Wanli, Peng Tao, He Fengling. Sentiment Classification for Chinese Reviews Using Machine Learning Methods Based on String Kernel[C]. In:Proceedings of the third International Conference on Convergence and Hybrid Information Technology,2008.
    [51]Tan S and Zhang J. An empirical study of sentiment analysis for Chinese documents[J]. Expert Systems with Applications.2008,34(4):2622-2629.
    [52]王实,高文,段立鹃Internet上的文本挖掘[J].计算机科学,2000,27(4):32-36.
    [53]Xiong Gang, Xiong Gang-Yu, Litokorpi Aki, et al. Middleware-based solution for enterprise information integration[C]. IEEE Symposium on Emerging Technologies and Factory Automation,2001(2):687-690.
    [54]陈天煌,邹青梅.基于XML的异构数据库信息共享技术研究[J].武汉理工大学学报,2005(29):129-132.
    [55]李琪林,刘强,周明天.论中间件技术及其分类[J].四川师范大学学报,2001,24(6):657-660.
    [56]Apache ActiveMQ-Index. http://activemq.apache.org.
    [57]Heritrix-Home Page, http://crawler.archive.org.
    [58]Yang Y. A Comparative Study on Feature Selection in Text Categorization[C]. In:Proceedings of the 14th International Conference on Machine Learning (ICML-97). Nashville:Morgan Kaufmann,1997:412-420.
    [59]Lewis D D. Feature Selection and Feature Extraction for Text Categorization[C]. In:Proceeding of Speech and Natural Language Workshop. San Francisco, USA: Morgan Kaufmann,1992:212-217.
    [60]Dunning T E. Accurate methods or the statistics of surprise and coincidence[C]. In:Proceedings of Computational Linguistics.1993:61-74.
    [61]Dienerieh T G. Machine learning research for current directions [J]. AI Magazine.1997,18(4):97-136.
    [62]Salton G. A Vector Space Model for Automatic Indexing[J]. Communication, 1975,18(11):613-620.
    [63]Salton G, Clement T Y. On the construction of effective vocabularies for information retrieval[C]. In:Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. New York:ACM,1973:11.
    [64]鲁松,李晓黎,白硕,王实.文档中词语权重计算方法的改进[J].中文信息 学报,2000,14(6):8-20.
    [65]Vapnik V N. The Nature of Statistical Learning Theory[M]. Springer, Berlin, 2005.
    [66]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(01):26-32.
    [67]Koller D, Sahami M. Toward Optimal Feature Selection[C]. In:Proceedings of the 13th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann,1996:284-292.
    [68]Yu L, Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy [J]. Journal of Machine Learning Research.2004,5:1205-1224.
    [69]刘群,李素建.基于《知网》的词汇语义相似度的计算[A].第三届汉语词汇语义学研讨会,台北,2002.
    [70]谭松波.中文情感挖掘语料-ChnSentiCorp. http://www.searchforum.org.cn/ tansongbo/corpus-senti.htm.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700