基于意见挖掘技术的网购评论倾向性分析的研究与应用

英文题名：Opinion Mining Based Sentiment Analysis for Online Products Reviews Research and Application
作者：范英翔
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本倾向性分析 ; 词性模式 ; 网购评论 ; 意见挖掘
英文关键词：text sentiment analysis ; pos patterns ; online production reviews ; opinion mining
学位年度：2012
导师：宋晖
学科代码：081203
学位授予单位：东华大学
论文提交日期：2011-12-01

摘要

互联网的高速发展使得网上购物越来越盛行,这极大改变了人们的购物方式。而人们对商品及购物过程的感受,也从口口相传发展为以网购评论的方式来传播。网购评论,不论对于普通购买者还是产品生产者都极为重要。本文力求通过从网购评论中分析、提取人们对商品的情感倾向,进而帮助消费者选择适合的商品,也帮助生产者有针对性地提高产品质量。
     基于意见挖掘的文本倾向性分析一般是将文档或句子看作词、短语或模式的集合,通过识别关键词、短语或模式,并计算其倾向性值,再将结果累加得到待分析文档或句子的倾向性值。文本倾向性分析一般通过数据采集、文本预处理、倾向性识别与判断以及结果展示等四个步骤实现。
     本文深入地研究了现有的文本倾向性分析方法,从京东商城上抓取网购评论数据,通过对数据的分析和统计,总结了网购评论数据的特点,进而提出基于词性模式的抽取和合并算法(POSEM算法),应用该算法抽取出训练数据集中的有效词性模式,再根据词性模式的特点,设计了模式匹配规则,最后,运用这些规则,从测试集中抽取出中心词和评价词,并实现了评论语句的倾向性判别。实验结果表明,本文提出的方法取得了较高的精确率和召回率。
     本文的主要工作如下：
     (1)本文结合现有的文本倾向性分析理论,对获得的网购评论数据进行了深入地分析和统计,总结了网购评论数据与倾向性分析相关的特点：评论句子中,形容词对倾向性判别的贡献最大,其在主观句中的数量与总数的比例最大,达到86.87%；名词、副词的贡献次之,比例分别达到71.64%和70.79%；其他词性,如动词、介词,对倾向性的分析也有重要的作用。
     (2)基于对网购评论数据的分析,本文设计并实现了基于词性模式的抽取与合并算法(POSEM算法)。该算法使用"POS\T\O"表示词性模式信息,并对词性模式的长度、在数据集在出现的频率和出现在主观句中的概率,分别设计了长度阈值、频度阈值和上下限概率阈值。其中,满足下限概率阈值的模式用于否定评论句子的倾向性。抽取算法从预处理后的训练文本数据中,抽取出满足全部阈值的词性模式。对于仅符合长度阈值和上下限概率阈值的模式,在保留模式中的中心词和评价词信息的前提下,合并算法尝试将其进行合并,以获得能够满足全部阈值要求的模糊模式。这样的设计可以在一定程度上提高倾向性分析的召回率。
     (3)基于对POSEM算法抽取到的词性模式的分析,本文设计了模式匹配规则,并从测试文本数据中识别出中心词、评价词,再利用以高精确率抽取得到的中心词和评价词来处理剩余的未处理文本,最后根据总结出的倾向性判别规则得到评论句子的倾向性。通过对实验结果的分析,本文提出的方法具有较高的精确率和召回率。
     (4)本文设计实现了一个通用的文本倾向性分析框架。该框架可以灵活地替换组件,以满足不同的实验需要。在预处理模块,系统为词性定义了统一的格式,当替换不同的分词工具时,只需要将其自定义的词性格式简单地转换为系统的格式即可。在文本分析模块,系统可以方便地替换训练、测试及应用组件。基于上述的框架,整合开源工具,本文设计实现了一个文本分析的原型实验平台。该平台集成了数据采集模块、文本预处理模块、文本倾向性分析模块和结果展示模块。
The rapid development of the Internet makes online shopping more and more popular, which greatly changes the model for consumption. The feelings for people to goods and the process of shopping spread not just by word of mouth but also by the online reviews. Then the online reviews are important not just for consumers but also for the producers. The paper seeks to analyze the online reviews and extract the attitudes and emotions of people to the goods, further more, helps consumers choose products and producers improve quality of products.
     Generally, sentiment analysis, opinion mining based, treats the texts or sentences as the collection of words, phrases or patterns. To calculate the sentiment of the words, phrases or patterns, the value of the sentiment of the texts or sentences could be calculated out. There are four steps for sentiment analysis:data collection, text preprocessing, sentiment identification and the result show.
     The paper studied the existing methods for sentiment analysis deeply, and crawled the online reviews from the Jingdong Mall. Through the data analysis and statistics, it summarized the characteristics of the data, and then presented the algorithm of extraction and merge for POS patterns (POSEM). With the algorithm, the effective POS patterns were extracted from the training data set. According to the characteristics of the POS patterns, the paper designed the rules of pattern matching, and finally, extracted the title words and opinion words from the test set. Then the sentiment of the online reviews was got. The experiment showed that the proposed method achieved a higher precision and recall rate.
     In this paper, our work is as follows:
     1. With the theoretical study of the existing text sentiment analysis, this paper conducted in-depth analysis and statistics, and summed up the characteristics related with the sentiment analysis:in the comments sentences, the adjective contributes the most for the sentiment analysis, the rate of the number of which to the total is 86.87%; the noun and adverb are followed, the ratio reaches 71.64% and 70.79%; the other part-of-speech, such as verbs, prepositions, has also an important role for sentiment analysis.
     2. With the analysis on the data of the production reviews, this paper designed the algorithm of extraction and merge for POS patterns (POSEM). The algorithm marks the POS pattern with "POS\T\O", and sets a length threshold, the frequency threshold and upper and lower probability threshold by the length, the number and the probability of POS patterns. The POS patterns which meet the lower probability threshold will be used to negate the subjective. The extraction algorithm extracts the POS patterns which meet all of the thresholds. Some POS patterns, which just meet the length threshold and the probability threshold, will be merged in order to get to meet all of the thresholds. This design could improve the recall of the sentiment analysis in some extent.
     3. With the analysis of the POS patterns which are extracted by the POSEM algorithm, the paper designed the pattern-matching rules, and the center-words and opinion words will be extracted from the test set. Then, the words with high-precision evaluation will be used to identify the remaining ones. The result of the experiment showed that the proposed method reached high precision and recall rate.
     4. A generic framework was designed and implemented, which can replace the components flexibly to meet the different needs for the experiments. In the preprocessing model, the system sets the uniform format for the POS tagging. When there is the need to replace the different word-segment tools, the system just needs to transform the POS format to the uniform one. In the analysis model, the system can replace the train, test and application components easily. Based on the framework, combining the open-sourcing tools, the paper designed and implemented a prototype experiment platform for text analysis. The system integrates the data collecting model, the text preprocessing model, the text sentiment analysis model and the result show model.

引文

[1]L. Dini and G. Mazzini. Opinion Classification Through Information Extraction. In A. Zanasi. C. Brebbia. N. Ebechen and P. Melli(eds.):Data Mining III. WIT Press. Southampton. UK.2002.pages 299-310.
    [2]H. Yu and V. Hatzivassiloglou. Towards Answering Opinion Questions:Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. In M. Collins and M. Steedman(eds):Proc. of EMNLP-03,8th Conference on Empirical Methods in Natural Language Processing, Sapporo. Japan.2003.pages 129-136.
    [3]J. Yi and W. Niblack. Sentiment Mining in WebFountain. In Proc. ICDE-05. the 21st International Conference on Data Engineering. IEEE Computer Society. Tokyo, Japan,2005, pages 1073-1083.
    [4]B. Tsou, R. Yuen. Polarity Classification of Celebrity Coverage in the Chinese Press [A]. In Proc. of the International Conference on Intelligence Analysis. McLean, USA,2005.
    [5]S. Morinaga. K. Yamanishi. K. Tateishi. and T. Fukushima. Mining product reputations on the Web. In Proc. of KDD-02.8th ACM International Conference on Knowledge Discovery and Data Mining. Edmonton. Canada.2002, pages 341-349.
    [6]M. Gamon. A. Aue. S. Corston-Oliver. and E. Ringger. Pulse:Mining Customer Opinion from Free Text. In Proc. of IDA-05. the 6th International Symposium on Intelligent Data Analysis. Lecture Notes in Computer Science. Springer-Verlag. Madrid. Spain.2005.
    [7]邱立坤,程薇,龙志祎,孙娇华.面向BBS的话题挖掘初探.清华大学出版社,2005,401-407.
    [8]齐海凤.网络舆情热点发现与事件跟踪技术研究[D].哈尔滨工程大学.2008.
    [9]C. Cortes, V. Vapnik. Support Vector Networks[J]. Machine Learning,1995,20: 273-297.
    [10]S. Eyheramendy, D. Lewis, D. Madigan. On the Naive Bayes Model for Text Categorization[C]. Artificial Intelligence & Statistics,2003.
    [11]Belur V Dasarathy. Nearest Neighbor NN Norms:NN Pattern Classification Techniques[C]. McGraw Hill Computer Science Series. IEEE Computer Society Press, Las Alamitos, California,1996.
    [12]Pang B, Lee L, Vaithyanathan S:Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania. 2002.
    [13]Pang B, Lee L. A sentimental education:sentiment analysis using subjectivity summarization based on minimum cuts. In Proc. of the 42nd Meeting of the Association for Computation Languages.2004.271-278.
    [14]Goldberg AB, Zhu X. Seeing stars when there aren't many stars:Graph-based semi-supervised learning for sentiment categorization. In Proc. of HLT-NAACL 2006 Workshop on Textgraphs:Graph-based Algorithms for Natural Language Processing.2006.45-52.
    [15]Whitelaw C, Garg N, Argamon S. Using appraisal groups for sentiment analysis. In Proc. of the 14th ACM Int. Conf. on Information and Knowledge Management. 2005.625-631.
    [16]张昱琪,周强.大规模真实文本中汉语动词语法搭配模板的自动识别[J].全国第六届计算语言学联合学术会议论文集,2001.
    [17]徐军,丁宇新,王晓龙.使用机器学习方法进行新闻的情感自动分类.中文信息学报.2007.21(6):95-100.
    [18]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究.中文信息学报.2007.21(6):88-94.
    [19]乔春庚,孙丽华,吴韶.基于模式的中文倾向性分析研究.中文倾向性分析评测.2008.21-31.
    [20]何婷婷,闻彬,宋乐.词语情感倾向性识别及观点抽取研究.中文倾向性分析评测.2008.89-93.
    [21]王俞霖,孙乐.软件所COAE2008报告.中文倾向性分析评测.2008.109-114.
    [22]李培,何中市,黄永文.基于依存关系分析的网络评论极性分类研究[J].计算机工程与应用,2010,46(11)：138-141.
    [23]李艺红,蒋秀凤.中文句子倾向性分析[J].福州大学学报(自然科学版),2010,4(8)：9-11.
    [24]咎红英,郭明,柴玉梅,吴云芳.新闻报道文本的情感倾向性研究[J].计算机工程,2010,36(15)：20-22.
    [25]姚天昉,程希文,徐飞玉,文本意见挖掘综述.中文信息学报.2008.
    [26]J. Wiebe, R. Ellen. Learning extraction patterns for subjective expressions[C]. EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing. ACM.2003.
    [27]B.Liu, M.Hu, and J.Cheng. Opinion observer:analyzing and comparing opinions on the Web.In Proc.of WWW'05,the 14th international conference on World Wide Web.2005.
    [28]J. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin. Learning Subjective Language [A]. Technical Report TR-02-100 [C]. Department of Computer Science, University of Pittsburgh, Pittsburgh, Penn-sylvania, USA.2002.
    [29]T. Nasukawa and J. Yi. Sentiment Analysis:Capturing Favorability using Natural Language Processing [A].In:Proceedings of the 2nd International Conference on Knowledge Capture (K-CAP 2003) [C]. Sanibel Island, Florida, USA:2003, 70-77.
    [30]V. Hatzivassiloglou and J. Wiebe. Effects of Adjective Orientation and Gradability on Sentence Subjectivity [A]. In:Proceedings of 18th International Conference on Computational Linguistics (COLING-2000) [C]. New Brunswick, NJ, USA:2000.
    [31]姚天昉,娄德成.汉语语句主题语义倾向分析方法的研究[J].中文信息学报,2007.21(5)：73-79.
    [32]P. D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews [A]. In:Proceedings of ACL-02,40th Annual Meeting of the Association for Computational Linguistics [C]. USA:2002, 417-424.
    [33]T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis [A]. In:Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing(HLT/EMNLP 2005)[C]. Vancouver, Canada:2005,347-354.
    [34]B. Liu. ACM SIGKDD Inaugural Webcast:Web Content Mining. Nov 29,2006.
    [35]A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira. A brief survey of web data extraction tools. ACM SIGMOD Record,31(2):84-93,2002.
    [36]Bing Liu,俞勇.WEB数据挖掘[M].北京：清华大学出版社,2009,138.
    [37]Salton G, Wong A, Yang C Sa. A vector space model for automatic indexing[J]. Communications of the ACM,1975,18(11):613-620
    [38]Shaw, W. M., Jr., Burgin, R.,&Howell. P. Performance standards and evaluations in IR test collections:vector-space and other retrieval models [J]. Information Processing & Management,1997,33(1):15-36
    [39]Robert WPL, Kwok KL. A comparison of Chinese document indexing strategies and retrieval models[J]. ACM Transactions on Asian Language Information Processing (TALIP),2002, 1(3):225-268
    [40]Maron M.E., Kuhns J.L. On Relevance, Probabilistic Indexing and Information Retrieval [J]. Journal of the Association for Computer Machinery,1960, 7(3):216-244
    [41]张启宇,朱玲,张雅萍.中文分词算法研究综述[J].情报探索,2008.
    [42]Y. Yang, C. G. Chute. An Example Based Mapping Method for Text Categorization and Retrieval [J]. ACM Transaction on Information Systems(TOIS), 1994,12(3):252-277.
    [43]张丙奇,姜吉发.企业相关信息抽取技术研究与系统实现[J].微电子学与计算机,2004,21(1)：1-6.
    [44]黄绍杉.基于统计与规则的专利摘要信息抽取[D].北京中国科学技术信息研究所.2010.
    [45]Mingqing Hu,Bing Liu. Mining and Summarizing Customer Reviews[C]. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge discovery and data mining. Seattle,WA.2004.
    [46]S. M. Kim and E. Hovy. Determining the Sentiment of Opinions[A]. In: Proceedings of COLING-04, the Conference on Computational Linguistics(COLING-04)[C]. Geneva, Switzerland:2004,1367-1373.
    [47]Li Zhuang, Feng Jing, Xiao-Yan Zhu. Movie Review Mining And Summarization[C].Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington,Virginia,USA.2006.
    [48]Nozomi Kobayashi,Kentaro Inui,Yuji Matsumoto. Collecting Evaluative Expressions for Opinion Extraction[C].IJCNLP.Hainan,China,2004.
    [49]伍星,何中市,黄永文.产品评论挖掘研究综述[J].计算机工程与应用,2008,44(36)：37-40.
    [50]Vasileious HatZivassiloglou, Kathleen R.McKeown. Predicting the semantic orientation of adjectives[A].In:Proceedings of the 35th Annual Meeting of the Association for Computational Liguistics and the 8th Conference of the European Chapter of the ACL[C],1997:174-181.
    [51]娄德成,姚天防.汉语句子语义极性分析和观点抽取方法的研究.中文信息学报,2006.
    [52]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制.中文信息学报.2007.
    [53]蔺磺,郭妹慧.程度副词的特点范围与分类[J].山西大学学报(哲学社会科学版),2003,26(2)：71-74.
    [54]Nozomi Kobayashi, Ryu Iida, Kentaro Inui, Yuji Matsumoto. Opinion Mining as Extraction of Attribute-Value Relations [C]. The 19th Annual Conference of JSAI.Japan.2005.
    [55]Ana-Maria Popescu,Oren Etzioni. Extracting Product Features and Opinions from Reviews. Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing[C].Vancouver,Canada.2005.
    [56]章剑锋,张奇,吴立德,黄萱箐.中文评论挖掘中的主观性关系抽取[C].第三届全国信息检索与内容安全学术会议.苏州:2007:675-681.
    [57]Lun-Wei Ku, Yu-Ting Liang, Hsin-Hsi Chen. Opinion extraction,summarization and tracking in news and blog Corpora[C]. Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, AAAI Technical Report. Stanford University, California, USA.2006.
    [58]Chao Wang, Jie Lu, Guangquan Zhang. A Semantic Classification Approach for Online Product Reviews[C].Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, French.2005.
    [59]Changhua Yang, Kevin Hsin-Yih Lin, Hsin-Hsi Chen. Building Emotion Lexicon from Web blog Corpora[C].2007 IEEE/WIC/ACM International Conference on Web Intelligence.Silicon Valley, CA, USA.2007.
    [60]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算[J].中文信息学报.2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700