面向Web文本的产品意见挖掘算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网的广泛应用,在Blog.BBS.Wiki等Web站点中出现了大量的针对商品或服务的客户评论。本文针对这样的Web评论文本,主要研究从文本中提取产品属性词和评价情感词,然后对客户持有的意见进行极性判断。文中使用到的方法经过实验都证明了方法的适用性,相对应所开发出来的系统也具有很好的鲁棒性。本文的研究内容主要如下:
     1、针对网络资源,首先用基于HTML标签的模式匹配的信息抽取方式从特定的网页中抽取产品属性词建立基本的评价对象词典,然后利用搜索引擎采集评论文本从中抽取情感词,然后基于HowNet计算这些词的倾向性,建立具有口语化特征的情感词表。
     2、利用中文依存句法分析,结合其他的语义特征进行属性词的抽取,以扩大属性词典,然后使用二部图模型,对属性词和情感词进行反复的互训练,最后将新训练的属性词和情感词分别写入词典,且将匹配的属性词和情感词以二元组的方式写入文本。
     3、手工构造了否定词、转折词和程度词表,然后定义了评论情感词的评分模型,对抽取出来的评价情感词进行打分,最后确定其极性,即客户对产品属性所持有的意见或者态度。
     通过上述工作,本文实现了对Web文本的意见挖掘,即属性词和情感词的抽取及意见的褒贬分析,并建立了相关资源。本文最后探索如何实现跨领域,在一定程度上表明了方法的可行性。
With the wide range of the Internet applications, Blog、BBS、Wiki and other Web sites appear in a large number of customer reviews for products or services. The paper aims at these Web texts, research of how to extracting product features and opinion words from texts, and then holds for a client to determine polarity of opinions. The methods in the paper have been proved the applicability by experiments;the relative developed system also has a good robustness. Our study is mainly as follows:
     1、Using network resources, we first adopt pattern matching extraction methods based on HTML tags to extract product features from specific WebPages and then establish a basic feature dictionary. Secondly, we crawls comment texts from search engines to extract opinion words, and then calculate the polarity of the words based on HowNet to construct a characteristic of colloquial sentiment lexicon.
     2、Use of Chinese Dependency Parsing Analysis, combination with other semantic properties, we extract new product features and expand the feature dictionary, and then based on the bipartite graph model, we take the feature words and opinion words to repeated co-training, finally, we write news feature words and opinion words into respect lexicon. At the same time, we write the matching feature and opinion words into new text in the way of binary group.
     3、We artificially construct negative word table, turning the table and extent of vocabulary words, and then define a rating model of sentiment words, scoring the sentiment word, then judge the polarity of the word, that is, the opinion or attitude of the reviewer.
     Through the above work, this paper presents the views of Web text mining, namely, extracted of the feature words and opinion words and the analysis of praise and abuse. And we established related resources. The paper finally explores how to achieve cross-domain; to a certain extent, we have been proved the feasibility of our methods.
引文
[1]Senecal S, Nantel J. The Influence of Online Product Recommendations on Consumers'Online Choices[C].Journal of Retailing, Elsevier,2004.159-169.
    [2]Chevalier J, Mayzlin D. The Effect of Word of Mouth on Sales:Online Book Reviews[C].NBER Working Paper Series10148, National Bureau of Economic Research, USA,2003.
    [3]Godes D, Mayzlin D. Using online conversations to study word-of-mouth communication [J]. Marketing Science.2004,23(4):545-560.
    [4]Popescu A M, Etzioni O. Extracting Product Features and Opinions From Reviews[C]. In Proceedings of HLT2EMNLP2005,ACL,2005.339-346.
    [5]Hu M,L iu B. Mining Opinion Features in Customer Reviews[C].In AAAI,2004.755-760.
    [6]Turney P D. Thumbs up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Review s[C].Proceeding of Association for Computational Linguistics 40th Anniversary Meeting,2002.417-424.
    [7]Liu B. Opinion Observer:Analyzing and Comparing Opinions on The Web[C].Proceedings of The 14th International World Wide Web Conference(WWW-2005),2005.10-14.
    [8]Liu J,Wu G, Yao J. Opinion Searching in Multi-product Reviews[C].Proceedings of The Sixth IEEE International Conference on Computer and Information Technology(CIT'06),2006.25-25.
    [9]Hu M,L iu B. Mining and Summarizing Customer Reviews[C].Proceedings of The Tenth ACM SIGKDD International Conference on Know ledge Discovery and Data M ining,2004.168-177.
    [10]Wiebe J M. Learning Subjective Adjectives from Corpora[C].Proceeding of 17th National Conference on Artificial Intelligence. Menlo Park, California:AAAI Press,2000.735-740.
    [11]M.Hu and B.Liu. Mining and summarizing customer reviews. In Proc.of KDD'04,pp.168-177.
    [12]N.Kobayashi, K.Inui, Y.Matsumoto. Collecting evaluative expressions for opinion extraction. In Proc of IJCNLP-2004,pp.596-605.
    [13]LAENDER A, RIBEIRO-NETO B, SILVAA. A brief survey of web data extraction tools[J]. SIGMOD Record,2002,31(2):84-93.
    [14]FRETTAG D. Machine learning for information extraction in informal domains[J]. Machine Learning,2000,39(2/3):169-202.
    [15]SODERLAND S. Learning information extraction rules for semi-structured and free text[J]. Machine Learning,1999,34(1/3):233-272.
    [16]MUSLEA I, MINTON S, KNOLOCK C. Hierarchical wrapper induction for semi-structured information sources[J]. Autonomous Agents and Multi-Agent System. 2001,4(1/2):93-114.
    [17]CRAIG A, KNOBLOCK, KRISTINA, et al. Accurately and reliably extracting data from the web:A machine learning approach[J]. Data Engineering Bulletin,2000, 23(4):33-41.
    [18]HSU CN. MINTON M. Generating finite state transducers for semi-structured data extraction from the Web[J]. Information System,1998,23(8):521-538.
    [19]KUSHMERICK N. Wrapper induction:efficiency and expressiveness[J]. Artificial Intelligence Journal,2000,118(1/2):15-68.
    [20]EMBLEY D, CAMPBELL D, JIANG S, et al. Conceptual-model- based data extraction from multiple record web pages [J]. Data and Knowledge Engineering, 1999,31(3):227-251.
    [21]CHRISTINA YIP CHUNG, MICHAEL GERTZ, NEEL SUNDARESAN. Reverse engineering for Web data:From visual to semantic structures[Z]. In Proceedings of 18th International Conference on Data Engineering, San Jose, California,2002.
    [22]CHRISTINA YIP CHUNG, NEEL SUNDARESAN. Quixote:Building XML repositories from topic specific web documents[Z]. In Fourth Int. Workshop on the Web and Database,2001.
    [23]ROBERT BAUMGARTNER, SERGIO FLESCA, GEORG GOTTLOB. Supervised wrapper generation with lixto[Z]. Proceeding of 27th International Conference on Very Large Database, Roma, Italy,2001.
    [24]ROBERT BAUMGARTNER, SERGIO FLESCA, GEORG GOTTLOB. Visual web information extraction with lixto[Z]. Proceeding of 27th International Conference on Very Large Database, Roma, Italy,2001.
    [25]LIU L, PU C, HAN W. XWRAP:An XML-enabled wrapper construction system for Web information sources[Z]. In Proceedings of the International Conference on Data Engineering, San Diego,2000.
    [26]LIU L, HAN W, BUTTLER D, et al. An XML-based wrapper generator for Web information extraction[Z]. In Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA,1999.
    [27]VALTER CRESCENZI, GIANSALVATORE MECCA. RoadRunner:towards automatic data extraction from large Web sites[Z]. In Proceedings of the 27th International Conference on Very Large Database. Roma, Italy,2001.
    [28]ARNAUD SAHUGUET, FABIEN AZAVANT. Building intelligent Web applications using light weight wrappers[J]. Data Knowledge Engineering, 2001,36(3):283-316.
    [29]AROCENA G. MENDELZON A. WebOQL:Restructuring documents, databases and webs [Z]. In Proceedings of the 14th ICDE Conference, Orlando, Flodida, USA,1998.
    [30]GUSTAVO AROCENA. WebOQL:Exploiting document structure in Web queries [D]. Toronto:Master’s thesis, University of Toronto,1997.
    [31]徐林吴,杨文柱,陈少飞.基于XPath的Web信息抽取[Z].19届全国数据库会议,郑州,2002.
    [32]杨文柱,徐林吴,郝亚南.个性化的Web查询助手的设计与实现[Z].19届全国数据库会议,郑州,2002.
    [33]XQuery[EB/OL]. http://www.w3.org/TR/xquery.
    [34]HowNet[R]. HowNet's Home Page. http://www.keenage.com
    [35]刘群,李素建.基于《知网》的词汇语义相似度的计算[A].第三届汉语词汇语义学研讨会,台北,2002.
    [36]周强.基于语料库和面向统计学的自然语言处理技术介绍.计算机科学,1995,22(4):189-192
    [37]F.Pereira and Y.Schabes. Inside-Outside Reestimation from Partially Bracheted Corpora. The 30th Annual Meeting of the Association for Computational Linguistics. 1992:128-135
    [38]D.Magerman and M.Marcus. Pearl:A Probabilistic Chart. Proc. Of the 1991 European ACL Conference, Berlin, Germany.1991:15-20
    [39]T.Briscoe and J.Carroll. Generalized LR Parsing of Natural Language (Corpora) with Unification-Based Grammars. Computational Linguistics,1993,19(1):25-60
    [40]周强,黄昌宁.基于局部优先的汉语句法分析方法.软件学报,1999,10(1):1-6
    [41]J.S.Ma, Y.Zhang, T.Liu, and S.Li. A statistical dependency parser of Chinese under small training data. Workshop:Beyond shallow analyese-Formalisms and statistical modeling for deep analyses, IJCNLP-04, San Ya.2004
    [42]M. Collins. A Statistical Dependency Parser Of Chinese Under Small Training Data. Proc. of the 34th Annual Meeting of the ACL.1996:184-191
    [43]M.Collins. Three Generative, Lexicalized Models for Statistical Parsing. Proceedings of the 35th annual meeting of the association for computational linguistics.1997:16-23
    [44]A. McCallum, D. Freitag, and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of ICML.2000:591-598
    [45]J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.Proceedings of 18th International Conference on Machine Learning 2001:282-289
    [46]M. Johnson. Joint and Conditional Estimation of Tagging and Parsing Models. Proceedings of ACL.2001:314-321
    [47]X.Q. Luo. A Maximum Entropy Chinese Character-Based Parser. Proceedings of Conference on Empirical Methods in NLP.2003:192-199
    [48]P. Fung, G. Ngai, Y. Yang, and B. Chen. A Maximum Entropy Chinese Parser Augmented with Transformation-Based Learning. ACM Transactions on Asian Language Information Processing,2004,3(3):159-168
    [49]V.N. Vapnik. The Nature of Statistical Learning Theory, Berlin:Springer-Verlag. 1995
    [50]T. Joachims. Text categorization with support vector machines:learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning.1998:137-142
    [51]H. Yamada and Y. Matsumoto. Statistical Dependency Analysis with Support Vector Machines. Proc. of the 8th Intern. Workshop on Parsing Technologies (IWPT). 2003:195-206
    [52]J.R. Quinlan. Induction of Decision Trees. Machine Learning,1986,1(1):81-106
    [53]E. Jelinek, J. Lafferty, D. Magerman, R. Mercer, A. Ratnaparkhi, and S.Roukos. Decision Tree Parsing using a Hidden Derivation Model. Proceedings of the Human Language Technology Workshop, Plainsboro,New Jersey.1994:272-277
    [54]D. Magerman. Statistical Decision-Tree Models for Parsing. Proc. of the 33rd Annual Meeting of the ACL.1995:276-283
    [55]J.F. Gao and H. Suzuki. Unsupervised Learning of Dependency Structure for Language Modeling. Proceedings of the 41th ACL.2003:521-528
    [56]D. Klein. The Unsupervised Learning of Natural Language Structure. Ph.D. Thesis:Stanford University.2005
    [57]M. Chitrao and R. Grishman. Statistical Parsing of Messages. Proceedings Speech and Natural Language Workshop:Morgan Kaufman Publishers.1990: 263-266
    [58]P. Pantel and D.K. Lin. An Unsupervised Approach to Prepositional Phrase Attachment using Contextually Similar Words. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics.2000:101-108
    [59]郝博一,夏云庆,郑方.OPINAX:一个有效的产品属性挖掘系统.第四届全国信息检索与内容安全学术会议(上卷),2008,p281-290
    [60]http://ir.hit.edu.cn/demo/ltp/
    [61]Inderjit S. Dhillon. Co-clustering documents and words using Bipartite Spectral Graph Partitioning. KDD 2001 San Francisco, California, USA
    [62]Stone, Philip J., Dunphy, Dexter, Smith, Marshall, Ogilvie, Daniel,1966. The General Inquirer:A Computer Approach to Content Analysis. MIT
    [63]王治敏,朱学锋,俞士汶,基于现代汉语语法信息词典的词语情感评价研究,Recent advancement in Chinese Lexical Semantics, Proceeding of 5th Chinese Lexical Semantics Workshop (CLSW-5),2004, Singapore
    [64]WordNet主页:http://wordnet.princeton.edu/
    [65]Hatzivassiloglou and McKeown, Predicting the Semantic Orientation of Adjectives. In:Proceedings of ACL-97,35th Annual Meeting of the Association for Computational Linguistics, pages 174-181, Association for Computational Linguistics, Madrid, ES,1997.
    [66]Theresa Wilson, Janyce Wiebe, and Paul Hoffmann, Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis, HLT-EMNLP-2005
    [67]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德,基于HowNet的词汇语义倾向计算,中文信息学报,2006年第1期
    [68]Turney Peter, Thumbs Up or Thumbs Down?Semantic Orientation Applied to Unsupervised Classification of Reviews. In:Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.417-424,2002
    [69]Turney, Peter D.,& Littman, Michael L.2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems,21(4),315-346
    [70]J. Wiebe, J. M. A corpus study of evaluative and speculative language. In Proceedings of the 2nd ACL SIG on Dialogue Workshop on Discourse and Dialogue (Aalborg, Denmark).
    [71]Sanjiv Das and Mike Chen. Yahoo! for Amazon:Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA),2001.
    [72]Kennedy A, Inkpen D. Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence,22 (2),2006,110-125.
    [73]Kushal Dave, Steve Lawrence, and David M. Pennock. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of WWW, pages 519-528,2003.
    [74]Chaovalit P, Zhou L. Movie Review Mining:A Comparison between Supervised and Unsupervised Classification Approaches[C].Proceedings of The 38th Annual Hawaii International Conference on System Sciences,2005.112.3.
    [75]Pang B, Lee L.Shivakumar Vaithyanathan.Thumbs up? Sentiment Classification Using M achine Learning Techniques[C].2002 Conference on Empirical Methods in Natural Language Processing(EMNLP'2002),2002.79-86.
    [76]Sanjiv R D,Chen M Y. Yahoo!For Amazon:Sentiment Parsing from Small Talk on The Web[C].Proceedings of The 8th Asia Pacific Finance A ssociation Annual Conference,2001.
    [77]BeinekeP,Trevor H,Shivakumar Vaithyanathan.The Sentimental Factor:Improving Review Classification via Human-Provided Information[C]. Proceedings of ACL,2004.263-270.
    [78]Fei Z C, Liu J, Wu G F.Sentiment Classification Using Phrase patterns[C].In:Proceedings of The Fourth InternationalConference on Computer and Information Technology(CIT'04).WuHan,China:IEEE,2004.1-6.
    [79]Dave K, Law rence S,Pennock DM. Mining The Peanut Gallery:Opinion Extraction and Semantic Classification of Product Reviews[C].Proceeding of 12th International Conference on World W ide Web.Budapest,Hungary:ACM Press,2003. 519-528.
    [80]Riloff E,Wiebe J,Wilson T. Learning Subjective Nouns using Extraction Pattern Bootstrapping[C].Proceedings of The Seventh Conference on Computational Natural Language Learning(CoNLL-03),2003.25-32.
    [81]Riloff E, Wiebe J. Learning Extraction Patterns for Subjective Expressions[C].Proceedings of The 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP-03,2003.105-112.
    [82]Yu H, Vasileios Hatzivassiloglou. Towards Answering Opinion Questions:Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences [C].Proceedings of The 2003 Conference on Empirical Methods in Natural Language Processing,2003.129-136.
    [83]叶强,张紫琼,罗振雄.面向互联网评论情感分析的中文主观性自动判别方法研究[J].信息系统学报,2007,1(1):79-81.
    [84]L iu B, Hu M Q, Cheng J S. Opinion Observer:Analyzing and Comparing Opinions on The Web[C].Proceedings of The 14th International World Wide Web Conference(WWW-2005),2005.342-351.
    [85]Kobayashi N, Inui K, Matsumoto Y,et al. Collecting Evaluative Expressions for Opinion Extraction[C].In Proceedings of The 1st International Joint Conference on Natural Language Processing.Sanya City,Hainan Island,China,2004.584-589.
    [86]Satoshi Morinaga, Kenji Yamanishi, Kenji Tateishi, and Toshikazu Pukushima. Mining product reputations on the web. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Pages 341-349,2002.
    [87]Tetsuya Nasukawa and Jeonghee Yi. Sentiment analysis:Capturing favorability using natural language processing. In The Second International Conferences on Knowledge Captuer(K-CAP2003), Sanibel Island, FL,USA,pages70-77,Oct2003
    [88]Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack. Sentiment analyzer Extracting sentiments about a given topic using natural Language processing techniques. In The Third IEEE Intenrational Conference on Data Mining, Nov 2003
    [89]Kushla Dave, Steve Lawrence, and David M. Pennock. Mining the peanut galeyr Opinion extraction and semantic classification of product reviews. In Proceedings of the international World Wide Web conference,pages519-528,2003
    [90]M. Gamon, A. Aue, S. Corston-Oilver, and E. Ringger. Pulse:Mining customer opinions from free text.Underreview,2005

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700