领域间适应性情感分类方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet的迅速发展与普及,网络上出现了越来越多的主观性言论。对于这些主观性文本的分析和挖掘,传统的基于主题的文本分类方法已经无法满足需求。因此,人们开始关注并研究这些主观性文本的情感分类。
     情感分类是一个领域相关问题,在一个领域训练的分类模型通常很难应用于另一个领域。如果针对每个领域都训练一个分类模型需要大量的标注数据。而标注数据的获得,需要耗费大量的时间和精力,代价非常高昂。因此,领域适应情感分类的研究具有很重要的应用价值。
     针对领域适应性情感分类,本文的主要研究和贡献如下:
     (1)针对不同领域特征统计分布的差异,提出了一种新的结合特征相似度计算的领域间特征选择方法,通过这种方法可以选择出在两个领域中具有相似统计分布的情感特征,从而提高了分类效果。
     (2)提出了基于质心迁移的领域间情感分类方法,该方法利用源领域的标注文本对目标领域的大量未标注文本进行分类,选择一部分可信度高的文本加入到训练集,同时去除源领域中距离目标领域测试集质心较远的文本,通过迭代逐渐缩小两个领域间的质心距离,减小领域间差异。实验表明,该方法能够显著提高分类的效果。
     (3)由于同一领域内文本可能具有不同的特征,而不同领域的文本也可能具有一定相似的特征,本文提出将两个领域的文本进行聚类,针对每个小类中的测试文本分别进行分类的方法。这种方法同样能够减少领域间的差异,提高分类的效果。
With the rapid development and popularization of Internet, there are more and more subjective remarks available in Internet. With respect to these subjective remarks and identifying their semantic orientation, the methods of traditional topic-based text classification becomes incapable of meeting people's needs.Therefore, sentiment classification has been paid more and more attention by various researchers.
     Sentiment classification is a very domain-specific problem; classifiers trained in one domain usually perform poorly in some others. If, in every domain, a classification model is trained, it would need a lot of annotated corpus. Since labeling data is very time-consuming and expensive, domain adaptation approaches for sentiment classification becomes valuable to handle the cross-domain classification problems.
     In this study, we focus on the domain adaptation for sentiment classification. Our main work and contributions include:
     (1)In order to eliminate feature's statistical distribution's difference between domains, we propose a novel feature selection approach which unions feature's similarity. By this way, we can choose sentiment features which have similar statistical distribution in two domains, which can improve the classification performance.
     (2)We propose a novel domain adaptation approach for sentiment classification under centroid-transfer. The approach makes full use of labeled documents in the source domain to label target's documents and choose a part of confident documents to join the training set, simultaneously remove some of the source domain's documents which are far form the test's centroid, by iteration between the two domains gradually narrow the centroid distance, reducing the differences between domains. The experiment results indicate that the proposed approach could significantly improve the performance of cross-domain sentiment analysis.
     (3) Based on the finding that the same domain's documents may have different features in different domains, and the document may also have certain similar features, we propose a new approach to do classification. Specifically, two domains of documents are first clustered and then classification is performed in each clusting. This approach can reduce the differences between the domains and thus improve the classification results.
引文
[1]中国互联网信息中心.第27次中国互联网络发展状况统计报告[EB/OL] . http://www.cnnic.net.cn/dtygg/dtgg/201101/P020110119328960192287.pdf,2011.01.19.
    [2]中国网友报.淘宝年度数据:揭示国内网购新风尚[EB/OL].http://paper.cnii.com.cn/zgwyb/images/2011-01/10/06/2011011006_pdf.pdf,2011.01.10.
    [3]周立柱,贺宇凯,王建勇.情感分析研究综述[J].计算机应用.2008,28(11): 2725-2728.
    [4] Bo Pang,Lillian Lee,Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques [C]. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2002). 2002:79-86.
    [5] Anthony A ue,Michael G amon.”Customizing sentiment classifiers to new domains: A case study”[C] .In Proceedings of Recent Advances in Natural Language Processing (RANLP) ,2005.
    [6] Tan Songbo,Wu Gaowei,Tang Huifeng,Cheng Xueqi.A novel scheme for domain-transfer problem in the context of sentiment analysis [C] . In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.2007:979-982.
    [7]黄萱菁,赵军.中文文本情感倾向性分析[J].中国计算机学会通讯.2008(2).
    [8] General Inquirer(GI词典) [EB/OL].http://www.wjh.harvard.edu/~inquirer/.
    [9] George A.Miller.WordNet: A lexical database for English [J].Communications of the ACM(CACM).1995,38 (11):39-41.
    [10] Esuli Andrea,Sebastiani Fabrizio.Sentwordnet:A publicly available lexical resource for opinion mining [C].In Proceedings of LREC-06,the 5thConference on Language Resources and Evaluation.2006:volume 6.
    [11]董振东,董强.知网简介[EB/OL].http://www.keenage.com/.
    [12]张伟,刘缙,郭先珍.学生褒贬义词典[M].中国大百科全书出版社.2004.
    [13] Sista Sreenivasa P,Srinivasan SH.Polarized lexicion for review classification [C].In Proceedings of ICAI,the International Conference on Artificial Intelligence.2004:867-872.
    [14] Kamps Jaap,Maarten Marx.Using WordNet to measure semantic orientation of adjectives [C] .In Proceedings of LREC-04,4th International Conference on Language Resources and Evaluation.2004:1115-1118.
    [15]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德."基于HowNet的词汇语义倾向计算" [J].中文信息学报.200620(01):14-20.
    [16]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[C].第三届学生计算语言学研讨会论文集.2006:91-100.
    [17] Hatzivassiloglou Vasileios,McKeown Kathleen R.Predicting the semantic orientation of adjectives [C].In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the ACL.1997:174-181.
    [18] Turney Peter D. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews [C]. In Proceedings of the Association for Computational Linguistics (ACL-2002). 2002:417-424.
    [19] Turney Peter D,Littman Michalel L.Measuring praise and criticism: Inference of semantic orientation from association [J].ACM Transactions on Information Systems .2003,21 (4):315-346.
    [20] Janyce M. Wiebe,Rebecca F. Bruce,Thomas P. O'Hara. Development and use of a gold-standard data set for subjectivity classifications [C]. In Proceedings of the Association for Computational Linguistics (ACL-1999). 1999:246-253.
    [21] Theresa Wilson,Janyce Wiebe,Paul Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis [C]. In Proceedings of the Human Language Technology Conference on Empirical Methods in Natural LanguageProcessing (HLT/EMNLP-2005). 2005:347-354.
    [22]王根,赵军.基于多重标记CRFs的句子情感分析研究[J].中文信息学报.2007,16(2):51-58.
    [23] Hu Mingqing,Liu Bing.Mining opinion features in customer reviews [C].In AAAI.2004:755-760.
    [24] Wang Chao,Lu Jie,Zhang Guangquan.A semantic classification approach for online product reviews [C].In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence(WI’5).2005.
    [25]张伟.基于树核函数的句子级别情感分类研究[D].苏州大学.2010.
    [26] Bo Pang,Lillian Lee.A sentiment education: Sentiment analysis using subjectivity summarization based on minimum cuts [C].In Proceedings of the Association for Computational Linguistics (ACL-2004).2004:271-278.
    [27] Abbasi Ahmed,Chen Hsinchun,Salem Arab.Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums [J].ACM Transactions on Information Systems (TOIS).2008,26(3):12.
    [28] Melville Prem,Gryc Wojciech,Lawrence Richard D.Sentiment analysis of blogs by combining lexical knowledge with text classification [C] . In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.2009:1275-1284.
    [29]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报.2007,21(6):88-94.
    [30] Wan Xiaojun.Co-training for cross-lingual sentiment classification [C].In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.2009:235-243.
    [31] Dasgupta Sajib,Vincent Ng.Mine the easy and classify the hard: Experiments with Automatic Sentiment Classification [C].In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics.2009:701-709.
    [32] Li Shoushan , Huang Chu-Ren , Zhou Guodong , Sophia Yat MeiLee.Employing personal/impersonal views in supervised and semi-supervised sentiment classification [C].In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.2010:414–423.
    [33]英国开发舆论分析软件.《环球时报》.2005.04.11第6版.
    [34] Gamon Michael,Aue Anthony,Simon Corston-Oliver,Eric Ringger.Pulse: Mining customer opinions from free text [C]. In Proceeding of the 6th Intrenational Symposium on Intelligent Data Analysis (IDA-2005).2005:121-132.
    [35] Liu Bing,Hu Minqing,Cheng Junsheng.Opinion observer:Analyzing and comparing opinions on the web [C]. In Proceeding of the 14th International World Wide Web Conference (WWW-2005).2005:342-351.
    [36] Kushal Dave,Steve Lawrence,David M. Pennock.Mining the peanut gallery: Opinion extraction and semantic classification of product reviews [C]. In Proceedings of the 12th International World Wide Web Conference (WWW-2003).2003:519-528.
    [37] John Blitzer,Mark Dredze,Fernando Pereira.Biographies, bollywood, boom-boxes and blenders:Domain adaptation for sentiment classification [C].In Association for Computational Linguistics.2007:440-447.
    [38]吴琼,谭松波.跨领域倾向性分析相关技术研究[J].中文信息学报,2010,24(1):77-83.
    [39] Salton Gerard . A vector space model for automatic indexing [C].Communications of the ACM.1975,18(11):613-620.
    [40] Salton Gerard . The SMART retrieval system-experiments in automatic document processing [M].Englewood Cliffs,NJ:Prentices Hall,1971.
    [41]张学工.关于统计学习理论与支持向量机[J].自动化学报.2000,26(1): 32-42.
    [42] Tan Songbo,Cheng Xueqi.Adapting naive bayes to domain adaptation for sentiment analysis [C].ECIR 2009:337–349.
    [43] SVM_light工具包[EB/OL].http://svmlight.joachims.org/.
    [44] Wilson情感词典[EB/OL].http://www.cs.pitt.edu/mpqa/.
    [45] Miller David J.,Uyar Hasan S..A mixture of experts classifier with learning based on both labelled and unlabelled data [M].MA:MIT Press,1997:571-577.
    [46] Nigam Kamal,Mccallum Andrew,Thrun Sebastian.Learning to classify text from labeled and unlabeled documents [C].In Proceedings of the Fifteenth National Conference on Artificial Intelligence.1998:792-799.
    [47]王继曾,刘宽,任浩征,罗恒.EM算法在统计自然语言处理中的应用[J].计算机工程与设计.2006,27(19):3715-3717.
    [48] Avrim Blum,Tom Mitchell.Combining labeled and unlabeled data with co-training [C].In Proceedings of the Eleventh Annual Conference on Computational Learning Theory.1998:92-100.
    [49] Nigam Kamal,Ghani Rayid.Understanding the behavior of co-training [C].In Proeeeding of KDD-2000 Workshop on TextMining.2000:135-141.
    [50] Everitt Brian S.Cluster Analysis [M].New York:Heinemann Educational Books Ltd.1974:45-60.
    [51] Jain Anil K.,Duin Robert P.W.,Mao Jianchang.Statistical pattern recognition:A review [C] . IEEE Trans. Actions on Pattern Analysis and Machine Intelligence.2000,22(1):4-37.
    [52] Han Jiawei,Kamber Micheline著.范明孟小峰译.数据挖掘概念与技术[M].机械工业出版社.2008,22(1):261-284.
    [53] Macqueen James. Some methods for classification and analysis of multivariate observations [C].In Proceedings of the 5th Berkeley Symp.1967,1:281 -297.
    [54] Von Luxburg U.A Tutorial on Spectral Clustering. Statistics and Computing [J],December 2007,17(4):395-416.
    [55] Pluskid.谱聚类算法[EB/OL].http://blog.pluskid.org/?p=287.2009.02.05.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700