基于主动学习的语料自动标注方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
意见挖掘是指针对主观性文本自动获取有用的意见信息和知识。汉语意见挖掘技术的研究需要汉语意见型主观性文本标注语料库的支持。
     由于汉语意见型主观性文本标注语料库包含了分词、词性、依存关系、语义、词概念、意见等大量信息,最后完成的标注通常比较复杂。为了减轻标注人员的负担,提高标注的效率和精确度,减少标注的错误率,有必要开发一款自动标注工具协助标注人员的工作。
     本文实现了一个基于主动学习的汉语意见元素标注工具,可以自动识别句子中的主题、情感和意见持有者等意见元素。主动学习算法具有需要训练样例较少,受不平衡训练样例干扰较小,分类性能较好等特点。本文经过实验,证明了主动学习算法应用于意见元素识别的有效性,并提出了一个公式,综合主动学习分类器F值、训练时间、训练样例数量三个方面,对系统的总体性能进行衡量。
Opinion Mining aims to automatically acquire useful opinioned information and knowledge in subjective texts. Research of Chinese Opinioned Mining requires the support of the annotated corpus for Chinese opinioned-subjective texts.
     Since the annotated corpus for Chinese opinioned-subjective texts includes much information including word segmentation, part-of-speech tag, dependency relationship, word meaning, and opinion, the finished annotations are usually very complicate. To relieve the burdens of annotators, increase the efficiency and accuracy of annotation, and reduce the possibility of false annotation, it is necessary to develop an automatic annotation tool to facilitate annotators’work.
     This paper implements an active learning based annotation tool for Chinese opinioned elements. It can identify topic, sentiment, and opinion holder in a sentence automatically. Active learning algorithm is featured with smaller training set size, less influence from unbalanced training data and better classification performance comparing to classical learning algorithm. This paper experimentally demonstrated the validity of active learning algorithm when used for opinioned elements identification and proposed a formula for overall system performance evaluation which consists of F-measure, training time, and training instance number.
引文
[1]刘全升,姚天昉,黄高辉,刘军,宋鸿彦.汉语意见型主观性文本类型体系的研究.中文信息学报.第22卷第六期. 2008.
    [2] S.M. Kim, and E. Hovy. Determining the Sentiment of Opinions. In Proceedings of the Conference on Computational Linguistics. 2004.
    [3]姚天昉,程希文,徐飞玉,汉思·乌思克尔特,王睿.文本意见挖掘综述.中文信息学报.第22卷第三期. 2008.
    [4] T. McEnery, A. Wilson. Corpus Linguistics. Edinburgh University Press. Britain. 1996.
    [5]王建新.计算机语料库的建设与应用.清华大学出版社. 2005.
    [6]宋鸿彦,刘军,姚天昉,刘全升,黄高辉.汉语意见型主观性文本标注语料库的构建.中文信息学报.第23卷第二期. 2009.
    [7] Tom M. Mitchell著.曾华军,张银奎等译.机器学习.机械工业出版社. 2003.
    [8] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. 2001. Software available at http://www.csie.ntu. edu.tw/~cjlin/libsvm
    [9] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer. 1995.
    [10] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory. 1992.
    [11] Ido Dagan, Sean P. Engelson. Committee-Based Sampling For Training Probabilistic Classifiers. In Proceedings of the International Conference on Machine Learning. 1995.
    [12] Dana Angluin. Learning Regular Sets from Queries and Counter Examples. Information and Computation, Volume 75, Issue 2. 1987.
    [13] C. Sammut and R. Banerji. Learning Concepts by Asking Questions. Machine Learning: An Artificial Intelligence Approach, Volume 2. 1986.
    [14] E.Y. Shapiro. Algorithmic Program Debugging. M.I.T. Press. 1982.
    [15] Dana Angluin. Queries and concept learning. Machine Learning, Volume 2, Issue 4. 1988.
    [16]龙军,殷建平,祝恩,赵文涛.主动学习研究综述.计算机研究与发展.第45卷第z1期. 2008.Conference/Conference on Empirical Methods in Natural Language Processing. 2005.
    [46] V. Stoyanov and C. Cardie. Toward Opinion Summarization: Linking the Sources. In Proceddings of the Workshop on Sentiment and Subjectivity in Text (COLING-ACL 2006 Workshop). 2006.
    [47] P.D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics. 2002.
    [48] P. D. Turney and M. L. Littman. Unsupervised Learning of Semantic Orientation from A Hundred-Billion-Word Corpus. Technical Report ERC-1094 (NRC 44929). National Research Council of Canada. 2002.
    [49] M. Gamon and A. Aue. Automatic Identification of Sentiment Vocabulary: Exploiting Low Association with Known Sentiment Terms. In Proceedings of the ACL-2005 Workshop on Feature Engineering for Machine Learning in NLP. 2005.
    [50] H. Yu and V. Hatzivassiloglou. Towards Answering Opinion Questions : Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. In Proceedings of 8th Conference on Empirical Methods in Natural Language Processing. 2003.
    [51] J. Kamps, M. Marx, R.J. Mokken, and M.de Rijke. Words with Attitude. In Proceddings of the 1st International Conference on Global WordNet. 2002.
    [52] B. Pang and L. Lee. A Sentimental Education: Sentiment Analysis using Subjectivity Summarization based on Minimun Cuts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2004.
    [53] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing. 2005.
    [54] L. Lecerf, B. Chidlovskii. Document Annotation by Active Learning Techniques. In Proceedings of the ACM symposium on Document engineering, 2006.
    [55] R. Reichart, K. Tomanek, U. Hahn. Multi-Task Active Learning for Linguistic Annotations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2008.
    [56] Y. Song, G.J. Qi, X.S. Hua, L.R. Dai, R.H. Wang. Video Annotation by Active Learning and Semi-Supervised Ensembling. In Proceedings of IEEE International Conference on Multimedia Expo, 2006.
    [57] Y. Xia, K.F. Wong and W. Li. A Phonetic-Based Approach to Chinese Chat Text Normalization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th AnnualMeeting of the Association for Computational Linguistics. 2006.
    [58]董振东,董强知网(http://www.keenage.com/zhiwang/c_zhiwang.html)
    [59]郎君,刘挺,张会鹏,李生. LTP:语言技术平台.第三届全国学生计算语言学研讨会. 2006.
    [60] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag. 1995.
    [61] C.W. Hsu, C.C. Chang, and C.J. Lin. A Practical Guide to Support Vector Classification. Available at www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
    [62]马金山.基于N-gram及依存分析的中文自动校对研究.长春工业大学硕士学位论文. 2003.
    [63]李娟子.汉语词义消歧方法研究.清华大学博士论文. 1999.
    [64]梅家驹,竺一鸣,高蕴琦,殷鸿翔.同义词词林.上海辞书出版社. 1983.
    [65] Songbo Tan, Xueqi Cheng, Yuefen Wang and Hongbo Xu. Adapting Naive Bayes to Domain Adaptation for Sentiment Analysis. In Proceedings of ECIR. 2009.
    [66] Songbo Tan, Yuefen Wang, Gaowei Wu, and Xueqi Cheng. Using Unlabeled Data to Handle Domain-transfer Problem of Semantic Detection. In Proceedings of SAC. 2008.
    [67] Songbo Tan, Gaowei Wu, Huifeng Tang, and Xueqi Cheng. A Novel Scheme for Domain-transfer Problem in the Context of Sentiment Analysis. In Proceedings of CIKM. 2007.
    [68] E. Osuna, R. Freund, and F. Girosi. Support Vector Machines: Training and Applications. AI Memo 1602, Massachusetts Institute of Technology. 1997.
    [69] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel, Y. Lecun, U. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of Classifier Methods: A Case Study in Handwriting Digit Recognition. International Conference on Pattern Recognition. IEEE Computer Society Press. 1994.
    [70] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer Learning Revisited: A Stepwise Procedure for Building and Training A Neural Network. Neurocomputing: Algorithms, Architectures and Applications. Springer-Verlag. 1990.
    [71] J. Friedman. Another Approach to Polychotomous Classification. Technical Report. Department of Statistics, Stanford University. 1996. Available at http://www-stat.stanford.edu/reports/friedman/poly.ps.Z.
    [72] U. Kressel. Pairwise Classification and Support Vector Machines. Advances in Kernel Methods - Support Vector Learning. MIT Press. 1999.
    [73] C.W. Hsu and C.J. Lin. A Comparison of Methods for Multi-class Support Vector Machines. IEEE Transactions on Neural Networks. 2002.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700