一种利用情感词统计信息构造文本特征表示的方法

英文篇名：Novel method of using statistical information to construct feature representation in sentiment classification
作者：韩彤晖 ; 杨东强 ; 马宏伟
英文作者：Han Tonghui;Yang Dongqiang;Ma Hongwei;School of Computer Science & Technology,Shandong Jianzhu University;
关键词：数据表达 ; 统计特征 ; 情感分类
英文关键词：data representation;;statistical features;;sentiment classification
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：山东建筑大学计算机科学与技术学院;
出版日期：2018-04-12 08:51
出版单位：计算机应用研究
年：2019
期：v.36;No.333
基金：国家社科基金资助项目(17BYY19);; 国家教育部人文社科基金资助项目(15YJA740054)
语种：中文;
页：JSYJ201907038
页数：6
CN：07
ISSN：51-1196/TP
分类号：173-178

摘要

数据表达方法和文本分类的效果密切相关。文本分类中常用的数据表达方法主要包括基于词典的共现频率方法、基于隐性语义空间(LSA/SVD)的方法、基于神经网络语言模型的方法。提出一种利用单词的统计特征创建文本分类中特征空间的表达方法。该方法利用单词的七种常见的统计特征,通过相关性分析选取相对独立的统计特征创建特征空间。该方法能够有效降低文本向量空间的维度,同时降低了语义空间内的计算复杂度。情感分类实验的结果表明,与现有的单词的数据表达方法相比,该方法能够显著提高分类算法的准确率和召回率。
Data representation is closely related to the performance of text classification method. There exist three typical methods,namely lexical co-occurrence,latent semantic analysis(LSA) or latent semantic analysis(LSA) or singular value decomposition(SVD),and various neural language models. This paper introduced a feature space construction method only using statistical information. The method first collected 7 types of common word's statistical information,and then chose independent features through correlation analysis,to contrast word feature space vector. This method could effectively reduce the dimension size of vector space models,and could effectively lower computation complexity in deriving latent semantic space. The sentiment classification results show that in contrast with those current data representation methods,this method can significantly improve the accuracy and recall rates for different classifier.

引文

[1] Gliozzo A,Strapparava C. Domain kernels for text categorization[C]//Proc of the 9th Conference on Computational Natural Language Learning. Stroudsburg,PA:Association for Computational Linguistics,2005:56-63.
    [2] May C,Ferraro F,Mc Cree A,et al. Topic identification and discovery on text and speech[C]//Proc of Conference on Empirical Methods in Natural Language Processing. 2015:2377-2387.
    [3] Wang Peilu,Qian Yao,Soong F K,et al. A unified tagging solution:bidirectional LSTM recurrent neural network with word embedding[EB/OL].(2015-11-01). https://arxiv:org/abs/00215v1.
    [4]aric F,GlavaG,Karan M,et al. Takelab:systems for measuring semantic text similarity[C]//Proc of the 1st Joint Conference on Lexical and Computational Semantics. Stroudsburg,PA:Association for Computational Linguistics,2012:441-448.
    [5] Iacobacci I,Pilehvar M T,Navigli R. SensEmbed:learning sense embeddings for word and relational similarity[C]//Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and the7th International Joint Conference on Natural Language Processing.Stroudsburg,PA:Association for Computational Linguistics,2015:95-105.
    [6] Kim Y. Convolutional neural networks for sentence classification[EB/OL].(2014-08-25). https://arxiv. org/abs/1408. 5882.
    [7] Zhang Haowei,Wang Jin,Zhang Jixian,et al. YNU-HPCC at Sem Eval2017 task 4:using a multi-channel CNN-LSTM model for sentiment classification[C]//Proc of the 11th International Workshop on Semantic Evaluation. 2017:796-801.
    [8] Chetviorkin I,Loukachevitch N. Two-step model for sentiment lexicon extraction from Twitter streams[C]//Proc of the 5th Workshop on Computational Approaches to Subjectivity,Sentiment and Social Media Analysis. 2014:67-72.
    [9] Pang Bo,Lee L,Vaithyanathan S. Thumbs up sentiment classification using machine learning techniques[C]//Proc of ACL-02 Confe-rence on Empirical Methods in Natural Language Processing. Stroudsburg,PA:Association for Computational Linguistics 2002:79-86.
    [10] Mikolov T,Chen Kai,Corrado G,et al. Efficient estimation of word representations in vector space[EB/OL].(2013-01-16). https://arxiv. org/abs/:1301. 3781v3.
    [11]Mikolov T,Sutskever I,Chen Kai,et al. Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems. 2013:3111-3119.
    [12] Mikolov T,Yih W T,Zweig G. Linguistic regularities in continuous space word representations[C]//Proc of HLT-NAACL. 2013:746-751.
    [13]Pennington J,Socher R,Manning C D. Glo Ve:global vectors for word representation[C]//Proc of Conference on Empirical Methods in Natural Language Processing. 2014:1532-1543.
    [14]Weston J,Ratle F,Mobahi H,et al. Deep learning via semi-supervised embedding[M]//Neural Networks:Tricks of the Trade. Berlin:Springer,2012:639-655.
    [15]Meng Fandong,Lu Zhengdong,Wang Mingxuan,et al. Encoding source language with convolutional neural network for machine translation[C]//Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg,PA:Association for Computational Linguistics,2015:20-30.
    [16]Segura-Bedmar I,Suárez-Paniagua V,Martínez P. Exploring word embedding for drug name recognition[C]//Proc of the 6th International Workshop on Health Text Mining and Information Analysis. Stroudsburg,PA:Association for Computational Linguistics,2015:64-72.
    [17]Yang Jinnan,Peng Bo,Wang Jin,et al. Chinese grammatical error diagnosis using single word embedding[C]//Proc of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications. 2016:155-161.
    [18]Ghosal D,Bhatnagar S,Akhtar M S,et al. IITP at SemEval-2017 task5:an ensemble of deep learning and feature based models for financial sentiment analysis[C]//Proc of the 11th International Workshop on Semantic Evaluation. 2017:899-903.
    [19]Lee Y Y,Ke Hao,Huang H H,et al. Combining word embedding and lexical database for semantic relatedness measurement[C]//Proc of the 25th International Conference Companion on World Wide Web.Switzerland:IW3C2,2016:73-74.
    [20]Rajeswari K,Nakil S,Patil N,et al. Text categorization optimization by a hybrid approach using multiple feature selection and feature extraction methods[J]. International Journal of Engineering Research and Applications,2014,4(3):86-90.
    [21]Uysal A K. An improved global feature selection scheme for text classification[J]. Expert Systems with Applications,2016,43(1):82-92.
    [22]NovoviˇccováJ,Maík A,Pudil P. Feature selection using improved mutual information for text classification[C]//Proc of Joint IAPR International Workshop on Structural,Syntactic,and Statistical Pattern Recognition. Berlin:Springer,2004:1010-1017.
    [23]Mesleh A. Chi square feature extraction based SVMs Arabic language text categorization system[J]. Journal of Computer Science,2007,3(6):430-435.
    [24]Mitra P,Murthy C A,Pal S K. Unsupervised feature selection using feature similarity[J]. IEEE Trans on Pattern Analysis and Machine Intelligence,2002,24(3):301-312.
    [25]Habibi M,Popescu-Belis A. Keyword extraction and clustering for document recommendation in conversations[J]. IEEE/ACM Trans on Audio,Speech,and Language Processing,2015,23(4):746-759.
    [26]韩彤晖,杨东强,马宏伟.单词统计特性在情感词自动抽取和商品评论分类中的作用[J].计算机应用研究,2019,36(3):866-872.(Han Tonghui,Yang Dongqiang,Ma Hongwei. The role of word statistical characteristics in the automatic extraction of emotional words and the classification of commodity comments[J]. Application Research of Computers,2019,36(3):866-872.
    [27] Duwairi R M,Qarqaz I. Arabic sentiment analysis using supervised classification[C]//Proc of International Conference on Future Internet of Things and Cloud. Piscataway,NJ:IEEE Press,2014:579-583.
    [28]Lohar P,Chowdhury K D,Afli H,et al. ADAPT at IJCNLP-2017 Task4:a multinomial naive Bayes classification approach for customer feedback analysis task[C]//Proc of the 8th International Joint Conference on Natural Language Processing. 2017:161-169.
    [29]Esmaeili M,Arjomandzadeh A,Shams R,et al. An anti-spam system using naive Bayes method and feature selection methods[J]. International Journal of Computer Applications,2017,165(4):1-5.
    [30]Kotani K,Yoshimi T. Effectiveness of linguistic and learner features to listenability measurement using a decision tree classifier[J]. Journal of Information and Systems in Education,2016,16(1):7-11.
    [31]Buscaldi D,Flores J G,Meza I V,et al. SOPA:random forests regression for the semantic textual similarity task[C]//Proc of the 9th International Workshop on Semantic Evaluation. 2015:132-133.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700