大数据环境下文本情感分析算法的规模适配研究:以Twitter为数据源

英文篇名：Research on Scale Adaptation of Text Sentiment Analysis Algorithm in Big Data Environment: Using Twitter as Data Source
作者：余传明 ; 原赛 ; 王峰 ; 安璐
英文作者：Yu Chuanming;Yuan Sai;Wang Feng;An Lu;School of Information and Safety Engineering, Zhongnan University of Economics and Law;School of Statistics and Mathematics, Zhongnan University of Economics and Law;School of Information Management, Wuhan University;
关键词：规模适配 ; 大数据 ; 海量文本 ; 情感分析 ; 机器学习算法
英文关键词：scale adaptation;;big data;;massive text s;;entiment analysis;;machine learning algorithm
中文刊名：TSQB
英文刊名：Library and Information Service
机构：中南财经政法大学信息与安全工程学院;中南财经政法大学统计与数学学院;武汉大学信息管理学院;
出版日期：2019-02-20 10:20
出版单位：图书情报工作
年：2019
期：v.63;No.617
基金：国家自然科学基金面上项目“大数据环境下基于领域知识获取与对齐的观点检索研究”(项目编号:71373286);; 教育部哲学社会科学研究重大课题攻关项目“提高反恐怖主义情报信息工作能力对策研究”(项目编号:17JZD034)研究成果之一
语种：中文;
页：TSQB201904025
页数：11
CN：04
ISSN：11-1541/G2
分类号：102-112

摘要

[目的/意义]以大数据环境下的文本情感分析这一特定任务为目的,对规模适配问题进行研究,为情报学领域研究人员进行大数据环境下数据分析时,实现效率和成本的最优选择提供借鉴。[方法/过程]采用斯坦福大学Sentiment140数据集,在对传统情感分析算法分析的基础上,提出了5种面向大数据的文本情感分析算法,检验各种算法在不同环境和数据规模下的适配效果,从准确性、可扩展性和效率等方面进行实证比较研究。[结果/结论]实验结果显示,本文所搭建的集群具有良好的运行效率、正确性以及可扩展性,Spark集群在处理海量文本情感分析数据时更具有效率优势,且在数据规模越大的情况下,效率优势越明显;在资源利用方面,随着节点数和核数的增加,集群的整体运行效率变化显著,配置5个4核4G内存的从节点,能够实现在高效完成分类任务的同时达到节约资源成本的效果。
[Purpose/significance] This paper aims to study the scale adaptation problem for the purpose of textual sentiment analysis in big data environment. The paper provides reference for the best choice between efficiency and cost when researchers in the field of information science carry out data analysis under big data environment. [Method/process] We use the Sentiment140 dataset of Stanford University. Based on the analysis of traditional sentiment analysis algorithms, we propose five textual sentiment analysis algorithms for big data to test the adaptation effectiveness of various algorithms under different environments and data sizes, and conduct empirical comparisons in terms of accuracy, scalability and efficiency. [Result/conclusion] The experimental results show that the cluster built in this paper has good operational efficiency, correctness, and scalability. Spark clusters have more efficiency advantages in processing large-scale text sentiment analysis data, and with increasing the data size, its efficiency advantage is more obvious. In resource utilization, as the number of nodes and cores increase, the overall operating efficiency of the cluster changes significantly. We find the configuration of five slave nodes with 4 cores and 4 G memory can achieve the effect of saving resource costs while efficiently completing the classification task.

引文

[1]BALTAS A,KANAVOS A,TSAKALIDIS A K.An Apache Spark implementation for sentiment analysis on Twitter data[C]//Proceedings of algorithmic aspects of cloud computing.Cham:Springer,2016:15-25.
    [2]明均仁.融合语义关联挖掘的文本情感分析算法研究[J].图书情报工作,2012,56(15):99-103.
    [3]唐晓波,兰玉婷.基于特征本体的微博产品评论情感分析[J].图书情报工作,2016,60(16):121-127.
    [4]XU H,ZHANG F,WANG W.Implicit feature identification in Chinese reviews using explicit topic mining model[J].Knowledgebased systems,2015,76(3):166-175.
    [5]刘雯,高峰,洪凌子.基于情感分析的灾害网络舆情研究---以雅安地震为例[J].图书情报工作,2013,57(20):104-110.
    [6]余传明.基于深度循环神经网络的跨领域文本情感分析[J].图书情报工作,2018,62(11):23-34.
    [7]余传明,冯博琳,安璐.基于深度表示学习的跨领域情感分析[J].数据分析与知识发现,2017(7):73-81.
    [8]余传明,冯博琳,田鑫,等.基于深度表示学习的多语言文本情感分析[J],山东大学学报:理学版,2018,53(3):13-23.
    [9]余传明,安璐.从小数据到大数据---观点检索面临的三个挑战[J].情报理论与实践,2016,39(2):13-19.
    [10]向小军,高阳,商琳,等.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188.
    [11]GLUSHKOVA D,JOVANOVIC P,ABELLO A.Map Reduce performance model for Hadoop 2.x[EB/OL].[2017-12-30].https://doi.org/10.1016/j.is.2017.11.006.
    [12]PAN J,HUA Y,LIU X,et al.Bagging-based logistic regression with Spark:a medical data mining method[C]//International conference on advances in mechanical engineering and industrial informatics.Atlantis:Atlantis Press,2016:1553-1559.
    [13]MOGHA G,AHLAWAT K,SINGH A P.Performance analysis of machine learning techniques on big data using Apache Spark[C]//International conference on recent developments in science,engineering and technology.Singapore:Springer,2017:17-26.
    [14]郭顺利,张向先.面向中文图书评论的情感词典构建方法研究[J].现代图书情报技术,2016,32(2):67-74.
    [15]MOGHADDAM S,ESTER M.Opinion digger:an unsupervised opinion miner from unstructured product reviews[C]//ACM international conference on information and knowledge management,New York:ACM,2010:1825-1828.
    [16]刘丽珍,赵新蕾,王函石,等.基于产品特征的领域情感本体构建[J].北京理工大学学报,2015,35(5):538-544.
    [17]WANG H,NIE X,LIU L,et al.A fuzzy domain sentiment ontology based opinion mining approach for Chinese online product reviews[J].Journal of computers,2013,8(9):2225-2231.
    [18]ZHU J,WANG H,ZHU M,et al.Aspect-based opinion polling from customer reviews[J].IEEE transactions on affective computing,2011,2(1):37-49.
    [19]YAN Z,XING M,ZHANG D,et al.EXPRS:an extended pagerank method for product feature extraction from online consumer reviews[J].Information&management,2015,52(7):850-858.
    [20]PANG B,LEE L,VAITHYANATHAN S.Thumbs up?:sentiment classification using machine learning techniques[C]//Proceedings of the ACL-02 conference on empirical methods in natural language processing-volume 10.Stroudsburg:Association for Computational Linguistics,2002:79-86.
    [21]DAVIDOV D,TSUR O,RAPPOPORT A.Enhanced sentiment learning using Twitter hashtags and smileys[C]//International conference on computational linguistics:posters.Stroudsburg:Association for Computational Linguistics,2010:241-249.
    [22]苏莹,张勇,胡珀,等.基于朴素贝叶斯与潜在狄利克雷分布相结合的情感分析[J].计算机应用,2016,36(6):1613-1618.
    [23]WANG S,MANNING C D.Baselines and bigrams:simple,good sentiment and topic classification[C]//Meeting of the Association for Computational Linguistics:short papers.Stroudsburg:Association for Computational Linguistics,2012:90-94.
    [24]陈钊,徐睿峰,桂林,等.结合卷积神经网络和词语情感序列特征的中文情感分析[J].中文信息学报,2015,29(6):172-178.
    [25]FAN L,ZHANG Y,DANG Y,et al.Analyzing sentiments in Web2.0 social media data in Chinese:experiments on business and marketing related Chinese Web forums[J].Information technology&management,2013,14(3):231-242.
    [26]何跃,朱婷婷.基于微博情感分析和社会网络分析的雾霾舆情研究[J].情报科学,2018(7):91-97.
    [27]安璐,吴林.融合主题与情感特征的突发事件微博舆情演化分析[J].图书情报工作,2017,61(15):120-129.
    [28]由丽萍,王嘉敏.基于情感分析和VIKOR多属性决策法的电子商务顾客满意感测度[J].情报学报,2015,34(10):1098-1110.
    [29]首欢容,邓淑卿,徐健.基于情感分析的网络谣言识别方法[J].数据分析与知识发现,2017(7):44-51.
    [30]肖璐,陈果,刘继云.基于情感分析的企业产品级竞争对手识别研究---以用户评论为数据源[J].图书情报工作,2016,60(1):83-90.
    [31]朱继召,贾岩涛,徐君,等.Spark CRF:一种基于Spark的并行CRFs算法实现[J].计算机研究与发展,2016,53(8):1819-1828.
    [32]CHEN J,LI K,TANG Z,et al.A parallel random forest algorithm for big data in a Spark cloud computing environment[J].IEEEtransactions on parallel&distributed systems,2017,28(4):919-933.
    [33]HAI M,ZHANG Y,ZHANG Y.A performance evaluation of classification algorithms for big data[J].Procedia computer science,2017,122(1):1100-1107.
    [34]宋杰,孙宗哲,毛克明,等.Map Reduce大数据处理平台与算法研究进展[J].软件学报,2017,28(3):514-543.
    [35]SALLOUM S,DAUTOV R,CHEN X,et al.Big data analytics on Apache Spark[J].International journal of data science&analytics,2016,1(3/4):145-164.
    [36]邢晓宇.决策树分类算法的并行化研究及其应用[D].昆明:云南财经大学,2010.
    [37]卫洁.Map Reduce框架下的贝叶斯文本分类学习研究[D].太原:山西财经大学,2012.
    [38]罗元帅.基于随机森林和Spark的并行文本分类算法研究[D].成都:西南交通大学,2016.
    [39]刘泽燊,潘志松.基于Spark的并行SVM算法研究[J].计算机科学,2016,43(5):238-242.
    [40]THINKNOOK.Twitter sentiment analysis training corpus(dataset)[EB/OL].[2017-12-30].http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/.
    [41]GO A,BHAYANI R,HUANG L,Sentiment140[EB/OL].[2017-12-30].https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip.
    [42]HEREDIA B,KHOSHGOFTAAR T M,PRUSA J,et al.Cross-domain sentiment analysis:an empirical investigation[C]//International conference on information reuse and integration.New York:IEEE,2016:160-165.
    [43]GOEL A,GAUTAM J,KUMAR S.Real time sentiment analysis of Tweets using Naive Bayes[C]//International conference on next generation computing technologies.New York:IEEE,2016:257-261.
    [44]LIMA M L,NASCIMENTO T P,LABIDI S,et al.Using sentiment analysis for stock exchange prediction[J].International journal of artificial intelligence&applications,2016,7(1):59-67.
    [45]FRIEDRICH N,BOWMAN T D,STOCK W G,et al.Adapting sentiment analysis for tweets linking to scientific papers[EB/OL].[2017-12-30].http://cn.arxiv.org/pdf/1507.01967v1.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700