基于双层堆叠分类模型的水军评论检测

英文篇名：Review spam detection based on the two-level stacking classification model
作者：廖祥文 ; 徐阳 ; 魏晶晶 ; 杨定达 ; 陈国龙
英文作者：LIAO Xiang-wen;XU Yang;WEI Jing-jing;YANG Ding-da;CHEN Guo-long;College of Mathematics and Computer Science, Fuzhou University;Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University;Digital Fujian Institute of Financial Big Data;College of Electronics and Information Science, Fujian Jiangxia University;
关键词：水军检测 ; 特征融合 ; 集成学习 ; 主成分分析
英文关键词：review detection;;feature fusion;;ensemble learning;;principal component analysis
中文刊名：SDDX
英文刊名：Journal of Shandong University(Natural Science)
机构：福州大学数学与计算机科学学院;福州大学福建省网络计算与智能信息处理重点实验室;数字福建金融大数据研究所;福建江夏学院电子信息科学学院;
出版日期：2019-05-20 16:50
出版单位：山东大学学报(理学版)
年：2019
期：v.54
基金：国家自然科学基金资助项目(61772135,U1605251);; 福建省自然科学基金资助项目(2017J01755);; 中国科学院网络数据科学与技术重点实验室开放基金课题(CASNDST201708,CASNDST201606);; 北邮可信分布式计算与服务教育部重点实验室主任基金资助(2017KF01)
语种：中文;
页：SDDX201907008
页数：11
CN：07
ISSN：37-1389/N
分类号：61-71

摘要

对于水军评论检测问题,已有方法在提取用户行为关系以及通过神经网络提取特征时复杂度过大,同时由于网络评论属于短文本类,其书写的不规范会导致训练过程中文本特征提取困难;另外,已有方法对数据集不平衡分布情况考虑不足。为此,提出了一种基于双层堆叠分类模型的水军评论检测方法。首先通过三元组形式构造矩阵表示用户间关系,并通过主成分分析得到低维用户关系表示,以此刻画用户在评论数据中的行为差异并且降低计算的复杂度;然后,通过评论的段落向量表示以及计算离散型特征(包括文本相似度、信息熵等)解决文本特征难以提取的问题;最后将三者相联结作为融合文本与行为特征的整体特征表示。利用集成学习的方法构造双层堆叠分类模型对评论分类,以提升模型在非平衡数据集下的检测性能。实验采用Yelp2013评论数据集,结果表明,与目前最好的基准方法对比,F_1值提高了1.7%～5.2%,在非平衡数据集中提升尤为明显。
For the issue of review spam detection, on the one hand, the time and space complexity of existing methods is high when extracting user behavior relationships and training neural network. On the other hand, the non-standard writing format of E-commercial reviews leads to the indistinct contextual features and most experiment did not consider the effect of the imbalance of data. Therefore, we propose a method for review spam detection based on a two-level stacking classification model. In the method, the relationship between users and products is represented by a triplet. In order to characterize user?s behavior and reduce complexity, low-dimensional feature representations are obtained by the principal component analysis. Then, the extracted paragraphs vector representation, information entropy and text similarity is represented as discrete feature to avoid indistinct of contextual features. Finally, the three connections are taken as the overall features combining text and behavioral features. These features are regarded as the input of the two-level stacking classification model in order to improve performance in unbalanced dataset. We conducted experiments in the Yelp 2013 dataset. Experimental results show the F_1 value of our proposed method is 1.7%—5.2% better than the state-of-the-art method. What?s more, the classification performance is significantly improved in the unbalanced dataset.

引文

[1] OTT M,CHOI Y,CARDIE C,et al.Finding deceptive opinion spam by any stretch of the imagination[C]// Proceedings of the Meeting of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:ACM,2011:309-319.
    [2] KIM S,CHANG H,LEE S,et al.Deep semantic frame-based deceptive opinion spam analysis[C]// Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.New York:ACM,2015:1131-1140.
    [3] KO M C,CHEN H H.Analysis of cyber army?s behaviours on web forum for elect campaign[C]// Proceedings of the Asia Information Retrieval Symposium.Switzerland:Springer,Cham,2015:394-399.
    [4] LI Huayi,FEI Geli,SHAO Weixiang,et al.Bimodal distribution and co-bursting in review spam detection[C]// Proceedings of the International Conference on World Wide Web.Republic and Canton of Geneva,Switzerland:International World Wide Web Conferences Steering Committee,2017:1063-1072.
    [5] REN Yafeng,ZHANG Yue.Deceptive opinion spam detection using neural network[C]// Proceedings of COLING 2016,the 26th International Conference on Computational Linguistics:Technical Papers.Osaka:The COLING 2016 Organizing Committee,2016:140-150.
    [6] WANG Xuepeng,LIU Kang,ZHAO Jun.Handling cold-start problem in review spam detection by jointly embedding texts and behaviors[C]// Proceedings of the Meeting of the Association for Computational Linguistics.Vancouver:ACM,2017:366-376.
    [7] KIM Y.Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Doha:EMNLP,2014:1746-1751.
    [8] SANTOSH K C,MAITY S K,MUKHERJEE A.ENWalk:learning network features for spam detection in twitter[C]// Proceedings of the International Conference on Social Computing,Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation.Switzerland;Springer,Cham,2017:90-101.
    [9] RAYANA S,AKOGLU L.Collective opinion spam detection:bridging review networks and metadata[C]// Proceedings of the 21th ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.New York:ACM,2015:985-994.
    [10] WANG Xuepeng,LIU Kang,HE Shizhu,et al.Learning to represent review with tensor decomposition for spam detection[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing.Austin:EMNLP,2016:866-875.
    [11] WANG Yalin,SUN Kenan,YUAN Xiaofeng,et al.A novel sliding window PCA-IPF based steady-state detection framework and its industrial application[J].IEEE Access,2018,6:20995-21004.
    [12] LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]// Proceedings of the International Conference on Machine Learning.Beijing:JMLR,2014:1188-1196.
    [13] CHEN Yijun,MAN Leungwong.Optimizing stacking ensemble by an ant colony optimization approach[C]// Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation.New York:ACM,2011:7-8.
    [14] SANTOSH K C,ARJUN Mukherjee.On the temporal dynamics of opinion spamming:case studies on yelp[C]// Proceedings of the 25th International Conference on World Wide Web.Republic and Canton of Geneva,Switzerland:WWW,2016:369-379.
    [15] MUKHERJEE A,VENKATARAMAN V,LIU B,et al.What yelp fake review filter might be doing[C]// Proceedings of the International AAAI Conference on Web and Social Media.Menlo Park:AAAI,2013:409-418.
    [16] HAI Zeng,ZHAO Peilin,CHENG Peng,et al.Deceptive review spam detection via exploiting task relatedness and unlabeled data[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing.Austin:EMNLP,2016:1817-1826.
    [17] FAKHRAEI S,SHASHANKA M.Collective spammer detection in evolving multi-relational social networks[C]// Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2015:1769-1778.
    (1)http://liu.cs.uic.edu/download/yelp_filter/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700