基于Skip-gram的CNNs文本邮件分类模型

英文篇名：CNNs-Highway Text Message Classification Model Based on Skip-gram
作者：黄鹤 ; 荆晓远 ; 董西伟 ; 吴飞
英文作者：HUANG He;JING Xiao-yuan;DONG Xi-wei;WU Fei;School of Computer,Nanjing University of Posts and Telecommunications;School of Automation,Nanjing University of Posts and Telecommunications;
关键词：自然语言处理 ; 词嵌入 ; 邮件分类 ; 卷积神经网络 ; 深度学习
英文关键词：natural language processing;;word embedding;;mail classification;;convolutional neural network;;deep learning
中文刊名：WJFZ
英文刊名：Computer Technology and Development
机构：南京邮电大学计算机学院;南京邮电大学自动化学院;
出版日期：2019-03-06 10:25
出版单位：计算机技术与发展
年：2019
期：v.29;No.266
基金：国家自然科学基金(61702280)
语种：中文;
页：WJFZ201906030
页数：5
CN：06
ISSN：61-1450/TP
分类号：149-153

摘要

随着互联网广告技术的发展和电子邮件的普及,越来越多的垃圾广告邮件充斥生活,而对如何高效区分垃圾邮件的研究也逐渐成为了热门课题。自然语言在结构上具有很强的前后相关性,而且对于中文邮件直接转化成向量会有过高的维度产生,影响最后分类的准确性。对此,首先对邮件文本进行分词,再利用skip-gram模型训练出数据集中每个词的word embedding,引入的词嵌入(word embedding)是为了将邮件文本转化成低维度特征向量;然后将每个词的word embedding组合为二维特征矩阵作为网络的输入,此外在每一次的迭代过程中,输入特征也作为参数进行更新;最后送入提出的CNN-HIGHWAY混合模型中进行邮件分类。将该混合模型在CCERT中文邮件样本集上进行实验,并与传统的机器学习方法和标准的卷积神经网络模型进行对比,结果表明该模型不仅解决了维度过高的问题,而且提高了邮件分类的准确率。
With the development of Internet advertising technology and the popularity of e-mail,more and more spam advertisements are flooding the lives. The research on how to effectively distinguish spam has gradually become a hot topic. The natural language has a strong front-to-back correlation in structure and also too high dimensions for the direct translation of Chinese emails into vectors,which adversely affects the accuracy of the final classification. Therefore,we propose a model which firstly segments e-mail texts and uses the skip-gram model to train the word embedding of each word in the data set. The introduced word embedding is to convert the message text into a low-dimensional feature vector. Then the word embedding of each word is combined into a two-dimensional feature matrix as the input of the network. In addition,during each iteration,the input features are also updated as parameters. Finally,the feature vectors are sent to the proposed CNN-HIGHWAY hybrid model for classification. The hybrid model is tested on the CCERT Chinese mail sample set. Compared with the traditional machine learning methods and the standard convolutional neural network models,this model not only solves the problem of high dimensionality,but also improves the accuracy of mail classification.

引文

[1] 李婷婷,姬东鸿.基于SVM和CRF多特征组合的微博情感分析[J].计算机应用研究,2015,32(4):978-981.
    [2] 陈翠平.基于深度信念网络的文本分类算法[J].计算机系统应用,2015,24(2):121-126.
    [3] SHEN J J,CHEN Y K,CHU K T,et al.An intelligent three-phase spam filtering method based on decision tree data mining[J].Security & Communication Networks,2016,9(17):4013-4026.
    [4] FENG Weimiao,SUN Jianguo,ZHANG Liguo,et al.A support vector machine based naive Bayes algorithm for spam filtering[C]//2016 IEEE 35th international performance computing and communications conference.Las Vegas,NV,USA:IEEE,2017:1-8.
    [5] KIM Y.Convolutional neural networks for sentence classification[C]//Proceedings of conference on empirical methods in natural language processing.Doha:[s.n.],2014:1746-1751.
    [6] HINTON G E.Learning distributed representations of concepts[C]//Proceedings of the 8th annual conference of cognitive science society.Amherst,Mass:[s.n.],1986:12-23.
    [7] MIKOLOV T,CHEN Kai,CORRADO G,et al.Efficient estimation of word representations in vector space[J].Computer Science,2013,2(12):27-35.
    [8] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3:1137-1155.
    [9] MIKOLOV T,SUTSKEVER I,CHEN Kai,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th international conference on neural information processing systems.Lake Tahoe,Nevada:Curran Associates Inc.,2013:3111-3119.
    [10] KRICHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems.[s.l.]:MIT Press,2012:1097-1105.
    [11] LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
    [12] LEE H,LARGMAN Y,PHAM P,et al.Unsupervised feature learning for audio classification using convolutional deep belief networks[C]//Proceedings of the 22nd international conference on neural information processing systems.Vancouver,British Columbia,Canada:Curran Associates Inc.,2009:1096-1104.
    [13] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing(almost) from scratch[J].Journal of Machine Learning Research,2011,12:2493-2537.
    [14] SHEN Yelong,HE Xiaodong,GAO Jianfeng,et al.Learning semantic representations using convolutional neural networks for Web search[C]//Proceedings of the 23rd international conference on world wide web.New York:ACM Press,2014:373-374.
    [15] KALCHBRENNER N,GREFENSTETTE E,BLUNSOM P.A convolutional neural network for modelling sentences[C]//Proceedings of the 52nd annual meeting of the association for computational.Baltimore,USA:[s.n.],2014:655-665.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700