一种基于稀疏自编码器的涉恐短文本特征提取方法

英文篇名：Feature Extraction and Clustering of Terrorism Short Text Based on Sparse Auto-Encoder
作者：黄炜 ; 黄建桥 ; 李岳峰
英文作者：Huang Wei;Huang Jianqiao;Li Yuefeng;School of Economics and Management, Hubei University of Technology;
关键词：涉恐文本 ; 稀疏自编码器 ; 特征提取 ; LDA主题聚类
英文关键词：terrorism text;;sparse autoencoder;;feature extraction;;LDA topic clustering
中文刊名：QBZZ
英文刊名：Journal of Intelligence
机构：湖北工业大学经济与管理学院;
出版日期：2018-12-14 14:25
出版单位：情报杂志
年：2019
期：v.38
基金：国家自然科学基金项目“微博环境下实时主动感知网络舆情事件的多核方法研究”(编号:71303075)及“大数据环境下基于特征本体学习的无监督文本分类方法研究”(编号:71571064)研究成果之一
语种：中文;
页：QBZZ201903031
页数：6
CN：03
ISSN：61-1167/G3
分类号：190+207-211

摘要

[目的/意义]稀疏自编码器是深度学习领域中一种较为高效的文本特征提取方法,有利于解决大规模涉恐短文本高维、稀疏难处理等问题。[方法/过程]首先经稀疏自编码器无监督学习方法降维,提取数据隐含特征,然后利用LDA主题聚类算法进行文本聚类,并通过与传统特征提取算法对比实验效果来验证该方法的有效性和高效性。[结果/结论]实验结果证明,将稀疏自编码器提取的文本特征用于LDA主题聚类,有效解决了涉恐短文本高维、稀疏、噪声大的问题,并显著提高了聚类结果的准确性。
[Purpose/Significance]Sparse self-encoder is a more efficient method of text feature extraction in the field of deep learning, it is conducive to solving high-dimensional, sparse and other difficult problems of large-scale terrorism short texts.[Method/Process]Firstly, the unsupervised learning method of sparse auto-encoder is used to reduce the dimension, and the hidden features of data are extracted. Then the clustering algorithm of LDA topic is used to cluster texts, and the effectiveness and efficiency of the method are verified by comparing the experimental results with the traditional feature extraction algorithm.[Result/Conclusion]The experimental results prove that using sparse auto-encoder extracted text features for LDA topic clustering can effectively solve the problem of high-dimensional, sparse, and loud noises in short texts related to terrorism, and significantly improve the accuracy of clustering results.

引文

[1] 孙晓,彭晓琪,胡敏,等.基于多维扩展特征与深度学习的微博短文本情感分析[J].电子与信息学报,2017,39(9):2048-2055.
    [2] 杜永萍,陈守钦,赵晓静.基于特征扩展与深度学习的短文本情感判定方法[J].计算机科学,2017,44(10):283-288.
    [3] 张绮绮,张树群,雷兆宜.基于改进的卷积神经网络的中文情感分类[J].计算机工程与应用,2017,53(22):111-115.
    [4] 梁军,柴玉梅,原慧斌,等.基于深度学习的微博情感分析[J].中文信息学报,2014,28(5):155-161.
    [5] Wang W, Jiang Y, Wang D, et al. Through wall human detection under small samples based on deep learning algorithm[J]. Pattern Recognition,2017,72.
    [6] 孙紫阳,顾君忠,杨静.基于深度学习的中文实体关系抽取方法[J].计算机工程,2017:1-8.
    [7] 柳长源,毕晓君,韦琦.基于向量机学习算法的多模式分类器的研究及改进[J].电机与控制学报,2013,17(1):114-118.
    [8] 李东洁,李君祥,张越,等.基于POS改进的BP神经网络数据手套手势识别[J].电机与控制学报,2014,18(8):87-93.
    [9] 仲伟峰,马丽霞,何小溪.PCA和改进BP神经网络的大米外观品质识别[J].哈尔滨理工大学学报,2015,20(4):76-81.
    [10] 刘铭,昝红英,原慧斌.基于SVM与RNN的文本情感关键句判定与抽取[J].山东大学学报(理学版),2014,49(11):68-73.
    [11] Bingio Y,Lamblin P,Popovici D,et al. Greedy layer w ise training of deep netw orks. A dvances in N eural Inform ation Processing System,2007(19):153-160.
    [12] Vincent P,Larochelle H,Lajoie I,et al. Stacked de-noising auto-encoders:learning useful representations in a deep netw ork w ith a local de-noising criterion. Journal of M achine Research,2010,11(12):3371-3408.
    [13] Bengio Y,Yao L,Alain G,et al. Generalized de-noising autoencoders as generative m odels//A dvances in N eural Inform ation Processing System s. L ake T ahoe,2013:899-907.
    [14]秦胜君,卢志平.稀疏自编码器在文本分类中的应用研究[J].科学技术与工程,2013,13(31):9422-9426.
    [15]邬美银,陈黎.基于深度学习的监控视频树叶遮挡检测[J].武汉科技大学学报,2016,39(1):69-74.
    [16]孙菲菲,林平,曹卓.基于旋转森林集成学习的涉恐实体挖掘研究[J].情报杂志,2015,34(5):190-195.
    [17]郭璇,吴文辉,肖治庭,等.基于深度学习和公开来源信息的反恐情报挖掘[J].情报理论与实践,2017,40(9):135-139.
    [18] Hinton G E,Salakhutdinov R R. Reducing the dimensionality of data w ith neural netw orks[J]. Science,2006,313(5786):504.
    [19] Hinton G E. Products of experts by minimizing contrastive divergence[J]. N eural C om putation,2002(14):1771-1800.
    [20]龚萍,王娜娜,罗举建.基于稀疏自编码神经网络的肺结节特征提取及良恶性分类[J].医疗卫生装备,2015,36(12):7-10.
    [21]赵瑞娟,官金安,谢国栋.稀疏降噪自编码器在IR-BCI的应用研究[J].计算机工程与应用,2017,53(11):167-171.
    [22]刘勘,袁蕴英.基于自动编码器的短文本特征提取及聚类研究[J].北京大学学报(自然科学版),2015,51(2):282-288.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700