基于拓扑稀疏编码预训练CNN的视频语义分析

英文篇名：Video Semantic Analysis Based on Topographic Sparse Pre-Training CNN
作者：程晓阳 ; 詹永照 ; 毛启容 ; 詹智财
英文作者：Cheng Xiaoyang;Zhan Yongzhao;Mao Qirong;Zhan Zhicai;School of Computer Science and Telecommunication Engineering,Jiangsu University;
关键词：视频语义 ; 卷积神经网络 ; 深度学习 ; 拓扑稀疏编码 ; 预训练
英文关键词：video semantic;;convolutional neural network(CNN);;deep learning;;topographic sparse encoder;;pre-training
中文刊名：JFYZ
英文刊名：Journal of Computer Research and Development
机构：江苏大学计算机科学与通信工程学院;
出版日期：2018-12-15
出版单位：计算机研究与发展
年：2018
期：v.55
基金：国家自然科学基金项目(61672268);; 江苏省重点研发计划基金项目(BE2015137)~~
语种：中文;
页：JFYZ201812013
页数：12
CN：12
ISSN：11-1777/TP
分类号：121-132

摘要

视频特征的深度学习已成为视频对象检测、动作识别、视频事件检测等视频语义分析方面的研究热点.视频图像的拓扑信息对描述图像内容的关联关系有着重要的作用,同时综合视频序列特性考虑以有标签的视频进行优化学习,将有利于提高视频特征表达的可鉴别性.基于上述考虑,提出一种基于拓扑稀疏编码预训练CNN的视频特征学习方法并用于视频语义分析,该方法将视频特征学习分为2个阶段:半监督视频图像特征学习和有监督的视频序列特征的优化学习.1)在半监督视频图像特征学习中,构建了一个新的拓扑稀疏编码器用之于预训练各层神经网络参数,使视频图像的特征表达能反映图像的拓扑信息,并在图像特征学习的全连接层以有标签的视频概念类别进行逻辑回归微调网络参数.2)在有监督的视频序列特征的优化学习中,构建了视频特征学习的全连接层,综合有标签的视频序列关键帧特征,建立逻辑回归约束,微调网络参数,以实现类别更具可鉴别的视频特征的优化.在典型的视频数据集上进行了相关方法的视频语义概念检测实验,实验结果表明:所提出的方法对视频特征的表达更具可鉴别性,能有效提高视频语义概念检测率.
Video feature learning by deep neural network has become a hot research topic in video semantic analysis such as video object detection,motion recognition and video event detection.The topographic information of the video image plays an important role in describing the relationship between image and content.At the same time,it is helpful to improve the discriminability of the video feature expression by considering the characteristics of the video sequence with optimization.In this paper,an approach based on pre-training convolutional neural network with new topographic sparse encoder is proposed for video feature learning.This method has two stages:semi-supervised video image feature learning and supervised video sequence features optimization learning.In the semisupervised video image feature learning stage,a new topographic sparse encoder is presented and used to pre-train neural networks,so that the characteristic expression of the video image can reflect the topographic information of the image,and a logistic regression is used to fine-tune the networks parameters using video concept label for video image feature learning.In the supervised video sequence feature optimization learning stage,a fully connected layer for feature learning of video sequence is constructed in order to express the feature of video sequence reasonably.A logistic regression constraint is established to adjust the network parameters in order that the discriminative feature of video sequence can be obtained.The experiments for relative methods are carried out on typical video datasets.The results show that the proposed method has better discriminability for the expression of video features,and can improve the accuracy of video semantic concept detection effectively.

引文

[1]Ye Guangnan,Liu Dong,Wang Jun,et al.Large-scale video hashing via structure learning[C]Proc of the 14th IEEEInt Conf on Computer Vision.Los Alamitos,CA:IEEEComputer Society,2013:2272-2279
    [2]Haseyama M,Ogawa T,Yagi N.A review of video retrieval based on image and video semantic understanding[J].ITETransactions on Media Technology and Applications,2013,1(1):2-9
    [3]Han Yahong,Yang Yi,Ma Zhigang,et al.Semisupervised feature selection via spline regression for video semantic recognition[J].IEEE Transactions on Neural Networks&Learning Systems,2015,26(2):252-264
    [4]Wang Miao,Zhang Fanglue,Hu Shimin.Data-driven image analysis and rditing:A survey[J].Journal of ComputerAided Design&Computer Graphics,2015,27(11):2015-2024(in Chinese)(汪淼,张方略,胡事民.数据驱动的图像智能分析和处理综述[J].计算机辅助设计与图形学学报,2015,27(11):2015-2024)
    [5]Yu Kai,Jia Lei,Chen Yuqiang,et al.Deep learning:Yesterday,today,and tomorrow[J].Journal of Computer Research and Development,2013,50(9):1799-1804(in Chinese)(余凯,贾磊,陈雨强,等.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804)
    [6]Hinton G E,Salakhutdinov R R.Reducing the dimensionality of data with neural networks[J].Science,2006,313(5786):504-507
    [7]Wu Jiasong,Qiu Shijie,Zeng Rui,et al.Multilinear principal component analysis network for tensor object classification[J].IEEE Access,2017,5(27):3322-3331
    [8]Liu Zhikang,Tian Ye,Wang Zilei.Stacked overcomplete independent component analysis for action recognition[C]Proc of the 13th Asian Conf on Computer Vision.Berlin:Springer,2016:368-383
    [9]Gammulle H,Denman S,Sridharan S,et al.Two stream LSTM:A deep fusion framework for human action recognition[C]Proc of the 17th IEEE Winter Conf on Applications of Computer Vision.Piscataway,NJ:IEEE,2017:177-186
    [10]Andrew N,Jiquan N,Chuan Y F.et al.UFLDL Tutorial[R/OL].Stanford:Stanford University,2013.[2017-08-01].http:ufld1.stanford.edu/wiki/index.php/UFDL.Tutorial
    [11]Hyvarinen A,Hoyer P,Inki M.Topographic ICA as a model of V1receptive fields[C]Proc of the 1st IEEE-INNS-ENNS Int Joint Conf on Neural Networks(IJCNN2000).Piscataway,NJ:IEEE,2000:83-88
    [12]Ngiam J,Chen Zhenghao,Chia D,et al.Tiled convolutional neural networks[C]Proc of the 23rd Int Conf on Neural Information Processing Systems.New York:Curran Associates,2010:1279-1287
    [13]Goh H,Kumierz,Lim J H,et al.Learning invariant color features with sparse topographic restricted Boltzmann machines[C]Proc of the 18th IEEE Int Conf on Image Processing.Piscataway,NJ:IEEE,2011:1241-1244
    [14]Karpathy A,Toderici G,Shetty S,et al.Large-scale video classification with convolutional neural networks[C]Proc of the 27th IEEE Conf on Computer Vision and Pattern Recognition.Los Alamitos,CA:IEEE Computer Society,2014:1725-1732
    [15]Ji Shuiwang,Xu Wei,Yang Ming,et al.3Dconvolutional neural networks for human action recognition[J].IEEETransactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231
    [16]Donahue J,Anne Hendricks L,Guadarrama S,et al.Longterm recurrent convolutional networks for visual recognition and description[C]Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2015:667-691
    [17]Jiang Yugang,Wu Zuxuan,Wang Jun,et al.Exploiting feature and class relationships in video categorization with regularized deep neural networks[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2018,40(2):352-364
    [18]Du Lei,Huang Heng,Yan Jingwen,et al.Structured sparse canonical correlation analysis for brain imaging genetics:An improved GraphNet method[J].Bioinformatics,2016,32(10):1544-1551
    [19]Erhan D,Bengio Y,Courville A,et al.Why does unsupervised pre-training help deep learning[J].The Journal of Machine Learning Research,2010,11(3):625-660
    [20]Pasa L,Sperduti A.Pre-training of recurrent neural networks via linear autoencoders[C]Proc of the 27th Annual Conf on Neural Information Processing Systems.San Francisco,CA:Morgan Kaufmann,2014:3572-3580
    [21]Erhan D,Manzagol P A,Bengio Y,et al.The difficulty of training deep architectures and the effect of unsupervised pretraining[C]Proc of the 12th Int Conf on Artificial Intelligence and Statistics.Berlin:Springer,2009:153-160
    [22]Jiang Nan,Rong Wenge,Peng Baolin,et al.An empirical analysis of different sparse penalties for autoencoder in unsupervised feature learning[C]Proc of the 25th Int Joint Conf on Neural Networks.Piscataway,NJ:IEEE,2015:12-19
    [23]Kavukcuoglu K,Ranzato M A,Fergus R,et al.Learning invariant features through topographic filter maps[C]Proc of the 21st Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2009:1605-1612
    [24]Coates A,Ng A Y,Lee H.An analysis of single-layer networks in unsupervised feature learning[C]Proc of the14th Int Conf on Artificial Intelligence and Statistics.Brookline,MA:Microtome,2011:215-223
    [25]Zhan Yongzhao,Tian Huafeng,Mao Qirong.Video semantic analysis based on kernel discriminative featuresblocked sparse representation[J].Journal of ComputerAided Design&Computer Graphics,2014,26(8):1290-1296(in Chinese)(詹永照,田华锋,毛启容.核可鉴别的特征分块稀疏表示的视频语义分析[J].计算机辅助设计与图形学学报,2014,26(8):1290-1296)
    [26]Yue-Hei Ng J,Hausknecht M,Vijayanarasimhan S,et al.Beyond short snippets:Deep networks for video classification[C]Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2015:4694-4702
    [27]Gan Chuang,Wang Naiyan,Yang Yi,et al.DevNet:Adeep event network for multimedia event detection and evidence recounting[C]Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2015:2568-2577
    [28]Bengio Y.Practical recommendations for gradient based training of deep architectures[G]LNCS 7700:Proc of the25th Annual Conf on Neural Information Processing Systems.Berlin:Springer,2012:437-478
    [29]Venkatesan R,Chandakkar P,Li Baoxin,et al.Classification of diabetic retinopathy images using multi-class multiple-instance learning based on color correlogram features[C]Proc of the 34th Annual Int Conf of the IEEE on Engineering in Medicine and Biology Society.Piscataway,NJ:IEEE,2012:1462-1465
    [30]Chan Chi Ho,Kittler J.Sparse representation of(multiscale)histograms for face recognition robust to registration and illumination problems[C]Proc of the 17th IEEE Int Conf on Image Processing.Piscataway,NJ:IEEE,2010:2441-2444
    [31]Li Jun,Zhang Tong,Luo Wei,et al.Sparseness analysis in the pretraining of deep neural networks[J].IEEETransactions on Neural Networks&Learning Systems,2017,28(6):1425-1438

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700