摘要
针对基于卷积神经网络(CNNs)的人体动作识别方法通常采用空域或时域局部特征的不足,提出一种融合人体动作全局时域和空间特征的双通道CNNs动作识别模型.空间通道对动作图像进行深度学习,采用多帧融合的方式提升准确率,全局时域通道对能量运动历史图(EMHI)进行深度学习,最后融合两个通道信息识别人体动作.利用现有的大型数据集进行预训练,以解决学习过程中训练样本不足问题.在UCF101数据集和该项目小样本数据集上进行实验,结果证明了该方法的有效性.
The existing human motion recognition methods based on convolution neural network(CNNs) usually use spatial or temporal local features. In this paper, a two-stream CNNs action recognition model was proposed, which integrated the global temporal and spatial features of human action.Motion images were deeply studied in spatial channels,the multi frame fusion way was used to raise the accuracy rate,and deep learning on the energy motion history image(EMHI) was performed in the global temporal stream.Finally,the two streams were combined to identify the human motion.In order to solve the problem of insufficient training samples in the learning process,the existing large data sets was used for pre-training.Experiments were carried out on the UCF101 dataset and the small sample dataset of the project.The results demonstrate the effectiveness of the method.
引文
[1]Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos[J/OL].[2018-06-01].https://arxiv.org/abs/1406.2199.
[2]Wang L,Qiao Y,Tang X.Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proc of IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE,2015:4305-4314.
[3]Wang P,Cao Y,Shen C,et al.Temporal pyramid pooling based convolutional neural network for action recognirecognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2015,99:1-10.
[4]Feichtenhofer C,Pinz A,Zisserman A.Convolutional two-stream network fusion for video action recognition[J/OL].[2018-06-01].https//www.computer.org/csdl/proc eedings-article/cvpr/2016/8851b933/12OmNzX6cjY.
[5]Wang L,Xiong Y,Wang Z,et al.Temporal segment networks:towards good practices for deep action recognition[C]//Proc of European Conference on Computer Vision.Cham:Springer,2016:20-36.
[6]Ji S,Yang M,Yu K.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,35(1):221-231.
[7]Du T,Bourdev L,Fergus R,et al.C3D:generic features for video analysis[J/OL].[2018-06-01].http://cn.arxiv.org/abs/1412.0767v1.
[8]Yan S,Xiong Y,Lin D.Spatial temporal graph convolutional networks for skeleton-based action recognition[J/OL].[2018-06-01].http://cn.arxiv.org/abs/1801.07455.
[9]Sun L,Jia K,Yeung D Y,et al.Human action recognition using factorized spatio-temporal convolutional networks[J/OL].[2018-06-01].http://cn.arxiv.org/abs/1510.00562.
[10]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J/OL].[2018-06-01].https://arxiv.org/abs/1409.1556.
[11]Szegedy C,Liu W,Jia Y,et al.Going deeper with convolutions[J/OL].[2018-06-01].https://arxiv.org/abs/1409.4842.
[12]He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[J/OL].[2018-06-01].https://arxiv.org/abs/1512.03385.
[13]Deng J,Dong W,Socher R,et al.ImageNet:a large-scale hierarchical image database[C]//Proc of IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE,2009:248-255.
[14]Szegedy C,Vanhoucke V,Ioffe S,et al.Rethinking the inception architecture for computer vision[J/OL].[2018-06-01].https://arxiv.org/abs/1512.00567.
[15]Bobick A F,Davis J W.The recognition of human movement using temporal templates[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(3):257-267.
[16]Bobick A,Davis J.An Appearance-Based Representation of Action[C]//Proc of International Conference on Pattern Recognition.New York:IEEE,1996:307-312.
[17]Yosinski J,Clune J,Nguyen A,et al.Understanding neural networks through deep visualization[J/OL].[2018-06-01].https://arxiv.org/abs/1506.06579.