一种融合全局时空特征的CNNs动作识别方法

英文篇名：An action recognition method based on global spatial-temporal feature convolutional neural networks
作者：王珂 ; 武军 ; 周天相 ; 李瑞峰
英文作者：Wang Ke;Wu Jun;Zhou Tianxiang;Li Ruifeng;State Key Laboratory of Robotics and System,Harbin Institute of Technology;
关键词：动作识别 ; 卷积神经网络 ; 能量运动历史图 ; 全局时域特征 ; 数据集
英文关键词：action recognition;;convolutional neural networks;;energy motion history image;;global temporal feature;;dataset
中文刊名：HZLG
英文刊名：Journal of Huazhong University of Science and Technology(Natural Science Edition)
机构：哈尔滨工业大学机器人技术与系统国家重点实验室;
出版日期：2018-12-20
出版单位：华中科技大学学报(自然科学版)
年：2018
期：v.46;No.432
基金：国家自然科学基金资助项目(61673136);; 教育部-中国移动科研基金资助项目(MCM20170208)
语种：中文;
页：HZLG201812007
页数：6
CN：12
ISSN：42-1658/N
分类号：41-46

摘要

针对基于卷积神经网络(CNNs)的人体动作识别方法通常采用空域或时域局部特征的不足,提出一种融合人体动作全局时域和空间特征的双通道CNNs动作识别模型.空间通道对动作图像进行深度学习,采用多帧融合的方式提升准确率,全局时域通道对能量运动历史图(EMHI)进行深度学习,最后融合两个通道信息识别人体动作.利用现有的大型数据集进行预训练,以解决学习过程中训练样本不足问题.在UCF101数据集和该项目小样本数据集上进行实验,结果证明了该方法的有效性.
The existing human motion recognition methods based on convolution neural network(CNNs) usually use spatial or temporal local features. In this paper, a two-stream CNNs action recognition model was proposed, which integrated the global temporal and spatial features of human action.Motion images were deeply studied in spatial channels,the multi frame fusion way was used to raise the accuracy rate,and deep learning on the energy motion history image(EMHI) was performed in the global temporal stream.Finally,the two streams were combined to identify the human motion.In order to solve the problem of insufficient training samples in the learning process,the existing large data sets was used for pre-training.Experiments were carried out on the UCF101 dataset and the small sample dataset of the project.The results demonstrate the effectiveness of the method.

引文

[1]Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos[J/OL].[2018-06-01].https://arxiv.org/abs/1406.2199.
    [2]Wang L,Qiao Y,Tang X.Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proc of IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE,2015:4305-4314.
    [3]Wang P,Cao Y,Shen C,et al.Temporal pyramid pooling based convolutional neural network for action recognirecognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2015,99:1-10.
    [4]Feichtenhofer C,Pinz A,Zisserman A.Convolutional two-stream network fusion for video action recognition[J/OL].[2018-06-01].https//www.computer.org/csdl/proc eedings-article/cvpr/2016/8851b933/12OmNzX6cjY.
    [5]Wang L,Xiong Y,Wang Z,et al.Temporal segment networks:towards good practices for deep action recognition[C]//Proc of European Conference on Computer Vision.Cham:Springer,2016:20-36.
    [6]Ji S,Yang M,Yu K.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,35(1):221-231.
    [7]Du T,Bourdev L,Fergus R,et al.C3D:generic features for video analysis[J/OL].[2018-06-01].http://cn.arxiv.org/abs/1412.0767v1.
    [8]Yan S,Xiong Y,Lin D.Spatial temporal graph convolutional networks for skeleton-based action recognition[J/OL].[2018-06-01].http://cn.arxiv.org/abs/1801.07455.
    [9]Sun L,Jia K,Yeung D Y,et al.Human action recognition using factorized spatio-temporal convolutional networks[J/OL].[2018-06-01].http://cn.arxiv.org/abs/1510.00562.
    [10]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J/OL].[2018-06-01].https://arxiv.org/abs/1409.1556.
    [11]Szegedy C,Liu W,Jia Y,et al.Going deeper with convolutions[J/OL].[2018-06-01].https://arxiv.org/abs/1409.4842.
    [12]He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[J/OL].[2018-06-01].https://arxiv.org/abs/1512.03385.
    [13]Deng J,Dong W,Socher R,et al.ImageNet:a large-scale hierarchical image database[C]//Proc of IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE,2009:248-255.
    [14]Szegedy C,Vanhoucke V,Ioffe S,et al.Rethinking the inception architecture for computer vision[J/OL].[2018-06-01].https://arxiv.org/abs/1512.00567.
    [15]Bobick A F,Davis J W.The recognition of human movement using temporal templates[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(3):257-267.
    [16]Bobick A,Davis J.An Appearance-Based Representation of Action[C]//Proc of International Conference on Pattern Recognition.New York:IEEE,1996:307-312.
    [17]Yosinski J,Clune J,Nguyen A,et al.Understanding neural networks through deep visualization[J/OL].[2018-06-01].https://arxiv.org/abs/1506.06579.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700