基于深度学习的视频中人体动作识别进展综述

英文篇名：The Progress of Human Action Recognition in Videos Based on Deep Learning:A Review
作者：罗会兰 ; 童康 ; 孔繁胜
英文作者：LUO Hui-lan;TONG Kang;KONG Fan-sheng;School of Information Engineering,Jiangxi University of Science and Technology;School of Computer Science and Technology,Zhejiang University;
关键词：动作识别 ; 综述 ; 卷积神经网络 ; 深度学习
英文关键词：action recognition;;review;;convolutional neural network;;deep learning
中文刊名：DZXU
英文刊名：Acta Electronica Sinica
机构：江西理工大学信息工程学院;浙江大学计算机科学技术学院;
出版日期：2019-05-15
出版单位：电子学报
年：2019
期：v.47;No.435
基金：国家自然科学基金(No.61462035,No.61862031);; 江西省自然科学基金(No.20171BAB202014);; 江西省青年科学家培养对象计划资助(No.20153BCB23010)
语种：中文;
页：DZXU201905025
页数：12
CN：05
ISSN：11-2087/TN
分类号：188-199

摘要

视频中的人体动作识别是计算机视觉领域内一个充满挑战的课题.不论是在视频信息检索、日常生活安全、公共视频监控,还是人机交互、科学认知等领域都有广泛的应用.本文首先简单介绍了动作识别的研究背景、意义及其难点,接着从模型输入信号的类型和数量、是否结合了传统特征提取方法、模型预训练三个维度详细综述了基于深度学习的动作识别方法,及比较分析了它们在UCF101和HMDB51这两个数据集上的识别效果.最后分别从视频预处理、视频中人体运动信息表征、模型学习训练这三个角度对未来动作识别可能的发展方向进行了论述.
Human action recognition in videos is a challenging topic in the field of computer vision.It is widely not only used in video information retrieval,daily life security,public video surveillance,but also human-computer interaction,scientific cognition and other fields.First,the research background,research significance and difficulties of action recognition are briefly introduced,and then the deep learning model based action recognition methods are comprehensively reviewed from three different aspects:the types and numbers of input signals,the combination with traditional feature extraction methods,and the pre-trained datasets.Furthermore,the performances of some typical methods on UCF101 and HMDB51 datasets are overviewed and analyzed.Last the possible future research directions are discussed from three perspectives:the video data preprocessing,the video human motion feature representation,and the model training.

引文

[1] 胡琼,秦磊,黄庆.基于视觉的人体动作识别综述[J].计算机学报,2013,36(12):2512-2524.HU Qiong,QIN Lei,HUANG Qing.Overview of human action recognition based on vision[J].Chinese Journal of Computers,2013,36(12):2512-2524.(in Chinese)
    [2] POPPE R.A survey on vision-based human action recognition[J].Image and Vision Computing,2010,28(6):976-990.
    [3] WEINLAND D,RONFARD R,BOYER E.A survey of vision-based methods for action representation,segmentation and recognition[J].Computer Vision and Image Understanding,2011,115(2):224-241.
    [4] 杜友田,陈峰,徐文立,李永彬.基于视觉的人的运动识别综述[J].电子学报,2007,35(1):84-90.DU You-tian,CHEN Feng,XU Wen-li,LI Yong-bin.A survey on the vision-based human motion recognition[J].Acta Electronica Sinica,2007,35(1):84-90.(in Chinese)
    [5] CHAQUET J M,CARMONA E J,FERNANDEZ-CABALLERO A.A survey of video datasets for human action and activity recognition[J].Computer Vision and Image Understanding,2013,117(6):633-659.
    [6] DAWN D D,SHAIKH S H.A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector[J].Visual Computer,2016,32(3):289-306.
    [7] 朱红蕾,朱昶胜,徐志刚.人体行为识别数据集研究进展[J].自动化学报,2018,44(6):978-1004.ZHU Hong-lei,ZHU Chang-sheng,XU Zhi-gang.Research progress on human action recognition datasets[J].Acta Automatica Sinica,2018,44(6):978-1004.(in Chinese)
    [8] ZHU F,SHAO L,XIE J,et al.From handcrafted to learned representations for human action recognition[J].Image and Vision Computing,2016,55(P2):42-52.
    [9] HERATH S,HARANDI M,PORIKLI F.Going deeper into action recognition:a survey[J].Image and Vision Computing,2017,60(4):4-21.
    [10] SARGANO A,ANGELOV P,HABIB Z.A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition[J].Applied Sciences,2017,7(1):110-147.
    [11] WU D,SHARMA N,BLUMENSTEIN M.Recent advances in video-based human action recognition using deep learning:a review[A].International Joint Conference on Neural Networks[C].USA:IEEE,2017.2865-2872.
    [12] YAO G L,LEI T,ZHONG J D.A review of convolutional-neural-network-based action recognition[J].Pattern Recognition Letters,2019,118(2):14-22.
    [13] DU T,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[A].International Conference on Computer Vision[C].Chile:IEEE,2015.4489-4497.
    [14] KARPATHY A,TODERICI G,SHETTY S,et al.Large-scale video classification with convolutional neural networks[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2014.1725-1732.
    [15] FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2016.1933-1941.
    [16] SIMONYAN K,ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[A].Neural Information Processing Systems[C].Canada:NIPS Proceedings,2014.568-576.
    [17] WANG H,SCHMID C.Action recognition with improved trajectories[A].International Conference on Computer Vision[C].Australia:IEEE,2013.3551-3558.
    [18] WANG L,XIONG Y,WANG Z,et al.Temporal segment networks:towards good practices for deep action recognition[J].ACM Transactions on Information Systems,2016,22(1):20-36.
    [19] SUN S,KUANG Z,OUYANG W,et al.Optical flow guided feature:a fast and robust motion representation for video action recognition[A].Computing Vision and pattern Recognition[C].USA:IEEE,2018.1390-1399.
    [20] ZHANG B,WANG L,WANG Z,et al.Real-time action recognition with enhanced motion vector CNNs[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2016.2718-2726.
    [21] WANG L,GE L,LI R,et al.Three-stream CNNs for action recognition[J].Pattern Recognition Letters,2017,92(C):33-40.
    [22] SHI Y,TIAN Y,WANG Y,et al.Sequential deep trajectory descriptor for action recognition with three-stream CNN[J].IEEE Transactions on Multimedia,2017,19(7):1510-1520.
    [23] BILEN H,FERNANDO B,GAVVES E,et al.Dynamic image networks for action recognition[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2016.3034-3042.
    [24] FERNANDO B,GAVVES E,ORAMAS M J O,et al.Rank pooling for action recognition[J].IEEE Transactions on Pattern Analysis Machine Intelligence,2017,39(4):773-787.
    [25] BILEN H,FERNANDO B,GAVVES E,et al.Action recognition with dynamic image networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(12):2799-2813.
    [26] JI S,XU W,YANG M,et al.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
    [27] TRAN D,RAY J,SHOU Z,et al.ConvNet architecture search for spatiotemporal feature learning[J].Computing Research Repository,2017,16(8):1-12.
    [28] ZHOU Y,SUN X,ZHA Z-J,et al.MiCT:mixed 3D/2D convolutional tube for human action recognition[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2018.449-458.
    [29] AHSAN U,SUN C,ESSA I.DiscrimNet:semi-supervised action recognition from videos using generative adversarial networks[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2018.230-240.
    [30] GOODALE M A,MILNER A D.Separate visual pathways for perception and action[J].Trends in Neurosciences,1992,15(1):20-25.
    [31] RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
    [32] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].Computer Science,2015,10(4):1-4.
    [33] SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2015.1-9.
    [34] WANG L,XIONG Y,WANG Z,et al.Towards good practices for very deep two-stream ConvNets[J].Computer Science,2015,8(7):1-5.
    [35] FEICHTENHOFER C,PINZ A,WILDES R P.Spatiotemporal residual networks for video action recognition[A].Neural Information Processing Systems[C].Spain:NIPS Proceedings,2016.3468-3476.
    [36] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2016.770-778.
    [37] FEICHTENHOFER C,PINZ A,WILDES R P.Spatiotemporal multiplier networks for video action recognition[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2017.7445-7454.
    [38] NG Y H,HAUSKNECHT M,VIJAYANARASIMHAN S,et al.Beyond short snippets:deep networks for video classification[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2015.4694-4702.
    [39] SUN L,JIA K,CHEN K,et al.Lattice long short-term memory for human action recognition[A].International Conference on Computer Vision[C].Italy:IEEE,2017.2166-2175.
    [40] WANG Y,WANG S,TANG J,et al.Hierarchical attention network for action recognition in videos[J].Computing Research Repository,2016,21(7):41-50.
    [41] WANG Y,LONG M,WANG J,et al.Spatiotemporal pyramid network for video action recognition[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2017.2097-2106.
    [42] VAROL G,LAPTEV I,SCHMID C.Long-term temporal convolutions for action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1510-1517.
    [43] DIBA A,SHARMA V,GOOL L V.Deep temporal linear encoding networks[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2017.1541-1550.
    [44] SOOMRO K,ZAMIR A R,SHAH M.UCF101:a dataset of 101 human actions classes from videos in the wild[J].Computer Science,2012,3(12):2-9.
    [45] KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:a large video database for human motion recognition[A].International Conference on Computer Vision[C].Spain:IEEE,2011.2556-2563.
    [46] YANG X,TIAN Y L.Effective 3D action recognition using EigenJoints[J].Journal of Visual Communication and Image Representation,2014,25(1):2-11.
    [47] BOBICK A,DAVIS J.An appearance-based representation of action[A].International Conference on Pattern Recognition[C].Austria:IEEE,1996.307-312.
    [48] WEINLAND D,RONFARD R,BOYER E.Free viewpoint action recognition using motion history volumes[J].Computer Vision and Image Understanding,2006,104(2):249-257.
    [49] BOBICK A F,DAVIS J W.The recognition of human movement using temporal templates[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(3):257-267.
    [50] YILMAZ A,SHAH M.Actions sketch:a novel action representation[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2005.984-989.
    [51] LINDEBERG T,LAPTEV I.On space-time interest points[J].International Journal of Computer Vision,2005,64(2-3):107-123.
    [52] HARRIS C.A combined corner and edge detector[A].Alvey Vision Conference[C].UK:IEEE,1988.1-6.
    [53] WILLEMS G,TUYTELAARS T,GOOL L.An efficient dense and scale-invariant spatio-temporal interest point detector[A].European Conference on Computer Vision[C].France:Springer,2008.650-663.
    [54] DALAL N,TRIGGS B.Histograms of oriented gradients for human detection[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2005.886-893.
    [55] KLASER A.A spatiotemporal descriptor based on 3D-gradients[A].British Machine Vision Conference[C].LEEDS:BMVA,2008.1-10
    [56] LAPTEV I,MARSZALEK M,SCHMID C,et al.Learning realistic human actions from movies[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2008.24-32.
    [57] DALAL N,TRIGGS B,SCHMID C.Human detection using oriented histograms of flow and appearance[A].European Conference on Computer Vision[C].Austria:Springer,2006.428-441.
    [58] OJALA T,PIETIK,INEN M,et al.Multiresolution gray-scale and rotation invariant texture classification with local binary patterns[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(7):971-987.
    [59] 田国会,尹建芹,闫云章,李国栋.基于混合高斯模型和主成分分析的轨迹分析行为识别方法[J].电子学报,2016,44(1):143-149.TIAN Guo-hui,YIN Jian-qin,YAN Yun-zhang,LI Guo-dong.Gaussian mixture models and principal component analysis based human trajectory behavior recognition[J].Acta Electronica Sinica,2016,44(1):143-149.(in Chinese)
    [60] KLASER A,SCHMID C.Action recognition by dense trajectories[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2011.3169-3176.
    [61] FAN L,HUANG W,GAN C,et al.End-to-end learning of motion representation for video understanding[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2018.6016-6025.
    [62] DENG J,DONG W,SOCHER R,et al.ImageNet:a large-scale hierarchical image database[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2009.248-255.
    [63] KAY W,CARREIRA J,SIMONYAN K,et al.The Kinetics human action video dataset[J].Computing Research Repository,2017,19(5):50-72.
    [64] SUN L,JIA K,YEUNG D Y,et al.Human action recognition using factorized spatio-temporal convolutional networks[A].International Conference on Computer Vision[C].Chile:IEEE,2015.4597-4605.
    [65] QIU Z,YAO T,MEI T.Learning spatio-temporal representation with pseudo-3D residual networks[A].International Conference on Computer Vision[C].Italy:IEEE,2017.5534-5542.
    [66] TRAN D,WANG H,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[A].Computer Vision and Pattern Recogmtion[C].USA:IEEE,2018.6450-6459.
    [67] XIE S,SUN C,HUANG J,et al.Rethinking spatiotemporal feature learning for video understanding[J].Computing Research Repository,2018,27(7):1-10.
    [68] CARREIRA J,ZISSERMAN A.Quo vadis,action recognition?A new model and the Kinetics dataset[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2017.4724-4733.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700