RGB-D行为识别研究进展及展望

英文篇名：RGB-D Action Recognition: Recent Advances and Future Perspectives
作者：胡建芳 ; 王熊辉 ; 郑伟诗 ; 赖剑煌
英文作者：HU Jian-Fang;WANG Xiong-Hui;ZHENG Wei-Shi;LAI Jian-Huang;School of Data and Computer Science,Sun Yat-sen University;Guangdong Province Key Laboratory of Computational Science;Key Laboratory of Machine Intelligence and Advanced Computing,Ministry of Education;School of Electronics and Information Technology,Sun Yat-sen University;
关键词：RGB-D ; 行为识别 ; 骨架点 ; 深度学习
英文关键词：RGB-D;;action recognition;;skeleton;;deep learning
中文刊名：MOTO
英文刊名：Acta Automatica Sinica
机构：中山大学数据科学与计算机学院;广东省信息安全技术重点实验室;机器智能与先进计算教育部重点实验室;中山大学电子信息与工程学院;
出版日期：2019-01-15 09:18
出版单位：自动化学报
年：2019
期：v.45
基金：国家自然科学基金(61702567,61876104);; 广东省重大项目(2018B010109007);; 广东省信息安全技术重点实验室开放课题基金(2017B030314131)资助~~
语种：中文;
页：MOTO201905001
页数：12
CN：05
ISSN：11-2109/TP
分类号：3-14

摘要

行为识别是计算机视觉领域很重要的一个研究问题,其在安全监控、机器人设计、无人驾驶和智能家庭设计等方面都有着非常重要的应用.基于传统RGB视频的行为识别方法由于容易受背景、光照等行为无关因素的影响,导致识别精度不高.廉价RGB-D摄像头出现之后,人们开始从一个新的途径解决行为识别问题.基于RGB-D摄像头的行为识别通过聚合RGB、深度和骨架三种模态的行为数据,可以融合不同模态的行为信息,从而可以克服传统RGB视频行为识别的缺陷,也因此成为近几年的一个研究热点.本文系统地综述了RGB-D行为识别领域的研究进展和展望.首先,对近年来RGB-D行为识别领域中常用的公共数据集进行简要的介绍;同时也系统地介绍了多模态RGB-D行为识别研究领域的典型模型和最新进展,其中包括卷积神经网络(Convolution neural network, CNN)和循环神经网络(Recurrent neural network, RNN)等深度学习技术在RGB-D行为识别的应用;最后,在三个公共RGB-D行为数据库上对现有方法的优缺点进行了比较和分析,并对未来的相关研究进行了展望.
Action recognition is an important research topic in computer vision, which is critical in some real-world applications including security monitoring, robot design, self driving and smart home system etc.. The existing single modality RGB based action recognition approaches are easily suffered from the illumination variation, background clutter,which leads to an inferior recognition performance. The emergence of low-cost RGB-D cameras opens a new dimension for addressing the problem of action recognition. It can overcome the drawbacks of single modality by outputting RGB, depth,and skeleton modalities, each of which can describe actions from one perspective. In this paper, we mainly review the current advances in RGB-D action recognition. Firstly, we briefly introduce some datasets popularly used in the research of RGB-D action recognition, then we review the literatures and the state-of-the-art recognition models based on convolution neural network(CNN) and recurrent neural network(RNN). Finally, we discuss the advantages and disadvantages of these methods through the experiments on three datasets and provide some problems needing addressing in the future.

引文

1 Hu J F, Zheng W S, Lai J H, Zhang J G. Jointly learning heterogeneous features for RGB-D activity recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11):2186-2200
    2 Wang J, Liu Z C, Chorowski J, Chen Z Y, Wu Y. Robust3 D action recognition with random occupancy patterns. In:Proceedings of the 12th European Conference on Computer Vision. Florence, Italy:Springer, 2012. 872-885
    3 Liu Zhi, Dong Shi-Du. Study of human action recognition by using skeleton motion information in depth video. Computer Applications and Software, 2017, 34(2):189-192, 219(刘智,董世都.利用深度视频中的关节运动信息研究人体行为识别.计算机应用与软件, 2017, 34(2):189-192, 219)
    4 Wang Song-Tao, Zhou Zhen, Qu Han-Bing, Li Bin. Bayesian saliency detection for RGB-D images. Acta Automatica Sinica, 2017, 43(10):1810-1828(王松涛,周真,曲寒冰,李彬. RGB-D图像的贝叶斯显著性检测.自动化学报, 2017, 43(10):1810-1828)
    5 Wang Xin, Wo Bo-Hai, Guan Qiu. Human action recognition based on manifold learning. Chinese Journal of Image and Graphics, 2014, 19(6):914-923(王鑫,沃波海,管秋.基于流形学习的人体动作识别.中国图象图形学报, 2014, 19(6):914-923)
    6 Liu Xin, Xu Hua-Rong, Hu Zhan-Yi. GPU based fast 3Dobject modeling with Kinect. Acta Automatica Sinica, 2012,38(8):1288-1297(刘鑫,许华荣,胡占义.基于GPU和Kinect的快速物体重建.自动化学报, 2012, 38(8):1288-1297)
    7 Wang Liang, Hu Wei-Ming, Tan Tie-Niu. A survey of visual analysis of human motion. Chinese Journal of Computers,2002, 25(3):225-237(王亮,胡卫明,谭铁牛.人运动的视觉分析综述.计算机学报, 2002,25(3):225-237)
    8 Georgia Gkioxari, Ross Girshick, Piotr Dollar, Kaiming He. Detecting and Recognizing Human-Object Interactions. In:Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018. DOI:10 .1109/CVPR.2018.00872
    9 Klaser A, Marszalek M, Schmid C. A spatio-temporal descriptor based on 3D-gradients. In:Proceedings of the 2008British Machine Vision Conference. Leeds, UK:British Machine Vision Association, 2008.
    10 Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004,60(2):91-110
    11 Wang X Y, Han T X, Yan S C. An HOG-LBP human detector with partial occlusion handling. In:Proceedings of the12 th International Conference on Computer Vision. Kyoto,Japan:IEEE, 2009. 32-39
    12 Wang J, Liu Z C, Wu Y, Yuan J S. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5):914-927
    13 Hu J F, Zheng W S, Lai J H, Zhang J G. Jointly learning heterogeneous features for RGB-D activity recognition. In:Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA:IEEE,2015. 5344-5352
    14 Wei P, Zhao Y B, Zheng N N, Zhu S C. Modeling 4D humanobject interactions for event and object recognition. In:Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Sydney:IEEE, 2013. 3272-3279
    15 Sung J, Ponce C, Selman B, Saxena A. Human activity detection from RGBD images. In:Proceedings of the 16th AAAI Conference on Plan, Activity, and Intent Recognition. San Francisco, USA:AAAI, 2011. 47-55
    16 Koppula H S, Gupta R, Saxena A. Learning human activities and object affordances from RGB-D videos. The International Journal of Robotics Research, 2013, 32(8):951-970
    17 Shahroudy A, Liu J, Ng T T, Wang G. NTU RGB+D:a large scale dataset for 3D human activity analysis. In:Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA:IEEE,2016.
    18 Zhu Y, Chen W B, Guo G D. Evaluating spatiotemporal interest point features for depth-based action recognition.Image and Vision Computing, 2014, 32(8):453-464
    19 Yang X D, Tian Y L. Super normal vector for activity recognition using depth sequences. In:Proceedings of the 2014IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA:IEEE, 2014. 804-811
    20 Zhang J, Li W Q, Ogunbona P O, Wang P C, Tang C.RGB-D-based action recognition datasets:a survey. Pattern Recognition, 2016, 60:86-105
    21 Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points. In:Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco, USA:IEEE, 2010.9-14
    22 Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In:Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.Providence, USA:IEEE, 2012. 20-27
    23 Oreifej O, Liu Z C. HON4D:histogram of oriented 4D normals for activity recognition from depth sequences. In:Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA:IEEE, 2013.716-723
    24 Ni B B, Wang G, Moulin P. RGBD-HuDaAct:a color-depth video database for human daily activity recognition. Consumer Depth Cameras for Computer Vision:Research Topics and Applications. London, UK:Springer, 2013. 193-208
    25 Lillo I, Soto A, Niebles J C. Discriminative hierarchical modeling of spatio-temporally composable human activities. In:Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA:IEEE, 2014.812-819
    26 Yu G, Liu Z C, Yuan J S. Discriminative orderlet mining for real-time recognition of human-object interaction. In:Proceedings of the 12th Asian Conference on Computer Vision.Singapore:Springer, 2014. 50-65
    27 Liu A A, Nie W Z, Su Y T, Ma L, Hao T, Yang Z X. Coupled hidden conditional random fields for RGB-D human action recognition. Signal Processing, 2015, 112:74-82
    28 Lu C W, Jia J Y, Tang C K. Range-sample depth feature for action recognition. In:Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition.Columbus, USA:IEEE, 2014. 772-779
    29 Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M,Blake A, et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM,2013, 56(1):116-124
    30 Hussein M E, Torki M, Gowayyed M A, El-Saban M. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In:Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China:AAAI, 2013. 2466-2472
    31 Lv F J, Nevatia R. Recognition and segmentation of 3-D human action using HMM and multi-class AdaBoost. In:Proceedings of the 9th European Conference on Computer Vision. Graz, Austria:Springer, 2006. 359-372
    32 Yang X D, Tian Y L. EigenJoints-based action recognition using naive-Bayes-nearest-neighbor. In:Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Providence, USA:IEEE,2012. 14-19
    33 Luo J J, Wang W, Qi H R. Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In:Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, Australia:IEEE,2013. 1809-1816
    34 Ofli F, Chaudhry R, Kurillo G, et al. Sequence of the most informative joints(SMIJ):a new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, 2014, 25(1):24-38
    35 Zhu Y, Chen W B, Guo G D. Fusing spatiotemporal features and joints for 3D action recognition. In:Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland, USA:IEEE, 2013.486-491
    36 Zanfir M, Leordeanu M, Sminchisescu C. The moving pose:an efficient 3D kinematics descriptor for low-latency action recognition and detection. In:Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, Australia:IEEE, 2013. 2752-2759
    37 Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a Lie group.In:Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA:IEEE,2014. 588-595
    38 Fragkiadaki K, Levine S, Felsen P, Malik J. Recurrent network models for human dynamics. In:Proceedings of the2015 IEEE International Conference on Computer Vision(ICCV). Santiago, Chile:IEEE, 2015. 4346-4354
    39 Du Y, Wang W, Wang L. Hierarchical recurrent neural network for skeleton based action recognition. In:Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA:IEEE, 2015.1110-1118
    40 Liu J, Shahroudy A, Xu D, Kot A C, Wang G. Skeletonbased action recognition using spatio-temporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12):3007-3021
    41 Song S J, Lan C L, Xing J L, Zeng W J, Liu J Y. An endto-end spatio-temporal attention model for human action recognition from skeleton data. In:Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco,USA:AAAI, 2017. 4263-4270
    42 Zhang P F, Lan C L, Xing J L, Zeng W J, Xue J R, Zheng N N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In:Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy:IEEE, 2017. 2136-2145
    43 Ke Q H, Bennamoun M, An S J, Sohel F, Boussaid F. A new representation of skeleton sequences for 3D action recognition. In:Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu,USA:IEEE, 2017. 4570-4579
    44 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    45 Li C, Zhong Q Y, Xie D, Pu S L. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055, 2018.
    46 He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In:Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA:IEEE, 2016. 770-778
    47 Shahroudy A, Ng T T, Yang Q X, Wang G. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence,2016, 38(10):2123-2129
    48 Shahroudy A, Ng T T, Gong Y H, Wang G. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(5):1045-1058
    49 Du W B, Wang Y L, Qiao Y. RPAN:an end-to-end recurrent pose-attention network for action recognition in videos.In:Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy:IEEE, 2017.3745-3754
    50 Evangelidis G, Singh G, Horaud R. Skeletal quads:human action recognition using joint quadruples. In:Proceedings of the 22nd International Conference on Pattern Recognition.Stockholm, Sweden:IEEE, 2014. 4513-4518
    51 Garcia N C, Morerio P, Murino V. Modality distillation with multiple stream networks for action recognition. In:Proceedings of the 15th European Conference on Computer Vision. Munich, Germany:Springer, 2018.
    52 Rahmani H, Bennamoun M. Learning action recognition model from depth and skeleton videos. In:Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy:IEEE, 2017. 5833-5842
    53 Baradel F, Wolf C, Mille J. Human action recognition:posebased attention draws focus to hands. In:Proceedings of the2017 IEEE International Conference on Computer Vision Workshops(ICCVW). Venice, Italy:IEEE, 2017.
    54 Hu J F, Zheng W S, Pan J H, Lai J H, Zhang J G. Deep bilinear learning for RGB-D action recognition. In:Proceedings of the 15th European Conference on Computer Vision.Munich, Germany:Springer, 2018.
    55 Wang D A, Ouyang W L, Li W, Xu D. Dividing and aggregating network for multi-view action recognition. In:Proceedings of the 15th European Conference on Computer Vision. Munich, Germany:Springer, 2018.
    56 Si C Y, Jing Y, Wang W, Wang L, Tan T N. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In:Proceedings of the 15th European Conference on Computer Vision. Munich, Germany:Springer, 2018.
    57 Muller M, Roder T. Motion templates for automatic classification and retrieval of motion capture data. In:Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Vienna, Austria:Eurographics Association Aire-la-Ville, 2006. 137-146
    58 Shahroudy A, Wang G, Ng T T. Multi-modal feature fusion for action recognition in RGB-D sequences. In:Proceedings of the 6th International Symposium on Communications,Control and Signal Processing(ISCCSP). Athens, Greece:IEEE, 2014. 1-4
    59 Cao L L, Luo J B, Liang F, Huang T S. Heterogeneous feature machines for visual recognition. In:Proceedings of the12 th International Conference on Computer Vision. Kyoto,Japan:IEEE, 2009. 1095-1102
    60 Liu L, Shao L. Learning discriminative representations from RGB-D video data. In:Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing,China:AAAI, 2013. 1493-1500
    61 Kong Y, Fu Y. Bilinear heterogeneous information machine for RGB-D action recognition. In:Proceedings of the 2015IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA:IEEE, 2015. 1054-1062
    62 Xia L, Aggarwal J K. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In:Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA:IEEE, 2013.2834-2841
    63 Cai Z W, Wang L M, Peng X J, Qiao Y. Multi-view super vector for action recognition. In:Proceedings of the 2014IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA:IEEE, 2014. 596-603
    64 Zhang Y, Yeung D Y. Multi-task learning in heterogeneous feature spaces. In:Proceedings of the 25th AAAI Conference on Artificial Intelligence. San Francisco, USA:AAAI,2011.
    65 Yu M Y, Liu L, Shao L. Structure-preserving binary representations for RGB-D action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016,38(8):1651-1664
    66 Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. In:Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA:IEEE, 2016. 317-326
    67 Hu J F, Zheng W S, Ma L Y, Gang W, Lai J H, Zhang J G. Early action prediction by soft regression. In:Proceedings of the 2018 IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 1-1
    68 Hu J F, Zheng W S, Ma L Y, Wang G, Lai J H. Real-time RGB-D activity prediction by soft regression. In:Proceedings of the 14th European Conference on Computer Vision.Amsterdam, The Netherlands:Springer, 2016. 280-296
    69 Barsoum E, Kender J, Liu Z C. HP-GAN:probabilistic3 D human motion prediction via GAN. In:Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Salt Lake City,USA:IEEE, 2018.
    70 Liu J, Shahroudy A, Wang G, Duan L Y, Kot A C. SSNet:scale selection network for online 3D action prediction.In:Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE, 2018.
    1即将深度图像像素点以三维坐标的形式展示.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700