基于视觉注意力的人体行为识别

英文篇名：Human Action Recognition Based on Visual Attention
作者：孔言 ; 梁鸿 ; 张千
英文作者：KONG Yan;LIANG Hong;ZHANG Qian;College of Computer & Communication Engineering, China University of Petroleum;
关键词：行为识别 ; 双流架构 ; 卷积神经网络(CNN) ; 视频表示 ; 视觉注意力
英文关键词：action recognition;;two-stream architecture;;Convolutional Neural Network(CNN);;video representation;;visual attention
中文刊名：XTYY
英文刊名：Computer Systems & Applications
机构：中国石油大学(华东)计算机与通信工程学院;
出版日期：2019-05-15
出版单位：计算机系统应用
年：2019
期：v.28
基金：国家科技部创新方法工作专项(2015IM010300)~~
语种：中文;
页：XTYY201905006
页数：7
CN：05
ISSN：11-2854/TP
分类号：44-50

摘要

视频中人体行为识别是近年来计算机视觉中的一个重要研究领域,但是现有的方法对于视频表示方式存在不足,无法聚焦于图像内的显著区域.提出了一种基于视觉注意力的深度卷积神经网络,可以有效地为视频表示特征附加一个权重,对特征中的有益区域进行注意,实现更加准确的行为识别.在自建的Oilfield-7油田数据集和HMDB51数据集上进行了实验,以此来验证适用于油田现场人体行为所提出的网络模型的有效性.实验结果表明,所提的方法与已取得优异表现的双流架构相比具有一定的优越性.
Recognition of human actions in videos is an important research field in computer vision in recent years.However, existing methods have insufficient representation of video and cannot focus on significant areas within the image. We propose a deep convolutional neural network based on visual attention, which can effectively add a weight to the video representation features, pay attention to the beneficial regions in the features, and achieve more accurate behavior recognition. We conducted experiments on HMDB51 and our own Oilfield-7 dataset to verify the validity of the model proposed for human actions on the oilfield. The experimental results show that the proposed method has certain advantages compared with the two-stream architectures which have achieved excellent performance.

引文

1He KM,Zhang XY,Ren SQ,et al.Deep residual learning for image recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).IEEE.Las Vegas,NV,USA.2016.770-778.
    2He KM,Zhang XY,Ren SQ,et al.Identity mappings in deep residual networks.European Conference on Computer Vision.Springer.The Netherlands.2016.630-645.
    3Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition.arXiv:1409.1556,2014.
    4Szegedy C,Ioffe S,Vanhoucke V,et al.Alemi.Inception-v4,inception-ResNet and the impact of residual connections on learning.arXiv:1602.07261,2017.
    5Nguyen TV,Song Z,Yan SC.STAP:Spatial-temporal attention-aware pooling for action recognition.IEEE Transactions on Circuits and Systems for Video Technology,2015,25(1):77-86.[doi:10.1109/TCSVT.2014.2333151]
    6Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos.arXiv:1406.2199,2014.
    7Tran D,Bourdev L,Fergus R,et al.Learning spatiotemporal features with 3D convolutional networks.Proceedings of the 2015 IEEE International Conference on Computer Vision.Santiago,Chile.2015.4489-4497.
    8Ji SW,Xu W,Yang M,et al.3D convolutional neural networks for human action recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.[doi:10.1109/TPAMI.2012.59]
    9Girdhar R,Ramanan D,Gupta A,et al.ActionVLAD:Learning spatio-temporal aggregation for action classification.2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,HI,USA.2017.3165-3174.
    10Laptev I,Marszalek M,Schmid C,et al.Learning realistic human actions from movies.2008 IEEE Conference on Computer Vision and Pattern Recognition.Anchorage,AK,USA.2008.1-8.
    11Wang H,Ullah MM,Klaser A,et al.Evaluation of local spatio-temporal features for action recognition.BMVC 2009-British Machine Vision Conference.London,UK.2009.124.1-124.11.
    12Wang H,Kl?ser A,Schmid C,et al.Action recognition by dense trajectories.IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2011).Providence,RI,USA.2011.3169-3176.
    13Wang H,Schmid C.Action recognition with improved trajectories.Proceedings of the 2013 IEEE International Conference on Computer Vision.Sydney,NSW,Australia.2013.3551-3558.
    14Karpathy A,Toderici G,Shetty S,et al.Large-scale video classification with convolutional neural networks.Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition.Columbus,OH,USA.2014.1725-1732.
    15Feichtenhofer C,Pinz A,Zisserman A.Convolutional two-stream network fusion for video action recognition.Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA.2016.1933-1941.
    16Ng JYH,Hausknecht M,Vijayanarasimhan S,et al.Beyond short snippets:Deep networks for video classification.Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA.2015.4694-4702.
    17Wang LM,Xiong YJ,Wang Z,et al.Temporal segment networks:Towards good practices for deep action recognition.European Conference on Computer Vision.TheNetherlands.2016.20-36.
    18Carreira J,Zisserman A.Quo Vadis,Action recognition?A new model and the kinetics dataset.2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI,USA.2017.4724-4733.
    19Hou XD,Zhang LQ.Dynamic visual attention:Searching for coding length increments.Advances in Neural Information Processing Systems.2009:681-688.
    20Mathe S,Sminchisescu C.Dynamic eye movement datasets and learnt saliency models for visual action recognition.In:Fitzgibbon A,Lazebnik S,Perona P,et al,eds.Computer Vision-ECCV 2012.Berlin,Heidelberg:Springer,2012.842-856.
    21Wang LM,Xiong YJ,Wang Z,et al.Towards good practices for very deep two-stream ConvNets.arXiv:1507.02159,2015.
    22Idrees H,Zamir AR,Jiang YG,et al.The THUMOS challenge on action recognition for videos“in the Wild”. Computer Vision and Image Understanding,2017,155:1-23.[doi:10.1016/j.cviu.2016.10.018]
    23Kuehne H,Jhuang H,Garrote E,et al.HMDB:A large video database for human motion recognition.2011 International Conference on Omputer Vision.Barcelona,Spain.2011.2556-2563.
    24Soomro K,Zamir AR,Shah M.UCF101:A dataset of 101 human actions classes from videos in the wild.arXiv:1212.0402,2012.
    25Ioffe S,Szegedy C.Batch normalization:Accelerating deep network training by reducing internal covariate shift.arXiv:1502.03167,2015.
    26Wedel A,Pock T,Zach C,et al.An Improved Algorithm for TV-L1 optical flow.In:Cremers D,Rosenhahn B,Yuille A L,et al,eds.Statistical and Geometrical Approaches to Visual Motion Analysis.Berlin,Heidelberg:Springer,2009.23-24.
    27Deng J,Dong W,Socher R,et al.ImageNet:A large-scale hierarchical image database.2009 IEEE Conference on Computer Vision and Pattern Recognition.Miami,FL,USA.2009.248-25.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700