基于三维卷积与双向LSTM的行为识别研究

英文篇名：Research on action recognition based on 3D convolution and bidirectional LSTM
作者：王毅 ; 马翠红 ; 毛志强
英文作者：WANG Yi;MA Cuihong;MAO Zhiqiang;College of Electrical Engineering,North China University of Science and Technology;
关键词：行为识别 ; 三维卷积 ; 双向LSTM ; 双中心loss ; 联合训练 ; 计算机视觉
英文关键词：behavior recognition;;3D convolution;;bidirectional LSTM;;double center loss;;joint training;;computer vision
中文刊名：XDDJ
英文刊名：Modern Electronics Technique
机构：华北理工大学电气工程学院;
出版日期：2019-07-15
出版单位：现代电子技术
年：2019
期：v.42;No.541
基金：国家自然科学基金(61171058)~~
语种：中文;
页：XDDJ201914018
页数：5
CN：14
ISSN：61-1224/TN
分类号：86-90

摘要

准确识别视频中的内容是未来互联网应用发展的方向,视频中的行为识别是计算机视觉领域的研究重点。为充分利用视频中的信息,提高行为识别的准确程度,文中提出一种基于三维卷积与双向LSTM的行为识别算法。设计一种基于三维卷积的空间注意模块,可以关注空间区域的显著特征。为了更好地处理长时间视频,引入一种新的基于双向LSTM(长短时记忆网络)的时间注意模块,其目的在于关注关键视频而不是给定视频的关键视频帧,然后采用双中心loss(计算损失函数)优化网络对两阶段策略联合训练,使其能够同时探索空间和时间域的相关性。在HMDB-51和UCF-101数据集上测试证明,所提算法能够准确识别视频中的相似动作,行为识别的准确率得到提高,识别效果显著。
Accurately identifying the content in video is the direction of future Internet application and development. The behavior recognition in video is the research focus in the field of computer vision. In order to make full use of the information in video and improve the accuracy of action recognition,an action recognition algorithm based on 3 D convolution and bidirectional LSTM is proposed in this paper. Specifically speaking,a spatial attention module based on three-dimensional convolution is proposed,which can focus on the salient features of the spatial region. In order to better handle long-time video,a new time-based module based on bidirectional LSTM(long-and short-term memory network)is introduced,which aims to focus on key video instead of the key video frame of a given video,adopts double-center Loss(calculation loss function)to optimize network for joint training in two-stage strategies,and enables it to simultaneously explore spatial and temporal correlation. The results of the tests with the HMDB-51 and UCF-101 data sets prove that this method can accurately identify similar actions in video,the accuracy of action recognition is greatly improved,and the recognition effect is remarkable.

引文

[1] JI S W,XU W,YANG M,et al. 3D convolutional neural networks for human action recognition[J]. IEEE transactions on pattern analysis and machine intelligence,2013,35(1):221-231.
    [2] WANG L,XIONG Y,WANG Z,et al. Towards good practices for very deep two-stream ConvNets[J]. Computer science,2015(7):1-5.
    [3]秦阳,莫凌飞,郭文科,等.3D CNNs与LSTMs在行为识别中的组合及其应用[J].测控技术,2017,36(2):28-32.QIN Yang,MO Lingfei,GUO Wenke,et al. Combination of3D CNNs and LSTMs and its application in activity recognition[J]. Measurement and control technology,2017,36(2):28-32.
    [4]黎松,平西建,丁益洪.开放源代码的计算机视觉类库OpenCV的应用[J].计算机应用与软件,2018,22(8):134-136.LI Song,PING Xijian,DING Yihong. Open source computer vision library OpenCV applications[J]. Computer applications and software,2018,22(8):134-136.
    [5] TRAN D,BOURDEV L,FERGUS R,et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos:IEEE Computer Society Press,2015:4489-4497
    [6] PENG X J,ZOU C Q,QIAO Y,et al. Action recognition with stacked fisher vectors[C]//Proceedings of the European Conference on Computer Vision. Heidelberg:Springer, 2014,8693:581-595.
    [7] SUN L,JIA K,YEUNG D,et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos:IEEE Computer Society Press,2015:4597-4605
    [8] Simonyan K,Zisserman A. Two-stream convolutional networksfor action recognition in videos[C]//Proceedings of the Advances in Neural Information Processing Systems. Cambridge:MIT Press,2014:568-576
    [9] WANG P,CAO Y,SHEN C,et al. Temporal pyramid pooling based convolutional neural networks for action recognition[J].IEEE transactions on multimedia,2017,27(12):2613-2622.
    [10] WANG H,SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos:IEEE Computer Society Press,2013:3551-3558.
    [11] Idress H,Zamir A,Jiang Y G,et al. The THUMOS challenge on action recognition for videos"in the wild"[J]. Computer Vision and Image Understanding,2017,155:1-23.
    [12] Kuehne H,Jhuang H,Garrote E,et al. HMDB:a large video database for human motion recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos:IEEE Computer Society Press,2011:2556-2563.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700