面向智能监控的行为识别

英文篇名：Action recognition for intelligent monitoring
作者：马钰锡 ; 谭励 ; 董旭 ; 于重重
英文作者：Ma Yuxi;Tan Li;Dong Xu;Yu Chongchong;College of Computer & Information Engineering,Beijing Technology & Business University;
关键词：行为识别 ; 目标检测 ; 深度学习 ; 卷积神经网络 ; 循环神经网络
英文关键词：action recognition;;target detection;;deep learning;;convolutional neural network;;recurrent neural network
中文刊名：ZGTB
英文刊名：Journal of Image and Graphics
机构：北京工商大学计算机与信息工程学院食品安全大数据技术北京市重点实验室;
出版日期：2019-02-16
出版单位：中国图象图形学报
年：2019
期：v.24;No.274
基金：国家自然科学基金项目(61702020);; 北京市自然科学基金项目(4172013)~~
语种：中文;
页：ZGTB201902012
页数：9
CN：02
ISSN：11-3758/TB
分类号：128-136

摘要

目的为了进一步提高智能监控场景下行为识别的准确率和时间效率,提出了一种基于YOLO(you onlylook once:unified,real-time object detection)并结合LSTM(long short-term memory)和CNN(convolutional neural net-work)的人体行为识别算法LC-YOLO(LSTM and CNN based on YOLO)。方法利用YOLO目标检测的实时性,首先对监控视频中的特定行为进行即时检测,获取目标大小、位置等信息后进行深度特征提取;然后,去除图像中无关区域的噪声数据;最后,结合LSTM建模处理时间序列,对监控视频中的行为动作序列做出最终的行为判别。结果在公开行为识别数据集KTH和MSR中的实验表明,各行为平均识别率达到了96. 6%,平均识别速度达到215 ms,本文方法在智能监控的行为识别上具有较好效果。结论提出了一种行为识别算法,实验结果表明算法有效提高了行为识别的实时性和准确率,在实时性要求较高和场景复杂的智能监控中有较好的适应性和广泛的应用前景。
Objective The mainstream methods of action recognition still experience two main challenges,that is the extrac-tion of target features and the speed and real-time of the overall process of action recognition. At present,most of the state-of-the-art methods use CNN( convolutional neural network) to extract depth features. However,CNN has a large computa-tional complexity,and most of the regions in the video stream are not target images. The feature extraction of an entireimage is certainly expensive. Target detection algorithms,such as optical flow method,are not real-time; unstable; suscep-tible to external environmental conditions,such as illumination,camera angle,and distance; increase the amount of calcu-lation; and reduce time efficiency. Therefore,a human action recognition algorithm called LC-YOLO( LSTM and CNNbased on YOLO),which is based on YOLO( you only look once: unified,real-time object detection) combined withLSTM( long short-term memory) and CNN,is proposed to improve the accuracy and time efficiency of action recognition inintelligent surveillance scenarios. Method The LC-YOLO algorithm mainly consists of three parts,namely,target detec-tion,feature extraction,and action recognition. YOLO target detection is added as an aid to the mainstream method systemof CNN + LSTM. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actionsin surveillance video is conducted; target size,location,and other information are obtained; features are extracted; andnoise data are efficiently removed from unrelated areas of the image. Combined with LSTM modeling and processing timeseries,the final action recognition is made for the sequence of actions in video surveillance. Generally,the proposed model is an end-to-end deep neural network that uses the input raw video action sequence as input and returns the action category.The specific process of the single action recognition of the LC-YOLO algorithm can be described as follows. 1) YOLO is used to extract the position and confidence information( x,y,w,h,c),which has a 45 frame/s speed,can realize real-time detection of surveillance video when a specific action frame is detected; Under the training of a large number of datasets,the accurate rate of YOLO action detection can reach more than 90%. 2) On the basis of target detection,the target range image content is acquired and retained,and the noise data interference of the remaining background parts is removed,which extracts complete and accurate target features. A 4 096-dimensional depth feature vector is extracted by using a VGGNet-16 model and is returned to the recognition module combined with the target size and position information( x,y,w,h,c) predicted by YOLO. 3) In comparison with a standard RNN,the LSTM architecture uses memory cells to store and output information by using the LSTM unit as the identification module,thereby determining the temporal relationship of multiple target actions. The action category of the entire sequence of actions is outputted. In comparison with the work conducted by predecessors,the contributions of the proposed algorithm are as follows. 1) Instead of motion foreground extraction,R-CNN,and other target detection methods,the YOLO algorithm which is faster and more efficient,is used in this study. 2) The target size and position information are obtained when the target area is locked,and the interference information of the unrelated area in the picture can be removed,thereby effectively utilizing CNN to extract the depth feature.Moreover,the accuracy of feature extraction and overall time efficiency of behavior recognition are improved. Result Experiments in the public action recognition datasets KTH and MSR show that the average recognition rate of each action reaches 96. 6%,the average recognition speed reaches 215 ms,and the proposed method has a good effect on the action recognition of intelligent monitoring. Conclusion This study presents a human action recognition algorithm called LC-YOLO,which is based on YOLO combined with LSTM and CNN. The fast and real-time nature of YOLO target detection is utilized; realtime detection of specific actions in surveillance video is conducted; target size,location,and other information are obtained; features are extracted; and the noise data of unrelated regions in the image are efficiently removed,which reduces the computational complexity of feature extraction and time complexity of behavior recognition. Experimental results in the public action recognition datasets KTH and MSR show that they have better adaptability and broad application prospects in intelligent monitoring with high real-time requirements and complex scenes.

引文

[1]Stavropoulos G,Giakoumis D,Moustakas K,et al.Automatic action recognition for assistive robots to support MCI patients at home[C]//Proceedings of the 10th International Conference on Pervasive Technologies Related To Assistive Environments.Island of Rhodes,Greece:ACM,2017:366-371.[DOI:10.1145/3056540.3076185]
    [2]Ng J Y H,Hausknecht M,Vijayanarasimhan S,et al.Beyond short snippets:Deep networks for video classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:4694-4702.[DOI:10.1109/CVPR.2015.7299101]
    [3]Ullah A,Ahmad J,Muhammad K,et al.Action recognition in video sequences using deep Bi-directional LSTM with CNN features[J].IEEE Access,2017,6.[DOI:10.1109/ACCESS.2017.2778011]
    [4]Donahue J,Hendricks L A,Rohrbach M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):677-691.[DOI:10.1109/TPAMI.2016.2599174]
    [5]Yu G,Li T.Recognition of human continuous action with 3DCNN[C]//Proceedings of the 11th International Conference on Computer Vision Systems.Shenzhen,China:Springer,Cham,2017:314-322.[DOI:10.1007/978-3-319-68345-4_28]
    [6]Mahjoub A B,Atri M.Human action recognition using RGB data[C]//The 11th International Design&Test Symposium.Hammamet,Tunisia:IEEE,2016:83-87.[DOI:10.1109/IDT.2016.7843019]
    [7]Li W Q,Zhang Z Y,Liu Z C.Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.San Francisco,CA,USA:IEEE,2010:9-14.[DOI:10.1109/CVPRW.2010.5543273]
    [8]Chéron G,Laptev I,Schmid C.P-CNN:pose-based CNN features for action recognition[C]//Proceedings of 2015 IEEEInternational Conference on Computer Vision.Santiago,Chile:IEEE,2015:3218-3226.[DOI:10.1109/ICCV.2015.368]
    [9]Fan H,Xu J,Deng Y,et al.Behavior recognition of human based on deep learning[J].Geomatics and Information Science of Wuhan University,2016,41(4):492-497.[樊恒,徐俊,邓勇,等.基于深度学习的人体行为识别[J].武汉大学学报:信息科学版,2016,41(4):492-497.][DOI:10.13203/j.whugis20140110]
    [10]Tu Z G,Cao J,Li Y K,et al.MSR-CNN:Applying motion salient region based descriptors for action recognition[C]//Proceedings of the 23rd International Conference on Pattern Recognition.Cancun,Mexico:IEEE,2016:3524-3529.[DOI:10.1109/ICPR.2016.7900180]
    [11]Karpathy A,Toderici G,Shetty S,et al.Large-scale video classification with convolutional neural networks[C]//Proceedings of2014 IEEE Conference on Computer Vision and Pattern Recognition.Columbus,OH,USA:IEEE,2014:1725-1732.[DOI:10.1109/CVPR.2014.223]
    [12]Zhou L,Nagahashi H.Real-time action recognition based on key frame detection[C]//Proceedings of the 9th International Conference on Machine Learning and Computing.Singapore:ACM,2017:272-277.[DOI:10.1145/3055635.3056569]
    [13]Redmon J,Divvala S,Girshick R,et al.You only look once:unified,real-time object detection[C]//Proceedings of 2016IEEE Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:779-788.[DOI:10.1109/CVPR.2016.91]
    [14]Greff K,Srivastava R K,Koutník J,et al.LSTM:a search space odyssey[J].IEEE Transactions on Neural Networks and Learning Systems,2017,28(10):2222-2232.[DOI:10.1109/TNNLS.2016.2582924]
    [15]Schuldt C,Laptev I,Caputo B.Recognizing human actions:a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition.Cambridge,UK:IEEE,2004:32-36.[DOI:10.1109/ICPR.2004.1334462]
    [16]Li J J,Mao X,Wu X Y,et al.Human action recognition based on tensor shape descriptor[J].IET Computer Vision,2016,10(8):905-911.[DOI:10.1049/iet-cvi.2016.0048]
    [17]Ijjina E P,Chalavadi K M.Human action recognition in RGB-Dvideos using motion sequence information and deep learning[J].Pattern Recognition,2017,72:504-516.[DOI:10.1016/j.patcog.2017.07.013]
    [18]Liu J,Wang G,Duan L Y,et al.Skeleton-based human action recognition with global context-aware attention LSTM networks[J].IEEE Transactions on Image Processing,2018,27(4):1586-1599.[DOI:10.1109/TIP.2017.2785279]
    [19]Megrhi S,Jmal M,Souidene W,et al.Spatio-temporal action localization and detection for human action recognition in big dataset[J].Journal of Visual Communication and Image Representation,2016,41:375-390.[DOI:10.1016/j.jvcir.2016.10.016]
    [20]Sargano A B,Wang X F,Angelov P,et al.Human action recognition using transfer learning with deep representations[C]//Proceedings of 2017 International Joint Conference on Neural Networks.Anchorage,AK,USA:IEEE,2017:463-469.[DOI:10.1109/IJCNN.2017.7965890]

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700