行为识别中一种基于融合特征的改进VLAD编码方法

英文篇名：An Improved VLAD Coding Method Based on Fusion Feature in Action Recognition
作者：罗会兰 ; 王婵娟
英文作者：LUO Hui-lan;WANG Chan-juan;School of Information Engineering, Jiangxi University of Science and Technology;
关键词：行为识别 ; 位置信息 ; 级联 ; 表示向量
英文关键词：action recognition;;position information;;concatenate;;expression vector
中文刊名：DZXU
英文刊名：Acta Electronica Sinica
机构：江西理工大学信息工程学院;
出版日期：2019-01-15
出版单位：电子学报
年：2019
期：v.47;No.431
基金：国家自然科学基金(No.61862031,No.61462035);; 江西省自然科学基金“视觉特征表达的自我深度学习模型研究”(No.20171BAB202014)
语种：中文;
页：DZXU201901007
页数：10
CN：01
ISSN：11-2087/TN
分类号：51-60

摘要

本文提出了一种新的基于融合特征的改进VLAD(Vector of Locally Aggregated Descriptors)编码方法,该方法命名为IVLAD(Improved Vector of Locally Aggregated Descriptors),将其应用于行为识别算法中,得到了较好的性能提升.针对单一特征描述符在描述视频空间信息的不足,提出将位置信息映射到特征空间中进行融合编码得到表示向量.在编码阶段为了克服传统VLAD方法只考虑特征与聚类中心距离的不足,提出在其基础之上另外计算每个聚类中心与其最相似特征的差值.为了进一步提高识别准确度,本文还提出对表征向量自身串联用以升维.另外本文还研究了不同词典大小及归一化方法对于识别算法的影响.在两个大型数据库UCF101及HMDB51上的实验比较表明,本文提出的方法比传统VLAD方法具有较大的性能提升.
A novel coding method IVLAD( Improved Vector of Locally Aggregated Descriptors) based on the fusion of features was proposed in this paper. It obtained good performance in behavior recognition. In order to solve the problem that single feature descriptor cannot express space information well, location information was mapped into feature space and then jointly coded to get the video expression vector. In order to avoid the deficiency of the traditional VLAD methods which only consider the distances of features and clustering centers, the distance between each cluster and its most similar feature was also used in the coding stage. Finally concatenating the video expression vector with itself was proposed to raise the dimension of vectors to further improve the recognition accuracy. Furthermore, the influences of the visual dictionary size, the location dictionary size and the normalization method on the recognition accuracy were studied. The experimental results on two large databases UCF101 and HMDB51 have shown that the proposed method had better performance than the traditional VLAD method.

引文

[1]ZHU F,SHAO L,XIE J. From handcrafted to learned representations for human action recognition[J]. Image and Vision Computing,2016,55(2):42-52.
    [2]杜友田,陈峰,徐文立,李永彬.基于视觉的人的运动识别综述[J].电子学报,2007,35(1):84-90.DU You-tian,CHEN Feng,XU Wen-li,LI Yong-bin. A survey on the vision-based human motion recognition[J].Acta Electronica Sinica,2007,35(1):84-90.(in Chinese)
    [3]BOYER E. A Survey of Vision-Based Methods for Action Representation,Segmentation and Recognition[M]. Elservier Science Inc,2011.
    [4]苏松志,李绍滋,陈淑媛,蔡国榕,吴云东.行人检测技术综述[J].电子学报,2012,40(4):814-820.SU Song-zhi,LI Shao-zi,CHEN Shu-yuan,CAI Guo-rong,WU Yun-dong. A survey on pedestrian detection[J]. Acta Electronica Sinica,2012,40(4):814-820.(in Chinese)
    [5] DAWN D D,SHAIKH S H. A comprehensive survey of human action recognition w ith spatio-temporal interest point(STIP)detector[J]. Visual Computer,2016,32(3):289-306.
    [6]田国会,尹建芹,闫云章,李国栋.基于混合高斯模型和主成分分析的轨迹分析行为识别方法[J].电子学报,2016,44(1):143-149.TIAN Guo-hui,YIN Jian-qin,YAN Yun-zhang,LI Guodong. Gaussian mixture models and principal component analysis based human trajectory behavior recognition[J].Acta Electronica Sinica,2016,44(1):143-149.(in Chinese)
    [7] SIMONYAN K,ZISSERMAN A. Two-Stream convolutional netw orks for action recognition in videos[A]. Proceedings of International Conference on Neural Information Processing Systems[C]. USA:M IT Press,2014. 568-576.
    [8]WANG L,GE L,LI R,et al. Three-stream CNNs for action recognition[J]. Pattern Recognition Letters,2017,92(C):33-40.
    [9]LECUN Y,BENGIO Y,HINTON G. Deep learning[J].Nature,2015,521(7553):436.
    [10]BILEN H,FERNANDO B,GAVVES E,et al. Action recognition w ith dynamic image netw orks[J]. IEEE Transactions on Pattern Analysis&M achine Intelligence,2016,PP(99):1-1.
    [11] GKIOXARI G,GIRSHICK R,MALIK J. Contextual action recognition w ith R*CNN[A]. Proceedings of International Conference on Computer Vision[C]. USA:IEEE,2015. 1080-1088.
    [12]CHERON G,LAPTEV I,SCHMID C,et al. P-CNN:Posebased CNN features for action recognition[J]. Proceedings of International Conference on Computer Vision[C].USA:IEEE,2015. 3218-3226.
    [13] YANG X,TIAN Y L. Action recognition using super sparse coding vector w ith spatio-temporal aw areness[A]. Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences[C]. Berlin:Springer,2014. 727-741.
    [14]PENG X,ZOU C,QIAO Y,et al. Action recognition with stacked fisher vectors[A]. European Conference on Computer Vision[C]. Berlin:Springer,2014. 581-595.
    [15]PENG X,WANG L,WANG X,et al. Bag of visual words and fusion methods for action recognition:Comprehensive study and good practice[J]. Computer Vision&Image Understanding,2016,150(C):109-125.
    [16] ARANDJELOVIC R,ZISSERMAN A. All about VLAD[A]. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition[C]. USA:IEEE,2013. 1578-1585.
    [17]DUTA I C,IONESCU B,AIZAWA K,et al. Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos[M]. USA:M ulti M edia M odeling,2017.
    [18] WANG H,ULLAH M M,KLASER A,et al. Evaluation of local spatio-temporal features for action recognition[A]. Proceedings of British Machine Vision Conference(BMVC2009)[C]. London,UK:DBLP,2009. DOI:10.5244/C. 23. 124.
    [19] LAPTEV I. On space-time interest points[A]. Proceedings of IEEE International Conference on Computer Vision[C]. USA:IEEE,2005. 107-123.
    [20]WANG H,KLASER A,SCHMID C,et al. Action recognition by dense trajectories[A]. Computer Vision and Pattern Recognition[C]. USA:IEEE,2011. 3169-3176.
    [21]WANG H,SCHMID C. Action recognition with improved trajectories[A]. Proceedings of IEEE International Conference on Computer Vision[C]. USA:IEEE,2014. 3551-3558.
    [22] UIJLINGS J,DUTA I C,SANGINETO E,et al. Video classification w ith Densely extracted HOG/HOF/M BH features:an evaluation of the accuracy/computational efficiency trade-off[J]. International Journal of M ultimedia Information Retrieval,2015,4(1):33-44.
    [23]LIU L,WANG L,LIU X. In defense of soft-assignment coding[A]. Proceedings of IEEE International Conference on Computer Vision[C]. USA:IEEE,2011. 2486-2493.
    [24] YANG J,YU K,GONG Y,et al. Linear spatial pyramid matching using sparse coding for image classification[A]. computer Vision and Pattern Recognition[C].USA:IEEE,2009. 1794-1801.
    [25]WANG J,YANG J,YU K,et al. Locality-constrained linear coding for image classification[A]. Computer Vision and Pattern Recognition[C]. USA:IEEE,2010. 3360-3367.
    [26] PERRONNIN F,SNCHEZ J,MENSINK T. Improving the fisher kernel for large-scale image classification[A].European Conference on Computer Vision[C]. Berlin:Springer-Verlag,2010. 143-156.
    [27]ZHOU X,YU K,ZHANG T,et al. Image classification using super-vector coding of local image descriptors[A].European Conference on Computer Vision[C]. Berlin:Springer,2010,6315:141-154.
    [28]JEGOU H,DOUZE M,SCHMID C,et al. Aggregating local descriptors into a compact image representation[A].Computer Vision and Pattern Recognition[C]. USA:IEEE,2010. 3304-3311.
    [29]SOMASUNDARAM G,CHERIAN A,MORELLAS V,et al. Action recognition using global spatio-temporal features derived from sparse representations[J]. Computer Vision&Image Understanding,2014,123(7):1-13.
    [30] ZHANG B,WANG H. Encoding scale into fisher vector for human action recognition[A]. Visual Communications and Image Processing[C]. USA:IEEE,2016. 1-4.
    [31]SOOMRO K,ZAMIR A R,SHAH M. UCF101:A Dataset of 101 Human Actions Classes From Videos in the Wild[DB/OL]. http://crcv. ucf. edu/data/UCF101. php.
    [32]KUEHNE H,JHUANG H,GARROTE E,et al. HMDB:A large video database for human motion recognition[A].Proceedings of International Conference on Computer Vision[C]. USA:IEEE,2011. 2556-2563.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700