Dynamic texture and scene classification by transferring deep image features

详细信息查看全文

作者：Xianbiao Qi^a ; ^b ; ^{qixianbiao@gmail.com" class="auth_mail" title="E-mail the corresponding author}Author Vitae ; Chun-Guang Li^b ; ^{lichunguang@bupt.edu.cn" class="auth_mail" title="E-mail the corresponding author}Author Vitae ; Guoying Zhao^a ; ^{gyzhao@ee.oulu.fi" class="auth_mail" title="E-mail the corresponding author}Author Vitae ; Xiaopeng Hong^a ; ^{xhong@ee.oulu.fi" class="auth_mail" title="E-mail the corresponding author}Author Vitae ; Matti Pietikä ; inen^a ; ^{mkp@ee.oulu.fi" class="auth_mail" title="E-mail the corresponding author}Author Vitae
关键词：Dynamic texture classification ; Dynamic scene classification ; Transferred ConvNet feature ; Convolutional neural network
刊名：Neurocomputing
出版年：2016
出版时间：1 January 2016
年：2016
卷：171
期：Complete
页码：1230-1241
全文大小：3138 K

文摘

Dynamic texture and scene classification are two fundamental problems in understanding natural video content. Extracting robust and effective features is a crucial step towards solving these problems. However, the existing approaches suffer from the sensitivity to either varying illumination, or viewpoint changes, or even camera motion, and/or the lack of spatial information. Inspired by the success of deep structures in image classification, we attempt to leverage a deep structure to extract features for dynamic texture and scene classification. To tackle with the challenges in training a deep structure, we propose to transfer some prior knowledge from image domain to video domain. To be more specific, we propose to apply a well-trained Convolutional Neural Network (ConvNet) as a feature extractor to extract mid-level features from each frame, and then form the video-level representation by concatenating the first and the second order statistics over the mid-level features. We term this two-level feature extraction scheme as a Transferred ConvNet Feature (TCoF). Moreover, we explore two different implementations of the TCoF scheme, i.e., the spatial TCoF and the temporal TCoF. In the spatial TCoF, the mean-removed frames are used as the inputs of the ConvNet; whereas in the temporal TCoF, the differences between two adjacent frames are used as the inputs of the ConvNet. We evaluate systematically the proposed spatial TCoF and the temporal TCoF schemes on three benchmark data sets, including DynTex, YUPENN, and Maryland, and demonstrate that the proposed approach yields superior performance.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700