Specific video identification via joint learning of latent semantic concept, scene and temporal structure

设为首页

收藏本站

网站地图 | English | 公务邮箱

About the library

Background
History
Leadership
Organization

Readers' Guide

Opening Hours
Collections
Help Via Email

Publications

Electronic Information Resources

Specific video identification via joint learning of latent semantic concept, scene and temporal structure

详细信息查看全文

作者：Zhicheng Zhao^a ; ^b ; ^{zhichenz@cs.cmu.edu" class="auth_mail" title="E-mail the corresponding author}Author Vitae ; Yifan Song^a ; ^{danielsong1990@gmail.com" class="auth_mail" title="E-mail the corresponding author}Author Vitae ; Fei Su^a ; ^{sufei@bupt.edu.cn" class="auth_mail" title="E-mail the corresponding author}Author Vitae
关键词：Specific video ; CNN ; VLAD ; LSTM
刊名：Neurocomputing
出版年：2016
出版时间：5 October 2016
年：2016
卷：208
期：Complete
页码：378-386
全文大小：1653 K

文摘

In this paper, based on three typical characteristics of specific videos, i.e., the theme, scene and temporal structure, a novel data-driven identification architecture for the specific video is proposed. To be concrete, at the frame-level, semantic features and scene features from two independent Convolutional Neural Networks (CNNs) are extracted. At the video-level, Vector of Locally Aggregated Descriptors (VLAD) is firstly adopted to encode spatial representation, and then multiple-layer Long Short-Term Memory (LSTM) networks are introduced to represent temporal information. Additionally, a large-scale specific video dataset (SVD) is built for evaluation. The experimental results show that our method obtain impressive 98% mAP. Moreover, in order to validate generalization capability of proposed architecture, extensive experiments on two public datasets, Columbia Consumer Videos (CCV) and Unstructured Social Activity Attribute (USAA), are conducted. Comparison results indicate that our approach outperforms state-of-the-art methods on USAA, and achieves comparable results on CCV.