A two-stage framework for multimodal video classification is proposed.
The model is built based on stacked contractive autoencoders.
The first stage is single modal pre-training.
The second stage is multimodal fine-tuning.
The objective functions are optimized by stochastic gradient descent.