摘要
如何挖掘出不同模态数据之间的潜在语义关联是跨模态检索算法的核心问题。已有研究表明,将表示学习和关联学习融合的模式比较适用于跨模态检索的任务,但目前基于这一模式的模型的不同模态数据的抽象层次之间只包含着1-1的对应关联关系。由于异构多模态数据的抽象粒度并不完全相同,对此它们之间的关联关系很可能不只存在于指定的抽象层上。因此,提出了一种融合多层语义的跨模态检索模型,它利用深度玻尔兹曼机的双向结构特点,实现了将文本模态数据的不同抽象层次同时关联到图像模态数据的多个抽象层上,从而更充分地挖掘不同模态数据抽象层之间N-M的内在关联。基于3个公开数据集的实验结果表明,该模型优于之前类似的跨模态检索模型,具有更高的检索精确度。
How to explore the inherent relations of different modalities is the core problem of cross-modal retrieval.The previous works demonstrate that the models which incorporate representation learning and correlation learning into a single process are more suitable for cross-modal retrieval task,but these models only contain the 1-1 correspondence correlations between different modalities.However,different modalities are more likely to have different granularities of semantics abstraction,and the correlations between different modalities are more likely to occur in different layers of semantic at the same time.This paper proposed a cross-modal retrieval model fusing multilayer semantic.The model benefits from the architecture of deep boltzmann machine which is an undirected graph model and implements that each semantic layer of text modality is associated with multiple different semantic layers of image modality at last,and explores the inherent N-M relations of different modalities more sufficiently.The results of experiments on three real and public datasets demonstrate that this model is obviously superior to the state-of-art models,and has higher accuracy of retrieval.
引文
[1] FENG F X.Deep learning for cross-modal retrieval[D].Beijing:Beijing University of Posts and Telecommunications,2015.(in Chinese)冯方向.基于深度学习的跨模态检索研究[D].北京:北京邮电大学,2015.
[2] FENG F,WANG X,LI R.Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the 22nd ACM international conference on Multimedia.ACM,2014:7-16.
[3] FENG F,LI R,WANG X.Deep correspondence restricted Boltzmann machine for cross-modal retrieval[J].Neurocomputing,2015,154:50-60.
[4] WANG W,OOI B C,YANG X,et al.Effective multi-modal retrieval based on stacked auto-encoders[J].Proceedings of the VLDB Endowment,2014,7(8):649-660.
[5] CAI G,FENG Y,LIN Q.Cross-modal retrieval based on deep correlated network[C]//2017 3rd IEEE International Confe-rence on Computer and Communications (ICCC).IEEE,2017:1226-1231.
[6] PENG Y,HUANG X,QI J.Cross-media Shared Representation by Hierarchical Learning with Multiple Deep Networks[C]//International Joint Conference on Artificial Intelligence(IJCAI).IEEE,2016:3846-3853.
[7] WANG K,YIN Q,WANG W,et al.A comprehensive survey on cross-modal retrieval[J].arXiv preprint arXiv:1607.06215,2016.
[8] SALAKHUTDINOV R,HINTON G.Deep boltzmann machines[C]//Artificial Intelligence and Statistics.IEEE,2009:448-455.
[9] SRIVASTAVA N,SALAKHUTDINOV R R.Multimodal lear- ning with deep boltzmann machines[C]//Advances in Neural Information Processing Systems.2012:2222-2230.
[10] CHO K H,RAIKO T,ILIN A.Gaussian-bernoulli deep boltz- mann machine[C]//The 2013 International Joint Conference on Neural Networks (IJCNN).IEEE,2013:1-7.
[11] KRIZHEVSKY A,HINTON G.Learning multiple layers of features from tiny images[R].Technical Teport,University of Toronto,2009.
[12] WELLING M,ROSEN-ZVI M,HINTON G E.Exponential family harmoniums with an application to information retrieval[C]//Advances in Neural Information Processing Systems.2005:1481-1488.
[13] HINTON G E,SALAKHUTDINOV R R.Replicated softmax:an undirected topic model[C]//Advances in Neural Information Processing Systems.2009:1607-1614.
[14] SALAKHUTDINOV R,LAROCHELLE H.Efficient learning of deep Boltzmann machines[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.2010:693-700.
[15] HINTON G E.Training products of experts by minimizing contrastive divergence[J].Neural Computation,2002,14(8):1771-1800.
[16] RASIWASIA N,COSTA PEREIRA J,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia.ACM,2010:251-260.
[17] CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from National University of Singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval.ACM,2009.
[18] FARHADI A,HEJRATI M,SADEGHI M,et al.Every picture tells a story:Generating sentences from images[M]//Computer Vision-ECCV 2010.Berlin:Springer,2010:15-29.
[19] NGIAM J,KHOSLA A,KIM M,et al.Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11).2011:689-696.