面向图像自动语句标注的注意力反馈模型

英文篇名：Feedback Attention Model for Image Captioning
作者：吕凡 ; 胡伏原 ; 张艳宁 ; 夏振平 ; 盛胜利
英文作者：Lyu Fan;Hu Fuyuan;Zhang Yanning;Xia Zhenping;Victor S Sheng;School of Electronic&Information Engineering, Suzhou University of Science and Technology;College of Intelligence and Computing, Tianjin University;Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou;School of Computer Science and Engineering, Northwestern Polytechnical University;Department of Computer Science, University of Central Arkansas;Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency;
关键词：图像自动语句标注 ; 注意力机制 ; 注意力反馈
英文关键词：image captioning;;attention mechanism;;attention feedback
中文刊名：JSJF
英文刊名：Journal of Computer-Aided Design & Computer Graphics
机构：苏州科技大学电子与信息工程学院;天津大学智能与计算学部;苏州科技大学苏州市虚拟现实智能交互及应用技术重点实验室;西北工业大学计算机学院;Department of Computer Science, University of Central Arkansas;江苏省建筑智慧节能重点实验室;
出版日期：2019-07-15
出版单位：计算机辅助设计与图形学学报
年：2019
期：v.31
基金：国家自然科学基金(61876121,61472267,61728205,61502329);; 江苏省重点研发计划(BE2017663)
语种：中文;
页：JSJF201907008
页数：8
CN：07
ISSN：11-2925/TP
分类号：64-71

摘要

图像自动语句标注利用计算机自动生成描述图像内容的语句,在服务机器人等领域有广泛应用.许多学者已经提出了一些基于注意力机制的算法,但是注意力分散问题以及由注意力分散引起的生成语句错乱问题还未得到较好解决.在传统注意力机制的基础上引入注意力反馈机制,利用关注信息的图像特征指导文本生成,同时借助生成文本中的关注信息进一步修正图像中的关注区域,该过程不断强化图像和文本中的关键信息匹配、优化生成的语句.针对常用数据集Flickr8k, Flickr30k和MSCOCO的实验结果表明,该模型在一定程度上解决了注意力分散和语句顺序错乱问题,比其他基于注意力机制方法标注的关注区域更加准确,生成语句更加通顺.
The image captioning problem aims to let machine generate relevant sentence of a given image, which has been applied to the service robot. To improve the performance of image captioning effectively, some researchers propose to leverage the attention mechanism. However, the mechanism often suffers from distraction and sentence-disorder. In this paper, we propose an image captioning model based on a novel feed-back attention mechanism. In generating the corresponding language for a given image, the proposed model uses the attention feedback from the generated language. With the feedback, the attention heatmap of the original image will be revised, and the generated sentence will also be better. We evaluate the proposed method on three benchmark data-sets, i.e., Flickr8k, Flickr30k and MSCOCO, and the experimental results show the superiority of the proposed method.

引文

[1]Vinyals O,Toshev A,Bengio S,et al.Show and tell:a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2015:3156-3164
    [2]Xu K,Ba J,Kiros R,et al.Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of International Conference on Machine Learning.Madison:Omnipress,2015:2048-2057
    [3]Krizhevsky A,Sutskever I,Hinton G E.Imagenet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems.New York:Curran Associates Press,2012,1:1097-1105
    [4]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[OL].[2018-08-27].https://arxiv.org/abs/1409.1556
    [5]He K M,Zhang X Y,Ren S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEEComputer Society Press,2016:770-778
    [6]Bahdanau D,Cho K,Bengio Y.Neural machine translation by jointly learning to align and translate[OL].[2018-08-27].https://arxiv.org/abs/1409.0473
    [7]Sun Feng,Qin Kaihuai,Sun Wei,et al.Image saliency detection based on region merging[J].Journal of Computer-Aided Design&Computer Graphics,2016,28(10):1679-1687(in Chinese)(孙丰,秦开怀,孙伟,等.基于区域合并的图像显著性检测[J].计算机辅助设计与图形学学报,2016,28(10):1679-1687)
    [8]Gao Sihan,Zhang Lei,Li Chenglong,et al.Image saliency detection via graph representation with fusing low-level and high-level features[J].Journal of Computer-Aided Design&Computer Graphics,2016,28(3):420-426(in Chinese)(高思晗,张雷,李成龙,等.融合低层和高层特征图表示的图像显著性检测算法[J].计算机辅助设计与图形学学报,2016,28(3):420-426)
    [9]You Q Z,Jin H L,Wang Z W,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEEComputer Society Press,2016:4651-4659
    [10]Gu J X,Cai J F,Wang G,et al.Stack-captioning:coarse-to-fine learning for image captioning[C]//Proceedings of AAAI Conference on Artificial Intelligence.Palo Alto:AAAI Press,2018:6837-6844
    [11]Gan Z,Gan C,He X D,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2017:5630-5639
    [12]Mao J H,Xu W,Yang Y,et al.Deep captioning with multimodal recurrent neural networks(m-RNN)[OL].[2018-08-27].https://arxiv.org/abs/1412.6632
    [13]Wu Q,Shen C H,Liu L Q,et al.What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2016:203-212
    [14]Karpathy A,Li F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2015:3128-3137
    [15]Johnson J,Karpathy A,Li F F.DenseCap:fully convolutional localization networks for dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2016:4565-4574
    [16]Rensink R A.The dynamic representation of scenes[J].Visual Cognition,2000,7(1-3):17-42
    [17]Liu C X,Mao J H,Sha F,et al.Attention correctness in neural image captioning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence.Palo Alto:AAAI Press,2017:4176-4182
    [18]Liu C,Sun F C,Wang C H,et al.MAT:a multimodal attentive translator for image captioning[OL].[2018-08-27].https://arxiv.org/abs/1702.05658
    [19]Li L H,Tang S,Deng L X,et al.Image caption with global-local attention[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence.Palo Alto:AAAI Press,2017:4133-4139
    [20]Yang Z L,Yuan Y,Wu Y X,et al.Review networks for caption generation[C]//Proceedings of Advances in Neural Information Processing Systems.New York:Curran Associates,2016:2369-2377
    [21]Cavana R Y.Modeling the environment:an introduction to system dynamics models of environmental systems[J].System Dynamics Review,2003,19(2):171-173
    [22]Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780
    [23]Zaremba W,Sutskever I.Learning to execute[OL].[2018-08-27].https://arxiv.org/abs/1410.4615
    [24]Young P,Lai A,Hodosh M,et al.From image descriptions to visual denotations:new similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78
    [25]Hodosh M,Young P,Hockenmaier J.Framing image description as a ranking task:data,models and evaluation metrics[J].Journal of Artificial Intelligence Research,2013,47:853-899
    [26]Lin T Y,Maire M,Belongie S,et al.Microsoft coco:common objects in context[C]//Proceedings of European Conference on Computer Vision.Heidelberg:Springer,2014:740-755
    [27]Papineni K,Roukos S,Ward T,et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Stroudsburg:ACL Press,2002:311-318
    [28]Brown P F,Desouza P V,Mercer R L,et al.Class-based n-gram models of natural language[J].Computational Linguistics,1992,18(4):467-479
    [29]Lavie A,Agarwal A.Meteor:an automatic metric for MTevaluation with high levels of correlation with human judgments[C]//Proceedings of the 2nd Workshop on Statistical Machine Translation.Stroudsburg:ACL Press,2007:228-231
    [30]Chen X L,Lawrence Zitnick C.Mind’s eye:a recurrent visual representation for image caption generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2015:2422-2431
    [31]Karpathy A,Li F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos:IEEE Computer Society Press,2015,1:3128-3137
    [32]Kiros R,Salakhutdinov R,Zemel R.Multimodal neural language models[C]//Proceedings of the 31st International Conference on Machine Learning.Madison:Omnipress,2014,32:Ⅱ-595-Ⅱ-603
    [33]Wang M S,Song L,Yang X K,et al.A parallel-fusion RNN-LSTM architecture for image caption generation[C]//Proceedings of the IEEE International Conference on Image Processing.Los Alamitos:IEEE Computer Society Press,2016:4448-4452
    [34]Wang C,Yang H J,Bartz C,et al.Image captioning with deep bidirectional LSTMs[C]//Proceedings of the 24th ACM International Conference on Multimedia.New York:ACM Press,2016:988-997
    [35]Tan Y H,Chan C S.phi-LSTM:a phrase-based hierarchical LSTM model for image captioning[C]//Proceedings of Asian Conference on Computer Vision.Heidelberg:Springer,2016:101-117
    [36]Fu K,Jin J Q,Cui R P,et al.Aligning where to see and what to tell:image captioning with region-based attention and scene-specific contexts[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39:2321-2334