一种结合空间特征的图像注意力标注算法改进研究

英文篇名：Improved algorithm for image attention annotation combined with spatial features
作者：徐守坤 ; 周佳 ; 李宁 ; 石林
英文作者：Xu Shoukun;Zhou Jia;Li Ning;Shi Lin;School of Information Science & Engineering,Changzhou University;School of Mathematics & Physics,Changzhou University;Fujian Provincial Key Laboratory of Information Processing & Intelligent Control ( Minjiang University);
关键词：视觉注意力 ; 图像标注 ; 空间特征
英文关键词：visual attention;;image annotation;;spatial feature
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：常州大学信息科学与工程学院;常州大学数理学院;福建省信息处理与智能控制重点实验室(闽江学院);
出版日期：2018-02-08 17:54
出版单位：计算机应用研究
年：2019
期：v.36;No.327
基金：闽江学院福建省信息处理与智能控制重点实验室开放课题资助项目(MJUKF201740)
语种：中文;
页：JSYJ201901067
页数：5
CN：01
ISSN：51-1196/TP
分类号：294-297+321

摘要

针对图像标注和attention机制结合过程中特征选择不充分和预测过程中对空间特征权重比例不足的问题,提出了一种结合空间特征的注意力图像标注方法。首先通过卷积神经网络得到图像特征,特征区域与文本标注序列匹配;然后通过attention机制给标注词汇加权,结合空间特征提取损失函数得到基于空间特征注意力的图像标注;最后分别在Flickr30k和MS-COCO两个数据集上进行验证,通过可视化显示该模型如何自动学习显著区域并生成相应的词汇输出序列。实验结果表明,该方法能较好地提取注意力区域并给出标注,与其他模型对比能够得到更好的标注结果。
Aiming at the problem of insufficient feature selection and lack of spatial feature weight in the process of image annotation and attention mechanism,this paper proposed a method of attention image annotation combined with spatial feature.Firstly,it obtained the image feature by convolution neural network,and matched the feature region with the text label sequence. Then,it used the attention mechanism to weight the annotation vocabulary,and combining the spatial feature to extract the loss function,the image annotation based on the spatial feature attention. Finally,the Flickr30 k and MS-COCO validated on the data set to show how the model automatically learns the salient regions and generated the corresponding vocabulary output sequences. The experimental results show that the method can extract the attention area and gave the annotation,and comparing with other models can get better labeling results.

引文

[1] Bahdanau D,Cho K,Bengio Y. Neural machine translation by jointly learning to align and translate[J]. Computer Science,2014,40(12):4751-4759.
    [2] Kiros R,Salakhutdinov R,Zemel R. Multimodal neural language models[C]//Proc of the 31st International Conference on Learning Representations. Piscataway,NJ:IEEE Press,2014:595-603.
    [3] Vinyals O,Toshev A,Bengio S,et al. Show and tell:a neural image caption generator[J]. Computer Science,2015,36(7):3156-3164.
    [4] Karpathy A,Li Feifei. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Trans on Pattern Analysis&Machine Intelligence,2014,39(4):664-676.
    [5] Xu K,Ba J,Kiros R,et al. Show,attend and tell:neural image caption generation with visual attention[J]. Computer Science,2015,58(12):2048-2057.
    [6] Yang Zhilin,Yuan Ye,Wu Yuexin,et al. Review networks for caption generation[C]//Advances in Neural Information Processing Systems.Vancouver:NIPS,2016:2361-2369.
    [7] You Quanzeng,Jin Hailin,Wang Zhaowen,et al. Image captioning with semantic attention[J]. Computer Science,2016,42(13):4651-4659.
    [8] Wu Qi,Shen Chunhua,Liu Lingqiao,et al. What value do explicit high level concepts have in vision to language problems?[J]. Computer Science,2016,12(1):1640-1649.
    [9] Lu Jiasen,Xiong Caiming,Parikh D,et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[J]. International Journal of Computer Vision,2016,115(3):211-252.
    [10]Zhou Luowei,Xu Chenliang,Koch P,et al. Watch what you just said:image captioning with text-conditional attention[J]. IEEE Trans on Image Processing,2016,25(8):3919-3930.
    [11]张冲.基于attention-based LSTM模型的文本分类技术的研究[D].南京:南京大学,2016.(Zhang Chong. Text classification based on attention-based LSTM model[D]. Nanjing:Nanjing University,2016.)
    [12]杨格兰,邓晓军,刘琮.基于深度时空域卷积神经网络的表情识别模型[J].中南大学学报:自然科学版,2016,47(7):2311-2319.(Yang Gelan,Deng Xiaojun,Liu Cong. Facial expression recognition model based on deep spatiotemporal convolutional neural networks[J]. Journal of Central South University:Science and Technology Edition,2016,47(7):2311-2319.)
    [13]Fu Kun,Jin Junqi,Cui Runpeng,et al. Aligning where to see and what to tell:image captioning with region-based attention and scene-specific contexts[J]. IEEE Trans on Pattern Analysis&Machine Intelligence,2015,39(12):2321-2334.
    [14]Cho K,Merrienboer B V,Gulcehre C,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computer Science,2014,45(18):4913-4921.
    [15] Sutskever I,Vinyals O,Le Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems. Vancouver:NIPS,2014:3104-3112.
    [16]Liu Chenxi,Mao Junhua,Sha Fei,et al. Attention correctness in neural image captioning[C]//Proc of AAAI-the Association for the Advance of Artificial Intelligence. Palo Alto,California:AAAI Press,2017:4176-4182.
    [17]柯逍,李绍滋,曹冬林.基于相关视觉关键词的图像自动标注方法研究[J].计算机研究与发展,2012,49(4):846-855.(Ke Xiao,Li Shaozi,Cao Donglin. Automatic image annotation based on relevant visual keywords[J]. Journal of Computer Research and Development,2012,49(4):846-855.)
    [18]Denkowski M,Lavie A. METEOR universal:language specific translation evaluation for any target language[C]//Proc of Workshop on Statistical Machine Translation. Piscataway,NJ:IEEE Press,2014:376-380.
    [19] Yao Ting,Pan Yingwei,Li Yehao,et al. Boosting image captioning with attributes[J]. ACM Trans on Graphics,2016,27(3):1423-1436.
    [20]Vedantam R,Zitnick C L,Parikh D. CIDEr:consensus-based image description evaluation[J]. Computer Science,2014,9(4):4566-4575.
    [21]Papineni K,Roukos S,Ward T,et al. BLEU:a method for automatic evaluation of machine translation[C]//Proc of Meeting on Association for Computational Linguistics. Stroudsburg,PA:Association for Computational Linguistics,2002:311-318.
    [22]Zhang Jianming,Lin Zhe,Brandt J,et al. Top-down neural attention by excitation backprop[C]//Proc of European Conference on Computer Vision. Berlin:Springer International Publishing,2016:543-559.
    [23]李静.基于多特征的图像标注研究[D].武汉:武汉理工大学,2013.(Li Jin. Image annotation based on multi-feature[D]. Wuhan:Wuhan University of Technology,2013.)
    [24]滕飞,郑超美,李文.基于长短期记忆多维主题情感倾向性分析模型[J].计算机应用,2016,36(8):2252-2256.(Teng Fei,Zheng Chaomei,Li Wen. Multidimensional topic model for oriented sentiment analysis based on long short-term memory[J]. Journal of Computer Applications,2016,36(8):2252-2256.)
    [25]刘杰. LSTM神经网络在Android平台上的实现[D].天津:南开大学,2015.(Liu Jie. The implementation of the LSTM neural network on the Android platform[D]. Tianjin:Nankai University,2015.)
    [26]Mao Junhua,Xu Wei,Yang Yi,et al. Deep captioning with multimodal recurrent neural networks(m-RNN)[EB/OL].(2014-12-20)[2015-06-11]. https://arxiv. org/abs/1412. 6632.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700