结合特征图切分的图像语义分割
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Feature map slice for semantic segmentation
  • 作者:曹峰梅 ; 田海杰 ; 付君 ; 刘静
  • 英文作者:Cao Fengmei;Tian Haijie;Fu Jun;Liu Jing;School of Optics and Photonic,Beijing Institute of Technology;National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences;
  • 关键词:深度学习 ; 全卷积神经网络 ; 语义分割 ; 场景解析 ; 特征切分 ; 多尺度 ; 特征复用
  • 英文关键词:deep learning;;fully convolutional neural networks;;semantic segmentation;;scene parsing;;feature slice;;multiple scales;;feature reuse
  • 中文刊名:ZGTB
  • 英文刊名:Journal of Image and Graphics
  • 机构:北京理工大学光电学院;中国科学院自动化研究所模式识别国家重点实验室;
  • 出版日期:2019-03-16
  • 出版单位:中国图象图形学报
  • 年:2019
  • 期:v.24;No.275
  • 基金:国家自然科学基金项目(61472422)~~
  • 语种:中文;
  • 页:ZGTB201903013
  • 页数:10
  • CN:03
  • ISSN:11-3758/TB
  • 分类号:144-153
摘要
目的基于全卷积神经网络的图像语义分割研究已成为该领域的主流研究方向。然而,在该网络框架中由于特征图的多次下采样使得图像分辨率逐渐下降,致使小目标丢失,边缘粗糙,语义分割结果较差。为解决或缓解该问题,提出一种基于特征图切分的图像语义分割方法。方法本文方法主要包含中间层特征图切分与相对应的特征提取两部分操作。特征图切分模块主要针对中间层特征图,将其切分成若干等份,同时将每一份上采样至原特征图大小,使每个切分区域的分辨率增大;然后,各个切分特征图通过参数共享的特征提取模块,该模块中的多尺度卷积与注意力机制,有效利用各切块的上下文信息与判别信息,使其更关注局部区域的小目标物体,提高小目标物体的判别力。进一步,再将提取的特征与网络原输出相融合,从而能够更高效地进行中间层特征复用,对小目标识别定位、分割边缘精细化以及网络语义判别力有明显改善。结果在两个城市道路数据集CamVid以及GATECH上进行验证实验,论证本文方法的有效性。在CamVid数据集上平均交并比达到66. 3%,在GATECH上平均交并比达到52. 6%。结论基于特征图切分的图像分割方法,更好地利用了图像的空间区域分布信息,增强了网络对于不同空间位置的语义类别判定能力以及小目标物体的关注度,提供更有效的上下文信息和全局信息,提高了网络对于小目标物体的判别能力,改善了网络整体分割性能。
        Objective Deep convolutional neural networks have recently shown outstanding performances in object recognition and have also been the first choice for dense classification problems,such as semantic segmentation. Fully convolutional network based methods have become the main research direction in the field of image semantic segmentation. However,repeated downsampling operations in these methods,such as pooling or convolution striding,lead to a significant decrease in the initial image resolution,which results in poor object delineation,small target losing,and weak segmentation output.Although some studies have solved this problem in recent years,determining how to effectively handle this problem remains an open question and deserves further attention. This study proposes a feature map slice module for semantic segmentation to solve this problem. Method The proposed method mainly includes two parts: middle layer feature map segmentation and corresponding feature extraction network. The feature map slice module mainly focuses on the middle layer feature map.The feature map is sliced into several small cubes,and then each cube is upsampled to the corresponding resolution of the original feature map,which enlarges the small target in the local area. Each cube is equivalent to a subregion of the original feature map by the proposed feature map slice module. After upsampling these cubes,the objects in these subregions are enlarged. Thus,the small objects in these regions can be regarded as relatively large objects,which are difficult to detect through the entire feature map. Therefore,in the process of feature extraction,attention must be focused on the small target objects in these subregions,which are difficult to detect if we handle the entire feature map. A weight-shared feature extraction network is thus designed for sliced feature maps. The feature extraction network adopts multiple convolution operations( different kernel sizes) to extract different scale feature information. For each input of the network,the dimension is reduced to half to save memory and dilation convolution is adopted to enlarge the network's receptive field. We then concatenate a difficult feature map( obtained by different convolution operations) and add a channel-attention operation. The feature extraction network combines multi-scale convolution and attention mechanism; when subregions are passing through the feature extraction network,it can extract different semantic category information from corresponding subregions,as well as provide contextual and global information and discriminant information of each slice effectively. Accordingly,we can focus on small objects in local areas and improve the discriminability of small target objects. Each cube passes through the feature extraction network. The extracted feature in the corresponding position is assembled and the entire mosaic feature map is acquired. The network original output is upsampled and fused with the mosaic feature map by element-wise max operation. In this way,the middle-layer feature can be reused efficiently. To utilize the middlelayer feature information,this module is introduced at multiple scales,which enhances the capability of extracting small target characteristics and spatial information in local areas. It also utilizes the semantic information in different scales and exhibits an obvious improvement for extracting small target features,refining segmentation edge,and enhancing network discrimination. Result The proposed method is verified on two urban scene-understanding datasets,namely,Cam Vid and GATECH. Both datasets contain many common urban scene objects,such as building,car,and cyclist. Several ablation experiments are conducted on the two datasets and excellent performances are achieved. In particular,intersection-over-union scores of 66.3 and 52.6 are acquired on Cam Vid and GATECH,respectively. Conclusion The proposed method utilizes the spatial distribution information of images,enhances the network capability to determine the semantic categories of different spatial locations,pays considerable attention to small target objects,and provides effective context and global information. The proposed method is expanded into different resolutions of the network considering that different resolutions can provide rich-scale information. Thus,we utilize middle layer feature information,improve the network capability to discriminate small target objects,and enhance the overall segmentation performance of the network.
引文
[1]Long J,Shelhamer E,Darrell T.Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:3431-3440.[DOI:10.1109/CVPR.2015.7298965]
    [2]Szegedy C,Liu W,Jia Y Q,et al.Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:1-9.[DOI:10.1109/CVPR.2015.7298594]
    [3]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[EB/OL].2015-04-10[2018-05-01].https://arxiv.org/pdf/1409.1556.pdf.
    [4]Chen L C,Papandreou G,Kokkinos I,et al.Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL].2016-06-07[2018-05-01].https://arxiv.org/pdf/1412.7062.pdf.
    [5]Yu F,Koltun V.Multi-scale context aggregation by dilated convolutions[EB/OL].2016-03-04[2018-05-01].https://arxiv.org/pdf/1511.07122v2.pdf.
    [6]Noh H,Hong S,Han B.Learning deconvolution network for semantic segmentation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision.Santiago,Chile:IEEE,2015:1520-1528.[DOI:10.1109/ICCV.2015.178]
    [7]Badrinarayanan V,Kendall A,Cipolla R.SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(12):2481-2495.[DOI:10.1109/TPAMI.2016.2644615]
    [8]Wang Y,Liu J,Yan J,et al.Objectness-aware semantic segmentation[C]//Proceedings of the 24nd ACM on Multimedia Conference.Amsterdam,Netherlands:ACM,2016:307-311.[DOI:10.1145/2964284.2967232]
    [9]Lin G S,Milan A,Shen C H,et al.Refine Net:multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,HI,USA:IEEE,2017:1925-1934.[DOI:10.1109/CVPR.2017.549]
    [10]Jégou S,Drozdzal M,Vazquez D,et al.The one hundred layers tiramisu:fully convolutional Dense Nets for semantic segmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops.Honolulu,HI,USA:IEEE,2017:1175-1183.[DOI:10.1109/CVPRW.2017.156]
    [11]Fu J,Liu J,Wang Y H,et al.Densely connected deconvolutional network for semantic segmentation[C]//Proceedings of 2017IEEE International Conference on Image Processing.Beijing,China:IEEE,2017:3085-3089.[DOI:10.1109/ICIP.2017.8296850]
    [12]Huang G,Liu Z,van der Maaten L,et al.Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,HI,USA:IEEE,2017:2261-2269.[DOI:10.1109/CVPR.2017.243]
    [13]Everingham M,Eslami S,van Gool L,et al.The PASCAL visual object classes challenge:a retrospective[J].International Journal of Computer Vision,2015,111(1):98-136.[DOI:10.1007/s11263-014-0733-5]
    [14]Cordts M,Omran M,Ramos S,et al.The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:3213-3223.[DOI:10.1109/CVPR.2016.350]
    [15]Brostow G J,Fauqueur J,Cipolla R.Semantic object classes in video:a high-definition ground truth database[J].Pattern Recognition Letters,2009,30(2):88-97.[DOI:10.1016/j.patrec.2008.04.005]
    [16]Raza S H,Grundmann M,Essa I.Geometric context from videos[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition.Portland,OR,USA:IEEE,2013:3081-3088.[DOI:10.1109/CVPR.2013.396]
    [17]Ronneberger O,Fischer P,Brox T.U-Net:convolutional networks for biomedical image segmentation[C]//Proceedings of the18th International Conference on Medical Image Computing and Computer-Assisted Intervention.Munich,Germany:Springer,2015:234-241.[DOI:10.1007/978-3-319-24574-4_28]
    [18]Ghiasi G,Fowlkes C C.Laplacian pyramid reconstruction and refinement for semantic segmentation[C]//Proceedings of the14th European Conference on Computer Vision.Amsterdam,The Netherlands:Springer,2016,519-534.[DOI:10.1007/978-3-319-46487-9_32]
    [19]Peng C,Zhang X Y,Yu G,et al.Large kernel matters---improve semantic segmentation by global convolutional network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,HI,USA:IEEE,2017:4353-4361.[DOI:10.1109/CVPR.2017.189]
    [20]Chen L C,Zhu Y K,Papandreou G,et al.Encoder-Decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision.Munich,Germany.Springer,2018,833-851.[DOI:10.1007/978-3-030-01234-2_49]
    [21]He K M,Zhang X Y,Ren S Q,et al.Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:770-778.[DOI:10.1109/CVPR.2016.90]
    [22]Jia Y Q,Shelhamer E,Donahue J,et al.Caffe:convolutional architecture for fast feature embedding[C]//Proceedings of the22nd ACM International Conference on Multimedia.Orlando,Florida,USA:ACM,2014:675-678.[DOI:10.1145/2647868.2654889]
    [23]Kendall A,Badrinarayanan V,Cipolla R.Bayesian SegNet:model uncertainty in deep convolutional encoder-decoder architectures for scene understanding[C]//Proceedings of 2017British Machine Vision Conference,London,UK,BMVA press,2017
    [24]Kundu A,Vineet V,Koltun V.Feature space optimization for semantic video segmentation[C]//Proceedings of 2016 IEEEConference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:3168-3175.[DOI:10.1109/CVPR.2016.345]
    [25]Wang Y H,Liu J,Li Y,et al.Hierarchically supervised deconvolutional network for semantic video segmentation[J].Pattern Recognition,2017,64:437-445.[DOI:10.1016/j.patcog.2016.09.046]
    [26]Tran D,Bourdev L,Fergus R,et al.Deep end2end voxel2voxel prediction[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops.Las Vegas,NV,USA:IEEE,2016:402-409.[DOI:10.1109/CVPRW.2016.57]

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700