融合分割先验的多图像目标语义分割
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Multi-image object semantic segmentation by fusing segmentation priors
  • 作者:廖旋 ; 缪君 ; 储珺 ; 张桂梅
  • 英文作者:Liao Xuan;Miao Jun;Chu Jun;Zhang Guimei;Key Laboratory of Nondestructive Testing,Nanchang Hangkong University;School of Aeronautical Manufacturing Engineering,Nanchang Hangkong University;Key Laboratory of Lunar and Deep Space Exploration,National Astronomical Observatories,Chinese Academy of Sciences;
  • 关键词:多图像 ; 目标分割 ; 深度学习 ; 卷积神经网络 ; 分割先验 ; 条件随机场
  • 英文关键词:multi-image;;object segmentation;;deep learning;;convolutional neural networks(CNN);;segmentation prior;;conditional random field(CRF)
  • 中文刊名:ZGTB
  • 英文刊名:Journal of Image and Graphics
  • 机构:南昌航空大学无损检测技术教育部重点实验室;南昌航空大学航空制造工程学院;中国科学院月球与深空探测重点实验室;
  • 出版日期:2019-06-16
  • 出版单位:中国图象图形学报
  • 年:2019
  • 期:v.24;No.278
  • 基金:国家自然科学基金项目(61661036,61663031,61462065);; 中国科学院月球与深空探测重点实验室开放基金项目(LDSE201705);; 无损检测技术教育部重点实验室开放基金项目(ZD201529003)~~
  • 语种:中文;
  • 页:ZGTB201906005
  • 页数:12
  • CN:06
  • ISSN:11-3758/TB
  • 分类号:48-59
摘要
目的在序列图像或多视角图像的目标分割中,传统的协同分割算法对复杂的多图像分割鲁棒性不强,而现有的深度学习算法在前景和背景存在较大歧义时容易导致目标分割错误和分割不一致。为此,提出一种基于深度特征的融合分割先验的多图像分割算法。方法首先,为了使模型更好地学习复杂场景下多视角图像的细节特征,通过融合浅层网络高分辨率的细节特征来改进PSPNet-50网络模型,减小随着网络的加深导致空间信息的丢失对分割边缘细节的影响。然后通过交互分割算法获取一至两幅图像的分割先验,将少量分割先验融合到新的模型中,通过网络的再学习来解决前景/背景的分割歧义以及多图像的分割一致性。最后通过构建全连接条件随机场模型,将深度卷积神经网络的识别能力和全连接条件随机场优化的定位精度耦合在一起,更好地处理边界定位问题。结果本文采用公共数据集的多图像集进行了分割测试。实验结果表明本文算法不但可以更好地分割出经过大量数据预训练过的目标类,而且对于没有预训练过的目标类,也能有效避免歧义的区域分割。本文算法不论是对前景与背景区别明显的较简单图像集,还是对前景与背景颜色相似的较复杂图像集,平均像素准确度(PA)和交并比(IOU)均大于95%。结论本文算法对各种场景的多图像分割都具有较强的鲁棒性,同时通过融入少量先验,使模型更有效地区分目标与背景,获得了分割目标的一致性。
        Objective Object segmentation from multiple images involves locating the positions and ranges of common target objects in a scene,as presented in a sequence image set or multi-view images. This process is applied to various computer vision tasks and beyond,such as object detection and tracking,scene understanding,and 3D reconstruction. Early approaches consider object segmentation as a histogram matching of color values,and they are only applied to pair-wise images with the same or similar objects. Later on,object co-segmentation methods are introduced. Most of these methods take the MRF model as the basic framework and establish the cost function that consists of the energy within the image itself and the energy between images by using the feature calculation based on the gray or color values of pixels. The cost function is minimized to obtain consistent segmentation. However,when the foreground and background colors in these images are similar,co-segmentation cannot easily realize object segmentation with consistent regions. In recent years,with the development of deep learning,methods based on various deep learning models have been proposed. Some methods,such as the full convolutional network,adopt convolutional neural networks to extract the high-level semantic features of images to attain end-toend image classification with pixel level. These algorithms can obtain better precision than traditional methods could. Compared with these traditional methods,deep learning methods can learn appropriate features automatically for individual classes without manual selection and adjustment of features. Exactly segmenting a single image must combine multi-level spatial domain information. Hence,multi-image segmentation not only demands fine grit accuracy in local regions and single image segmentation but also requires the balance of local and global information among multiple images. When ambiguous regions around the foreground and background are involved or when sufficient priori information is not given about objects,most deep learning methods tend to generate errors and achieve inconsistent segmentation from sequential image sets or multiview images. Method In this study,we propose a multi-image segmentation method on the basis of depth feature exaction.The method is similar to the neural network model of PSPNet-50,in which a residual network is used to exact the features of the first 50 layers of the network. These extracted features are integrated into the pyramid pooling module by using pooling layers with differently sized pooling filters. Then,the features of different levels are fused. After applying a convolutional layer and up-convolutional layer,the initial end-to-end outputs are attained. To make the model learn the detail features from the multi-view images of complex scenes comprehensively,we join the first and fifth parts of the output network features. Thus,the PSPNet-50 network model is improved by integrating the high-resolution details of the shallow layer network,which also is used to reduce the effects of spatial information loss on the segmentation edge details as the network deepens. In the training phase,the improved network model is first pre-trained using the ADE20 k dataset. Thus,the model,after considerable data training,achieves strong robustness and generalization. Afterward,one or two prior segmentations of the object are gained by using the interactive segmentation approach. These small priori segmentation integrations are fused into the new model. The network is then re-trained to solve the ambiguity segmentation problem between the foreground and the background and the inconsistent segmentation problem among multi-image. We analyze the relationship between the number of re-training iterations and the segmentation accuracy by employing a large number of experimental results to determine the optimal number of iterations. Finally,by constructing a fully connected conditional random field,the recognition ability of the deep convolutional neural network and the accurate locating ability of the fully connected condition random field are coupled together. The object region is effectively located,and the object edge is clearly detected. Result We evaluate our method on multi-image from various public data sets showing outdoor buildings and indoor objects. We also compare our results with those of other deep learning methods,such as fully convolutional networks( FCNs) and pyramid scene parsing network( PSPNet). Experiments in the multi-view dataset of "Valbonne"and "Box"show that our algorithm can exactly segment the region of the object in re-training classes while effectively avoiding the ambiguous region segmentation for those untraining object classes. To evaluate our algorithm quantitatively,we compute the commonly used accuracy evaluation,average values of pixel accuracy( PA),and intersection over union( IOU) and then evaluate the segmentation accuracy of the object. Results show that our algorithm attains satisfactory scores not only in complex scene image sets with similar foreground and background contexts but also in simple image sets with obvious differences between the foreground and background contexts. For example,in the"Valbonne"set,the PA and IOU values of our result are 0. 968 3 and 0. 946 9,respectively; whereas the values of FCN are 0. 702 7 and 0. 694 2,respectively. The values of PSPNet are 0. 850 9 and 0. 824 0. Our method achieves 10% higher scores than FCN does and 20% higher scores than PSPNet does.In the"Box"set,our method achieves the PA values of 0. 994 6 and IOU values of 0. 957 7. However,FCN and PSPNet cannot find the real region of the object because the "Box"class is not contained in their re-training classes. The same improvements are found in other data sets. The average scores of PA and IOU of our method are more than 0. 95. Conclusion Experimental results demonstrate that our algorithm has strong robustness in various scenes and can achieve consistent segmentation in multi-view images. A small amount of priori integration can help to accurately predict object pixel-level region and make the model effectively distinguish object regions from the background. The proposed approach consistently outperforms competing methods for contained and un-contained object classes.
引文
[1]Li F X,Kim T,Humayun A,et al.Video segmentation by tracking many figure-ground segments[C]//Proceedings of 2013IEEE International Conference on Computer Vision.Sydney,NSW,Australia:IEEE,2013:2192-2199.[DOI:10.1109/ICCV.2013.273]
    [2]Zhang G M,Chen B B,Xu K,et al.New CV model combining fractional differential and image local information[J].Journal of Image and Graphics,2018,23(8):1131-1143.[张桂梅,陈兵兵,徐可,等.结合分数阶微分和图像局部信息的CV模型[J].中国图象图形学报,2018,23(8):1131-1143.][DOI:10.11834/jig.170580]
    [3]Martinovic'A,Knopp J,Riemenschneider H,et al.3D all the way:semantic segmentation of urban scenes from start to end in3D[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:4456-4465.[DOI:10.1109/CVPR.2015.7299075]
    [4]Rother C,Minka T,Blake A,et al.Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs[C]//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.New York,NY,USA:IEEE,2006:993-1000.[DOI:10.1109/CVPR.2006.91]
    [5]Mukherjee L,Singh V,Dyer C R.Halfintegrality based algorithms for cosegmentation of images[C]//Proceedings of 2009IEEE Conference on Computer Vision and Pattern Recognition.Miami,FL,USA:IEEE,2009:2028-2035.[DOI:10.1109/CVPR.2009.5206652]
    [6]Hochbaum D S,Singh V.An efficient algorithm for co-segmentation[C]//Proceedings of 2009 IEEE International Conference on Computer Vision.Kyoto,Japan:IEEE,2009:269-276.[DOI:10.1109/ICCV.2009.5459261]
    [7]Vicente S,Kolmogorov V,Rother C.Cosegmentation revisited:models and optimization[C]//Proceedings of the 11th European Conference on Computer Vision.Heraklion,Crete,Greece:Springer,2010:465-479.[DOI:10.1007/978-3-642-15552-9_34]
    [8]Joulin A,Bach F,Ponce J.Discriminative clustering for image co-segmentation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Francisco,CA,USA:IEEE,2010:1943-1950.[DOI:10.1109/CVPR.2010.5539868]
    [9]Vicente S,Rother C,Kolmogorov V.Object cosegmentation[C]//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition.Colorado Springs,CO,USA:IEEE,2011:2217-2224.[DOI:10.1109/CVPR.2011.5995530]
    [10]Rubio J C,Serrat J,López A,et al.Unsupervised co-segmentation through region matching[C]//Proceedings of 2012 IEEEConference on Computer Vision and Pattern Recognition.Providence,RI,USA:IEEE,2012:749-756.[DOI:10.1109/CVPR.2012.6247745]
    [11]CollinsM D,Xu J,Grady L,et al.Random walks based multiimage segmentation:quasiconvexity results and GPU-based solutions[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition.Providence,RI,USA IEEE,2012:1656-1663.[DOI:10.1109/CVPR.2012.6247859]
    [12]Dong X P,Shen J B,Shao L,et al.Interactive cosegmentation using global and local energy optimization[J].IEEE Transactions on Image Processing,2015,24(11):3966-3977.[DOI:10.1109/TIP.2015.2456636]
    [13]Zhu Y F,Zhang Y J.Transductive co-segmentation of multi-view images[J].Journal of Electronics&Information Technology,2011,33(4):763-768.[朱云峰,章毓晋.直推式多视图协同分割[J].电子与信息学报,2011,33(4):763-768.][DOI:10.3724/SP.J.1146.2010.00839]
    [14]Djelouah A,Franco J S,Boyer E,et al.Multi-view object segmentation in space and time[C]//Proceedings of 2013 IEEE International Conference on Computer Vision.Sydney,NSW,Australia:IEEE,2013:2640-2647.[DOI:10.1109/ICCV.2013.328]
    [15]Nguyen T N A,Cai J F,Zheng J M,et al.Interactive object segmentation from multi-view images[J].Journal of Visual Communication and Image Representation,2013,24(4):477-485.[DOI:10.1016/j.jvcir.2013.02.012]
    [16]Shotton J,Johnson M,Cipolla R.Semantic texton forests for image categorization and segmentation[C]//Proceedings of 2008IEEE Conference on Computer Vision and Pattern Recognition.Anchorage,AK,USA:IEEE,2008:1-8.[DOI:10.1109/CVPR.2008.4587503]
    [17]Shotton J,Fitzgibbon A,Cook M,et al.Real-time human pose recognition in parts from single depth images[C]//Proceedings of2011 IEEE Conference on Computer Vision and Pattern Recognition.Colorado Springs,CO,USA:IEEE,2011:1297-1304.[DOI:10.1109/CVPR.2011.5995316]
    [18]Shelhamer E,Long J,Darrell T.Fully convolutional networks for semantic segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):640-651.[10.1109/TPAMI.2016.2572683]
    [19]Badrinarayanan V,Handa A,Cipolla R.Seg Net:a deep convolutional encoder-decoder architecture for robust semantic pixelwise labelling[J].eprint ar Xiv:1505.07293,2015.
    [20]Zhao H S,Shi J P,Qi X J,et al.Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,Hawaii,USA:IEEE,2017:6230-6239.[DOI:10.1109/CVPR.2017.660]
    [21]Chen L C,Papandreou G,Kokkinos I,et al.Semantic image segmentation with deep convolutional nets and fully connected CRFs[J].eprint ar Xiv:1412.7062,2014.
    [22]Chen L C,Papandreou G,Kokkinos I,et al.DeepLab:semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected CRFs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(4):834-848.[DOI:10.1109/TPAMI.2017.2699184]
    [23]He K M,Zhang X Y,Ren S Q,et al.Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:770-778.[DOI:10.1109/CVPR.2016.90]
    [24]Zhou B L,Zhao H,Puig X,et al.Semantic understanding of scenes through the ADE20K dataset[J].ar Xiv:1608.05442,2016.
    [25]Gulshan V,Rother C,Criminisi A,et al.Geodesic star convexity for interactive image segmentation[C]//Proceedings of 2010IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Francisco,CA,USA:IEEE,2010:3129-3136.[DOI:10.1109/CVPR.2010.5540073]
    [26]Lecun Y,Bottou L,Bengio Y,et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.[DOI:10.1109/5.726791]
    [27]Krhenbühl P,Koltun V.Efficient inference in fully connected CRFs with Gaussian edge potentials[C]//Proceedings of the International Conference on Neural Information Processing Systems.Granada,Spain:Curran Associates Inc,2011:109-117.
    [28]Visual Geometry Group.Multi-view and Oxford Colleges building reconstruction[EB/OL].[2018-07-05].http://www.robots.ox.ac.uk/vgg/data/data-mview.html
    [29]Kim H,Xiao H,Max N.Piecewise planar scene reconstruction and optimization for multi-view stereo[C]//Proceedings of the11th Asian Conference on Computer Vision.Daejeon,Korea:Springer,2012:191-204.[DOI:10.1007/978-3-642-37447-0_15]
    [30]Kowdle A,Sinha S N,Szeliski R.Multiple view object cosegmentation using appearance and stereo cues[C]//Proceedings of the 12th European Conference on Computer Vision.Florence,Italy:Springer,2012:789-803.[DOI:10.1007/978-3-642-33715-4_57]

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700