融合语义先验和渐进式深度优化的宽基线3维场景重建

英文篇名：Wide-baseline 3D reconstruction with semantic prior fusion and progressive depth optimization
作者：姚拓中 ; 左文辉 ; 安鹏 ; 宋加涛
英文作者：Yao Tuozhong;Zuo Wenhui;An Peng;Song Jiatao;School of Electronic and Information Engineering,Ningbo University of Technology;College of Information Science and Electronic Engineering,Zhejiang University;
关键词：宽基线匹配 ; 致密3维场景重建 ; 高层语义先验 ; 超像素合并 ; 渐进式优化
英文关键词：wide-baseline matching;;dense 3D scene reconstruction;;high-level semantic prior;;superpixel merging;;progressive optimization
中文刊名：ZGTB
英文刊名：Journal of Image and Graphics
机构：宁波工程学院电信学院;浙江大学信电系;
出版日期：2019-04-16
出版单位：中国图象图形学报
年：2019
期：v.24;No.276
基金：国家自然科学基金青年科学基金项目(61502256);; 浙江省重点研发计划项目(2018C01086);; 宁波市自然科学基金项目(2018A610160)~~
语种：中文;
页：ZGTB201904011
页数：12
CN：04
ISSN：11-3758/TB
分类号：115-126

摘要

目的基于视觉的3维场景重建技术已在机器人导航、航拍地图构建和增强现实等领域得到广泛应用。不过,当相机出现较大运动时则会使得传统基于窄基线约束的3维重建方法无法正常工作。方法针对宽基线环境,提出了一种融合高层语义先验的3维场景重建算法。该方法在马尔可夫随机场(MRF)模型的基础上,结合超像素的外观、共线性、共面性和深度等多种特征对不同视角图像中各个超像素的3维位置和朝向进行推理,从而实现宽基线条件下的初始3维重建。与此同时,还以递归的方式利用高层语义先验对相似深度超像素实现合并,进而对场景深度和3维模型进行渐进式优化。结果实验结果表明,本文方法在多种不同的宽基线环境,尤其是相机运动较为剧烈的情况下,依然能够取得比传统方法更为稳定而精确的深度估计和3维场景重建效果。结论本文展示了在宽基线条件下如何将多元图像特征与基于三角化的几何特征相结合以构建出精确的3维场景模型。本文方法采用MRF模型对不同视角图像中超像素的3维位置和朝向进行同时推理,并结合高层语义先验对3维重建的过程提供指导。与此同时,还使用了一种递归式框架以实现场景深度的渐进式优化。实验结果表明,本文方法在不同的宽基线环境下均能够获得比传统方法更接近真实描述的3维场景模型。
Objective As a research hotspot in computer vision,3D scene reconstruction technique has been widely used in many fields,such as unmanned driving,digital entertainment,aeronautics,and astronautics. Traditional scene reconstruction methods iteratively estimate the camera pose and 3D scene models sparsely or densely on the basis of image sequences from multiple views by structure from motion. However,the large motion between cameras usually leads to occlusion and geometric deformation,which often appears in actual applications and will significantly increase the difficulty of image matching. Most previous works,including sparse and dense reconstructions,are only effective in narrow baseline environments,and wide-baseline 3D reconstruction is a considerably more difficult problem. This problem often exists in many applications,such robot navigation,aerial map building,and augmented reality,and is valuable for research. In recent years,several semantic fusion-based solutions have been proposed and have become the developing trends because these methods are more consistent with human cognition of the scene. Method A novel wide-baseline dense 3D scene reconstruction algorithm,which integrates the attribute of an outdoor structural scene and high-level semantic prior,is proposed. Our algorithm has the following characteristics. 1) Superpixel,which is larger than the pixel in the area,is used as a geometric primitive for image representation with the following advantages. First,it increases the robustness of region correlation in weak-texture environments. Second,it describes the actual boundary of the objects in the scene and the discontinuity of the depth. Third,it reduces the number of graph nodes in Markov random field( MRF) model,thereby resulting in remarkable reduction of computational complexity when solving an energy minimization problem. 2) An MRF model is utilized to estimate the 3D position and orientation of each superpixel in different view images on the basis of multiple low-level features.In our MRF energy function,the unary potential models the planar parameter of each superpixel and uses the relational error of estimated and ground truth depths for penalty. The pairwise potential models three geometric relations,namely,co-linearity,connectivity,and co-planarity between adjacent superpixels. In addition,a new potential is added to model the relational error between the triangulated and estimated depths. 3) The depth and 3D model of the scene are progressively optimized through superpixel merging with similar depths according to high-level semantic priors in our iterative type framework. When the adjacent superpixels have similar depths,they are merged,and a larger superpixel is generated,thereby reducing the possibility of depth discontinuity further. The segmentation image after superpixel merging is used in the next iteration for MRF-based depth estimation. The MAP inference of our MRF model can be efficiently solved by the classic linear programming. Result We use several classic wide-baseline image sequences,such as"Stanford Ⅰ,Ⅱ,Ⅲ,and Ⅳ","Merton College Ⅲ","University Library",and"Wadham College"to evaluate the performance of our wide-baseline 3D scene reconstruction algorithm. Experimental results demonstrate that our algorithm can estimate the large camera motion more accurately than the classic method and can recover more robust and accurate depth estimation and 3D scene models.Our algorithm can work effectively in the narrow-and wide-baseline environments and are especially suitable for large-scale scene reconstruction. Conclusion This study shows how to recover an accurate 3D scene model based on multiple image features and triangulated geometric features in wide-baseline environments. We use an MRF model to estimate the planar parameter of superpixel in different views,and high-level semantic prior is integrated to guide the superpixel merging with similar depths. Furthermore,an iterative framework is proposed to optimize the depth of the scene and the 3D scene model progressively. Experimental results show that our proposed algorithm can achieve more accurate 3D scene model than the classic algorithm in different wide-baseline image datasets.

引文

[1]Pritchett P,Zisserman A.Wide Baseline Stereo Matching[C]//Proceedings of the 6th International Conference on Computer Vision.Bombay,India:IEEE,1998.[DOI:10.1109/ICCV.1998.710802]
    [2]Tuytelaars T,van Gool L.Wide baseline stereo matching based on Local,affinely invariant regions[C]//Proceedings of the 11th British Machine Vision Conference.Bristol,UK:University of Bristol,2000:412-425.
    [3]Xiao J J,Shah.Two-frame wide baseline matching[C]//Proceedings of the 9th IEEE International Conference on Computer Vision.Nice,France:IEEE,2003:603-609.[DOI:10.1109/ICCV.2003.1238403]
    [4]Lowe D G.Distinctive image features from scale-invariant Keypoints[J].International Journal of Computer Vision,2004,60(2):91-110.[DOI:10.1023/B:VISI.0000029664.99615.94]
    [5]Tola E,Lepetit V,Fua P.DAISY:an efficient dense descriptor applied to wide-baseline stereo[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(5):815-830.[DOI:10.1109/TPAMI.2009.77]
    [6]Hassner T,Mayzels V,Zelnik-Manor L.On SIFTs and their scales[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition.Providence,RI,USA:IEEE,2012:1522-1528.[DOI:10.1109/CVPR.2012.6247842]
    [7]Bay H,Ferrari V,van Gool L.Wide-baseline stereo matching with line segments[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Diego,CA,USA:IEEE,2005:329-336.[DOI:10.1109/CVPR.2005.375]
    [8]Micusik B,Wildenauer H,Kosecka J.Detection and matching of rectilinear structures[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition.Anchorage,AK,USA:IEEE,2008:1-7.[DOI:10.1109/CVPR.2008.4587488]
    [9]Matas J,Chum O,Urban M,et al.Robust wide-baseline stereo from maximally stable extremal regions[J].Image and Vision Computing,2004,22(10):761-767.[DOI:10.1016/j.imavis.2004.02.006]
    [10]Schaffalitzky T,Zisserman A.Viewpoint invariant texture matching and wide baseline stereo[C]//Proceedings of the 8th IEEEInternational Conference on Computer Vision.Vancouver,BC,Canada:IEEE,2001:636-643.[DOI:10.1109/ICCV.2001.937686]
    [11]Trulls E,Kokkinos I,Sanfeliu A,et al.Dense segmentationaware descriptors[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition.Portland,OR,USA:IEEE,2013:2890-2897.[DOI:10.1109/CVPR.2013.372]
    [12]Liu C,Yuen J,Torralba A.SIFT flow:dense correspondence across scenes and its applications[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(5):978-994.[DOI:10.1109/TPAMI.2010.147]
    [13]Barnes C,Shechtman E,Finkelstein A,et al.PatchMatch:a randomized correspondence algorithm for structural image editing[J].ACM Transactions on Graphics,2009,28(3):#24.[DOI:10.1145/1531326.1531330]
    [14]Kim J,Liu C,Sha F,et al.Deformable spatial pyramid matching for fast dense correspondences[C]//Proceedings of 2013IEEE Conference on Computer Vision and Pattern Recognition.Portland,OR,USA:IEEE,2013:2307-2314.[DOI:10.1109/CVPR.2013.299]
    [15]Duchenne O,Bach F,Kweon I S,et al.A tensor-based algorithm for high-order graph matching[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(12):2383-2395.[DOI:10.1109/TPAMI.2011.110]
    [16]Ranftl R,Vineet V,Chen Q F,et al.Dense monocular depth estimation in complex dynamic scenes[C]//Proceedings of 2016IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016.[DOI:10.1109/CVPR.2016.440]
    [17]Roy A,Todorovic S.Monocular depth estimation using neural regression forest[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:5506-5514.[DOI:10.1109/CVPR.2016.594]
    [18]Dasgupta S,Fang K,Chen K,et al.De Lay:robust spatial layout estimation for cluttered indoor scenes[C]//Proceedings of2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:616-624.[DOI:10.1109/CVPR.2016.73]
    [19]Zou C H,Colburn A,Shan Q,et al.LayoutNet:reconstructing the 3D room layout from a single RGB image[C]//Proceedings of2018 IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2018:2051-2059.
    [20]Ren S Q,He K M,Girshick R,et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.[DOI:10.1109/TPAMI.2016.2577031]
    [21]He K M,Gkioxari G,Dollr P,et al.Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Venice,Italy:IEEE,2017:2980-2988.[DOI:10.1109/ICCV.2017.322]
    [22]Hadfield S,Bowden R.Exploiting high level scene cues in stereo reconstruction[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Santiago,Chile:IEEE,2015:783-791.[DOI:10.1109/ICCV.2015.96]
    [23]Tateno K,Tombari F,Laina I,et al.CNN-SLAM:real-time dense monocular SLAM with learned depth prediction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,HI,USA:IEEE,2017:6565-6574.[DOI:10.1109/CVPR.2017.695]
    [24]Savinov N,Ladicky'L,Hne C,et al.Discrete optimization of ray potentials for semantic 3D reconstruction[C]//Proceedings of2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:5511-5518.[DOI:10.1109/CVPR.2015.7299190]
    [25]Savinov N,Hne C,Ladicky'L,et al.Semantic 3D reconstruction with continuous regularization and ray potentials using a visibility consistency constraint[C]//Proceedings of 2016 IEEEConference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA:IEEE,2016:5460-5469.[DOI:10.1109/CVPR.2016.589]
    [26]Hne C,Zach C,Cohen A,et al.Joint 3D scene reconstruction and class segmentation[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition.Portland,OR,USA:IEEE,2013:97-104.[DOI:10.1109/CVPR.2013.20]
    [27]Mustafa A,Hilton A.Semantically coherent co-segmentation and reconstruction of dynamic scenes[C]//Proceedings of 2017 IEEEConference on Computer Vision and Pattern Recognition.Honolulu,HI,USA:IEEE,2017:5583-5592.[DOI:10.1109/CVPR.2017.592]
    [28]Felzenszwalb P F,Huttenlocher D P.Efficient graph-based image segmentation[J].International Journal of Computer Vision,2004,59(2):167-181.
    [29]Saxena A,Sun M,Ng A Y.3D reconstruction from sparse views using monocular vision[C]//Proceedings of the 11th IEEE International Conference on Computer Vision.Rio de Janeiro,Brazil:IEEE,2007.[DOI:10.1109/ICCV.2007.4409219]
    [30]Michels J,Saxena A,Ng A Y.High speed obstacle avoidance using monocular vision and reinforcement learning[C]//Proceedings of the 22nd International Conference on Machine Learning.Bonn,Germany:IEEE,2005:593-600.[DOI:10.1145/1102351.1102426]
    [31]Lourakis M,Argyros A.A generic sparse bundle adjustment C/C++package based on the Levenberg-Marquardt algorithm[R].Foundation for Research and Technology-Hellas,Tech.Rep.,2006.
    [32]Bay H,Ess A,Tuytelaars T,et al.Speeded-up robust features(SURF)[J].Computer Vision and Image Understanding,2008,110(3):346-359.[DOI:10.1016/j.cviu.2007.09.014]
    [33]Saxena A,Sun M,Ng A Y.Make3D:learning 3D scene structure from a single still image[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(5):824-840.[DOI:10.1109/TPAMI.2008.132]
    [34]Hoiem D,Efros A A,Hebert M.Geometric context from a single image[C]//Proceedings of the 10th IEEE International Conference on Computer Vision.Beijing,China:IEEE,2005:654-661.[DOI:10.1109/ICCV.2005.107]
    [35]Pollefeys M,Nistér D,Frahm J M,et al.Detailed real-time urban 3D reconstruction from video[J].International Journal of Computer Vision,2008,78(2-3):143-167.[DOI:10.1007/s11263-007-0086-4]