基于循环卷积神经网络的单目视觉里程计

英文篇名：Monocular Visual Odometry Based on Recurrent Convolutional Neural Networks
作者：陈宗海 ; 洪洋 ; 王纪凯 ; 葛振华
英文作者：CHEN Zonghai;HONG Yang;WANG Jikai;GE Zhenhua;Department of Automation, University of Science and Technology of China;
关键词：卷积LSTM ; 循环卷积神经网络 ; 单目视觉里程计 ; 无监督学习
英文关键词：convolutional LSTM(long short term memory);;RCNN(recurrent convolutional neural network);;monocular visual odometry;;unsupervised learning
中文刊名：JQRR
英文刊名：Robot
机构：中国科学技术大学自动化系;
出版日期：2018-09-10 10:46
出版单位：机器人
年：2019
期：v.41
基金：国家自然科学基金(61375079)
语种：中文;
页：JQRR201902002
页数：9
CN：02
ISSN：21-1137/TP
分类号：13-21

摘要

提出了一种基于卷积长短期记忆(LSTM)网络和卷积神经网络(CNN)的单目视觉里程计方法,命名为LSTMVO(LSTM visual odometry).LSTMVO采用无监督的端到端深度学习框架,对单目相机的6-DoF位姿以及场景深度进行同步估计.整个网络框架包含位姿估计网络以及深度估计网络,其中位姿估计网络是以端到端方式实现单目位姿估计的深度循环卷积神经网络(RCNN),由基于卷积神经网络的特征提取和基于循环神经网络(RNN)的时序建模组成,深度估计网络主要基于编码器和解码器架构生成稠密的深度图.同时本文还提出了一种新的损失函数进行网络训练,该损失函数由图像序列之间的时序损失、深度平滑度损失和前后一致性损失组成.基于KITTI数据集的实验结果表明,通过在原始单目RGB图像上进行训练,LSTMVO在位姿估计精度以及深度估计精度方面优于现有的主流单目视觉里程计方法,验证了本文提出的深度学习框架的有效性.
A monocular visual odometry method based on convolutional long short term memory(LSTM) network and convolutional neural network(CNN) is proposed, named LSTM visual odometry(LSTMVO). LSTMVO uses an unsupervised end-to-end deep learning framework to simultaneously estimate the 6-DoF(degree of freedom) pose and scene depth of monocular cameras. The entire network framework includes a pose estimation network and a depth estimation network.The pose estimation network is a deep recurrent convolutional neural network(RCNN) that implements monocular pose estimation from end to end, consisting of feature extraction based on convolutional neural networks and time-series modeling based on recurrent neural networks(RNN). The depth estimation network generates dense depth maps primarily based on the encoder-decoder architecture. At the same time, a new loss function for network training is proposed. The loss function consists of time series loss, loss of depth smoothness, and loss of consistency before and after the image sequence. The experimental results based on KITTI dataset show that by training on the original monocular RGB image, LSTMVO is superior to the existing mainstream monocular visual odometry methods in terms of pose estimation accuracy and depth estimation accuracy, verifying the effectiveness of the deep learning framework proposed.

引文

[1]Davison A J, Reid I D, Molton N D, et al. MonoSLAM:Realtime single camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
    [2]Klein G, Murray D. Parallel tracking and mapping for small AR workspaces[C]//6th IEEE and ACM International Symposium on Mixed and Augmented Reality. Piscataway, USA:IEEE, 2007:225-234.
    [3]Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM:A versatile and accurate monocular SLAM system[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
    [4]Newcombe R A, Lovegrove S J, Davison A J. DTAM:Dense tracking and mapping in real-time[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2011:2320-2327.
    [5]Engel J, Sch?ps T, Cremers D. LSD-SLAM:Large-scale direct monocular SLAM[C]//13th European Conference on Computer Vision. Berlin, Germany:Springer, 2014:834-849.
    [6]Engel J, Koltun V, Cremers D. Direct sparse odometry[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2018, 40(3):611-625.
    [7]Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:2625-2634.
    [8]Kendall A, Grimes M, Cipolla R. Posenet:A convolutional network for real-time 6-DOF camera relocalization[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2015:2938-2946.
    [9]Li R H, Liu Q, Gui J J, et al. Indoor relocalization in challenging environments with dual-stream convolutional neural networks[J]. IEEE Transactions on Automation Science and Engineering, 2018, 15(2):651-662.
    [10]Costante G, Mancini M, Valigi P, et al. Exploring representation learning with CNNs for frame-to-frame ego-motion estimation[J]. IEEE Robotics and Automation Letters, 2016, 1(1):18-25.
    [11]Wang S, Clark R, Wen H K, et al. DeepVO:Towards end-toend visual odometry with deep recurrent convolutional neural networks[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:2043-2050.
    [12]Ummenhofer B, Zhou H Z, Uhrig J, et al. DeMoN:Depth and motion network for learning monocular stereo[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:5622-5631.
    [13]Clark R, Wang S, Wen H K, et al. VINet:Visual-inertial odometry as a sequence-to-sequence learning problem[C]//31st AAAI Conference on Artificial Intelligence. Palo Alto, USA:AAAI,2017:3995-4001.
    [14]Pillai S, Leonard J J. Towards visual ego-motion learning in robots[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2017:5533-5540.
    [15]Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks[C]//Advances in Neural Information Processing Systems. Canada:Neural Information Processing Systems Foundation, 2015:2017-2025.
    [16]Garg R, Vijay Kumar B G, Carneiro G, et al. Unsupervised CNN for single view depth estimation:Geometry to the rescue[C]//14th European Conference on Computer Vision. Berlin,Germany:Springer, 2016:740-756.
    [17]Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:6602-6611.
    [18]Zhou T H, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:6612-6621.
    [19]Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C/OL].(2015-04-10)[2017-01-01]. https://arxiv.org/pdf/1409.1556.pdf.
    [20]Shi X J, Chen Z R, Wang H, et al. Convolutional LSTM network:A machine learning approach for precipitation nowcasting[C]//Advances in Neural Information Processing Systems.Canada:Neural Information Processing Systems Foundation,2015:802-810.
    [21]Wang Z, Bovik A C, Sheikh H R, et al. Image quality assessment:From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4):600-612.
    [22]Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? the KITTI vision benchmark suite[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2012:3354-3361.
    [23]Saxena A, Sun M, Ng A Y. Make3D:Learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5):824-840.
    [24]Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[C]//Advances in Neural Information Processing Systems. Canada:Neural Information Processing Systems Foundation, 2014:2366-2374.
    [25]Liu F Y, Shen C H, Lin G S, et al. Learning depth from single monocular images using deep convolutional neural fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10):2024-2039.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700