自主移动机器人导航与控制中的增强学习方法研究

英文题名：Research on Reinforcement Learning Methods for Navigation and Control of Autonoumous Mobile Robots
作者：李兆斌
论文级别：硕士
学科专业名称：控制科学与工程
中文关键词：增强学习 ; 马尔可夫决策过程 ; 近似策略迭代 ; 滚动窗口 ; 自主避障 ; 自主驾驶车辆
英文关键词：Reinforcement Learning ; Markov Decision Processes ; Approximate policy iteration ; Rolling Windows ; Autonomous obstacle avoidance ; Autonomous Vehicle
学位年度：2010
导师：徐昕
学科代码：081104
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

用机器学习方法,特别是增强学习方法(Reinforcment learning: RL)提高移动机器人在未知环境中的控制性能和对环境的自适应能力,是自主移动机器人导航与控制研究领域一个非常重要的发展趋势。因此,本文在国家自然科学基金项目“基于核的增强学习与近似动态规划方法研究”的支持下,主要围绕增强学习中近似策略迭代(Approximate policy iteration: API)算法的性能评估、基于核的最小二乘策略迭代算法(Kernel-based least-squares policy iteration: KLSPI)的参数自动优化、近似策略迭代在移动机器人避障控制和自主驾驶车辆纵向速度学习控制中的应用进行研究。
     取得的主要成果和创新包括:
     1、首先对API算法进行了性能评估,通过实验对比分析,验证了API算法,特别是KLSPI在解决值函数平滑的序贯决策问题时性能更优,表明序贯决策问题值函数的平滑程度是影响API算法性能表现的重要因素。为克服KLSPI算法中核函数参数手动选择的不足,本文通过对初始样本进行ε-球近邻分析,得到稀疏化的核词典基础上,又提出了基于Bellman残差梯度下降的核函数宽度优化方法。仿真测试验证了这种核函数参数优化方法的有效性。
     2、对移动机器人自主避障行为决策过程进行Markov决策过程(Markov Decision Processe: MDP)建模之后,将滚动窗口路径规划和增强学习中的API算法相结合,提出了一种面向未知环境的移动机器人自主避障学习控制方法。仿真验证了该方法的泛化性能和对未知环境的自适应能力。同时,对两类不同的API算法用于自主避障时的学习效率进行了对比分析,结果表明基于KLSPI的自主避障方法可以更快地收敛到近似最优策略。
     3、在对高速公路自主驾驶车辆的研究现状、重难点问题和自主学习控制系统的研究意义进行分析后,对高速公路环境下车辆运动控制过程进行了MDP建模,提出了用于高速公路环境下自主驾驶车辆纵向速度控制的API学习控制方法,并对该学习控制方法进行了仿真研究。仿真结果表明基于API的学习控制方法可以实现对自主驾驶车辆期望速度较为准确的控制,为下一步自主驾驶车辆学习控制的深入研究打下了基础。
To improve the control performance and adaptive ability in unknown ervironments with machine learning, especially reinforcement learning, is an imorptant research topic in the navigation and control of mobile robots. Under the support of the National Natural Science Foundation Project -- Research on kernel-based reinfrocement learning and approximate dynamic planning method,this paper studied the performance evaluation of approximate policy iteration (API) algorithms, parameters optimization of kernel function in kernel-based least-squares policy iteration (KLSPI), autonomous obstacle avoidance with API and learning control of longitudinal velocity of autonomous vehicles.
     The main contributions and innovations of this paper can be summarized as follows:
     1、API algorithms were tested in their performances. By carrying out experiments and analyzing their results, the much better performance of API algorithm was validated. It is demonstratred that KLSPI can have better performance when solving sequential decision-making problems with smooth value functions. It is verified that whether is the sequential decision-making problems with smooth value functions or not will play an important role in the performance of approximate policy iteration. Then, a new sample sparseness method which is analysed byε-neigthbor is proposed based on KLSPI, and furthermore, a width optimization method of kernel function based on the Bellman error gradient descent was proposed. Simulation results indicated the efficiency of the method.
     2、A Markov Decision Processe (MDP) model was given for autonomous obstacle avoidance of mobile robots. As a result, a new autonomous obstacle avoidance method, combining the API algorithm with the rolling window planning method, was proposed for mobile robots in unknown environment. Furthermore, the generalization and adaptation of the proposed method was tested in simulation, and the reliability of the two API-based control methods for autonomous obstacle avoidance was analyzed comparatively. It was indicated that the KLSPI-based autonomous obstacle avoidance method converged to the near optimal policy more quickly than other methods.
     3、An analysis was given on the key difficulties of autonomous vehicles, and the development of control systems for autonomous vehicles with learning ability. The motion control of vehicles on high way was modeled as MDPs, and then, an API learning and contrl method was designed to control the longitudinal velocity of vehicles on high way. The simulation results show that the API learning and contrl method can realize high-precision control for the expected velocity,what’s more, it lay the foundation for further research on learning control of autonomous vehicles.

引文

[1]蔡自兴,贺汉根,陈虹.未知环境中移动机器人导航控制研究的若干问题.控制与决策, 2002, 17(4): 385-390.
    [2]王志文,郭戈.移动机器人导航技术现状与展望.机器人, 2003, 25(5): 470-474.
    [3] Bauer A, Wollherr D, Buss M. Human-robot collaboration: a survey. Inter -national Journal of Humanoid Robotics, 2008, 5(1): 47-66.
    [4] Jan G E, Chang K Y, Parberry I. Optimal path planning for mobile robot navigation. IEEE-ASME Transactions on Mechatrioics, 2008, 13(4): 451-460.
    [5]吴洪岩.基于强化学习的自主移动机器人导航研究.东北师范大学, 2009.
    [6] Lozano-Perez T. Spatial Planning: A Configuration Space Approach. IEEE Trans on Computers, 1983, 32(2): 108-120.
    [7] Salichs M A, Moreno, L. Navigation of mobile robots: open questions. Robotica, 2000, 18(3): 227-234.
    [8]蔡自兴,贺汉根,陈虹.未知环境中移动机器人导航控制理论与方法.北京:科学出版社, 2009.
    [9] Latombe J C. Robot Motion Planning. Norwell: Kluwer Academic Publishers, 1991.
    [10] Robin R M, Ken H, Alisa M. Integrating explicit path planning with reactive control of mobile robots using trulla. Robotics&Autonomous Systems, 1999, 27(4): 225-245.
    [11] Brooks R, Robis A. Layered control system for a mobile robot. IEEE Trans on Robotics&Automation, 1986, 2(1): 14-23.
    [12]徐昕.增强学习及其在移动机器人导航与控制中的应用研究.长沙:国防科学技术大学, 2002.
    [13] Gat E. Integrated planning and reacting in a heterogeneous asynchronous architecture for controlling real world mobile robots. Proc of the AAAI-92. Menlo Park: 1992. 809-815.
    [14] Hebert M, Caillas C, Krotokov E. Terrain mapping for a roving planetary explorer. In Proc. IEEE Robotics and Automation Conf, Scottsdale, Arizona, 1989: 997-1002.
    [15] Leonard J J, Durrant W H F. Mobile robot localization by tracking geometric beacons. IEEE Transactions on Robotics and Automation, 1991, 7(3): 376-382.
    [16] Castellanos J A, Montiel J M M, Neira J, et al. The SPmap: A probabilistic framewok for simultaneous localization and map building. IEEE Trans onRobotics&Automation, 1999, 15(5): 948-953.
    [17] Thrun S. A probabilistic approach to concurrent mapping and localization for mobile robots. Machine Learning, 1998, 31:29-53.
    [18]李磊,叶涛,谭民等.移动机器人技术研究现状与未来.机器人, 2002, 24(5): 475-480.
    [19] Cai Z, Peng Z. Cooperative coevoluationary adaptive genetic algorithm in path planning of cooperative multi-mobile robot system. Intelligent&Robotic System, 2002, 33(1): 61-67.
    [20]席裕庚.动态不确定环境下广义控制问题的预测控制.控制理论与应用, 2000, 17(5): 665-670.
    [21]沈晶,顾国昌,刘海波.未知动态环境中基于分层强化学习的移动机器人路径规划.机器人, 2006, 28(5): 544-551.
    [22] Zheng Q, Jason G. Neural Q-learning in Motion Planning for Mobile Robot. Proceedings of the IEEE International Conference on Automation and Logistics, Shenyang, 2009: 1024-1028.
    [23] Riedmiller M, Gabel T, Hafner R, et al. Reinforcement learning for robot soccer. Autonomous Robots, 2009, 27: 55-73.
    [24] Boubertakh H, Tadjine M, Glorennec P Y. A new mobile robot navigation method using fuzzy logic and a modified Q-learning algorithm. Journal of Intelligent & Fuzzy Systems, 2010, 21: 113-119.
    [25] Lin L J. Hierarchical learning of robot skills by reinforcement. Proc of the Int Conf on Neural Networks, San Francisco 1993: 181-186.
    [26] Meeden L A. Trends in evolutionary robotics. Soft Computing for Intelligent Robotic Systems, New York, Physical Verlag 1998:215-233.
    [27]贺汉根,徐昕.增强学习在移动机器人导航控制中的应用.中南工业大学学报, 2000, 31: 170-173.
    [28]刘国栋,谢宏斌,李春光.动态环境中基于遗传算法的移动机器人路径规划的方法.机器人, 2003, 25(4): 327-330.
    [29]樊长虹,陈卫东,席裕庚.动态未知环境下一种Hopfield神经网络路径规划方法.控制理论与应用, 2004, 21(3): 345-350.
    [30] Engedy I, Horváth G. Artificial Neural Network based Mobile Robot Navigation. IEEE International Symposium on Intelligent Signal Processing, 2009: 241-246.
    [31] Lin L J. Self-improving reactive agents based reinforcement learning, planning and teaching. Machine Learning, 1992, 8(3-4): 293-321.
    [32]张洪宇.基于增强学习的移动机器人运动控制研究.长沙:国防科学技术大学, 2008.
    [33]乔俊飞,樊瑞元,韩红桂等.机器人动态神经网络导航算法的研究和实现.控制理论与应用, 2010, 27(1): 111-115.
    [34] Baird L C. Reinforcement learning through gradient descent. Pittsburgh: Carnegie Mellon University, 1999.
    [35] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
    [36] Barto A, Sutton R, Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans on System, Man and Cybernetics, 1983, 13(5): 834-846.
    [37] Sutton R. Learning to Predict by the method of temporal differences. Machine Learning, 1988, 3(1): 9-44.
    [38] Brartke S J, Barto A G. Linear Least-Squares Algorithms for Temporal Difference Learning. Machine Learning, 1996, 22(1): 33-57.
    [39] Boyan J. Least-Squares Temporal Difference Learning. Machine Learning: Proceedings of the Sixteenth International Conference, 1999.
    [40] Xu X, Hu D W, Lu X C. Kernel-based Least-Squares Policy Iteration for Reinforcement Learning. IEEE Transactions on Neural Networks, 2007, 18(4): 973-992.
    [41] Watkins C. Learning from delayed rewards. Cambridge: University of Cambridge, 1989.
    [42] Peng J, Williams R J. Incremental multi-step Q-learning. Machine Learning, 1996, 11: 283-290.
    [43] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
    [44] Peng J, Williams R J . Efficient Dynamic Programming-Based Learning for Control. ortheastern University, 1993.
    [45] Gaskett C, Wettergreen D, Zelinsky A. Q-learning in continuous state and action spaces. The 12th Australian Joint Conference on Artificial Intelligence, Sydney, 1999.
    [46] Esogbue A O, Hearnes E. A Learning Algorithm for the Control of Continuous Action Set-Point Regulator Systems. Journal of Computational Analysis and Applications, 1999, 1(2): 121-145.
    [47] Rummery G A, Niranjan M. On-line Q-learning using connectionist systems. Cambridge University Engineering Department , 1994.
    [48] Singh S P,Jaakkola T, Littman M L,et al. Convergence Results for Single-step On-policy Reinforcement Learning Algorithms. Machine Learning, 2000, 38: 287-308.
    [49] Parr R, Russell S. Approximating optimal policies for partially observable stochastic domains. The International Joint Conference on Artificial Intelligence,San Francisco, 1995: 1088-1094.
    [50]王学宁.策略梯度增强学习的理论、算法及应用研究.长沙:国防科学技术大学, 2006.
    [51] Williams R J. Simple statistical Gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8: 229-256.
    [52] Baxter J, Bartlett P L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 2001, 15: 319-350.
    [53] Grudic G Z, Unger L R. Localizing Policy Gradient Estimates to Action Transitions. The 17th international conference on Machine Learning, 2000: 343-350.
    [54] Greensmith E, Bartlett P L, Baxter J. Variance Reduction Techniques for Gradient Estimation in Reinforcement Learning. Jouranl of Machine Learning Reseach, 2004, 5: 1471-1530.
    [55] Weaver L, Tao N. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. The Proceedings of the 17th conference in Uncertainty in Artificial Intelligence, Washington, 2001: 538-545.
    [56] Amari S. Natural gradient works efficiently in learning. Neural Computation, 1998, 10(2): 251-276.
    [57] Kakade S. A Natural Policy Gradient. Advances in Neural Information Processing Systems. 2002. 1531-1538.
    [58] Peters J, Vijayakumar S, Schaal S. Natural actor-critic. The 16th European Conference on Machine Learning, 2005: 280-291.
    [59] Ormoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning, 2002, 49(2-3): 161-178.
    [60] Sutton R, Precup D, Singh S. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 1999, 112(1-2): 181-211.
    [61] Parr R. Hierarchical Control and Learning for Markov Decision Processes. Berkeley: University of California, 1998.
    [62] Dietterich T. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research, 2000, 13: 227-303.
    [63]沈晶.分层强化学习方法研究.哈尔滨:哈尔滨工程大学, 2006.
    [64] Mannor S, Menache I, Hoze A, et al. Dynamic Abstraction in Reinforcement Learning via Clustering. 2004: 560-567.
    [65] Drummond C. Accelerating Reinforcement Learning by Composing Solutions of Automatically Identified Subtasks. Journal of Artificial Intelligence Research, 2002, 16: 59-104.
    [66] McGovern A. Autonomous Discovery of Temporal Abstraction from Interaction with an Environment. USA: University of Massachusetts Amhert, 2002.
    [67] Hengst B. Discovering Hierarchy in Reinforcement Learning. Australia: University of New South Wales, 2003.
    [68] Uther W T B. Tree Based Hierarchical Reinforcement Learning. USA: Carnegie Mellon University, 2002.
    [69] Tesauro G. TD-Gammon:a self-teaching back-gammon program achieves master-level play. Neural Computation, 1994, 6: 215-219.
    [70] Crites R H, Barto A G, Singh S P. Elevator group control using multiple reinforcement learning agents. Machine Learning, 1998, 33(2-3): 235-262.
    [71] Ng A Y, Kim H J, Jordan M I. Autonomous helicopter flight via reinforcement learning. In NIPS 16 , 2004.
    [72] Michail G L, Parr R. Least-Squares Policy Iteration. Journal of Machine Learning Research, 2003, 4: 1107-1149.
    [73]侯振挺,郭先平.马尔可夫决策过程.湖南科学技术出版社,1997.
    [74] Watkins C J C H, Dayan P. Q learning. Machine Learning, 1992, 8: 279-292.
    [75] Laarhoven P J M, Aarts E H L. Simulated annealing: theory and applications. Kluwer, Norwell, 1987.
    [76] Homaifar A, Guan S, Liepins G E . A new approach on the traveling salesman problem by genetic algorithms. Proceedings of the 5th International Conference on Genetic Algorithms, 1993: 460-466.
    [77] Ma J, Yang T, Hou Z G, et al. Nerodynamic programming: a case study of the traveling salesman problem. Neural Comput & Applic, 2008, 17: 347-355.
    [78] Singh S P, Sutton R S. Reinforcement Learning with Replacing Eligibility Traces. Machine Learning, 1996, 22: 123-158.
    [79]文锋.基于自适应评价者设计方法的学习控制研究.合肥:中国科学技术大学, 2005.
    [80]吴涛.核函数的性质、方法及其在障碍检测中的应用.长沙:国防科学技术大学, 2003.
    [81] Orr M J L. Introduction to Radial Basis Functions Networks.1996.
    [82] Park J, Sandberg I W. Universal Approximation Using Radial-Basis function Networks. Neural Computation, 1991, 3: 246-257.
    [83] Haykin S. Neural Networks-a Comprehensive Foundation. Prentice Hall, 1999.
    [84] Moody J, Darken C J. Fast learning in networks of locally-tuned processing units. Neural Computation, 1989, 1(2): 281-294.
    [85] Archambeau C, Lendasse A, Trullemans C, et al. Phosphene evaluation in a visual prosthesis with artificial neural networks. In: Proceedings of the European Symposium on Intelligent Technologies, Hybrid Systems and theirimplementation on Smart Adaptive Systems, Tenerife, Spain, 2001: 509-515.
    [86]兰天鸽,方勇华,熊伟等.自构造RBF神经网络及其参数优化.计算机工程. 2008, 34(9): 200-202.
    [87] Wang Y, Huang G, Saratchandran P, et al. Self- Adjustment of Neuron Impact Width in Growing and Pruning RBF (GAP-RBF) Neuron Networks. ICS'05 Proceedings of the 9th WSEAS International Conference on Systems, 2003.
    [88]高大启.自适应RBF-LBF串联神经网络结构与参数优化方法.计算机学报, 2003, 26(5): 575-586.
    [89] Chang Q, Chen Q, Wang X, Scaling Gaussian RBF kernel width to improve SVM classification. International Conference on Neural Networks and Brain, 2005: 19-22.
    [90] Liu J H, Lampinen J, A Differential Evolution Based Incremental Training Method for RBF Networks. In Proceedings of. GECCO 2005, Washington, DC, USA, 2005: 881-888.
    [91]苏美娟,邓伟. RBF网络的微分进化正交最小二乘算法.计算机工程与应用,2007, 43(35):46-48.
    [92] Faress K N, Hagry M T EL, Kosy A. A. EL Trajectory tracking control for a wheeled mobile robot using fuzzy logic controller. WSEAS Transactions on Systems, 2005, 4(7): 1017-1021.
    [93] Wang X, Hou Z, Zou A, et al. A behavior controller based on spiking neural networks for mobile robots. Neurocomputing, 2008, 71(4-6): 655-666.
    [94] Er M J, Tan T P, Loh S Y. Control of a mobile robot using generalized dynamic fuzzy neural networks. Microprocessors and Microsystems, 2004, 28(9): 491-498.
    [95] Boubertakh H, Tadjine M, Glorennec P Y. A new mobile robot navigation method using fuzzy logic and a modified Q-learning algorithm. Journal of Intelligent and Fuzzy Systems, 2010, 21(1-2): 113-119.
    [96] Hafner R, Riedmiller M. Neural reinforcement learning controller for real robot application. IEEE International Conference on Robotics and Automation, 2007: 2098-2103.
    [97] SimRobot User Manual. http://www.uamt.feec.vutbr.cz/robotics/ simulations /amrt/simrobot.zip , 2001.
    [98] Belker T, Beetz M, Cremers A B. Learning action models for the improved execution of navigation plans. Robotics and Autonomous Systems, 2002, 38(3-4): 137-148.
    [99] Beetz M, Belker T. Learning Structured Reactive Navigation Plans from Executing MDP Navigation Policies. Proceedings of the fifth international conference on Autonomous agents, 2001.
    [100]张纯刚,席裕庚.全局环境位置未知时基于滚动窗口的机器人路径规划方法研究.中国科学(E辑),2001, 31(1): 51-58.
    [101]杨明.无人自动驾驶车辆研究综述与展望.哈尔滨工业大学学报, 2006, 38: 1259-1262.
    [102] Campbell M, Egerstedt M, How J P. Autonomous driving in urban environments:approaches, lessons and challenges. Philosophical Transaction of The Royal Society, 2010, 368: 4649–4672.
    [103] Chen Q, ?zgünerü, Redmill K. Ohio State University at the 2004 DARPA Grand Challenge: Developing a Completely Autonomous Vehicle. Intelligent Transportation Systems, 2004, 19(5): 8-11.
    [104] Jacob C, Hadayeghi A, Abdulhai B, et al. HighwayWork Zone Dynamic Traffic Control Using Machine Learning. Intelligent Transportation Systems Conference, 2006: 267-272.
    [105] Natarajan S, Kunapuli G, Judahy K, et al. Multi-Agent Inverse Reinforcement Learning. In the Proceedings of the Ninth International Conference on Machine Learning and Applications, Washington DC, USA, 2010.
    [106] Pommerleau D A. Efficient Training of Artificial Neural Networks for Autonomous Navigation. Neural Computation, 1991, 3(1): 88-97.
    [107] Kr?del M, Kuhnert K D. Reinforcement Learning to Drive a Car by Pattern Matching. IEEE 28th Annual Conference of theIndustrial Electronics Society, 2002, 3: 1728-1733.
    [108] Kuhnert K D , Kr?del M. Autonomous Vehicle Steering Based on Evaluative Feedback by Reinforcement Learning. http://www.mkroedel.de/pdf-files /MLDM2005.pdf, 2005.
    [109] Forbes J R N. Reinforcement Learning for Autonomous Vehicles. University of California, Berkeley, 2002.
    [110] Cai L, Rad A B, Chan, W L. An Intelligent Longitudinal Controller for Application in Semiautonomous Vehicles. IEEE Transactions on Industrial Electronics, 2010, 57(4): 1487-1497.
    [111] Ng L, Clark C M, Huissoon J P. Reinforcement Learning of Adaptive Longitudinal Vehicle Control for Dynamic Collaborative Driving. In Proceedings of the 2008 IEEE Intelligent Vehicles, 2008: 907-912.
    [112] Suresh B A, Gilmore B J. Vehicle Model Complexity--How Much Is Too Much. SAE 940656, 1994.
    [113] Margolis D L, Asgari J. Multipurpose models of vehicle dynamics for controller design. SAE Technical Papers, 1991.
    [114] Suárez J I, Vinagre B M, Naranjo J E, et al, Validation of the Model of an Unmanned Autonomous Vehicle Used in Path-Tracking Task.http://mechatronics. ece. usu.edu/yqchen/paper/05BC06_Inesbk-Blas.
    [115] Falcone P, Borrelli F, Asgari J, et al. Predictive Active Steering Control for Autonomous Vehicle Systems. IEEE Transactions on Control Systems Technology, 2007, 15(3):566-580.
    [116] Kumarawadu S, Lee T T. Neuroadaptive combined lateral and longitudinal control of highway vehicles using RBF networks. IEEE Trans. Intell. Transp. Syst., 2006, 7(4): 500-512.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700