一种采用模型学习和经验回放加速的正则化自然行动器评判器算法

英文篇名：A Regularized Natural AC Algorithm with the Acceleration of Model Learning and Experience Replay
作者：钟珊 ; 刘全 ; 傅启明 ; 龚声蓉 ; 董虎胜
英文作者：ZHONG Shan;LIU Quan;FU Qi-Ming;GONG Sheng-Rong;DONG Hu-Sheng;School of Computer Science and Technology,Soochow University;School of Computer Science and Engineering,Changshu Institute of Technology;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University;Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology;Collaborative Innovation Center of Novel Software Technology and Industrialization;College of Electronic & Information Engineering,Suzhou University of Science and Technology;
关键词：行动器评判器算法 ; 模型学习 ; 经验回放 ; 最优策略 ; 正则化 ; 自然梯度
英文关键词：actor critic algorithm;;model learning;;experience replay;;optimal policy;;regularization;;natural gradient
中文刊名：JSJX
英文刊名：Chinese Journal of Computers
机构：苏州大学计算机科学与技术学院;常熟理工学院计算机科学与工程学院;吉林大学符号计算与知识工程教育部重点实验室;苏州科技大学江苏省建筑智慧节能重点实验室;软件新技术与产业化协同创新中心;苏州科技大学电子与信息工程学院;
出版日期：2017-12-29 09:08
出版单位：计算机学报
年：2019
期：v.42;No.435
基金：国家自然科学基金项目(61772355,61702055,61303108,61373094,61472262,61502323,61502329);; 江苏省自然科学基金(BK2012616);; 江苏省高校自然科学研究项目(13KJB520020);; 江苏省高校自然科学研究面上项目(16KJD520001);; 江苏省科技计划项目(BK2015260);; 吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172014K04,93K172017K18);; 苏州市应用基础研究计划工业部分(SYG201422,SYG201308)资助~~
语种：中文;
页：JSJX201903005
页数：22
CN：03
ISSN：11-1826/TP
分类号：82-103

摘要

行动器评判器(Actor Critic,简称AC)算法是强化学习连续动作领域的一类重要算法,其采用独立的结构表示策略,但更新策略时需要大量样本导致样本效率不高.为了解决该问题,提出了基于模型学习和经验回放加速的正则化自然AC算法(Regularized Natural AC with Model Learning and Experience Replay,简称RNAC-ML-ER).RNAC-ML-ER将Agent与环境在线交互产生的样本用于学习系统动态性对应的线性模型和填充经验回放存储器.将线性模型产生的模拟样本和经验回放存储器中存储的样本作为在线样本的补充,实现值函数、优势函数和策略的更新.为了提高更新的效率,在每个时间步,仅当模型的预测误差未超过阈值时才利用该模型进行规划,同时根据TD-error从大到小的顺序对经验回放存储器中的样本进行回放.为了降低策略梯度估计的方差,引入优势函数参数向量对优势函数进行线性近似,在优势函数的目标函数中加入2-范数进行正则化,并通过优势函数参数向量来对策略梯度更新,以促进优势函数和策略的收敛.在指定的两个假设成立的条件下,通过理论分析证明了所提算法RNAC-ML-ER的收敛性.在4个强化学习的经典问题即平衡杆、小车上山、倒立摆和体操机器人中对RNACML-ER算法进行实验,结果表明所提算法能在大幅提高样本效率和学习速率的同时保持较高的稳定性.
Actor Critic(AC)algorithm serves as an important method for solving problems with continuous action space in reinforcement learning(RL),where the actor corresponds the policy and the critic refers to the value function.However,this separate representation structure of the policy results in that enormous samples are required to achieve the convergence for policy.To address this problem,a regularized natural AC algorithm with model learning and experience replay,called RNAC-ML-ER,is proposed,where the value function,the advantageous function and the policy are updated in on-line learning,planning and experience replaying so that the optimal policy can be found as soon as possible.The linear model with respect to the system dynamics is learned and the memory of experience replay is filled in on-line learning,via the samples collected from the interaction between Agent and environment.After the linear model is learned,it can be used to generate amount of simulated samples.The actual samples generated during learning,the samples stored in the memory as well as the simulated samples corporate together so as to update the value function,the advantageous function and the policy further.In order to improve the updating efficiency,the prediction error the model is computed at each time step,but it is used for planning only when the prediction error does not exceed the threshold.Furthermore,the samples in the memory are replayed according to their TD-errors.To reduce the variance of the estimated gradient and accelerate the convergence of the policy,two tricks are employed here.One is that the advantageous function is also linearly approximated,where the2-regularization as the smooth method is introduced to the goal function of optimizing process,and the other is that the policy gradient is update by using the learned parameters of the advantageous function.Theoretically,RNAC-ML-ER is analyzed from two aspects such as the time and space complexities analysis and the convergence analysis.The time and the space complexities are O(SW(TE+TM+1)d)and O(M),where S,W,TE,TM,d and M represent the number of episodes,the maximal steps in every episode,the updating times of the samples in the memory,the planning times,the dimension of the parameters and the capacity of the memory,respectively.The convergence of RNAC-ML-ER is analyzed by proving three theorems under the predefined two assumptions.RNAC-ML-ER is implemented on four typical benchmarks such as the pole balancing problem,the mountain car problem,the inverted pendulum problem and the acrobat problem.RNAC-ML-ER is not only compared in performance with the discrete and continuous methods,but also compared with the non-linear deep network models.The performance mainly concerns in the convergence rate,the sample efficiency and the stability.The experimental results show that RNAC-ML-ER has the best performance compared the other methods nearly in all operated experiments.It also demonstrates that the application of the model outperforms the method without using model in sample efficiency.Therefore,one of the next work will be that introducing the linear model to the deep-network-based approximation to speed the learning of the value function and the policy.

引文

[1]Puterman M L.Markov Decision Processes:Discrete Stochastic Dynamic Programming.New Jersey,USA:John Wiley&Sons Inc.,2014
    [2]Guo X X,Singh S,Lee H,et al.Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning//Proceedings of the Conference on Neural Information Processing Systems(NIPS 2014).Montréal,Canada,2014:3338-3346
    [3]van Seijen H,Sutton R S.Efficient planning in MDPs by small backups//Proceedings of the 30th International Conference on Machine Learning(ICML 2013).Atlanta,USA,2013:361-369
    [4]Mnih V,Kavukcuoglu K,Silver D,et al.Human-level control through deep reinforcement learning.Nature,2015,518(7540):529-533
    [5]Caicedo J C,Lazebnik S.Active object localization with deep reinforcement learning//Proceedings of the IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile,2015:2488-2496
    [6]He H,Boyd-Graber J,Kwok K,DauméIII H.Opponent modeling in deep reinforcement learning//Proceedings of the33rd International Conference on Machine Learning(ICML2016).New York,USA,2016:1804-1813
    [7]van Hasselt H,Guez A,Silver D.Deep reinforcement learning with double Q-learning//Proceedings of the 30th AAAI Conference on Artificial Intelligence(AAAI 2016).Phoenix,USA,2016:2094-2100
    [8]Fu Qi-Ming,Liu Quan,Wang Hui,et al.A novel off Q(λ)algorithm based on linear function approximation.Chinese Journal of Computers,2014,37(3):677-686(in Chinese)(傅启明,刘全,王辉等.一种基于线性函数逼近的离策略Q(λ)算法.计算机学报,2014,37(3):677-686)
    [9]Sutton R S,McAllester D A,Singh S,Mansour Y.Policy gradient methods for reinforcement learning with function approximation//Proceedings of the Conference on Neural Information Processing Systems(NIPS 1999).Denver,USA,1999:1057-1063
    [10]Silver D,Lever G,Heess N,et al.Deterministic policy gradient algorithms//Proceedings of the 31st International Conference on Machine Learning(ICML 2014).Beijing,China,2014:387-395
    [11]Barto A G,Sutton R S,Anderson C W.Neuronlike adaptive elements that can solve difficult learning control problems.IEEE Transaction on System,Man,Cybernetics,1983,5(SMC-13):834-846
    [12]Grondman I,Busoniu L,Lopes G A D,Babuska R.A survey of actor-critic reinforcement learning:Standard and natural policy gradients.IEEE Transaction on Systems,Man,and Cybernetics,2012,42(6):1291-1307
    [13]Ghavamzadeh M,Engel Y,Valko M.Bayesian policy gradient and actor-critic algorithms.Journal of Machine Learning Research,2016,17(66):1-53
    [14]Hasselt H V,Mahmood A R,Sutton R S.Off-policy TD(λ)with a true online equivalence//Proceedings of the International Conference on Uncertainty in Artificial Intelligence(UAI 2014).Quebec,Canada,2014
    [15]Seijen H V,Sutton R.True online TD(λ)//Proceedings of the International Conference on Machine Learning(ICML2014).Beijing,China,2014:692-700
    [16]Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Massachusset,USA:MIT Press,1998
    [17]Watkins C,Dayan P.Q-learning.Machine Learning,1992,3-4(8):279-292
    [18]Boyan J A.Technical update:Least-square temporal difference learning.Machine learning,2002,2-3(49):233-246
    [19]Tagorti M,Scherer B.On the rate of the convergence and error bounds for LSTD(λ)//Proceedings of the International Conference on Machine Learning(ICML 2015).Lille,France,2015:528-536
    [20]Coulom R.Efficient selectivity and backup operators in Monte-Carlo tree search//Proceedings of the 5th International Conference on Computers and Games.Turin,Italy,2006:72-83
    [21]Mnih V,Kavukcuoglu K,Silver D,et al.Playing Atari with deep reinforcement learning//Proceedings of the Neural Information Processing Systems Deep Learning Workshop(NIPS 2013).Nevada,USA,2013:201-220
    [22]Mnih V,Badia A P,Mirza M,et al.Asynchronous methods for deep reinforcement learning//Proceedings of the International Conference on Machine Learning(ICML 2016).New York,USA,2016:1928-1937
    [23]Schulman J,Levine S,Moritz P,et al.Trust region policy optimization//Proceedings of the International Conference on Machine Learning(ICML 2015).Lille,France,2015:1889-1897
    [24]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning//Proceedings of the International Conference on Learning Representation(ICLR2016).San Juan,Puerto Rico,2016
    [25]Lin L J.Self-improving reactive agents based on reinforcement learning,planning and teaching.Machine Learning,1992,8(3-4):293-321
    [26]Adam S,Busoniu L,Babuska R.Experience replay for real-time reinforcement learning control.IEEE Transaction on Systems,Man,and Cybernetics,2012,42(2):201-212
    [27]Amari S I.Natural gradient works efficiently in learning.Neural Computation,1998,10(2):251-276
    [28]Bagnell J,Schneider J.Covariant policy search//Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI 2003).Acapulco,Mexico,2003:1019-1024
    [29]Peters J,Vijayakumar S,Schaal S.Reinforcement learning for humanoid robotics//Proceedings of the 3rd IEEE-RASInternational Conference on Humanoid Robots.Karlsruhe,Germany,2003:1-20
    [30]Grosse R,Salakhudinov R.Scaling up natural gradient by sparsely factorizing the inverse fisher matrix.Journal of Machine Learning Research,2015,37:2304-2313
    [31]Thomas P S,Dabney W C,Giguere S,Mahadevan S.Projected natural actor-critic//Proceedings of the Conference on Neural Information Processing Systems(NIPS 2013).Nevada,USA,2013:2337-2345
    [32]Sutton R S.Integrated architecture for learning,planning and reacting based on approximating dynamic programming//Proceedings of the International Conference on Machine Learning(ICML1990).Texas,USA,1990:216-224
    [33]Peng J,Williams R J.Efficient learning and planning within the Dyna framework.Adaptive Behavior,1993,4(1):437-454
    [34]Moore A W,Atkeson C G.Prioritized sweeping:Reinforcement learning with less data and less real time.Machine Learning,1993,1(13):103-130
    [35]Santos M,Lopez V,Botella G.Dyna-H:A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems,Knowledge-Based Systems,2012,1(32):28-36
    [36]Sutton R S,Szepesvri C,Geramfard A,Bowling M.Dyna-style planning with linear function approximation and prioritized sweeping.//Proceedings of the International Conference on Uncertainty in Artificial Intelligence(UAI 2008).Helsinki,Finland,2008:528-536
    [37]Zhong Shan,Liu Quan,Fu Qi-Ming,Zhang Zong-Zhang,et al.A heuristic Dyna optimizing algorithm using approximation model representation.Journal of Computer Research and Development,2015,52(12):2764-2775(in Chinese)(钟珊,刘全,傅启明等.一种近似模型表示的启发式Dyna优化算法.计算机研究与发展,2015,52(12):2764-2775)
    [38]Grondman I,Vaandrager M,Busoniu L,Schuitema E.Efficient model learning methods for actor-critic control systems.IEEE Transaction on Man,and Cybernetics,2012,42(3):591-602
    [39]Busoniu L,Babuˇska R,Schutter B D,Ernst D.Reinforcement Learning and Dynamic Programming Using Function Approximators.New York,USA:CRC Press,2010
    [40]Grondman I,Busoniu L,Babuska R.Model learning actor-critic algorithms:Performance evaluation in a motion control task//Proceedings of the 51st IEEE Annual Conference on Decision and Control(CDC 2012).Hawaii,USA,2012:5272-5277
    [41]Costa B,Caarls W,Menasche D S.Dyna-MLAC:Trading computational and sample complexities in actor-critic reinforcement learning//Proceedings of the 2015 Brazilian Conference on Intelligent Systems.Natal,Brazil,2015:37-42
    [42]Cheng Y H,Huang F,Wang X S.Efficient data use in incremental actor-critic algorithms.Neurocomputing,2013,116:346-354
    [43]Gu S,Lillicrap T,Sutskever I,Levine S.Continuous deep Q-learning with model-based acceleration//Proceedings of the 33rd International Conference on Machine Learning(ICML 2016).New York,USA,2016:2829-2838
    [44]Tamar A,Wu Y,Thomas G,et al.Value iteration networks//Proceedings of the Conference on Neural Information Processing Systems(NIPS 2016).Barcelona,Spain,2016:2154-2162
    [45]Bhatnagar S,Sutton R S,Ghavamzadeh M,Lee M.Natural actor-critic algorithm.Automatica,2009,45(11):2471-2482
    [46]Peters J,Schaal S.Natural actor-critic.Neurocomputing,2008,71(7):1180-1190
    [47]Kushner H J,Clark D S.Stochastic Approximation Algorithms and Applications.New York,USA:Springer-Verlag,1997
    [48]Borkar V S,Mcyn S P.The O.D.E.method for convergence of stochastic approximation and reinforcement learning.SIAM Journal of Control and Optimization,2000,38(2):447-469
    [49]Duan Y,Chen X,Houthooft R,et al.Benchmarking deep reinforcement learning for continuous control//Proceedings of the 33rd International Conference on Machine Learning(ICML 2016).New York,USA,2016:1329-1338

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700