基于值函数和策略梯度的深度强化学习综述

英文篇名：Survey of Deep Reinforcement Learning Based on Value Function and Policy Gradient
作者：刘建伟 ; 高峰 ; 罗雄麟
英文作者：LIU Jian-Wei;GAO Feng;LUO Xiong-Lin;Department of Automation,China University of Petroleum;
关键词：深度学习 ; 强化学习 ; 深度强化学习 ; 值函数 ; 策略梯度 ; 机器学习
英文关键词：deep learning;;reinforcement learning;;deep reinforcement learning;;value function;;policy gradient;;machine learning
中文刊名：JSJX
英文刊名：Chinese Journal of Computers
机构：中国石油大学(北京)自动化系;
出版日期：2018-10-22 14:23
出版单位：计算机学报
年：2019
期：v.42;No.438
基金：国家自然科学基金(21676295);; 中国石油大学(北京)2018年度前瞻导向及培育项目(2462018QZDX02)资助~~
语种：中文;
页：JSJX201906015
页数：33
CN：06
ISSN：11-1826/TP
分类号：248-280

摘要

作为人工智能领域的热门研究问题,深度强化学习自提出以来,就受到人们越来越多的关注.目前,深度强化学习能够解决很多以前难以解决的问题,比如直接从原始像素中学习如何玩视频游戏和针对机器人问题学习控制策略,深度强化学习通过不断优化控制策略,建立一个对视觉世界有更高层次理解的自治系统.其中,基于值函数和策略梯度的深度强化学习是核心的基础方法和研究重点.该文对这两类深度强化学习方法进行了系统的阐述和总结,包括用到的求解算法和网络结构.首先,本文概述了基于值函数的深度强化学习方法,包括开山鼻祖深度Q网络和基于深度Q网络的各种改进方法.然后介绍了策略梯度的概念和常见算法,并概述了深度确定性策略梯度、信赖域策略优化和异步优势行动者-评论家这三种基于策略梯度的深度强化学习方法及相应的一些改进方法.接着概述了深度强化学习前沿成果阿尔法狗和阿尔法元,并分析了后者和该文概述的两种深度强化学习方法的联系.最后对深度强化学习的未来研究方向进行了展望.
As a hot research problem in the field of artificial intelligence,Deep Reinforcement Learning(DRL)has attracted more and more attention since it was proposed.At present,DRL can solve many problems that were previously difficult to solve such as learning how to play video games directly from raw pixels and learning a control strategy for robot problems.DRL builds an autonomous system with a higher level understanding of the visual world by a continous optimization of the control strategy.Among them,DRL based on value function and policy gradient is the core basic method and research focus.This paper systematically elaborates and summarizes two types of DRL methods including solving algorithms and network structures.Firstly,DRL methods based on value function are summarized,including Deep Q-Network(DQN)and improved methods based on DQN.DQN is a pioneering work in the field of DRL.This model trains Convolutional Neural Network(CNN)with a variety of Q learning.Before the emergence of DQN,the problem of instability or even non-convergence will appear when the action value function in Reinforcement Learning(RL)is approximated by neural network.To solve this problem,DQN uses two technologies:the experience replay mechanism and the target network.According to different emphasis on DQN improvement,various improved versions based on DQN can be divided into four categories:improvement of training algorithm,improvement of neural network structure,improvement of introduction of new learning mechanism and improvement based on new proposed RL algorithm.The research motivation,overall thinking,advantages and disadvantages,application scope and performance of DQN improvement are elaborated in detail.Then the concept and common algorithms of policy gradient are introduced.Policy gradient algorithm is widely used for RL problems in continuous space.Its main idea is to parameterize the policy,calculate the policy gradient about the action and the action is adjusted continuously along the direction of the gradient and the optimal policy is gradually obtained.The common policy gradient algorithm includes REINFORCE algorithm and Actor-Critic algorithm.Also,DRL methods based on policy gradient are summarized including Deep Deterministic Policy Gradient(DDPG),Trust Region Policy Optimization(TRPO),Asynchronous Advantage Actor-Critic(A3 C)and some corresponding improved methods.Drawing on DQN technology,DDPG adopts the experience replay mechanism and a separate target network to reduce the correlation between data and increase the stability and robustness of the algorithm.The problem solved by TRPO is to select the appropriate step size by introducing the trust region constraint defined by Kullback-Leibler divergence so as to ensure that the optimization of the policy is always in the good direction.A3 Cuses a conceptually simple and lightweight DRL framework and optimizes the deep neural network controller using an asynchronous gradient descent method.Then,AlphaGo and Alpha Zero which represent advanced research achievements of DRL are summarized and the relationship between the latter and the two DRL methods summarized in this paper is analyzed.Then some common experimental platforms of DRL algorithms are introduced including ALE,OpenAI Gym,RLLab,MuJoCo and TORCS.Finally,the future research directions of DRL are prospected.

引文

[1]Munos R.From bandits to Monte-Carlo tree search:The optimistic principle applied to optimization and planning.Foundations and Trends in Machine Learning,2014,7(1):1-129
    [2]Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge,USA:MIT Press,1998
    [3]Bertsekas D P,Bertsekas D P,Bertsekas D P,et al.Dynamic Programming and Optimal Control.Belmont,USA:Athena Scientific,1995
    [4]Szepesvari C.Algorithms for reinforcement learning.Synthesis Lectures on Artificial Intelligence and Machine Learning,2010,4(1):1-103
    [5]Krizhevsky A,Sutskever I,Hinton G E.ImageNet classification with deep convolutional neural networks//Proceedings of the International Conference on Neural Information Processing Systems.Nevada,USA,2012:1097-1105
    [6]Sermanet P,Kavukcuoglu K,Chintala S,et al.Pedestrian detection with unsupervised multi-stage feature learning//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Portland,USA,2013:3626-3633
    [7]Dahl G E,Acero A.Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.IEEE Transactions on Audio Speech&Language Processing,2011,20(1):30-42
    [8]Graves A,Mohamed A R,Hinton G.Speech recognition with deep recurrent neural networks//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.Vancouver,Canada,2013:6645-6649
    [9]Huval B,Coates A,Ng A.Deep learning for class-generic object detection.arXiv preprint arXiv:1312.6885,2013
    [10]Makantasis K,Karantzalos K,Doulamis A,et al.Deep learning-based man-made object detection from hyperspectral data//Proceedings of the Advances in Visual Computing-11th International Symposium.Las Vegas,USA,2015:717-727
    [11]Jain V,Seung H S.Natural image denoising with convolutional networks//Proceedings of the 22nd Annual Conference on Neural Information Processing Systems.Vancouver,Canada,2008:769-776
    [12]Mikolov T,Karafit M,Burget L,et al.Recurrent neural network based language model//Proceedings of the Conference of International Speech Communication Association.Chiba,Japan,2010:1045-1048
    [13]He K M,Zhang X Y,Ren S.Deep residual learning for image recognition//Proceedings of the IEEE Conference on Computer and Pattern Recognition.Las Vegas,USA,2016:770-778
    [14]Goodfellow I J,Pouget A,Mirza M,et al.Generative adversarial nets//Proceedings of the Neural Information and Processing System.Montreal,Canada,2014:2672-2680
    [15]Mnih V,Kavukcuoglu K,Silver D,et al.Playing Atari with deep reinforcement learning.//Proceedings of the Workshops at the 26th Neural Information Processing Systems2013.Lake Tahoe,USA,2013:201-220
    [16]Mnih V,Kavukcuoglu K,Silver D,et al.Human-level control through deep reinforcement learning.Nature,2015,518(7540):529-533
    [17]Zhang M,Geng X,Bruce J,et al.Deep reinforcement learning for tensegrity robot locomotion//Proceedings of the International Conference on Robotics and Automation.Singapore,2017:634-641
    [18]Gu S,Holly E,Lillicrap T,et al.Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates//Proceedings of the IEEE International Conference on Robotics and Automation.Singapore,2017:3389-3396
    [19]Cuayhuitl H,Yu S,Williamson A,et al.Scaling up deep reinforcement learning for multi-domain dialogue systems//Proceedings of the International Joint Conference on Neural Networks.Anchorage,USA,2017:3339-3346
    [20]Zhao T,Eskenazi M.Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning//Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue.Los Angeles,USA,2016:1-10
    [21]Gao J,Shen Y,Liu J,et al.Adaptive traffic signal control:Deep reinforcement learning algorithm with experience replay and target network.arXiv preprint arXiv:1705.02755,2017
    [22]Genders W,Razavi S.Using a deep reinforcement learning agent for traffic signal control.arXiv preprint arXiv:1611.01142,2016
    [23]Xiong X,Wang J,Zhang F,et al.Combining deep reinforcement learning and safety based control for autonomous driving.arXiv preprint arXiv:1612.00147,2016
    [24]Sallab A E L,Abdou M,Perot E,et al.Deep reinforcement learning framework for autonomous driving.Electronic Imaging,2017,2017(19):70-76
    [25]Thananjeyan B,Garg A,Krishnan S,et al.Multilateral surgical pattern cutting in 2D orthotropic gauze with deep reinforcement learning policies for tensioning//Proceedings of the IEEE International Conference on Robotics and Automation.Singapore,2017:2371-2378
    [26]O′Shea T J,Clancy T C.Deep reinforcement learning radio control and signal detection with KeRLym,agym RL agent.arXiv preprint arXiv:1605.09221,2016
    [27]Liu Quan,Zhai Jian-Wei,Zhang Zong-Zhang,et al.A survey on deep reinforcement learning.Chinese Journal of Computers,2018,40(1):1-27(in Chinese)(刘全,翟建伟,章宗长等.深度强化学习综述.计算机学报,2018,40(1):1-27)
    [28]Zhao Dong-Bin,Shao Kun,Zhu Yuan-Heng,et al.Review of deep reinforcement learning and discussions on the development of computer Go.Control Theory&Applications,2016,33(6):701-717(in Chinese)(赵冬斌,邵坤,朱圆恒等.深度强化学习综述:兼论计算机围棋的发展.控制理论与应用,2016,33(6):701-717)
    [29]Sutton R S.Learning to predict by the methods of temporal differences.Machine Learning,1988,3:9-44
    [30]Cjch W,Dayan P.Q-learning.Machine Learning,1992,8(3-4):279-292
    [31]Lin L J.Self-improving reactive agents based on reinforcement learning,planning and teaching.Machine Learning,1992,8(3-4):293-321
    [32]Hasselt H V,Guez A,Silver D.Deep reinforcement learning with double Qlearning//Proceedings of the 13th AAAIConference on Artificial Intelligence.Phoenix,USA,2016:2094-2100
    [33]Hasselt H V.Double Q-learning//Proceedings of the 23:24th Annual Conference on Neural Information Processing Systems.Vancouver,Canada,2010:2613-2621
    [34]Hester T,Vecerik M,Pietquin O,et al.Learning from demonstrations for real world reinforcement learning.arXiv preprint arXiv:1704.03732,2017
    [35]Schaul T,Quan J,Antonoglou I,Silver D,et al.Prioritized experience replay//Proceedings of the 4th International Conference on Learning Representations.San Juan,Puerto Rico,2016:322-355
    [36]Wang Z,Schaul T,Hessel M,et al.Dueling network architectures for deep reinforcement learning//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:1995-2003
    [37]Mahajan A,Tulabandhula T.Symmetry learning for function approximation in reinforcement learning.arXiv preprint arXiv:1706.02999,2017
    [38]Narayanamurthy S M,Ravindran B.Efficiently exploiting symmetries in real time dynamic programming//Proceedings of the 20th International Joint Conference on Artificial Intelligence.Hyderabad,India,2007:2556-2561
    [39]Taitler A,Shimkin N.Learning control for air hockey striking using deep reinforcement learning.arXiv preprint arXiv:1702.08074,2017
    [40]Levine N,Zahavy T,Mankowitz D J,et al.Shallow updates for deep reinforcement learning//Proceedings of the Annual Conference on Neural Information Processing Systems.Long Beach,USA,2017:3138-3148
    [41]Hill B.Bayesian inference in statistical analysis.Technometrics,1973,16(3):478-479
    [42]Lipton Z C,Li X,Gao J,et al.Efficient dialogue policy learning with BBQ-networks.arXiv preprint arXiv:1608.05081,2016
    [43]Blundell C,Cornebise J,Kavukcuoglu K,et al.Weight uncertainty in neural networks.arXiv preprint arXiv:1505.05424,2015
    [44]Leibfried F,Graumoya J,Bouammar H.An informationtheoretic optimality principle for deep reinforcement learning.arXiv preprint arXiv:1708.01867,2017
    [45]Ortega P A,Braun D A.Thermodynamics as a theory of decision-making with information processing costs.Royal Society A Mathematical Physical&Engineering Sciences,2013,469(2153):926-930
    [46]Mossalam H,Assael Y M,Roijers D M,et al.Multi-objective deep reinforcement learning.arXiv preprint arXiv:1610.02707,2016
    [47]Roijers D M,Whiteson S,Oliehoek F A.Algorithmic Decision Theory:Computing Convex Coverage Sets for Faster Multi-Objective Coordination.Berlin and Heidelberg:Springer-Verlag,2015
    [48]Roijers D M,Vamplew P,Whiteson S,et al.A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research,2013,48:67-113
    [49]Anschel O,Baram N,Shimkin N.Averaged-DQN:Variance reduction and stabilization for deep reinforcement learning//Proceedings of the 34th International Conference on Machine Learning.Sydney,Australia,2017:176-185
    [50]Raghu A,Komorowski M,Celi L A,et al.Continuous state-space models for optimal sepsis treatment-A deep reinforcement learning approach//Proceedings of the Machine Learning for Health Care.Boston,USA,2017:147-163
    [51]Hausknecht M,Stone P.Deep recurrent Q-learning for partially observable MDPs.arXiv preprint arXiv:1507.06527,2015
    [52]Hochreiter S,Schmidhuber J.Long short-term memory.Neural Computation,1997,9(8):1735-1780
    [53]Narasimhan K,Kulkarni T,Barzilay R.Language understanding for text-based games using deep reinforcement learning//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Lisbon,Portugal,2015:1-11
    [54]Zhu P,Li X,Poupart P.On improving deep reinforcement learning for POMDPs.arXiv preprint arXiv:1704.07978,2017
    [55]Osband I,Roy B V,Wen Z.Generalization and exploration via randomized value functions//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:2377-2386
    [56]Osband I,Blundell C,Pritzel A,et al.Deep exploration via bootstrapped DQN//Proceedings of the Annual Conference on Neural Information Processing Systems.Barcelona,Spain,2016:4026-4034
    [57]Bickel P J,Freedman D A.Some asymptotic theory for the bootstrap//Proceedings of the 30th AAAI Conference on Artificial Intelligence.Phoenix,USA,2016:2094-2100
    [58]Jaques N,Gu S,Bahdanau D,et al.Sequence tutor:Conservative fine-tuning of sequence generation models with KL-control//Proceedings of the 34th International Conference on Machine Learning.Sydney,Australia,2017:1645-1654
    [59]Nair A,Srinivasan P,Blackwell S,et al.Massively parallel methods for deep reinforcement learning.arXiv preprint arXiv:1507.04296,2015
    [60]Sorokin I,Seleznev A,Pavlov M,et al.Deep attention recurrent Q-network.arXiv preprint arXiv:1512.01693,2015
    [61]Lipton Z C,Kumar A,Gao J,et al.Combating deep reinforcement learning’s sisyphean curse with reinforcement learning.arXiv preprint arXiv:1611.01211,2017
    [62]Gu Shixiang,Lillicrap T,Sutskever I,et al.Continuous deep Q-learning with model-based acceleration//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:2829-2838
    [63]Duryea E,Ganger M,Hu W.Exploring deep reinforcement learning with multi Q-Learning.Intelligent Control&Automation,2016,07(4):129-144
    [64]Mnih V,Badia A P,Mirza M,et al.Asynchronous methods for deep reinforcement learning//Proceedings of the International Conference on Machine Learning.New York,USA,2016:1928-1937
    [65]Bellemare M G,Dabney W,Munos R.A distributional perspective on reinforcement learning//Proceedings of the34th International Conference on Machine Learning.Sydney,Australia,2017:449-458
    [66]Fortunato M,Azar M G,Piot B,et al.Noisy networks for exploration.arXiv preprint arXiv:1706.10295,2017
    [67]Hessel M,Modayil J,Van Hasselt H,et al.Rainbow:Combining improvements in deep reinforcement learning.arXiv preprint arXiv:1710.02298,2017
    [68]Lee K,Choi S,Oh S.Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning.arXiv preprint arXiv:1709.06293,2017
    [69]He H,Boydgraber J,Kwok K,et al.Opponent modeling in deep reinforcement learning//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:1804-1813
    [70]Palmer G,Tuyls K,Bloembergen D,et al.Lenient multiagent deep reinforcement learning.arXiv preprint arXiv:1707.04402,2017
    [71]Potter M A,De Jong K A.A cooperative coevolutionary approach to function optimization.Computer Science,1994,866:249-257
    [72]Omidshafiei S,Pazis J,Amato C,et al.Deep decentralized multi-task multi-agent reinforcement learning under partial observability//Proceedings of the 34th International Conference on Machine Learning.Sydney,Australia,2017:2681-2690
    [73]Qureshi A H,Nakamura Y,Yoshikawa Y,et al.Robot gains social intelligence through multimodal deep reinforcement learning//Proceedings of the 16th IEEE International Conference on Humanoid Robots.Cancun,Mexico,2016:745-751
    [74]Bellemare M G,Ostrovski G,Guez A,et al.Increasing the action gap:New operators for reinforcement learning//Proceedings of the 13rd AAAI Conference on Artificial Intelligence.Phoenix,USA,2016:1476-1483
    [75]He F S,Liu Y,Schwing A G,et al.Learning to play in a day:Faster deep reinforcement learning by optimality tightening.arXiv preprint arXiv:1611.01606,2016
    [76]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning.Computer Science,2015,8(6):A187
    [77]Schulman J,Levine S,Moritz P,et al.Trust region policy optimization//Proceedings of the International Conference on Machine Learning.Lugano,Switzerland,2015:1889-1897
    [78]Williams R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning,1992,8(3-4):229-256
    [79]Peters J,Vijayakumar S,Schaal S.Natural actor-critic//Proceedings of the 16th European Conference on Machine Learning.Porto,Portugal,2005:280-291
    [80]Bhatnagar S,Sutton R S,Ghavamzadeh M,et al.Incremental natural actor-critic algorithms//Proceedings of the 21st Annual Conference on Neural Information Processing Systems.Vancouver,Canada,2007:105-112
    [81]Degris T,Pilarski P M,Sutton R S.Model-free reinforcement learning with continuous action in practice//Proceedings of the American Control Conference.Montreal,Canada,2012:2177-2182
    [82]Sutton R S.Policy gradient methods for reinforcement learning with function approximation//Proceedings of the Advances in Neural Information Processing Systems.Denver,USA,1999:1057-1063
    [83]Silver D,Lever G,Heess N,et al.Deterministic policy gradient algorithms//Proceedings of the 31st International Conference on Machine Learning.Beijing,China,2014:387-395
    [84]Degris T,White M,Sutton R S.Linear off-policy actor-critic//Proceedings of the 29th International Conference on Machine Learning.Edinburgh,UK,2012
    [85]Sutton R S,Maei H R,Precup D,et al.Fast gradientdescent methods for temporal-difference learning with linear function approximation//Proceedings of the 26th Annual International Conference on Machine Learning.Montreal,Canada,2009:993-1000
    [86]Popov I,Heess N,Lillicrap T,et al.Data-efficient deep reinforcement learning for dexterous manipulation.arXiv preprint arXiv:1704.03073,2017
    [87]Wang S,Jing Y.Deep reinforcement learning with surrogate agent-environment interface.arXiv preprint arXiv:1709.03942,2017
    [88]Veˇcerík M,Hester T,Scholz J,et al.Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817,2017
    [89]Hausknecht M,Stone P.Deep reinforcement learning in parameterized action space.arXiv preprint arXiv:1511.04143,2015
    [90]Bruin T D,Kober J,Tuyls K,et al.Improved deep reinforcement learning for robotics through distribution-based experience retention//Proceedings of the International Conference on Intelligent Robots and Systems.Daejeon,South Korea,2016:3947-3952
    [91]Kakade S,Langford J.Approximately optimal approximate reinforcement learning//Proceedings of the 19th International Conference on Machine Learning.Sydney,Australia,2002:267-274
    [92]Baxter J,Bartlett P L.Infinite-horizon policy-gradient estimation.Journal of Artificial Intelligence Research,2011,15(1):319-350
    [93]Lagoudakis M G,Parr R.Reinforcement learning as classification:Leveraging modern classifiers//Proceedings of the20th International Conference on Machine Learning.Washington,USA,2003:424-431
    [94]Gabillon V,Ghavamzadeh M,Scherrer B.Approximate dynamic programming finally performs well in the game of Tetris//Proceedings of the Advances in Neural Information Processing Systems.Lake Tahoe,USA,2013:1754-1762
    [95]Kakade S.A natural policy gradient//Proceedings of the Advances in Neural Information Processing Systems.Vancouver,Canada,2001:1531-1538
    [96]Wu Y,Mansimov E,Liao S,et al.Scalable trust-region method for deep reinforcement learning using Kroneckerfactored approximation.arXiv preprint arXiv:1708.05144,2017
    [97]Grosse R,Martens J.A Kronecker-factored approximate fisher matrix for convolution layers//Proceedings of the International Conference on Machine Learning.New York,USA,2016:573-582
    [98]Martens J,Grosse R.Optimizing neural networks with Kronecker-factored approximate curvature//Proceedings of the 32nd International Conference on Machine Learning.Lille,France,2015:2408-2417
    [99]Schulman J,Moritz P,Levine S,et al.High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,2015
    [100]Feng N,Recht B,Re C,et al.HOGWILD!:A lock-free approach to parallelizing stochastic gradient descent//Proceedings of the Advances in Neural Information Processing Systems.Granada,Spain,2011:693-701
    [101]Babaeizadeh M,Frosio I,Tyree S,et al.Reinforcement learning through asynchronous advantage actor-critic on a GPU.arXiv preprint arXiv:1611.06256,2017
    [102]Clemente A V,Castejón H N,Chandra A.Efficient parallel methods for deep reinforcement learning.arXiv preprint arXiv:1705.04862,2017
    [103]Jaderberg M,Mnih V,Czarnecki W M,et al.Reinforcement learning with unsupervised auxiliary tasks.arXiv preprint arXiv:1611.05397,2016
    [104]Zahavy T,Zrihem N B,Mannor S.Graying the black box:Understanding DQNs//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:1899-1908
    [105]Silver D,Huang A,Maddison C J,et al.Mastering the game of Go with deep neural networks and tree search.Nature,2016,529(7587):484-489
    [106]Li X,Li L,Gao J,et al.Recurrent reinforcement learning:A hybrid approach.arXiv preprint arXiv:1509.03044,2015
    [107]Mirowski P,Pascanu R,Viola F,et al.Learning to navigate in complex environments.arXiv preprint arXiv:1611.03673,2016
    [108]Sharma S,Suresh A,Ramesh R,et al.Learning to factor policies and action-value functions:Factored action space representations for deep reinforcement learning.arXiv preprint arXiv:1705.07269,2017
    [109]Sharma S,Raguvir J G,Ramesh S,et al.Learning to mix n-Step returns:Generalizing lambda-returns for deep reinforcement learning.arXiv preprint arXiv:1705.07445,2017
    [110]Wang Z,Bapst V,Heess N,et al.Sample efficient actorcritic with experience replay.arXiv preprint arXiv:1611.01224,2016
    [111]Munos R,Stepleton T,Harutyunyan A,et al.Safe and efficient off-policy reinforcement learning//Proceedings of the Advances in Neural Information Processing Systems.Barcelona,Spain,2016:1046-1054
    [112]Wawrzy'nski P.Real-time reinforcement learning by sequential Actor-Critics and experience replay.Neural Networks,2009,22(10):1484-1497
    [113]Nachum O,Norouzi M,Xu K,et al.Bridging the gap between value and policy based reinforcement learning//Proceedings of the Advances in Neural Information Processing Systems.Long Beach,USA,2017:2772-2782
    [114]O’Donoghue B,Munos R,Kavukcuoglu K,et al.Combining policy gradient and Q-learning.arXiv preprint arXiv:1611.01626,2016
    [115]Schulman J,Chen X,Abbeel P.Equivalence between policy gradients and soft Q-Learning.arXiv preprint arXiv:1704.06440,2017
    [116]Lin K,Wang S,Zhou J.Collaborative deep reinforcement learning.arXiv preprint arXiv:1702.05796,2017
    [117]Hinton G,Vinyals O,Dean J.Distilling the knowledge in a neural network.Computer Science,2015,14(7):38-39
    [118]Tangkaratt V,Abdolmaleki A,Sugiyama M.Deep reinforcement learning with relative entropy stochastic search.arXiv preprint arXiv:1705.07606,2017
    [119]Heess N,Hunt J J,Lillicrap T P,et al.Memory-based control with recurrent neural networks.arXiv preprint arXiv:1512.04455,2015
    [120]Heess N,Wayne G,Silver D,et al.Learning continuous control policies by stochastic value gradients//Proceedings of the Advances in Neural Information Processing Systems.Montreal,Canada,2015:2944-2952
    [121]Balduzzi D,Ghifary M.Compatible value gradients for reinforcement learning of continuous deep policies.Computer Science,2015,8(6):A187
    [122]Engel Y,Szabo P,Volkinshtein D.Learning to control an octopus arm with Gaussian process temporal difference methods//Proceedings of the Advances in Neural Information Processing Systems.Vancouver,Canada,2005:347-354
    [123]Coulom R.Efficient selectivity and backup operators in Monte-Carlo tree search.Lecture Notes in Computer Science,2006,4630:72-83
    [124]Tesauro G,Galperin G R.On-line policy improvement using Monte-Carlo search//Proceedings of the Advances in Neural Information Processing Systems.Denver,USA,1996:1068-1074
    [125]Silver D,Schrittwieser J,Simonyan K,et al.Mastering the game of Go without human knowledge.Nature,2017,550(7676):354
    [126]Bellemare M G,Naddaf Y,Veness J,et al.The arcade learning environment:An evaluation platform for general agents(extended abstract)//Proceedings of the 24th International Joint Conference on Artificial Intelligence.Buenos Aires,Argentina,2015:4148-4152
    [127]Brockman G,Cheung V,Pettersson L,et al.Openai gym.arXiv preprint arXiv:1606.01540,2016
    [128]Duan Y,Chen X,Houthooft R,et al.Benchmarking deep reinforcement learning for continuous control//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:1329-1338
    [129]Todorov E,Erez T,Tassa Y.MuJoCo:A physics engine for model-based control//Proceedings of the International Conference on Intelligent Robots and Systems.Vilamoura,Portugal,2012:5026-5033
    [130]Boyan J.Generalization in reinforcement learning:Safely approximating the value function//Proceedings of the Advances in Neural Information Processing Systems.Denver,USA,1994:369-376
    [131]Kingma D P,Ba J.Adam:A method for stochastic optimization.arXiv preprint arXiv:1412.6980,2014
    [132]Christiano P,Leike J,Brown T B,et al.Deep reinforcement learning from human preferences//Proceedings of the Advances in Neural Information Processing Systems.Long Beach,USA,2017:4302-4310
    [133]Finn C,Levine S,Abbeel P.Guided cost learning:Deep inverse optimal control via policy optimization//Proceedings of the 33rd International Conference on Machine Learning.New York,USA,2016:49-58
    [134]Chen T,Givony S,Zahavy T,et al.A deep hierarchical approach to lifelong learning in Minecraft//Proceedings of the 31st AAAI Conference on Artificial Intelligence.San Francisco,USA,2017:1553-1561
    [135]Teh Y W,Bapst V,Czarnecki W M,et al.Distral:Robust multitask reinforcement learning//Proceedings of the Advances in Neural Information Processing Systems.Long Beach,USA,2017:4499-4509
    (1)Divide the gradient by a running average of its recent magnitude.https://zh.coursera.org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-runningaverage-of-its-recent-magnitude 2017,4,21

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700