基于多重门限机制的异步深度强化学习

英文篇名：Asynchronous Deep Reinforcement Learning with Multiple Gating Mechanisms
作者：徐进 ; 刘全 ; 章宗长 ; 梁斌 ; 周倩
英文作者：XU Jin;LIU Quan;ZHANG Zong-Zhang;LIANG Bin;ZHOU Qian;School of Computer Science and Technology,Soochow University;Collaborative Innovation Center of Novel Software Technology and Industrialization;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University;
关键词：深度学习 ; 强化学习 ; 异步深度强化学习 ; 循环神经网络 ; 多重门限机制 ; 跳跃连接
英文关键词：deep learning;;reinforcement learning;;asynchronous deep reinforcement learning;;recurrent neural network;;multiple gating mechanisms;;skip connection
中文刊名：JSJX
英文刊名：Chinese Journal of Computers
机构：苏州大学计算机科学与技术学院;软件新技术与产业化协同创新中心;吉林大学符号计算与知识工程教育部重点实验室;
出版日期：2017-12-29 09:09
出版单位：计算机学报
年：2019
期：v.42;No.435
基金：国家自然科学基金项目(61272055,61303108,61373094,61472262,61502323,61502329,61772355);; 江苏省自然科学基金(BK2012616);; 江苏省高校自然科学研究项目(13KJB520020,16KJB520041);; 吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172014K04,93K172017K18);; 苏州市应用基础研究计划工业部分(SYG201422,SYG201308)资助~~
语种：中文;
页：JSJX201903012
页数：18
CN：03
ISSN：11-1826/TP
分类号：186-203

摘要

近年来,深度强化学习已经成为人工智能领域一个新的研究热点.深度强化学习在如Atari 2600游戏等高维度大状态空间任务中取得了令人瞩目的成功,但仍存在训练时间太长等问题.虽然异步深度强化学习通过利用多线程技术大幅度减少了深度强化学习模型所需的训练时间,但是,基于循环神经网络的异步深度强化学习算法依然需要大量训练时间,原因在于具有记忆能力的循环神经网络无法利用并行化计算加速模型训练过程.为了加速异步深度强化学习模型的训练过程,并且使得网络模型具有记忆能力,该文提出了一种基于多重门限机制的异步优势行动者-评论家算法.该模型主要有三个特点:一是通过使用多重门限机制使前馈神经网络具有记忆能力,使Agent能够通过记忆不同时间步的状态信息做出更优的决策;二是通过利用并行计算进一步加速Agent的训练过程,减少模型所需的训练时间;三是通过采用一种新的跳跃连接方式实现数据向更深的网络层传递,增强模型识别状态特征的能力,从而提升深度强化学习算法的稳定性和学习效果.该文通过Atari 2600游戏平台上的部分战略型游戏以及稀疏奖赏环境型游戏来评估新模型的性能.实验结果表明,与传统的异步深度强化学习算法相比,新模型能够以较少的时间代价来获得更优的学习效果.
In recent years,deep reinforcement learning(RL),a combination of reinforcement learning and deep learning,has become a new research hotspot in the artificial intelligence community.Deep RL algorithms have achieved unprecedented success in high-dimensional and large-scale space tasks such as Atari 2600,but there are still issues such as the duration of training model parameters too long.Even in the case of the graphics processing unit(GPU)acceleration,deep Q-network,double deep Q-network,deep recurrent Q-network and other deep RL algorithms based on experience replay still need to train ten days or even more.DeepMind's researchers use asynchronous methods to replace deep RL's experience replay,accelerating deep RL models training greatly.Asynchronous deep reinforcement learning(ADRL)algorithms can take a few hours or a few days to achieve better learning effect on a single machine with a standard multicore central processing unit by using multi-threading techniques and are not necessary for the GPU.Although ADRL significantly reduces the training time required for deep RL models by using multi-threading techniques,the ADRL algorithms based on recurrent neural network(RNN)require lots of training time yet.It is due to RNN is unable to accelerate model training process with parallel computation.The calculation of the current time step of RNN depends on the output of the previous time step,so given a set of sequentially generated input sequences,the RNN can only compute each input sequentially,and is incapable to be calculated by parallel computation.In order to accelerate the training process of ADRL models and make deep neural networks with the ability of memory,we propose the asynchronous advantage actor-critic algorithm with multiple gating mechanisms(A3C-MGM).There are three main characteristics in the algorithm:First,similar to long short-term memory,the gates of the multiple gating mechanisms control the information passed in the hierarchy;feedforward neural networks have the ability of memory by using multiple gating mechanisms,which allows the agent to make better decision by memorizing the state information of different time steps.Second,the calculation of the current time step of A3C-MGM does not depend on the output of the previous time step,and A3C-MGM is able to accelerate the training process of the agent via parallel computation,further reducing the training time required for the model.Third,a new skip connection is utilized in our work in order to improve the stability and performance of the deep RL algorithm by passing data to deeper network layers.We evaluate the performance of the new algorithm through five challenging strategic games with sparse reward from the set of classic Atari 2600 games,i.e.,Battle Zone,Tutankhum,Time Pilot,Space Invaders,and Pong.In order to further verify the performance of the deep RL algorithm,we also compare the other four Atari 2600 games.Experimental results show that the new algorithm can learn faster than traditional asynchronous deep reinforcement learning algorithms,and achieve better performance in terms of the average reward per episode especially on the Battle Zone,Tutankhum,Asterix,Chopper Command,Solaris,and Time Pilot games.

引文

[1]Yu Kai,Jia Lei,Chen Yu-Qiang,Xu Wei.Deep learning:yesterday,today,and tomorrow.Journal of Computer Research and Development,2013,50(9):1799-1804(in Chinese)(余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天.计算机研究与发展,2013,50(9):1799-1804)
    [2]Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge,USA:MIT Press,1998
    [3]Gao Yang,Zhou Ru-Yi,Wang Hao,Cao Zhi-Xin.Study on an average reward reinforcement learning algorithm.Chinese Journal of Computers,2007,30(8):1372-1378(in Chinese)(高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究.计算机学报,2007,30(8):1372-1378)
    [4]Fu Qi-Ming,Liu Quan,Wang Hui,et al.A novel off-policy Q(λ)algorithm based on linear function approximation.Chinese Journal of Computers,2014,37(3):677-686(in Chinese)(傅启明,刘全,王辉等.一种基于线性函数逼近的离策略Q(λ)算法.计算机学报,2014,37(3):677-686)
    [5]Silver D,Schrittwieser J,Simonyan K,et al.Mastering the game of go without human knowledge.Nature,2017,550(7676):354-359
    [6]Mnih V,Kavokcuoglu K,Silver D,et al.Playing atari with deep reinforcement learning//Proceedings of the Workshops at the 26th Neural Information Processing Systems 2013.Lake Tahoe,USA,2013:10-18
    [7]Mnih V,Kavukcuoglu K,Silver D,et al.Human-level control through deep reinforcement learning.Nature,2015,518(7540):529-533
    [8]Watkins C J C H.Learning from Delayed Rewards[Ph.D.dissertation].King’s College,Cambridge,England,1989
    [9]van Hasselt H,Guez A,Silver D.Deep reinforcement learning with double Q-learning//Proceedings of the Association for the Advance of Artificial Intelligence.Phoenix,USA,2016:2094-2100
    [10]Schaul T,Quan J,Antonoglou I,et al.Prioritized experience replay//Proceedings of the International Conference on Learning Representations.San Juan,Puerto Rico,2016:53-73
    [11]Hochreiter S,Schmidhuber J.Long short-term memory.Neural Computation,1997,9(8):1735-1780
    [12]Cho K,van Merri3nboer B,Bahdanau D,et al.On the properties of neural machine translation:Encoder-decoder approaches//Proceedings of the Eighth Workshop on Syntax,Semantics and Structure in Statistical Translation.Doha,Qatar,2014:103-111
    [13]Narasimhan K,Kulkarni T D,Barzilay R.Language understanding for text-based games using deep reinforcement learning//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Lisbon,Portugal,2015:1-11
    [14]Hausknecht M,Stone P.Deep recurrent Q-learning for partially observable MDPs//Proceedings of the Association for the Advance of Artificial Intelligence Fall Symposium Series.Arlington,USA,2015:29-37
    [15]Rummery G A,Niranjan M.On-Line Q-Learning Using Connectionist Systems.Cambridge University Engineering Department,Cambridge,UK:Technical Report:CUED/F-INFENG/TR 166,1994
    [16]Sutton R S.Generalization in reinforcement learning:Successful examples using sparse coarse coding//Proceedings of the Advances in Neural Information Processing Systems.Denver,USA,1995:1038-1044
    [17]Mnih V,Badia A P,Mirza M,et al.Asynchronous methods for deep reinforcement learning//Proceedings of the International Conference on Machine Learning.New York City,USA,2016:1928-1937
    [18]Kakade S.A natural policy gradient//Proceedings of the Advances in Neural Information Processing Systems.Vancouver,Canada,2001:1531-1538
    [19]Silver D,Lever G,Heess N,et al.Deterministic policy gradient algorithms//Proceedings of the International Conference on Machine Learning.Beijing,China,2014:387-395
    [20]Sutton R S,Mcallester D,Singh S,et al.Policy gradient methods for reinforcement learning with function approximation//Proceedings of the Advances in Neural Information Processing Systems.Denver,USA,1999:1057-1063
    [21]Konda V R,Tsitsiklis J N.Actor-critic algorithms//Proceedings of the Advances in Neural Information Processing Systems.Denver,USA,1999:1008-1014
    [22]Levine S,Finn C,Darrell T,et al.End-to-end training of deep visuomotor policies.Journal of Machine Learning Research,2016,17(39):1-40
    [23]Chung J,Gulcehre C,Cho K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling//Proceedings of the Workshops at the 27th Neural Information Processing Systems 2014.Montreal,Canada,2014:457-465
    [24]Xu K,Ba J,Kiros R,et al.Show,attend and tell:Neural image caption generation with visual attention//Proceedings of the International Conference on Machine Learning.Lille,France,2015:2048-2057
    [25]Van Oord A,Kalchbrenner N,Kavukcuoglu K.Pixel recurrent neural networks//Proceedings of the International Conference on Machine Learning.New York City,USA,2016:1747-1756
    [26]Dauphin Y N,Fan A,Auli M,et al.Language modeling with gated convolutional networks//Proceedings of the International Conference on Machine Learning.Sydney,Australia,2017:933-941
    [27]He K,Zhang X,Ren S,et al.Deep residual learning for image recognition//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,USA,2016:770-778
    [28]Srivastava R K,Greff K,Schmidhuber J.Training very deep networks//Proceedings of the Advances in Neural Information Processing Systems.Montreal,Canada,2015:2377-2385
    [29]Hariharan B,Arbelez P,Girshick R,et al.Hypercolumns for object segmentation and fine-grained localization//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,USA,2015:447-456
    [30]Long J,Shelhamer E,Darrell T.Fully convolutional networks for semantic segmentation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,USA,2015:3431-3440
    [31]Sermanet P,Kavukcuoglu K,Chintala S,et al.Pedestrian detection with unsupervised multi-stage feature learning//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Portland,USA,2013:3626-3633
    [32]Yang S,Ramanan D.Multi-scale recognition with DAG-CNNs//Proceedings of the IEEE International Conference on Computer Vision.Santiago,Chile,2015:1215-1223
    [33]Kingma D,Ba J.Adam:a method for stochastic optimization//Proceedings of the International Conference on Learning Representations.San Diego,USA,2015:256-270
    [34]Busoniu L,Babuska R,De Schutter B,et al.Reinforcement learning and dynamic programming using function approximators.State of Florida,USA:CRC Press,2010
    (1)https://github.com/openai/gym

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700