深度强化学习综述:兼论计算机围棋的发展

英文篇名：Review of deep reinforcement learning and discussions on the development of computer Go
作者：赵冬斌 ; 邵坤 ; 朱圆恒 ; 李栋 ; 陈亚冉 ; 王海涛 ; 刘德荣 ; 周彤 ; 王成红
英文作者：ZHAO Dong-bin;SHAO Kun;ZHU Yuan-heng;LI Dong;CHEN Ya-ran;WANG Hai-tao;LIU De-rong;ZHOU Tong;WANG Cheng-hong;The State Key Laboratory of Managentment and Control for Complex Systems, Institute of Automation,Chinese Academy of Sciences;College of Automation, University of Science and Technology Beijing;Department of Automation, Tsinghua University;Department of Information Sciences, National Natural Science Foundation of China;
关键词：深度强化学习 ; 初弈号 ; 深度学习 ; 强化学习 ; 人工智能
英文关键词：deep reinforcement learning;;AlphaGo;;deep learning;;reinforcement learning;;artificial intelligence
中文刊名：KZLY
英文刊名：Control Theory & Applications
机构：中国科学院自动化研究所复杂系统管理与控制国家重点实验室;北京科技大学自动化学院;清华大学自动化系;国家自然科学基金委信息科学部;
出版日期：2016-06-15
出版单位：控制理论与应用
年：2016
期：v.33
基金：国家自然科学基金项目(61273136,61573353,61533017)~~
语种：中文;
页：KZLY201606001
页数：17
CN：06
ISSN：44-1240/TP
分类号：4-20

摘要

深度强化学习将深度学习的感知能力和强化学习的决策能力相结合,可以直接根据输入的图像进行控制,是一种更接近人类思维方式的人工智能方法.自提出以来,深度强化学习在理论和应用方面均取得了显著的成果.尤其是谷歌深智(Deep Mind)团队基于深度强化学习方法研发的计算机围棋"初弈号–Alpha Go",在2016年3月以4:1的大比分战胜了世界围棋顶级选手李世石(Lee Sedol),成为人工智能历史上一个新里程碑.为此,本文综述深度强化学习的发展历程,兼论计算机围棋的历史,分析算法特性,探讨未来的发展趋势和应用前景,期望能为控制理论与应用新方向的发展提供有价值的参考.
Deep reinforcement learning which incorporates both the advantages of the perception of deep learning and the decision making of reinforcement learning is able to output control signal directly based on input images. This mechanism makes the artificial intelligence much close to human thinking modes. Deep reinforcement learning has achieved remarkable success in terms of theory and application since it is proposed. ‘Chuyihao–Alpha Go', a computer Go developed by Google Deep Mind, based on deep reinforcement learning, beat the world's top Go player Lee Sedol 4:1 in March2016. This becomes a new milestone in artificial intelligence history. This paper surveys the development course of deep reinforcement learning, reviews the history of computer Go concurrently, analyzes the algorithms features, and discusses the research directions and application areas, in order to provide a valuable reference to the development of control theory and applications in a new direction.

引文

[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529–533.
    [2]SILVER D,HUANG A,MADDISON C,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484–489.
    [3]AREL I.Deep reinforcement learning as foundation for artificial general intelligence[M]//Theoretical Foundations of Artificial General Intelligence.Amsterdam:Atlantis Press,2012:89–102.
    [4]TEAAURO G.TD-Gammon,a self-teaching backgammon program,achieves master-level play[J].Neural Computation,1994,6(2):215–219.
    [5]SUTTON R S,BARTO A G.Reinforcement Learning:An Introduction[M].Cambridge MA:MIT Press,1998.
    [6]KEARNS M,SINGH S.Near-optimal reinforcement learning in polynomial time[J].Machine Learning,2002,49(2/3):209–232.
    [7]KOCSIS L,SZEPESVARI C.Bandit based Monte-Carlo planning[C]//Proceedings of the European Conference on Machine Learning.Berlin:Springer,2006:282–293.
    [8]LITTMAN M L.Reinforcement learning improves behaviour from evaluative feedback[J].Nature,2015,521(7553):445–451.
    [9]BELLMAN R.Dynamic programming and Lagrange multipliers[J].Proceedings of the National Academy of Sciences,1956,42(10):767–769.
    [10]WERBOS P J.Advanced forecasting methods for global crisis warning and models of intelligence[J].General Systems Yearbook,1977,22(12):25–38.
    [11]WATKINS C J C H.Learning from delayed rewards[D].Cambridge:University of Cambridge,1989.
    [12]RUMMERY G A,NIRANJAN M.On-Line Q-Learning Using Connectionist Systems[M].Cambridge:University of Cambridge,Department of Engineering,1994.
    [13]BERTSEKAS D P,TSITSIKLIS J N.Neuro-dynamic programming:an overview[C]//Proceedings of the 34th IEEE Conference on Decision and Control.New Orleans:IEEE,1995,1:560–564.
    [14]THRUN S.Monte Carlo POMDPs[C]//Advances in Neural Information Processing Systems.Denver:MIT Press,1999,12:1064–1070.
    [15]LEWIS F L,VRABIE D.Reinforcement learning and adaptive dynamic programming for feedback control[J].IEEE Circuits and Systems Magazine,2009,9(3):32–50.
    [16]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//Proceedings of the International Conference on Machine Learning.Beijing:ACM,2014:387–395.
    [17]CAFLISCH R E.Monte Carlo and quasi-Monte Carlo methods[J].Acta Numerica,1998,7:1–49.
    [18]MICHIE D,CHAMBERS R A.BOXES:An experiment in adaptive control[J].Machine Intelligence,1968,2(2):137–152.
    [19]BARTO A G,DUFF M.Monte Carlo matrix inversion and reinforcement learning[C]//Advances in Neural Information Processing Systems.Denver:NIPS,1993:687–694.
    [20]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A survey of Monte Carlo tree search methods[J].IEEE Transactions on Computational Intelligence and AI in Games,2012,4(1):1–43.
    [21]CHASLOTH G.Monte-Carlo tree search[D].Maastricht:Maastricht Universiteit,2010.
    [22]COULOM R.Efficient selectivity and backup operators in MonteCarlo tree search[M]//Computers and Games.Berlin Heidelberg:Springer,2006:72–83.
    [23]WEI Q L,LIU D R.A new discrete-time iterative adaptive dynamic programming algorithm based on Q-learning[M]//International Symposium on Neural Networks.New York:Springer,2015:43–52.
    [24]WEI Q L,LIU D R,SHI G.A novel dual iterative-learning method for optimal battery management in smart residential environments[J].IEEE Transactions on Industrial Electronics,2015,62(4):2509–2518.
    [25]JAAKKOLA T,JORDAN M I,SINGH S P.On the convergence of stochastic iterative dynamic programming algorithms[J].Neural Computation,1994,6(6):1185–1201.
    [26]TSITSIKLIS J N.Asynchronous stochastic approximation and Qlearning[J].Machine Learning,1994,16(3):185–202.
    [27]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Learning,1992,8(3/4):279–292.
    [28]SINGH S,JAAKKOLA T,LITTMAN M L,et al.Convergence results for single-step on-policy reinforcement-learning algorithms[J].Machine Learning,2000,38(3):287–308.
    [29]SUTTON R S.Learning to predict by the methods of temporal differences[J].Machine Learning,1988,3(1):9–44.
    [30]DEGRIS T,PILARSKI P M,SUTTON R S.Model-free reinforcement learning with continuous action in practice[C]//Proceedings of the American Control Conference.Montreal:IEEE,2012:2177–2182.
    [31]WILLIAMS R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning,1992,8(3/4):229–256.
    [32]SUTTON R S,MCALLESTER D A,SINGH S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems.Denver:MIT Press,1999,99:1057–1063.
    [33]LIU D R,WEI Q L.Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems[J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(3):621–634.
    [34]ZHANG H G,LIU D R,LUO Y H,et al.Adaptive Dynamic Programming for Control:Algorithms and Stability[M].New York:Springer,2012.
    [35]ZHAO D B,XIA Z P,WANG D.Model-free optimal control for affine nonlinear systems with convergence analysis[J].IEEE Transactions on Automation Science and Engineering,2015,12(4):1461–1468.
    [36]ZHAO D B,ZHU Y H.MEC―a near-optimal online reinforcement learning algorithm for continuous deterministic systems[J].IEEE Transactions on Neural Networks and Learning Systems,2015,26(2):346–356.
    [37]ZHU Y H,ZHAO D B,LI X J.Using reinforcement learning techniques to solve continuous-time non-linear optimal tracking problem without system dynamics[J].IET Control Theory&Applications,2016,DOI:10.1049/iet-cta.2015.0769.
    [38]JIANG Y,JIANG Z P.Robust adaptive dynamic programming and feedback stabilization of nonlinear systems[J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):882–893.
    [39]WU H N,LUO B.Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear control[J].IEEE Transactions on Neural Networks and Learning Systems,2012,23(12):1884–1895.
    [40]ZHAO D B,ZHANG Q C,WANG D,et al.Experience replay for optimal control of nonzero-sum game systems with unknown dynamics[J].IEEE Transactions on Cybernetics,2016,46(3):854–865.
    [41]WU Jun,XU Xin,WANG Jian,et al.Recent advances of reinforcement learning in multi-robot systems:a survey[J].Control and Decision,2011,26(11):1601–1610.(吴军,徐昕,王健,等.面向多机器人系统的增强学习研究进展综述[J].控制与决策,2011,26(11):1601–1610.)
    [42]MATARIC M J.Reinforcement learning in the multi-robot domain[M]//Robot Colonies.New York:Springer,1997:73–83.
    [43]ZHAO D B,ZHANG Z,DAI Y J.Self-teaching adaptive dynamic programming for Gomoku[J].Neurocomputing,2012,78(1):23–29.
    [44]ZHAO D B,WANG B,LIU D R.A supervised actor–critic approach for adaptive cruise control[J].Soft Computing,2013,17(11):2089–2099.
    [45]KAKADE S.A natural policy gradient[C]//Advances in Neural Information Processing Systems.Vancouver:MIT Press,2001,14:1531–1538.
    [46]TSITSIKLIS J N,VAN R B.An analysis of temporal-difference learning with function approximation[J].IEEE Transactions on Automatic Control,1997,42(5):674–690.
    [47]TSITSIKLIS J N,VAN R B.Average cost temporal-difference learning[J].Automatica,1999,35(11):1799–1808.
    [48]BHATNAGAR S,PRECUP D,SILVER D,et al.Convergent temporal-difference learning with arbitrary smooth function approximation[C]//Advances in Neural Information Processing Systems.Vancouver:MIT Press,2009:1204–1212.
    [49]SUTTON R S,MAEI H R,PRECUP D,et al.Fast gradient-descent methods for temporal-difference learning with linear function approximation[C]//Proceedings of the 26th Annual International Conference on Machine Learning.Montreal:ACM,2009:993–1000.
    [50]MELO F S,LOPES M.Fitted natural actor-critic:a new algorithm for continuous state-action MDPs[M]//Machine Learning and Knowledge Discovery in Databases.Berlin Heidelberg:Springer,2008:66–81.
    [51]BRAFMAN R I,TENNENHOLTZ M.R-MAX--a general polynomial time algorithm for near-optimal reinforcement learning[J].The Journal of Machine Learning Research,2003,3(10):213–231.
    [52]GAO Yang,CHEN Shifu,LU Xin.Research on reinforcement learning technology:a review[J].Acta Automatica Sinica,2004,30(1):86–100.(高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86–100.)
    [53]BERNSTEIN A,SHIMKIN N.Adaptive-resolution reinforcement learning with efficient exploration in deterministic domains[J].Machine Learning,2010,81(3):359–397.
    [54]LI L H,CHU W,LANGFORD J,et al.Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms[C]//Proceedings of the Fourth ACM International Conference on Web Search and Data Mining.Hong Kong:ACM,2011:297–306.
    [55]NOURI A,LITTMAN M L,LI L H,et al.A novel benchmark methodology and data repository for real-life reinforcement learning[C]//Proceedings of the 26th International Conference on Machine Learning.Montreal:ACM,2009.
    [56]LOFTIN R,MACGLASHAN J,PENG B,et al.A strategyaware technique for learning behaviors from discrete human feedback[C]//Proceedings of the Association for the Advancement of Artificial Intelligence.Québec City:AAAI,2014:937–943.
    [57]THOMAZ A L,BREZEAL C.Teachable robots:understanding human teaching behavior to build more effective robot learners[J].Artificial Intelligence,2008,172(6):716–737.
    [58]NIV Y.Neuroscience:Dopamine ramps up[J].Nature,2013,500(7464):533–535.
    [59]CUSHMAN F.Action,outcome,and value a dual-system framework for morality[J].Personality and Social Psychology Review,2013,17(3):273–292.
    [60]HINTON G E,OSINDERO S,TEH Y W.A fast learning algorithm for deep belief nets[J].Neural Computation,2006,18(7):1527–1554.
    [61]ABDEL-HAMID O,MOHAMED A,JIANG H,et al.Convolutional neural networks for speech recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2014,22(10):1533–1545.
    [62]CARLSON B A,CLEMENTS M A.A projection-based likelihood measure for speech recognition in noise[J].IEEE Transactions on Speech and Audio Processing,1994,2(1):97–102.
    [63]OUYANG W,ZENG X,WANG X.Learning mutual visibility relationship for pedestrian detection with a deep model[J].International Journal of Computer Vision,2016,DOI:10.1007/s11263-016-0890-9.
    [64]DAHL G E,YU D,DENG L,et al.Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J].IEEE Transactions on Audio,Speech,and Language Processing,2012,20(1):30–42.
    [65]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems.Lake Tahoe:MIT Press,2012:1097–1105.
    [66]LE Q V.Building high-level features using large scale unsupervised learning[C]//Proceedings of the IEEE International Conference onAcoustics,Speech and Signal Processing.Vancouver:IEEE,2013:8595–8598.
    [67]GRAVERS A,MOHAMED A,HINTON G.Speech recognition with deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.Vancouver:IEEE,2013:6645–6649.
    [68]XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the32nd International Conference on Machine Learning.Lille:ACM,2015:2048–2057.
    [69]PINHEIRO P,COLLOBERT R.Recurrent convolutional neural networks for scene labeling[C]//Proceedings of the 31nd International Conference on Machine Learning.Beijing:ACM,2014:82–90.
    [70]HE K M,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016.
    [71]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436–444.
    [72]VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th International Conference on Machine Learning.Helsinki:ACM,2008:1096–1103.
    [73]RIFAI S,VINCENT P,MULLER X,et al.Contractive autoencoders:Explicit invariance during feature extraction[C]//Proceedings of the 28th International Conference on Machine Learning.Bellevue:ACM,2011:833–840.
    [74]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278–2324.
    [75]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL]//ar Xiv preprint.2015.ar Xiv:1409.1556[cs.CV].
    [76]TIELEMAN T.Training restricted Boltzmann machines using approximations to the likelihood gradient[C]//Proceedings of the 25th International Conference on Machine Learning.Helsinki:ACM,2008:1064–1071.
    [77]TIELEMAN T,HINTON G.Using fast weights to improve persistent contrastive divergence[C]//Proceedings of the 26th Annual International Conference on Machine Learning.Montreal:ACM,2009:1033–1040.
    [78]MOHAMED A,DAHL G E,HINTON G.Acoustic modeling using deep belief networks[J].IEEE Transactions on Audio,Speech,and Language Processing,2012,20(1):14–22.
    [79]FENG X,ZHANG Y,GLASS J.Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.Florence:IEEE,2014:1759–1763.
    [80]VINCENT P,LAROCHELLE H,LAJOIE I,et al.Stacked denoising autoencoders:learning useful representations in a deep network with a local denoising criterion[J].The Journal of Machine Learning Research,2010,11(11):3371–3408.
    [81]BOULANGER-LEWANDOWSKI N,BENGIO Y,VINCENT P.Modeling temporal dependencies in high-dimensional sequences:application to polyphonic music generation and transcription[C]//Proceedings of the 29th International Conference on Machine Learning.Edinburgh:ACM,2012:1159–1166.
    [82]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673–2681.
    [83]CHO K,VAN M B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Doha:ACL,2014:1724–1734.
    [84]KOEHN P,HOANG H,BIRCH A,et al.Moses:open source toolkit for statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.Stroudsburg:ACM,2007:177–180.
    [85]GOODFELLOW I J,SHLENS J,SZEGEDY C.Explaining and harnessing adversarial examples[EB/OL]//ar Xiv preprint.2015.ar Xiv:1412.6572v3[stat.ML].
    [86]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning[C]//Proceedings of the NIPS Workshop on Deep Learning.Lake Tahoe:MIT Press,2013.
    [87]SHIBATA K,IIDA M.Acquisition of box pushing by direct-visionbased reinforcement learning[C]//Proceedings of the SICE Annual Conference.Nagoya:IEEE,2003,3:2322–2327.
    [88]SHIBATA K,OKABE Y.Reinforcement learning when visual sensory signals are directly given as inputs[C]//Proceedings of the International Conference on Neural Networks.Houston:IEEE,1997,3:1716–1720.
    [89]LANGE S,RIEDMILLER M.Deep auto-encoder neural networks in reinforcement learning[C]//Proceedings of the International Joint Conference on Neural Networks.Barcelona:IEEE,2010:1–8.
    [90]ABTAHI F,ZHU Z,BURRY A M.A deep reinforcement learning approach to character segmentation of license plate images[C]//Proceedings of the 14th IAPR International Conference on Machine Vision Applications.Tokyo:IEEE,2015:539–542.
    [91]LANGE S,RIEDMILLER M,VOIGTLANDER A.Autonomous reinforcement learning on raw visual input data in a real world application[C]//Proceedings of the International Joint Conference on Neural Networks.Brisbane:IEEE,2012:1–8.
    [92]WYMANN B,ESPI E,GUIONNEAU C,et al.TORCS,The open racing car simulator[EB/OL].2014,http://torcs.sourceforge.net.
    [93]KOUTNIK J,SCHMIDHUBER J,GOMEZ F.Online evolution of deep convolutional network for vision-based reinforcement learning[M]//From Animals to Animats 13.New York:Springer,2014:260–269.
    [94]LIN L J.Reinforcement learning for robots using neural networks[D].Pittsburgh:Carnegie Mellon University,1993.
    [95]SCHAUL T,QUAN J,ANTONOGIOU I,et al.Prioritized experience replay[C]//Proceedings of the International Conference on Learning Representations.San Juan:ACM,IEEE,2016.
    [96]NAIR A,SRINIVASAN P,BLACKWELL S,et al.Massively parallel methods for deep reinforcement learning[C]//Proceedings of the ICML Workshop on Deep Learning.Lille:ACM,2015.
    [97]GUO X,SINGH S,LEE H,et al.Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning[C]//Advances in Neural Information Processing Systems.Montreal:MIT Press,2014:3338–3346.
    [98]VAN H H,GUEZ A,SILVER D.Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence.Phoenix:AAAI,2016:1813–1819.
    [99]WANG Z,FREITAS N,LANCTOT M.Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning.New York:ACM,2016.
    [100]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep exploration via bootstrapped DQN[EB/OL]//ar Xiv preprint.2016.ar Xiv:1602.04621.
    [101]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[EB/OL]//ar Xiv preprint.2016.ar Xiv:1602.01783[cs.LG]
    [102]CUCCU G,LUCIW M,SCHMIDHUBER J,et al.Intrinsically motivated neuroevolution for vision-based reinforcement learning[C]//Proceedings of the IEEE International Conference on Development and Learning.Trondheim:IEEE,2011,2:1–7.
    [103]NARASIMHAN K,KULKARNI T,BARZILAY R.Language understanding for text-based games using deep reinforcement learning[C]//Proceedings of the Conference on Empirical Methods for Natural Language Processing.Lisbon:ACL,2015.
    [104]HAUSKNECHT M,STONE P.Deep recurrent Q-learning for partially observable mdps[C]//Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents.Arlington:AAAI,2015.
    [105]SOROKIN I,SELEZNEV A,PAVLOV M,et al.Deep attention recurrent Q-network[C]//Proceedings of the NIPS Workshop on Deep Reinforcement Learning.Montreal:MIT Press,2015.
    [106]CAI X,WUNSCH II D C.Computer Go:a grand challenge to AI[M]//Challenges for Computational Intelligence.Berlin Heidelberg:Springer,2007:443–465.
    [107]TIAN Y D,ZHU Y.Better computer Go player with neural network and long-term pediction[EB/OL]//ar Xiv preprint.2016.ar Xiv:1511.06410v3[cs.LG].
    [108]TIAN Yuandong.A simple analysis of Alpha Go[J].Acta Automatica Sinica,2016,42(5):671–675.(田渊栋.阿法狗围棋系统的简要分析[J].自动化学报,2016,42(5):671–675.)
    [109]HUANG Shijie.The strategies for Ko fight of computer Go[D].Taiwan:National Taiwan Normal University,2002:1–57.(黄士杰.电脑围棋打劫的策略[D].台湾:台湾师范大学资讯工程研究所,2002:1–57.)
    [110]GUO Xiaoxiao,LI Cheng,MEI Qiaozhu.Deep learning applied to games[J].Acta Automatica Sinica,2016,42(5):676–684.(郭潇逍,李程,梅俏竹.深度学习在游戏中的应用[J].自动化学报,2016,42(5):676–684.)
    [111]KOLLER D,MILCH B.Multi-agent influence diagrams for representing and solving games[J].Games and Economic Behavior,2003,45(1):181–221.
    [112]WOOLDRIDGE M.An Introduction to Multiagent Systems[M].New York:John Wiley&Sons,2009.
    [113]FOERSTER J N,ASSAEL Y M,FREITAS N,et al.Learning to communicate to solve riddles with deep distributed recurrent qnetworks[EB/OL]//ar Xiv preprint.2016.ar Xiv:1602.02672.
    [114]GU S,LILLICRAP T,SUTSKEVER I,et al.Continuous deep Qlearning with model-based acceleration[EB/OL]//ar Xiv preprint.2016.ar Xiv:1603.00748.
    [115]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[EB/OL]//ar Xiv preprint.2016.ar Xiv:1509.02971v5[cs.LG].
    [116]ZHAO D B,ZHU Y H,LV L,et al.Convolutional fitted Q iteration for vision-based control problems[C]//Proceedings of the International Joint Conference on Neural Networks.VANCOUVER:IEEE,2016.
    [117]PAN S J,YANG Q.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345–1359.
    [118]PARISOTTO E,BA J L,SALAKHUTDINOV R.Actor-mimic:deep multitask and transfer reinforcement learning[C]//Proceedings of the International Joint Conference on Neural Networks.Vancouver:IEEE,2016.
    [119]LEVINE S,WAGENER N,ABBEEL P.Learning contact-rich manipulation skills with guided policy search[C]//Proceedings of the IEEE International Conference on Robotics and Automation.Seattle:IEEE,2015:156–163.
    [120]LEVINE S,PASTOR P,KRIZHEVSKY A,et al.Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection[EB/OL]//ar Xiv preprint.2016.ar Xiv:1603.02199.
    (1)初弈号:谷歌深智团队研发的计算机围棋程序,国内有很多译名版本,如“阿尔法围棋”、“阿尔法狗”、或昵称为“狗狗”、“阿发哥”等等.本文翻译为“初弈号”,取其“初级、围棋、机器”三大特征,保留英文原文的朴素感,也有充满自信、奋发图强之意.
    (2)https://drive.google.com/file/d/0Bx KBn D5y2M8NVHRi VXBn OVpi YUk/view
    (3)http://www.kddchina.org/#/Content/alphago
    (4)http://www.kddchina.org/#/Content/alphago
    (5)http://36kr.com/p/5044469.html
    (6)https://deepmind.com/health.html
    (7)http://www.osaro.com/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700