基于多Agent强化学习的RoboCup局部策略研究

英文题名：Researches of Robocup's Local Strategy Based on Multi-Agent Reinforcement Learning
作者：李瑾
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：强化学习 ; 多目标问题 ; 非累积立即奖赏 ; RoboCup训练
英文关键词：reinforcement learning ; multi-goal ; non-cumulative reward ; RoboCup
英文关键词：training
学位年度：2012
导师：刘全
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2012-05-01

摘要

强化学习是人工智能领域中一种重要的用于解决学习控制问题的方法。但是经典强化学习算法在解决RoboCup局部策略训练问题时，仍然存在算法收敛速度缓慢，无法有效解决训练中存在的环境不确定性、多Agent协作与通信以及多目标特性等问题。针对强化学习算法应用于RoboCup局部策略训练时所存在的收敛速度缓慢和无法有效解决局部策略训练多目标特性这两个问题，本文提出了相应的改进方法，其研究内容主要包括以下四个方面：
     (1)针对累积立即奖赏值形式存在的收敛速度慢、容易陷入局部最优等问题，提出了一种非累积的立即奖赏值形式，将其结合到经典的强化学习方法中，形成了基于非累积立即奖赏值形式的强化学习方法。将该方法应用到机器人足球1对1射门训练中，实验结果表明，非累积立即奖赏值形式在该问题上的收敛速度和训练效果都要优于累积立即奖赏值形式。
     (2)针对平均奖赏强化学习固有的收敛度慢的问题，提出了一种改进的强化学习算法。同时，为了处理训练中产生的大状态空间问题，提高泛化能力，该算法结合了BP神经网络作为近似函数。将该方法运用于Keepaway局部训练中，训练结果表明，该算法具有较快的收敛速度和较强的泛化能力。
     (3)针对多目标强化学习问题，提出了一种基于最大集合期望损失的多目标强化学习算法——LRGM算法。该算法预估各个目标的最大集合期望损失，在平衡各个目标的前提下，选择最佳联合动作以产生最优联合策略。
     (4)针对强化学习结合非线性函数泛化不收敛的问题，提出基于改进的MSBR误差函数的Sarsa(λ)算法，证明了算法的收敛性，并对动作选择概率函数和步长参数进行优化。将该算法与多目标强化学习算法LRGM相结合，应用于RoboCup2对2射门局部策略训练中，取得了较好的效果，实验结果表明了该学习算法的有效性。
Reinforcement learning has become a central paradigm for solving learning-controlproblems in artificial intelligence. The traditional reinforcement learning suffers from slowconvergence and could not be availably used in some applications, such as uncertainenvironment, multiple agents and multiple goals. And the training of RoboCup has allthese problems. To solve the slow convergence and the multi-goal feature, some improvedalgorithms are proposed in this paper.
     The main research contents are concluded as follows:
     ⅰ. The expected cumulative-reward could not be used in all applications, and itsuffers from slow convergence due to the influence of accumulating the lower rewards, andtakes time to fade away the effect of the sub-optimal policy. To solve these problems, thenon-cumulative reward is proposed in this paper, and the reinforcement learning modelwith the non-cumulative reward is also proposed. The algorithm is applied to the shootingtraining of RoboCup. The experimental results show that the proposed algorithm hascertain advantages compared to reinforcement learning methods with the expectedcumulative-reward.
     ⅱ. R-Learning has some problems, such as slow convergence and sensitivity withparameters. To solve the problem of the slow speed of convergence, an improvedR-Learning algorithm is proposed. The algorithm uses BP as the approximate function togeneralize the state space. The experimental results of Keepaway show that the proposedalgorithm converges faster and has the ability of generalization.
     ⅲ. To solve the multiple-goal problem of RoboCup, a novel multiple-goalreinforcement learning algorithm, LRGM, is proposed. This algorithm estimates the lostreward of the greatest mass of sub goals and trades off the long term reward of sub-goals toget a composite policy.
     ⅳ. B error function of the single learning module based on MSBR error function isproposed in this paper. B error function has guaranteed the convergence of the valueprediction with nonlinear function approximation. The probability of selecting actions andthe parameter α are also improved with respect to B error function. The experimentalresults of shooting2vs.2show that the LRGM-Sarsa(λ) is more stable and can convergefaster.

引文

[1] George F L.人工智能-复杂问题求解的结构和策略[M].史忠植,张银奎,赵志崑,译.第5版.北京:机械工业出版社,2006
    [2]蔡自兴,徐光祐.人工智能及其应用[M].第三版.北京:清华大学出版社,2003
    [3] Tom M M.机器学习[M].曾华军,张银奎,译.北京:机械工业出版社,2003
    [4] Sutton S R, Barto A G. Reinforcement Learning: an Introduction[M]. Cambridge,MA:MIT Press,1998
    [5] Kaelbing L P, Littman M L, Moore A W. Reinforcement learning: a survey[J]. Journalof Artificial Intelligence Research,1996,4:237-285
    [6]高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,33(1):86-99
    [7] Mackworth A K. On seeing robots[C].//Proc of the Computer Vision Systems, Theoryand Applications. Singapore: World Scientific Press,1993:1-13
    [8]李镇雨,陈小平.基于Markov对策的强化学习及其在RoboCup中的应用[J].计算机工程与应用,2005,36(27):202-205
    [9] Kitano H et al. RoboCup: a challenge problem for AI[J]. AI Magazine,1997,18(1):73–85
    [10]Kitano H et al. The RoboCup synthetic agent challenge97[C].//Proc of the FifteenthInternational Joint Conference on Arthificial Intelligence. SanFrancisco, CA, USA:Morgan Kaufmann Press,1997:24–29
    [11]Kim K H, Ko K W, Kim J G, Lee S H, and Cho H S. Multiple micro robots playingrobot soccer game[C].//Proc of the Micro-RobotWorld Cup Soccer Tournament.Piscataway, NJ: IEEE, Robotics and Automation Society Press,1996:38-43
    [12]苏浩铭.基于模型知识的大空间强化学习算法的研究与实现[D].合肥:合肥工业大学,2008
    [13]Stone P, Sutton S R. Scaling reinforcement learning toward RoboCup soccer[C].//Procof the Eighteenth International Conference on Machine Learning. MA, USA: ICMLPress,2001
    [14]宋志伟,陈小平.仿真机器人足球中的强化学习[J].机器人,2003,25(7):761-766
    [15]Stone P. Layered learning in multi-agent systems[D]. USA: Carnegie MellonUniversity,1998
    [16]Riedrniller M et al. Karlruhe Brainstormers-design principles[C].//Proc of theRoboCup-99, Robot Soccer World Cup III. Berlin: Springer Verlag Press,1999:588-591
    [17]Riedmiller M, Merke A, Meier D et al. Karlsruhe Brainstormer–a reinforcementlearning approach to robotic soccer[C].//Proc of the RoboCup-2000: Robot SoccerWorld Cup IV. Berlin: Springer Verlag Press,2001
    [18]Arseneau S, Sun W, Zhao C P, Jeremy R. Cooperstock: inter-laryer leraning towardsemergent cooperative behavior[C].//Proc of the American Association for ArtificialIntelligence. Menlo Park, CA: AAAI Press,2000
    [19]李镇宇.多主体系统决策问题研究及在RoboCup中的应用[D].合肥:中国科技大学,2005
    [20]刘峻峰.人工智能算法在RoboCup中的应用[D].北京:北京理工大学,2003
    [21]Stone P, Veloso M. Team-partitioned, opaque-transition reinforcement learning[C].//Proc of the RoboCup-98: Robot Soccer World Cup II. Berlin: Springer Verlag Press,1998
    [22]Stone P, Sutton S R, Singh S. Reinforcement learning for3vs.2keepaway[C].//Procof the RoboCup2000: Robot Soccer World Cup IV. Berlin: Springer Verlag Press,2000:249-258
    [23]Stone P, Kuhlmann G, Taylor E M, Liu Y X. Keepaway soccer: from machine learningtestbed to benchmark[J]. Lecture Notes in Computer Science,2006,4020(2006):93-105
    [24]Lind J, Jung G C, Gerber C. Adaptively and learning in intelligent real-timesystems[C].//Proc of the Third International Conference on Autonomous Agents. NewYork, NY: ACM Press,1999
    [25]Hu H S, Kostiadis K, Liu Z Y. Coordination and learning in a team of mobilerobots[C].//Proc of the IASTED Robotics and Application Conference. Santa Barbara,California: RA Press,1999:378-383
    [26]Kostiadis K, Hu H S. Reinforcement learning and co-operation in a simulatedmulti-agent systems[C].//Proc of the IEEE/RJS IROS’99. Piscataway, NJ: IEEE Press,1999:990-995
    [27]Stone P, Sutton S R. Keepaway soccer: a machine learning test bed[J]. Lecture Notesin Computer Scienece,2002,2377(2002):207-237
    [28]Stone P, Sutton S R, Kuhlmann G. Reinforcement learning for RoboCup soccerkeepaway[J]. Adaptive Behavior,2005,13(3):165-188
    [29]Kostiadis K, Hu H S. Kanerva based generalization and reinforcement learning forpossession football[C].//Proc of the International Conference on Intelligent Robotsand Systems. Piscataway, NJ: IEEE Press,2001
    [30]Leng J S, Fyfe C L. Reinforcement learning of competitive skills with soccer agents[J].Computer Science,2007,4692(2007):572-579
    [31]Geipel M, Beetz M. Learning to shoot goals analysing the learning process and theresulting polices[C].//Proc of the RoboCup-2006:Robot Soccer World Cup X. Paris,USA: Springer Press,2007
    [32]Wu F, Zilberstein S, Chen X P. Trial-based dynamic progrmming for multi-agentplanning[C].//Proc of the Twenty-Fourth AAAI Conference on Artificial Intelligence.Menlo Park, CA: AAAI Press,2010:11-15
    [33]Zhang Z Z, Chen X P. Accelerating point-based POMDP algorithms via greedystrategies[C].//Proc of the2nd International Conference on Simulation, Modeling, andProgramming for Autonomous Robots. Berlin: Springer Press,2010:15-18
    [34]Wu F, Zilberstein S, Chen X P. Multi-agent online planning with communication[C].//Proc of the ICAPS-09. Menlo Park, CA: AAAI Press,2009:321-328
    [35]Wu F, Zhilberstein S, Chen X P. Online planning for multi-agent systems withbounded communication[J]. Artifical Intelligence,2011,175(2):441-790
    [36]周勇,刘锋.基于改进的Q学习的RoboCup传球策略研究[J].计算机技术与发展,2008,18(4):63-66
    [37]李龙澍,葛瑞峰,王慧萍.基于神经网络的批强化学习在RoboCup中的应用[J].计算机技术与发展,2009,19(7):98-101
    [38]黄炳强.强化学习方法及其应用研究[D].上海:上海交通大学,2007
    [39]Sutton S R. Learning to predict by the methods of temporal differences[J]. MachineLearning,1988,3:9-44
    [40]Singh S, Norvig P, Cohn D. Agents and reinforcement learning[J]. Dr. Dobb’s Journal,1997,22(3):28-38
    [41]Watkins C. Q-learning[J]. Machine Learning,1992,(8):279-292
    [42]Peng J, Williams J R. Incremental multi-step Q-Learning[J]. Machine Learning,1996,3(11):283-290
    [43]Rummery G A. On-line Q-learning using Connectionist Systems[M]. England:Cambridge University,1994
    [44]Schwartz A. A reinforcement learning method for maximizing undiscountedrewards[C].//Proc of the Tenth International Conference on Machine Learning.SanFrancisco, CA, USA: Morgan Kaufmann Press,1993,298-305
    [45]Tan M. Multi-agent reinforcement Learning: independent vs. cooperative agents[C].//Proc of the tenth International Conference on Machine Learning. SanFrancisco, CA,USA: Morgan Kaufmann Press,1993:330-337
    [46]Chang Y, Kaelbling L. Playing is believing: the role of beliefs in multi-agentlearning[C].//Proc of the NIPS-2001. Vancouver, Canada: NIPS Press,2001
    [47]Claus C, Boutilier C. The dynamics of reinforcement learning in cooperativemultiagent system[C].//Proc of the Fifteenth National/Tenth Conference on ArtificialIntelligence/Innovative Applications on Artificial Intelligence. Menlo Park, CA: AAAIPress,1998:746-752
    [48]Bowling M. Convergence and no-regret in multiagent learning[C].//Proc of theAdvances in Naural Information Processing Systems. Cambridge,MA: MIT Press,2004:209-216
    [49]Watkins C J C H. Learning with delayed rewards[D]. England: Cambridge University,1989
    [50]Mahadevan S. Average reward reinforcement learning: foundations, algorithms, andempirical results[J]. Machine Learning,1996,22(13):159-195
    [51]高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378
    [52]Martin T H, Howard B D, Mark H B.神经网络设计[M].戴葵等译.北京:机械工业出版社,2002
    [53]Tesauro G J. TD-gammon, a self-teaching back gammon program, achievesmaster-level play[J]. Neural Computation,1994,6(2):215-219
    [54]陆鑫,高阳,李宁,陈世福.基于神经网络的强化学习算法研究[J].计算机研究与发展,2002,39(8):981-985
    [55]林联明,王浩,王一雄.基于神经网络的Sarsa强化学习算法[J].计算机技术与发展,2006,16(1):30-32
    [56]Karlsson J. Learning to solve multiple goals [D]. Rochester: University of Rochester,1997
    [57]Humphrys M. Action selection methods using reinforcement learning[C].//Proc of theFourth International Conference on Simulation of Adaptive Behavior. Cambridge, MA:MIT Press,1996:135-144
    [58]Sprague N, Ballard D, Robinson A. Modeling embodied visual behaviors[J].Transactions on Applied Perception,2007,4(2):1-23
    [59]Ana M, Timothy C, Hal P. Taxing executive processes does not necessarily increaseimpulsive decision making[J]. Experimental Psychology,2010,57(3):193-201
    [60]Bertsekas D P, Tsitsiklis J N. Neuro-dynamic Programming[M]. Belmont MA: AthenaScientific,1996
    [61]郭锐,吴敏,彭军,彭娇,曹卫华.一种新的多智能体Q学习算法[J].自动化学报,2007,33(4):367-372
    [62]Aissani N, Beldjilali B, Beldjilali B. Dynamic scheduling of maintenance tasks in thepetroleum industry: a reinforcement approach[J]. Engineering Applications ofArtificial Intelligence,2009,22(7):1083-1103

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700