Exponential moving average based multiagent reinforcement learning algorithms
详细信息    查看全文
  • 作者:Mostafa D. Awheda ; Howard M. Schwartz
  • 关键词:Multi ; agent learning systems ; Reinforcement learning ; Markov decision processes ; Nash equilibrium
  • 刊名:Artificial Intelligence Review
  • 出版年:2016
  • 出版时间:March 2016
  • 年:2016
  • 卷:45
  • 期:3
  • 页码:299-332
  • 全文大小:2,130 KB
  • 参考文献:Abdallah S, Lesser V (2008) A multiagent reinforcement learning algorithm with non-linear dynamics. J Artif Intell Res 33:521–549MathSciNet MATH
    Awheda MD, Schwartz HM (2013) Exponential moving average Q-learning algorithm. In: Adaptive dynamic programming and reinforcement learning (ADPRL), 2013 IEEE symposium on, IEEE, pp 31–38. IEEE
    Awheda MD, Schwartz HM (2015) The residual gradient FACL algorithm for differential games. In Electrical and computer engineering (CCECE). 2015 IEEE 28th Canadian conference on, IEEE, pp 1006–1011. IEEE
    Banerjee B, Peng J (2007) Generalized multiagent learning with performance bound. Auton Agents Multi-Agent Syst 15(3):281–312CrossRef
    Bellman R (1957) Dynamic programming. Princeton University Press, PrincetonMATH
    Bowling M (2005) Convergence and no-regret in multiagent learning. Adv Neural Inf Process Syst 17:209–216
    Bowling M, Veloso M (2001a) Convergence of gradient dynamics with a variable learning rate. In: ICML, pp 27–34
    Bowling M, Veloso M (2001b) Rational and convergent learning in stochastic games. In: International joint conference on artificial intelligence, vol. 17. Lawrence Erlbaum Associates Ltd, pp 1021–1026
    Bowling M, Veloso M (2002) Multiagent learning using a variable learning rate. Artif Intell 136(2):215–250CrossRef MathSciNet MATH
    Burkov A, Chaib-draa B (2009) Effective learning in the presence of adaptive counterparts. J Algorithms 64(4):127–138CrossRef MathSciNet MATH
    Busoniu L, Babuska R, De Schutter B (2006) Multi-agent reinforcement learning: A survey. In: Control, automation, robotics and vision, 2006. ICARCV’06. 9th international conference on, IEEE, pp 1–6. IEEE
    Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. Syst Man Cybern Part C: Appl Rev, IEEE Trans 38(2):156–172CrossRef
    Claus C, Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI/IAAI, pp 746–752
    Conitzer V, Sandholm T (2007) Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Mach Learn 67(1–2):23–43CrossRef
    Dai X, Li C-K, Rad AB (2005) An approach to tune fuzzy controllers based on reinforcement learning for autonomous vehicle control. Intell Transp Syst, IEEE Trans 6(3):285–293CrossRef
    D’Angelo H (1970) Linear time-varying systems: analysis and synthesis. Allyn & Bacon, NewtonMATH
    DeCarlo RA (1989) Linear systems: a state variable approach with numerical implementation. Prentice-Hall Inc, Upper Saddle River
    Dixon W (2014) Optimal adaptive control and differential games by reinforcement learning principles. J Guid Control Dyn 37(3):1048–1049CrossRef
    Fulda N, Ventura D (2007) Predicting and preventing coordination problems in cooperative Q-learning systems. In: IJCAI, vol. 2007, pp 780–785
    Gutnisky DA, Zanutto BS (2004) Learning obstacle avoidance with an operant behavior model. Artif Life 10(1):65–81CrossRef
    Hinojosa W, Nefti S, Kaymak U (2011) Systems control with generalized probabilistic fuzzy-reinforcement learning. Fuzzy Syst, IEEE Trans 19(1):51–64CrossRef
    Howard RA (1960) Dynamic programming and markov processes. MIT Press, Cambridge
    Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069MathSciNet
    Hu J, Wellman MP, et al (1998) Multiagent reinforcement learning: theoretical framework and an algorithm. In: ICML, vol. 98, Citeseer, pp 242–250
    Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
    Kondo T, Ito K (2004) A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robot Auton Syst 46(2):111–124CrossRef
    Luo B, Wu H-N, Li H-X (2014a) Data-based suboptimal neuro-control design with reinforcement learning for dissipative spatially distributed processes. Ind Eng Chem Res 53(19):8106–8119CrossRef
    Luo B, Wu H-N, Huang T, Liu D (2014b) Data-based approximate policy iteration for nonlinear continuous-time optimal control design. Automatica 50(12):3281–3290CrossRef MathSciNet
    Luo B, Wu H-N, Huang T (2015a) Off-policy reinforcement learning for \(H_{\infty }\) control design. Cybern, IEEE Trans 45(1):65–76CrossRef
    Luo B, Wu H-N, Li H-X (2015b) Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming. Neural Netw Learn Syst, IEEE Trans 26(4):684–696CrossRef
    Luo B, Huang T, Wu H-N, Yang X (2015c) Data-driven \(H_{\infty }\) control for nonlinear distributed parameter systems. Neural Netw Learn Syst, IEEE Trans 26(11):2949–2961
    Luo B, Wu H-N, Huang T, Liu D (2015d) Reinforcement learning solution for HJB equation arising in constrained optimal control problem. Neural Netw 71:150–158CrossRef
    Modares H, Lewis FL, Naghibi-Sistani M-B (2014) Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1):193–202CrossRef MathSciNet MATH
    Rodríguez M, Iglesias R, Regueiro CV, Correa J, Barro S (2007) Autonomous and fast robot learning through motivation. Robot Auton Syst 55(9):735–740CrossRef
    Schwartz HM (2014) Multi-Agent Machine Learning: A Reinforcement Approach. Wiley, New YorkCrossRef
    Sen S, Sekaran M, Hale J (1994) Learning to coordinate without sharing information. In: AAAI, pp 426–431
    Singh S, Kearns M, Mansour Y (2000) Nash convergence of gradient dynamics in general-sum games. In: Proceedings of the sixteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 541–548
    Smart WD, Kaelbling LP (2002) Effective reinforcement learning for mobile robots. In: Robotics and automation. Proceedings. ICRA’02. IEEE international conference on, vol. 4, IEEE, 2002, pp. 3404–3410. IEEE
    Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. The MIT Press, Cambridge
    Tan M (1993) Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning, pp 330–337
    Tesauro G (2004) Extending Q-learning to general adaptive multi-agent systems. In: Advances in neural information processing systems, vol. 16. MIT press, pp 871–878
    Thathachar MA, Sastry PS (2011) Networks of learning automata: techniques for online stochastic optimization. Springer, Boston
    Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292MATH
    Watkins CJCH (1989) Learning from delayed rewards, Ph.D. thesis, University of Cambridge England
    Weiss G (1999) Multiagent systems: a modern approach to distributed artificial intelligence. MIT Press
    Wu H-N, Luo B (2012) Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear control. Neural Netw Learn Syst, IEEE Trans 23(12):1884–1895CrossRef MathSciNet
    Ye C, Yung NH, Wang D (2003) A fuzzy controller with supervised learning assisted reinforcement learning algorithm for obstacle avoidance. Syst Man Cybern Part B: Cybern, IEEE Trans 33(1):17–27CrossRef
    Zhang C, Lesser VR (2010) Multi-agent learning with policy prediction. In: AAAI
  • 作者单位:Mostafa D. Awheda (1)
    Howard M. Schwartz (1)

    1. Department of Systems and Computer Engineering, Carleton University, 1125 Colonel By Drive, Ottawa, ON, K1S 5B6, Canada
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Science, general
    Complexity
  • 出版者:Springer Netherlands
  • ISSN:1573-7462
文摘
Two multi-agent policy iteration learning algorithms are proposed in this work. The two proposed algorithms use the exponential moving average approach along with the Q-learning algorithm as a basis to update the policy for the learning agent so that the agent’s policy converges to a Nash equilibrium policy. The first proposed algorithm uses a constant learning rate when updating the policy of the learning agent, while the second proposed algorithm uses two different decaying learning rates. These two decaying learning rates are updated based on either the Win-or-Learn-Fast (WoLF) mechanism or the Win-or-Learn-Slow (WoLS) mechanism. The WoLS mechanism is introduced in this article to make the algorithm learn fast when it is winning and learn slowly when it is losing. The second proposed algorithm uses the rewards received by the learning agent to decide which mechanism (WoLF mechanism or WoLS mechanism) to use for the game being learned. The proposed algorithms have been theoretically analyzed and a mathematical proof of convergence to pure Nash equilibrium is provided for each algorithm. In the case of games with mixed Nash equilibrium, our mathematical analysis shows that the second proposed algorithm converges to an equilibrium. Although our mathematical analysis does not explicitly show that the second proposed algorithm converges to a Nash equilibrium, our simulation results indicate that the second proposed algorithm does converge to Nash equilibrium. The proposed algorithms are examined on a variety of matrix and stochastic games. Simulation results show that the second proposed algorithm converges in a wider variety of situations than state-of-the-art multi-agent reinforcement learning algorithms. Keywords Multi-agent learning systems Reinforcement learning Markov decision processes Nash equilibrium

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700