基于马尔可夫决策过程理论的Agent决策问题研究

英文题名：Research on Agent Decision Problem Based on Markov Decision Process Theory
作者：石轲
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：人工智能 ; Agent决策 ; 多Agent系统 ; 马尔可夫决策过程 ; 马尔可夫博弈 ; RoboCup ; 仿真2D
英文关键词：Artificial Intelligence ; Agent Decision ; Multi-Agent System ; Markov Decision Process ; Markov Game ; RoboCup ; 2D Soccer Simulation
学位年度：2010
导师：陈小平
学科代码：081203
学位授予单位：中国科学技术大学
论文提交日期：2010-04-01

摘要

人工智能被认为其主要目标是构造可以决策出智能行为的Agents,即这些Agents能够在多方面再现人类可以做出的智能行为。马尔可夫决策过程(MDP)可以用来描述和处理大规模不确定性环境下的Agent决策问题。
     RoboCup机器人世界杯是国际上一项为促进分布式人工智能、智能机器人技术及其相关领域的研究与发展而举行的大型比赛和学术活动,RoboCup仿真2D比赛是RoboCup所有项目中以Agent决策为重点的一个分支。
     本文以马尔可夫决策过程的相关理论为基础,以RoboCup仿真2D比赛为实验平台,对Agent决策相关问题进行了研究。本文的主要工作可以概括为以下三个方面:
     本文重构并实现了一个完整的RoboCup仿真2D球队决策系统WE2009。该系统以部分可观察随机博弈(POSG)的模型为理论基础,包括信息处理、高层决策和行为执行三个模块。特别是高层决策模块,采用基于独立行为生成器的结构设计,不仅可以充分利用Agent的决策时间,而且可以提高团队合作的效率。
     本文提出了一类特殊的马尔可夫决策过程,即行动驱动的马尔可夫决策过程(ADMDP)。本文分析了ADMDP的理论模型,提出了ADMDP的相关求解方法。该方法采取离线值迭代与在线搜索相结合,在本文中用来求解RoboCup仿真2D比赛中的不离身带球问题,使Agent的带球性能有了较大的提高。
     本文提出了一类特殊的马尔可夫博弈,即基于阵型的零和马尔可夫博弈(FZSMG)。本文分析了FZSMG的理论模型,并以此为基础来描述RoboCup仿真2D比赛中的Anti-Mark问题。针对Anti-Mark问题,本文提出了一个基于阵型变换的启发式求解方法,使球队在与盯人防守的对手比赛时取得了较好的效果。
     本文的所有工作都是基于WE2009实现的,WE2009在完成后参加了2009RoboCup机器人世界杯和2009中国机器人大赛两次重要比赛,并且全部获得冠军。
As most people thought, the goal of Artificial Intelligence is to construct Agents which can make intelligent behaviors, and it also means that these agents will recreate intelligent human behaviors in all respects. Markov Decision Process (MDP) could be used to describe and process Agent decision problems in large size and probabilistic environments.
     RoboCup is an international competition and scientific activity to prompt decentralized Artificial Intelligence, intelligent robotics and related fields. The 2D competition of soccer simulation league is a branch of RoboCup which is emphasis on Agent decision problems.
     In this dissertation, we have done research on Agent decision problems based on the theory of MDP and the test bed of RoboCup 2D soccer simulation. The three main contributions of this dissertation are as below:
     We design and realize a complete 2D soccer simulation team system which is called WE2009. WE2009 is based on the theory of Partially Observable Stochastic Games (POSG) and consists of three modules: message parser, high level decision and low level actions. The high level decision module which adopts a structure based on independent behavior generator, can not only make use of the decision time sufficiently, but also increase the efficiency of teamwork.
     We propose a special kind of MDP, which is called Action-Driven Markov Decision Process (ADMDP). We analyze the theory model of ADMDP and propose the algorithm for solving ADMDP. This algorithm based on offline value iteration and online research is used for the proximal dribble problem in 2D soccer simulation. The empirical result shows that it is much better than the old algorithm of our team in Agent’s dribble performance.
     We propose a special kind Markov Game, which is called Formation-based Zero-Sum Markov Game (FZSMG). We analyze the theory model of FZSMG which is used to describe Anti-Mark problem in 2D soccer simulation. We propose a new heuristic method based on formation change to solve the Anti-Mark problem, which gets a better performance in the competition with the opponents depending on mark defense system.
     All above works are realized in WE2009 2D soccer simulation team. This team has participated RoboCup 2009 and RoboCup China Open 2009 and won two champions!

引文

[Aijun, 2010] Aijun Bai, Jing Wang, Guanghui Lu, et al. WrightEagle 2D Soccer Simulation Team Description 2010. RoboCup International Symposium 2010, June, 2010.
    [Barto, 1995] Barto A G., Bradtke S J, Singh S P. 1995. Learning to act using real-time dynamic programming. Artif. Intell., 72, 81-138.
    [Bellman, 1957] R. Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics 6, 1957.
    [Changjie, 2008a] Changjie Fan. Research on Planning Based on Markov Decision Theory. A dissertation for doctor’s degree of University of Science and Technology of China, 2008.
    [Changjie, 2008b] Changjie Fan and Xiaoping Chen. Optimal Action Criterion and Algorithm Improvement of Real-Time Dynamic Programming. Journal of Software, 2008, 19(11): 2869-2878.
    [Cohen, 2005] Cohen S, Maimon O. Reinforcement-Learning: An Overview from a Data Mining Perspective. The Data Mining and Knowledge Discovery Handbook 2005: 469-486.
    [Colin, 2007] Colin McMillen and Manuela Veloso. Thresholded Rewards: Acting Optimally in Timed, Zero-Sum Games. In Proceedings of AAAI'07, Outstanding Paper Award, Vancouver, Canada, July 2007.
    [Craig, 1999] Craig Boutilier, Thomas Dean and Steve Hanks. Decision-theoretic planning: structural assumptions and computational leverage. The Journal of Artificial Intelligence Research, 1999, 11: 1-94.
    [Daniel, 2000] Daniel W. Stroock. An Introduction to Markov Processes (Graduate Texts in Mathematics). Springer Berlin Heidelberg New York, 2000.
    [Drew, 1991] Drew Fudenberg, Jean Tirole. Game Theory. MIT Press, August 1991.
    [Ellen, 1994] Ellen Germain. Software's special agents: Tired of sifting through electronic mail, searching databases and scanning networks for interesting news? An intelligent agent could be what you need. New Scientist, 1994, 142(1920): 19-20.
    [Eric, 2004] Eric A. Hansen, Daniel S. Bernstein and Shlomo Ziberstein. Dynamic Programming for Partially Observable Stochastic Games. 19th National Conference on Artificial Intelligence (AAAI-04).
    [Eugene, 2002] Eugene A. Feinberg and Adam Shwartz. Handbook of Markov Decision Processes (Methods and Applications). Kluwer 2002.
    [Feng, 2007] Feng Wu and Xiaoping Chen. Solving Large-Scale and Sparse-RewardDEC-POMDPs with Correlation-MDPs. Proceedings of RoboCup International Symposium 2007. Atlanta, America, July 2007.
    [Geffner, 1998] Geffner, H. Modelling Intelligent Behaviour: The Markov Decision Process Approach. In Proceedings of the 6th Ibero-American Conference on Ai: Progress in Artificial intelligence (October 05 - 09, 1998).
    [Jianhuai, 2008] Jianhuai Cai, Xiao Lin, Wuyi Yu, et al. Application of Fuzzy Evaluation and Inference in RoboCup. Knowledge Acquisition and Modeling Workshop, 2008, pages 593-596, 21-22 Dec. 2008.
    [Joshua, 2005] Joshua Grass and Shlomo Zilberstein. Programming with Anytime Algorithms. In Proceedings of the IJCAI-95 Workshop on Anytime Algorithms and Deliberation Scheduling, 2005.
    [Ke, 2007] Ke Shi. Research and Application of Anytime Algorithm in RoboCup 2D Soccer Simulation Team. A dissertation for bachelor’s degree of Hefei University of Technology, 2007.
    [Ke, 2009] Ke Shi, Aijun Bai, Yunfang Tai, et al. WrightEagle2009 2D Soccer Simulation Team Description Paper. RoboCup International Symposium 2009, July, 2009. [Ke, 2010] Ke Shi and Xiaoping Chen. Action-Driven Markov Decision Process and the Application in RoboCup. Journal of Chinese Computer Systems, in press, 2010.
    [Kitano, 1997] Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I., abd H. Matsubara, E.O.: RoboCup: A Challenge Problem for AI. AI Magazine 18 (1997) 73–85.
    [Lane, 1994] Lane D M, Mcfadzean A G. 1994. Distributed problem solving and real-time mechanisms in robot architectures. Engineering Applications of Artificial Intelligence Journal. 7(2): 105-117.
    [Leslie, 1996] Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 1996, 4: 237-285.
    [Li, 2005] Li Li-hong and Michael L. Littman. Lazy Approximation for Solving Continuous Finite horizon MDPs. In AAAI-05, pages 1175-1180, Pittsburgh, PA, 2005.
    [Littman, 1994] Littman, M. L. Markov games as a framework for multiagent reinforcement learning. Proceedings of the 11th International Conference on Machine Learning (ML-94). Morgan Kaufmann.
    [Lloyd, 1953] Lloyd Shapley, Stochastic games, Proc. Nat. Acad. Sciences, 39:1095-1100, 1953.
    [Luiz, 2007] Luiz A. Celiberto Jr., Carlos H. C. Ribeiro, Anna Helena Reali Costa, etc. Heuristic Reinforcement Learning applied to RoboCup Simulation Agents. Proceedings of RoboCup International Symposium 2007, Atlanta, July 2007.
    [Mao, 2003] Mao Chen, Klaus Dorer, Ehsan Foroughi, et al. Robocup Soccer Server manual 7.07, February 11, 2003. RoboCup Federation.
    [Martin, 2005] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience; 1 edition (March 3, 2005).
    [Michail, 2002] Michail G. Lagoudakis and Ronald Parr. Value Function Approximation in Zero-Sum Markov Games. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002.
    [Nicolas, 2002] Nicolas Vieille. Stochastic games: Recent results. In Handbook of Game Theory, 1833–1850. Elsevier Science, 2002.
    [Nikos, 2004] Nikos Vlassis, R. Elhorst, J. R. Kok. Anytime Algorithms for Multiagent Decision Making Using Coordination Graphs. In Proc. Intl. Conf. On Systems, Man and Cybernetics, 2004.
    [Osborne, 1994] Osborne, M. and A. Rubinstein. A Course in Game Theory, Cambridge and London: The MIT Press, 1994.
    [Owen, 1982] Owen, Guillermo 1982. Game Theory: Second edition. Academic Press, Orlando, Florida.
    [Parkers, 2003] Parkes, David C. and Singh, Satinder. An MDP-based approach to Online Mechanism Design. In Proc. 17th Annual Conf. on Neural Information Processing Systems, 2003.
    [Peter, 1999] Peter Stone and Maria Manuela Veloso, "Task Decomposition, Dynamic Role Assignment, and Low-Bandwidth Communication for Real-Time Strategic Teamwork," Artificial Intelligence, 1999.
    [Puterman, 1994] Puterman M. Markov decision processes.John Wiley and Sons,New York, 1994.
    [Riedmiller, 2009] Riedmiller, M., Gabel, T., Hafner, R., and Lange, S. 2009. Reinforcement learning for robot soccer. Auton. Robots 27, 1 (Jul. 2009), 55-73.
    [Shi, 2001] Shi Li. Research of Learning Problems in Multi-Agent System Based on Fuzzy Neural Network. A dissertation for doctor’s degree of Tsinghua University, 2001.
    [Shoham, 1993] Shoham Y. 1993. Agent-oriented programming. Artificial Intelligence. 60(1): 51-92.
    [Stan, 1996] Stan Franklin, Arthur C. Graesser. Is it an Agent, or Just a Program? A Taxonomy for Autonomous Agents. ATAL 1996: 21-35.
    [Stuart, 2003] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach (Second Edition). Pearson Education, Inc., 2003.
    [Thomas, 2007] Thomas Gabel, Martin Riedmiller, and Florian Trost. A Case Study on Improving Defense Behavior in Soccer Simulation 2D: The NeuroHassle Approach. Proceedings of RoboCup International Symposium 2008, July, July 2007.
    [Van, 1981] Van Der Wal, J. 1981. Stochastic dynamic programming. In Mathematical Centre Tracts 139. Morgan Kaufmann, Amsterdam.
    [Weiwei, 2009] Weiwei Zhang. Research and Design about P2P Incentive Mechanism Based on Game Theory. A dissertation for master’s degree of Northwest University, 2009.
    [Wooldridge, 1994] Wooldridge, M. J. and Jennings, N. R. Agent Theories, Architectures and Languages: A Survey. In: ECAI94 Workshop on Agent Theories Architectures and Languages, Amsterdam, The Netherlands. pp. 1-32.
    [Wooldridge, 1997] Wooldridge M. and Fisher M.. Agent-based software engineering. IEEE Proceedings Software Engineering. 1997, 144(1): 26-37.
    [Yang, 2000] Yang Gao, Zhihua Zhou, Jiazhou He, et al. Research on Markov Game-based Multi-Agent Reinforcement Learning Model and Algorithms. Journal of Computer Research and Development, 2000, 37(3): 257-263.
    [Zixing, 2003] Zixing Cai and Guangyou Xu. Artificial Intelligence: Principles and Applications (Third Edition). Tsinghua University Press, 2003.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700