基于模型的动态分层强化学习算法研究

作者：袁姣红
论文级别：硕士
学科专业名称：控制科学与工程
中文关键词：智能体 ; 基于模型的强化学习 ; 自适应聚类 ; 动态分层强化学习 ; MAXQ
英文关键词：agent ; model-based reinforcement learning ; adaptive clustering ; dynamic hierarchical reinforcement learning ; MAXQ
学位年度：2011
导师：吴敏 ; 陈鑫
学科代码：081101
学位授予单位：中南大学

摘要

强化学习因具有自学习和在线学习的良好特性,已经成为机器学习领域的一个重要分支。然而,智能体在大规模高维度的决策环境下进行强化学习时被“维数灾难”(学习参数的个数随变量的维数成指数级增长)所困扰,学习效率低下,导致难以及时甚至无法完成学习任务。因此,如果能有效缓解“维数灾难”,提出一种适用于未知大规模复杂环境下的高效率强化学习方法,则可以为提高智能体在实际应用中的自适应性提供有效的解决方案,对促进机器学习领域理论和技术的发展具有重要意义。
     因此,为了缓解未知大规模环境下的“维数灾难”问题,提高学习效率,本文研究将动态分层技术和基于模型的自学习技术相结合的方法,在基于模型的强化学习过程中,提出一种基于探索信息自适应聚类的动态分层强化学习算法。该算法动态生成融合了状态抽象和时态抽象(或称动作抽象)的MAXQ分层结构,从而通过限制MAXQ中每个子任务的策略搜索空间而显著加快了学习速度。
     首先,在基于模型的强化学习过程中,利用基于探索信息的自适应聚类算法将整个状态空间划分成若干个状态子空间,即通过状态抽象完成了任务的自动分层,并基于状态子空间的终止状态集,提出-种改进的动作选择策略。其次,根据各动作有效执行的频率情况进行时态抽象自动生成类似于MAXQ的分层结构,进而根据有效动作集将各状态子空间归入到相应的MAXQ子任务中,从而自动生成融合了状态抽象和时态抽象的MAXQ分层结构。再次,基于该MAXQ分层框架搜索任务的递归最优策略,并在以后的学习过程中动态调整MAXQ结构,以降低初次分层结构不合理的局限性。
     通过仿真试验表明,本文提出的算法能显著提高未知环境下智能体的学习效率,有效缓解“维数灾难”问题,从而验证了算法的有效性。最后对论文进行总结,并提出一些有待进一步研究的问题。
Reinforcement learning (RL) has been an important branch of machine learning for its good characteristics of self-learning and online learning. However, RL is perplexed by the problem of'curse of dimensionality'(the number of parameters to be learned increases exponentially with the dimensionality of variables) in the large-scale environment, result in low learning efficiency which means that the agent can hardly complete the task in time, even fail to achieve the goal. Therefore, if a novel reinforcement learning method which can handle the problem of'curse of dimensionality'in the unknown large-scale world is presented, an effective solution proposal which is able to improve the adaptability of the agent in the application will be provided. More importantly, this study will have great significance to the development of machine learning theory and technology.
     This thesis studies the combination of dynamic hierarchical RL and model-based RL, so as to address the problem of'curse of dimensionality'in the unknown large-scale world as well as improve learning efficiency of the agent. During the process of model-based RL, a novel dynamic hierarchical reinforcement learning with adaptive clustering based on the exploration information (DHRL-ACEI) algorithm is proposed in this thesis. The DHRL-ACEI algorithm builds the MAXQ hierarchy in which state abstraction and temporal abstraction (also called action abstraction) are integrated, so it can accelerate the learning remarkably by restricting the exploration range of policy for every subtask in MAXQ hierarchy.
     First of all, the whole state subspace is divided into some state subspaces (regions) using adaptive clustering based on the exploration information algorithm in the model-based RL, which belongs to state abstraction, and an improved strategy of action selection is presented based on the set of terminal states w.r.t. these regions. Next, similar MAXQ hierarchy is constructed automatically based on frequency of successful execution w.r.t. actions, then incorporates those regions result from state abstraction into relevant subtasks of MAXQ by their sets of valid actions, so we build the MAXQ hierarchy automatically where state abstraction and temporal abstraction are combined. Then the recursively hierarchical optimal policy is derived based on the MAXQ framework, and the hierarchy will be updated dynamically in the following learning so as to reduce the limitation of unreasonable hierarchy built at first.
     Simulation results show that DHRL-ACEI algorithm presented in this thesis can handle the problem of'curse of dimensionality'and enhance learning efficiency of the agent significantly in the unknown large-scale environment, so that demonstrate the effectiveness of the DHRL-ACEI algorithm. Finally, this thesis makes some conclusions, and presents some issues for future research.

引文

[1]Kaelbling L. P., Littman M. L., Moore A. W., Reinforcement learning:a survey. Journal of Artificial Intelligence Research,1996,4:237-285
    [2]Singh S. Agents and reinforcement learning. San Mateo, CA, USA:Miller Freeman Publish Inc.,1997
    [3]高阳,陈世福,陆鑫.强化学习研究综述.自动化学报,2004,30(1)：86-100
    [4]张汝波,顾国昌,刘照德,王醒策.强化学习理论、算法及应用.控制理论与应用,2000,17(5)：637-642
    [5]Parr R., Painter-Wakefield C, Li L., et al. Analyzing feature generation for value-function approximation. In Proceedings of the Twenty-Fourth International Conference on Machine Learning. ACM Press, New York,2007: 737-744
    [6]Konidaris G., Osentoski S., Value function approximation in reinforcement learning using the Fourier basis. Technical Report UM-CS-2008-19, Department of Computer Science, University of Massachusetts, Amherst, June 2008:1-6
    [7]Johns J., Mahadevan S., Wang C. Compact spectral bases for value function approximation using kronecker factorization. In Proceedings of the Twenty-second National Conference on Artificial Intelligence,2007,1:559-564
    [8]Moriarty D.E., Schultz A.C., Grefenstette J.J., Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research,1999,11 (1): 241-276
    [9]Singh S. P., Jaakkola T., Jordan M. I., Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, Cambridge, Massachusetts:MIT Press,1995:361-368
    [10]Dai Z.h., Chen X., Cao W.h., et al. Model-based learning with bayesian and MAXQ value function decomposition for hierarchical task. Proceedings of the 8th World Congress on Intelligent Control and Automation, Jinan, China,2010: 676-681
    [11]Jong N. K., Stone P., Hierarchical model-based reinforcement learning: R-MAX+MAXQ. In the 25th International Conference on Machine Learning, Helsinki, Finland,2008:432-439
    [12]Ghavamzadeh M., Mahadevan S., Makar R., Extending hierarchical reinforcement learning to continuous-time, average-reward, and multi-agent models. CMPSCI Technical Report 03-23, University of Massachusetts Amherst, USA, July,2003:1-32
    [13]Ghavamzadeh M., Mahadevan S., Makar R., Hierarchical multi-agent reinforcement learning. Autonomous Agents and Multi-Agent Systems,2006,13 (2):197-229
    [14]彭志平,李绍平.分层强化学习研究进展.计算机应用研究,2008,25(4)：974-978
    [15]沈晶,顾国昌,刘海波.分层强化学习中的Option自动生成算法.计算机工程与应用,2005,41(34)：4-6
    [16]沈晶,顾国昌,刘海波.分层强化学习中的并行自动分层方法研究.计算机工程与设计,2007,28(2)：422-424
    [17]Barto A. G., Mahadevan S., Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems:Theory and Applications,2003,13 (4):341-379
    [18]Li L., Walsh T. J., Littman M. L., Towards a unified theory of state abstraction for MDPs. In Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics,2006:1-10
    [19]Jong N. K., Stone P., State abstraction discovery from irrelevant state variables. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Professional Book Center, Denver, CO,2005:752-757
    [20]Mannor S., Menache I., Hoze A., et al. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the Twenty-First International Conference on Machine Learning. ACM Press, New York,2004:560-567.
    [21]Simsek O., Barto A. G., Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the Twenty-First International Conference on Machine Learning. New York,2004,21 (1-2): 751-758
    [22]McGovern A. Autonomous discovery of temporal abstraction from interaction with an environment. Ph.D Dissertation, University of Massachusetts Amhert, USA,2002
    [23]Knoblock C. A., Automatically generating abstractions for planning. Artificial Intelligence,1994,68 (2):243-302
    [24]Konidaris G., Barto A., Efficient skill learning using abstraction selection. In Proceedings of the 21 st International Joint Conference on Artificial Intelligence, July,2009:1107-1112
    [25]Dietterich T.G., Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 2000,13(1):227-303
    [26]Minsky M. L., Theory of neural-analog reinforcement systems and its application to the brain-model problem. Ph.D Dissertation, Princeton University,1954
    [27]Watkins C. J.C.H., Dayan P., Q-learning. Machine Learning,1992,8 (3-4): 279-292
    [28]Panait L., Tuyls K., Theoretical advantages of lenient Q-learners:an evolutionary game theoretic perspective. Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems,2007: 180-187
    [29]Araabi B. N., Mastoureshgh S., Ahmadabadi M. N., A study on expertise of agents and its effects on cooperative Q-learning. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics,2007,37 (2):398-409
    [30]Sutton R. S., Learning to predict by the methods of temporal differences. Machine Learning,1988,3(1):9-44
    [31]Barto A.G., Bradtke S.J., and Singh S. P., Learning to act using real-time dynamic programming. Artificial Intelligence,1995,72 (1):81-138
    [32]Sutton R. S., Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning,1990:216-224
    [33]Moore A.W., Atkeson C. G., Prioritized sweeping:reinforcement learning with less data and less time. Machine Learning,1993,13 (1):103-130
    [34]Jing Peng and Ronald J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior,1993,1 (4):437-454
    [35]陈飞,王本年,高阳等.贝叶斯学习与强化学习结合技术的研究.计算机科学,2006,33(2)：173-177
    [36]Dearden R., Friedman N., Russell S., Bayesian Q-learning. In Proceedings of Fifteenth National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press,1998:761-768
    [37]Dearden R., Friedman N., Andre D., Model based bayesian exploration. In Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, San Francisco, Morgan Kaufmann,1999:150-159
    [38]Strens M., A Bayesian framework for reinforcement learning. Proceedings of the 17th International Conference on Machine Learning, California,2000: 943-950
    [39]Parr R. E., Hierarchical control and learning for Markov decision processes. Ph.D Dissertation. University of California, Berkeley,1998
    [40]Sutton R. S., Precup D., Singh S., Between MDPs and Semi-MDPs:a framework for temporal abstraction in reinforcement learning. Artificial Intelligence,1999,112 (1):181-211
    [41]Dietterich T. G., An overview of MAXQ hierarchical reinforcement learning. Lecture Notes in Computer Science,2000,1864:26-44
    [42]沈晶.分层强化学习方法研究.博士学位论文,2006：28-97
    [43]沈晶,顾国昌,刘海波.一种新的分层强化学习方法.计算机应用,2006,26(8)：1938-1942
    [44]Howard R. A., Dynamic probabilistic systems:semi-Markov and decision processes. New York, Wiley,1971
    [45]Thrun S., Schwartz A., Finding structure in reinforcement learning. Advances in Neural Information Processing Systems, MIT Press,1995:385-392
    [46]Drummond C., Accelerating reinforcement learning by composing solutions of automatically identified subtasks. Journal of Artificial Intelligence Research,2002,16:59-104
    [47]Digney B. L., Emergent hierarchical control structures:learning reactive hierarchical relationships in reinforcement environments. From Animals to Animats 4:Proceedings of the 4th International Conference on Simulation of Adaptive Behaviour, North Falmouth, USA,1996:363-372
    [48]Menache I., Mannor S., and Shimkin N., Q-Cut-dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the Thirteenth European conference on Machine Learning. Springer, Heidelberg,2002:295-306.
    [49]Digney B. L., Learning hierarchical control structures for multiple tasks and changing environments. From Animals to Animats 5:Proceedings of the 5th International Conference on Simulation of Adaptive Behaviour, Zurich, Switzerland,1998:321-330
    [50]Dayan P., Hinton G., Feudal reinforcement learning. Proceedings of Advances in Neural Information Processing Systems, San Francisco:Morgan Kaufmann,1993:271-278
    [51]McCallum A. Reinforcement learning with selective perception and hidden state. Ph.D Dissertation, University of Rochester, USA,1995
    [52]Moore A., The Parti-Game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Advances in Neural Information Processing Systems, Morgan Kaufmann Publishers,1994:711-718
    [53]Hengst B., Discovering hierarchy in reinforcement learning. Ph.D Dissertation, University of New South Wales, Australia,2003
    [54]Hengst B., Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the Nineteenth International Conference on Machine Learning, 2002:243-250
    [55]Price B., Boutilier C. A bayesian approach to imitation in reinforcement learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence,2003:712-720
    [56]Chiu C.-C, Soo, V.-W., Automatic complexity reduction in reinforcement learning, Computational Intelligence,2010,26 (1):1-25
    [57]Jonsson A., Barto A., Causal graph based decomposition of factored MDPs. Journal of Machine Learning Research,2006,7:2259-2301
    [58]Mcgovern A., Barto A. G., Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann, San Franciso,2001: 361-368.
    [59]Simsek O., Wolfe A. P., Barto A. G., Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the Twenty-Second International Conference on Machine Learning. ACM Press, New York,2005,119:816-823.
    [60]Moore A.W., Baird L.C., Kaelbling L., Multi-value-functions:efficient automatic action hierarchies for multiple goal MDPs. Proceedings of the 15th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 1999:1316-1323
    [61]Uther W. T. B., Tree based hierarchical reinforcement learning. Ph.D Dissertation, Carnegie Melton University, USA,2002
    [62]Takahashi Y., Asada M., Multi-controller fusion in multi-layered reinforcement learning. International Conference on Multisensor Fusion and Integration for Intelligent Systems, Baden, Germany,2001:7-12
    [63]Singh S., Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning,1992,8 (3-4):323-339
    [64]Singh S., Jaakkola T., Littman M. L., et al. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 2000,38 (3):287-308
    [65]Zhang T., Ramakrishnan R., Livny M., BIRCH:an efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference, Montreal, Canada,1996:103-114.
    [66]Hartigan J., Clustering Algorithms, John Wiley and Sons, New York, NY,1975.
    [67]Ester M., Kriegel H.-P., Sander J., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining,1996:226-231
    [68]Agrawal R., Gehrke J., Gunopulos D., et al. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international Conference on Management of data, New York,1998,27 (2):94-105
    [69]Fisher D.H., Knowledge acquisition via incremental conceptual clustering. Machine Learning,1987,2:139-172

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700