强化学习及其应用研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

强化学习及其应用研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on the Reinforcement Learning and Its Application
作者：徐明亮
论文级别：博士
学科专业名称：轻工信息技术与工程
中文关键词：强化学习 ; Q—学习 ; OPTION ; 移动机器人导航
英文关键词：reinforcement ; Q-learning ; OPTION ; The navigation of mobile robots
学位年度：2010
导师：须文波
学科代码：081104
学位授予单位：江南大学
论文提交日期：2010-03-01
答辩委员会主席：徐安

摘要

强化学习作为一种重要的机器学习方法,其最显著的特点是通过与环境交互,利用环境反馈的奖惩,即增强信号来调整和改善自己的行为,最终获得最佳策略。由于该方法具有对环境的先验知识要求低,可以在实时环境中进行在线学习,因此受到许多研究者的关注,同时在智能控制,序列决策等领域也得到了广泛应用。
     强化学习的根本任务就是学习从状态空间到动作空间的映射,其本质就是用参数化的函数来逼近“状态—动作”的映射关系,而这种映射关系可由状态值函数或状态—动作对值函数来确定。经典的强化学习方法都是建立在以查找表的方式来描述值函数的小规模、离散的状态和动作空间的基础之上。为改善和提高强化学习在大规模的离散状态动作空间和连续状态空间或动作空间的性能,研究者们在强化学习中引入分层学习技术和泛化技术。
     就分层技术而言,典型的技术有OPTION、HAM(包括PHAM)、MAXQ这三类方法。分层强化学习的关键在于任务的自动分层。由于OPTION方法特别适合于分区或分段子任务的自动划分,并且子任务粒度易于控制。因此OPTION方法在根据状态空间中的瓶颈状态进行任务分层和子任务自动构造中的方法中应用最为广泛。就泛化技术而言,通常是在强化学习中引入具有泛化性能的神经网络技术和模糊推理技术。由于Q-学习具有实现简单,易于理解的优点,因此应用非常广泛。在所有以神经网络或模糊推理系统来逼近Q值函数的方法中都是采用间接逼近的方法,即神经网络或模糊推理系统的输入为状态,只逼近若干个预先选定的离散动作的Q值,动作输出也是基于这些选定的种子动作为基础产生。而种子动作的选择没有任何先验知识,选择的好坏直接影响强化学习系统的学习性能。文章在对强化学习的研究背景和相关理论进行概述和对相关文献综述基础之上,对分层强化学习中基于瓶颈状态的OPTION自动分层技术以及基于神经网络和模糊推理系统Q值函数逼近进行了研究。
     轮式移动机器人是一种能够在环境中自主移动并完成预定任务的智能系统,在工业、农业、民用以及军事等领域具有广泛的应用前景。在轮式移动机器人的各项研究和应用中,导航是最基本和最重要的问题。由于强化学习具有较强的在线自适应性和对复杂系统的自学习能力,因此其在机器人导航研究中受到了广泛的关注。本文以轮式移动机器人沿墙导航控制为主要研究内容,研究了基于强化学习的移动机器人反应式导航问题。
     文章的主要内容和成果如下：
     1.提出了基于禁忌状态的OPTION自动构造方法。在这个方法中,通过在基于瓶颈状态的OPTION的自动分层技术中引入禁忌状态,使得agent在与环境的交互过程中自动构造以瓶颈状态为子目标的OPTION.与相关文献相比该方法的主要特点是不仅能自动搜索到环境中的瓶颈状态,还能自动搜索OPTION的起始状态,自动构造OPTION的起始集,同时在搜索过程中对OPTION的内部策略进行学习。网格环境的仿真实验验证了该方法能够实现OPTION三要素的自动构造。
     2.为避免种子动作的选择,文章对Q-学习中的动作值函数逼近进行了研究。虽然RBF网络规模较大,但是它具有全局逼近和局部逼近的性能,同时还具有学习速度快的优点,因此文章对采用RBF网络和实现动作值函数直接逼近分别进行了研究,提出了RBFQ强化学习系统,在该系统中网络的输入为状态动作对,输出即为输入的Q值。利用TD误差和当前状态动作对与基函数之间距离对网络结构和参数进行自适应调整,同时将优化技术引入到强化学习中来,以函数优化技术实现贪婪动作的搜索,并用经典的倒立摆平衡控制仿真实验验证了RBFQ方法的有效性。
     3.由于模糊推理系统具有万能逼近的性质,同时还具有可解释性,便于在系统中嵌入已有经验和知识,因此文章对采用模糊推理系统实现动作值函数直接逼近也进行了研究,提出了AFQL强化学习系统。利用TD误差和当前状态动作对与模糊基函数之间距离实现模糊规则自动构造,以及对模糊规则的前件和后件进行自适应调整。与RBFQ方法一样,以函数优化技术实现系统的输出动作。倒立摆平衡控制仿实验验证了AFQL方法的有效性。
     4.利用本文提出的AFQL强化学习方法对室内机器人沿墙导航进行了仿真研究,仿真结果验证了本文所提的方法能够实现未知环境中移动机器人沿墙导航,也进一步说明了该方法具有良好的学习效率和泛化性能
Reinforcement learning (RL) is an important machine learning framework that can get optimum policy based on the interaction with the environment. The policy is updated according to the punishment or Awards, namely reinforcement signal that given by environment. Reinforcement learning not only has the quality of low requirement for the prior knowledge about the environment but also can learn online for the real-time environment. RL has attracted many researches and widely is used in the field of intelligent control and sequential decision.
     The main aim of reinforcement learning is to learn the mapping from the state space to the action space and that can be determined completely by the value function estimation such as state value function and the pair of state and action value function that can be approximated using the parameter function in essence. The classical reinforcement learning only concerns small scale discrete state and action space and the value function described with the Look-Up Table (LUT). In order to improve the performance of the Reinforcement learning in the large-scale discrete space and continuous state space or continuous action space, hierarchical learning and generalization methods are introduced into reinforcement learning.
     In terms of hierarchical learning, the hierarchical reinforcement learning (HRL), such as Options, HAM and MAXQ have been presented. The key in the hierarchical reinforcement learning is to automaticly decompose the task into several approviate sub-tasks, In the OPTION framework, it is widely used because it is easy to automatic generate subtasks, esp. by partitioning regions or stages, such as bottleneck states. While in order to generalize the RL to continuous state space or action space generalization methods such as neural network and fuzzy inference system is introduced. The Q-learning has the merits such as easiness to understand and realize and is widely used. In the related literatures neural network or fuzzy inference system is used to approach indirectly the action-value function. The inputs of neural network or fuzzy inference system are the states and the outputs are the Q-values of severalcorresponding discrete actions. The action that acts on environments is based on those several discrete'seed' actions. The choice of the 'seed'action plays an important role in those methods. Bad choice may decrease the performance of those reinforcement learning and unfortunately there is no available knowledge to chose discrete'seed'actions This dissertation first summarizes the background and the theorem of the reinforcement learning, then focus on the OPTION automatic construct based on bottleneck states in the state space and the action value function in the continuous state and action space directly approaching with nueral network and fuzzy inference system.
     Wheeled mobile robots can move and work autonomously in certain environments, and have been widely used in many areas such as industry, agriculture, daily life and military affairs. Navigation is the most fundamental and important function for wheeled mobil robot. Reinforcement learning is widely used in the navigation of wheeled mobile robots for its merits of on-line adaptability, self-learning ability for complex system and human-like thinking mode. The paper focuses on the navigation method by followint the wall based on reactive control
     The main content and contributions in this dissertation include:
     1. Automatic construct of Options based on taboo states is presented. In this method the taboo state is introduced in the environment for agent to automatically construct Options. During the interaction with the environment the learning agent can discovery automatically the bottlenecks and choose the appropriate bottleneck as the sub-goal of Option. Morever the initial set of Option can be obtained and the policies of Options can be learnt simultaneously. Several grid-world tasks illustrate that the agent can automatically construct useful Options online
     2. The RBFQ is presented. Although the scale of the radial based function neural network is lagre, it has the capability of local and universal approach and quickly learns and rapidly converges. In order to avoding the chose of seed actions, the radial based function neural network is used to approach directly the action value function. The structure and parameters identification of RBF is accomplished automatically and simultaneously in an adaptive way with a self-organizing approach according to the TD error and distance between the pair of state and action and the center of radial based function. The optimization method is used to search greedy action. Experimental results of the balancing control of a cart-pole system demonstrate the superiority and applicability of the proposed method.
     3. The Q-learning base on fuzzy inference system is presented. In this method the fuzzy inference system (FIS) is used to approach the action value function. The number of the rule increases in adaptivety way. The parameters of the consequent part and premise part of fuzzy inference system can be updated. The optimization method is used to search greedy action. Experimental results of the balancing control of a cart-pole system demonstrate the applicability of the proposed method.
     4. The navigation of mobile robots based on the reinforcement learning is studied. Simulate results demonstrate the proposed AFQL method with good generalization and efficiency can accomplish task for the mobile robots wall-following.

引文

[1]S. Horikawa, T. Furuhashi, S. Okuma, and Y. Uchikawa, A fuzzy controller using a neural network and its capability to learn expert's control rules [C]. In Proc. Int. Conf. Fuzzy Logic Neural Networks, Iizuka, Japan,1990,103-106.
    [2]H. Nomura, I. Hayashi, and N. Wakami, A self-tuning method of fuzzy control by descent method [C], in Proc. IFSA'91,1st Int. Fuzzy Syst. Assoc. World Congr., vol. Engineering.155-158
    [3]C. Lin, C. Lee Neural-network-based fuzzy logic control and decision system [J], IEEE transaction on Compution,1991,40:1320-1336
    [4]G. Hinton, J. Sejnowski (editors). Unsupervised Learning and Map Formation:Foundations of Neural Computation [M]. MIT Press,1999
    [5]R. Sutton, A. Barto. Reinforcement Learning:An Introduction [M]. Bradford Books, MIT,1998
    [6]G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play [J]. NeuralComputation,1994,6:215-219.
    [7]R. Crites and A. Barto., Elevator group control using multiple reinforcement learning agents [J]. MachineLearning,1998,33:235-262.
    [8]W. Zhang and T. Dietterich, A reinforcement learning approach to job-shop scheduling [C]. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence,1995.1114-1120.
    [9]S. Singh and D. Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems [C]. In:Proceedings of Advances in Neural Information Processing Systems,1996, 974-980.
    [10]J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search methods [C]. In:International Conference on Robotics and Automation. IEEE Press,2001, 1615-1620.
    [11]张荣,陈卫东.基于强化学习的倒立摆起摆与平衡全过程控制[J].控制与决策,2004,26(1)：72-76
    [12]A. Barto, R. Sutton, etc. Neuronlike adaptive elements that can solve difficult learning control problems [J]. IEEE Transactions on Systems, Man and Cybernetics,1983,13:834-847.
    [13]R. Bellman. Dynamic Programming [M], Princeton University Press,1957.
    [14]R. Howard. Dynamic Probabilistic Syestms:Semi-Markov and Decision processes [M]. New York:wiley,1971.
    [15]R. Sutton, D. Precup, S. Singh. Between MDPs and semi-MDPs:A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence,1999,112:181-211.
    [16]D. Precup. Temporal Abstraction in Reinforcement Learning [D]. PhD thesis, University of Massachusetts Amherst,2000.
    [17]R. Parr, S. Russell. Reinforcement learning with hierarchies of machines [C]. In:Proc. Advances in Neural Information Processing Systems. Cambridge, MA:MIT Press,1998,1043-1049
    [18]D. Andre. Programmable Reinforcement Learning Agents [D]. PhD thesis, University of California at Berkeley,2003.
    [19]D. Andre and S. Russell. Programmable reinforcement learning agents [C]., In:Proceedings of Advances in Neural Information Processing Systems. MIT Press,2001,1019-1025.
    [20]T. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition [J]. Journal of Artificial Intelligence Research.2000,13:227-303.
    [21]A. Gill. Introduction to the Theory of Finite-state Machines [M]. McGraw-Hill,1962.
    [22]B. Digney.Learning Hierarchical Control Structures of Multiple Tasks and Changing Environments.Form Animals to Animats[C]. In:Proeeedings of 5th International Conference on imulation of Adaptive Behabiour(SAB98), Zurich, Switzerland,1998,321-330.
    [23]C. Drummond. Composing functions to speed up reinforcement learning in a changing world [C]. In: European Conference on Machine Learning,1998,370-381.
    [24]M. Stolle, D. Precup, Learning OPTIONs in reinforcement learning [C], In:The 5th Int'l Symposium on Abstraction, Reformulation and Approximation, Kananaskis, Alberta, Canada,2002.
    [25]A. McGovern, A. Barto. An Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density [C] In:Proc.18th Int Conf. Machine Learning. San FRBFcisco, CA:Morgan Kaufmann,2001,361-368
    [26]T. Dietterich., R. Lathrop, T. Lozano-Perez. Solving the multiple-instance problem with axisparallel rectangles. Artificial Intelligence,1997,89:31-71.
    [27]O. Maron. Learning from ambiguity [D]. Doctoral dissertation, Massachusetts Institute of Technology, 1998.
    [28]O. Maron, T. Lozano-Perez. A framework for multiple-instance learning [C]. In:NIPS 10. Cambridge, Massachusetts:MIT Press,1998,570-576.
    [29]R. Matthew Kretchmar, Todd Feil, Rohit Bansal. Improved automatic discovery of subgoals for OPTIONS in hierarchical reinforcement learning [J]. Journal of Computer Science and Technology,2003,3 (2):9-14
    [30]S. Goel, M. Huber. Subgoal discovery for hierarchical reinforcement learning using learned policies [C]. Proceedings of the 16th International FLAIRS Conference. Florida:AAAI Press,2003, 346-350.
    [31]Chung-Cheng Chiu and Von-Wun Soo. Sub-goal Identification for Reinforcement Learning and Planning in Multiagent Problem Solving [C]. In:Proceedings of the 5th German conference on Multiagent System Technologies Leipzig, Germany, LNAI 4687,2007,37-48,
    [32]Fei Chen, Yang Gao, Shifu Chen, etc. Connect-based Sub-goal Discovery for OPTIONs in Hierarchical Reinforcement Learning [C]. In:Third International Conference on Natural Computation
    2007,4:698-702.
    [33]Chuan Shi, Rui Huangl, Zhongzhi Shi. Automatic Discovery of Sub-goal in Reinforcement learning Using Unique-direction Value [C]. In:Proc.6th IEEE Int. Conf. on Cognitive Informatics,2007, 480-486
    [34]杜小勤.强化学习中状态抽象技术的研究[D].[博士学位论文]华中科技大学,计算机应用技术,2007年.
    [35]C. Anderson. Learning to control an inverted pendulum using nerual networks [J]. IEEE Control System Magazine,1989,9 (3):31-37
    [36]蒋国飞,吴沧蒲基于Q学习算法和BP神经网络的倒立摆控制[J].自动化学报,1998,24(5)：662-666.
    [37]R S Sutton. Generalization in reinforcement learning:successful examples using sparse coarse coding. In:D Touretzky, M.Mozer, M. Hasselmo, eds. Advances in Neural Information Processing Systems,8, NY:MIT Press,1996,1038-1044.
    [38]C. Watkins, Learning from delayed rewards [D]. Ph.D. dissertation, Cambridge Univ., Cambridge, U.K,1989.
    [39]J. Albus. A New Approach to Manipulator Control:the Cerebellar Model Articulation Controller (CMAC) [J]. J. Dynamic Systems, Measurement and Control,1975,97:220-227.
    [40]G. Rummery. Problem solving with reinforcement learning [D]. PhD thesis, Cambridge University, 1995.
    [41]T. Kohonen. Self-Organization and Associative Memory [M]. Springer, Berlin, third edition,1989.
    [42]C. Touzet. Neural reinforcement learning for behaviour synthesis [M]. Robotics and Autonomous Systems,1997,22(3-4):251-81.
    [43]J. Santos, Contribution to the study and design of reinforcement functions [D]. PhD thesis, Universidad de Buenos Aires, Universite d'Aix-Marseille III,1999.
    [44]C. Gaskett, D. Wettergreen, A. Zelinsky, Q-learning in continuous state and action spaces [C]. In Australian Joint Conference on Artificial Intelligence,2003,417-428,
    [45]L. Jouffe, Fuzzy inference system learning by reinforcement learning [J], IEEE Transactions on Systems, Man and Cybernetics,1998,28 (3):338-355.
    [46]Y. Maeda, Modified Q-Learning method with fuzzy state division and adaptive rewards [C]. In:Proc. IEEE World Congr. Computational Intelligence,2002:,1556-1561.
    [47]T. Horiuchi, A. Fujino, O. Katai, and T. Sawaragi, Fuzzy interpolation based Q-learning with profit sharing plan scheme [C]. In:Proc. IEEE Int. Conf. Fuzzy Systems,1997,1707-1712.
    [48]H. Berenji, Fuzzy Q-Learning:A new approach for fuzzy dynamic programming [C]. In:Proc. IEEE Int. Conf. Fuzzy Systems,1994,486-491.
    [49]Meng Joo Er, Chang Deng. Online Tuning of Fuzzy Inference Systems Using Dynamic Fuzzy Q-Learning [J], IEEE Transactions on Systems, Man and Cybernetics Part B:Cybernetics,2004,34(3): 1478-1489.
    [50]F. Glover, J. Kelly, M. Laguna. Genetic Algorithms and Tabu Search:Hybrids for Optimization [J]. Computers and Operations Research,1995,22(1):111-134.
    [51]M. Minsky. Theory of Neural-Analog Reinforcemeng Systems and its Application to the Brain-Model Problem [D]. Ph.D Dissertation, Princeton University,1954.
    [52]P. Glorennec,. (2000) Reinforcement learning: an overview. In: Proceedings European Symposium on Intelligent Techniques (ESIT-00), Aachen, Germany,2000,17-35.
    [53]F. Garcia, S. Ndyaye, Apprentissage par renforcement en horizon fini I:comparaison du Q-Learning et du R-Learning, Apprentissage automatique, Editions Herms,1999.
    [54]A. Schwartz, A reinforcement learning method for maximizing undiscounted rewards[C]. In: Proceedings of the 10th International Conference on Machine Learning. Amherst, MA:Morgan Kaufmann,1993:298-305.
    [55]L. Kaelbling, A. Moore. Reinforcement Lenaring: A Survey [J]. Journal of Artificial Intelligence Researeh,1996,4:237-285.
    [56]C. Watkins, P. Dayan. Q-learning [J]. Machine Learning,1992,8(3):279-292.
    [57]M. Minsky. Computation:Finite and Infinite Machines [M]. Prentice Hall, Englewood Cliffs, NJ, 1967
    [58]A. Samuel, Some studies in machine learning using the game of checkers [J]. IBM Journal on Research and Development,1959.210-229
    [59]J. Andreae. Learning Machines:A Unified View [J]. Encyclopaedia of Linguistics, Information and Control, ed. Meetham. Pergamon Press,1969,261-270.
    [60]P. Werbos. Advanced forecasting methods for global crisis warning and models of intelligence [J].General Systems Yearbook,1977,22:25-38.
    [61]P. Werbos. Applications of advances in nonlinear sensitivity analysis.In Drenick, R.F.and Kosin,F.,editors,System Modeling an Optimization.Springer-Verlag [C]. In Proceedings of the Tenth IFIP Conference, New York,1981.
    [62]P. Werbos. Building and understanding adaptive systems:a statistical/numerical approach to factory automation and brain research [J]. IEEE Transactions on Systems, Man, and Cybernetics,1987,7-20.
    [63]P. Werbos. Generalization of back propagation with applications to a recurrent gas market model [J].Neural Networks,1988,1:339-356.
    [64]P. Werbos. Neural networks for control and system identification [C]. In:Proceedings of the 28th Conference on Decision and Control,Tampa, Florida,1989,260-265.
    [65]P. Werbos. Approximate dynamic programming for real-time control and neural modeling (3d ed.) [M]. In:D.A. White and D.A. Sofge, Editors, Handbook of Intelligent Control:Neural, Fuzzy, and Adaptive Approaches, Van Nostrand Reinhold, New York,1992,493-525.
    [66]N. Metropolis, S. Ulam. The Monte Carlo Method [J]. Journal of the American Statistical Association, 1949,44 (247):335-341.
    [67]D. Michie, R. Chambers. BOXES:an experiment in adaptive control. In Machine Intelligence 2, (eds. E. Dale and D. Michie), Edinburgh:Oliver & Boyd,1968,137-152 (with R.A. Chambers).
    [68]A. Samuel. Some studies in machine learning using the game of checkers. --Recent progress [J]. IBM Journal on Research and Development,1967,601-617。
    [69]A.Klopf. Brain function and adaptive systems--A heterostatic theory [R]. Technical Report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MA. A summary appears in Proceedings of the International Conference on Systems, Man, and Cybernetics, IEEE Systems, Man, and Cybernetics Society, Dallas, TX,1974.
    [70]J. Holland. Adaptation in Natural and Artificial Systems [M].University of Michigan Press, Ann Arbor, 1975.
    [71]J. Holland. Adaptation [M]. In Rosen,R.and Snell,F.M.(editors) Progress in Theoretical Biology, Academic Press,NY,1976,4:263-293.
    [72]R.Sutton. Learning to predict by the methods of temporal difference [J]. Machine Learning,1988,3:9-44.
    [73]R. Sutton. Temporal Credit Assignment in Reinforcement Learning [D]. PhD thesis, University of Massachusetts, Amherst, MA,1984.
    [74]S. Singh. R. Sutton. Reinforcement learning with replacing eligibility traces [J]. Machine Learning, 1996,22:123-158.
    [75]A. Barto, R. Sutton, C. Anderson. Neuronlike elements that can solve difficult learning control problems [J]. IEEE Transactions on Systems, Man, and Cybernetics,1983,13:835-846.
    [76]G. Rummery, M. Niranjan. On-line q-learning using connectionist systems [R]. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department,1994.
    [77]徐明亮,须文波.何胜.经验自举粒子群优化算法[J].计算机工程与应用,2008,(31)：87-89.
    [78]L. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning,1992,8:293-321.
    [79]H. Simon, The Sciences of the Artificial [M], Cambridge MITPerss,1996.
    [80]Y. Takahashi, M. Asada. Multi-controller Fusion in Multi-layered Reinforcement Learning [C], In: International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI2001), Baden, Germany,2001,7-12.
    [81]S. Singh. Transfer of Learning by Composing Solutions of Elemental Sequential Tasks [J]. Machine Learning,1992,8:323-339.
    [82]P. Dayna, G. Hinton. Feudal Reinforcement Learning [C]. In:Proceedings of Advances in Neural Information Poreessing Systems 5, San FRBFciscoo:Morgan Kaufmann,1993,271-278.
    [83]R Howard. Dynmaic Probabilistic Syestms:Semi-Markov and Decision processes [M]. New York: Wily,1971.
    [84]S. Mahadevna, N. Marchalleck, T. Das, etc. Self-improving Factor Simulation Using Continuous-time Average-reward Reinforcement Learning [C]. In:Proceedings of the 14th International Conference on Machine Learning, Nashville, Tennessee, USA,1997,202-210.
    [85]M. Ghavamzadeh. Hierarchical Reinforcement Learning in Continuous State and Multi-Agent Environments [D]. Ph.D.thesis.Department of Computer Science, University of Massachusetts Amherst,2006.
    [86]D. White. Markov Decision Processes [M]. Wiley, New York,1993.
    [87]M. Puterman. Markov Decision Processes [M].Wiley Interscience,1994.
    [88]S. Bradtke, M. Duff, Reinforcement learning methods for continuous-time Markov decision problems [C], in:Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, MA,1995, 393-400.
    [89]R. Parr. Hierarchical Control and Learning for Markov Decision Processes [D]. Ph.D. thesis, University of California, Berkeley, California,1998.
    [90]A. Greenwald, K. Hall, R. Serrano. Correlated-q learning [C]. In:Proceedings of Twentieth International Conference on Machine Learning, Washington DC,2003,242-249.
    [91]沈晶分层强化学习方法研究[D]：[博士学位论文].哈尔滨：哈尔滨工程大学2006.
    [92]C. Knoblock. Automatically Generating Abstractions for Planning [J]. Artificial Intelligence,1994, 68(2):243-302.
    [93]B. Hengst. Discovering Hierarchy in Reinforcement Learning [D]. Ph.D Dissertation, Universiyt of New South Wales, Australia,2003.
    [94]B. Hengst. Generating Hierarchical Structure in Reinforcement Learning from State variables [C]. Lecture Notes in Computer Seienee, Springe,2000,533-543.
    [95]A. Moore. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state spaces [C]. In:Jack D Cowan, Gerald Tesauro, Joshua Alspector, eds. Advances in Neural Information Processing Systems,6:Morgan Kaufmann Publishers,1994, 711-718.
    [96]A. Mccallunr. Reinforcement Learning with Selective Perception and Hidden State [D]. Ph.D Dissertation, University of Rochester, USA,1995.
    [97]W. Uhter. Tree Based Hierarchical Reinforcement Learning [D]. Ph.D Dissertation, Carnegie Mellon University, USA,2002.
    [98]M. Ryan, M Reid. Using ILP to improve planning in hierarchical reinforcement learning [C]. In:J. Cussens and A. Frisch, editors, The Tenth International Conference ILP-2000, London, UK,2000. Springer-Verlag,2000,174-190.
    [99]S. Thrun, A. Sewhartz. Finding Strucurte in Reinforcement Learning [C]. Advances in Neural Information Poreessing Systems, MITPress,1995:385-392.
    [100]A. McGoveern. Autonomous Discovery of Temoral Abstraction from Interaction wiht an
    Environment [D]. Ph.D Dissertation, University of Massachusetts Amhert, USA,2002
    [101]C. Drummond. Accelearating Reinforcement Learning by Composing Solutions of Automatically Indentified Subtasks [J]. Journal of Artificial Intelligence Research,2002,16:59-5104.
    [102]B.Digney. Emergent Hierarchical Control Structures:Learning Reactive Hierarchical
    Relationships in Reinforcement Environments. From Animals to Animates [C]. In:Proeeedings of the 4th International Conference on Simulation of Adaptive Behavior, North Falmouth, USA,1996,363-372
    [103]A.McGovern, R. Sutton. Macro-Actions in Reinforcement Learning: An Empirical Analysis [R]. Amherst Technical Report, University of Massachusetts, USA,1998,98 - 70.
    [104]I.Menache, S. Mannor, N. Shimkin. Q-Cut:Dynamic Discovery of Sub-Goals in Reinforcement Learning [C]. Lecture Notes in Computer Science. Springer,2002,295-306.
    [105]S. Singh. Reinforcement Learning with a Hierarchy of Abstract Models [C]. In:Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, USA,1992,202-207.
    [106]A. Moore, L. Baird, L. Kaelbling. Multi-Value-Functions:Efficient Automatic Action Hierarchies for Multiple Goal MDPs [C]. In:Proceedings of the 15ht International Joint Conference on Artificial Intelligence, Stockholm, Sweden,1999,1316-1323.
    [107]L. Kaelbling Hierarchical Reinforcement Learning:Preliminary Results [C]. In:Proceedings of the loht Interactional Conference on Machine Learning, Sar FRBFcisco:Morgan Kaufmann,1993, 167-173.
    [108]S. Mnnaor, I. Menache, A. Hoze, etc. Dynamic Abstraction in Reinforcement Learning via Clustering [C]. In:The Proceedings of the 21st Interactional Conference on Machine Learning, Banff, Alberta, Canada,2004,560-567.
    [109]王本年,高阳,陈兆乾,谢俊元,陈世福.面向OPTION的k-聚类Subgoal发现算法[J].计算机研究与发展,2006,43(5)：851—855.
    [110]苏畅,高阳,陈世福,陈兆乾.基于SMDP环境的自主生成OPTIONs算法的研究[J].模式识别与人工智能,2005,18(6)：41—-46.
    [111]Ozgur Simsek, P. Alicia Wolfe, Andrew G. Barto. Local graph partitioning as a basis for generating temporally-extended actions in reinforcement learning [R]. AAAI Workshop on Learning and Planning in Markov Processes-Advances and Challenges,2004.
    [112]F. Glover, M. Laguna. Tabu Search [M]. Kluwer. Norwell, MA,1997.
    [113]杜小勤强化学习中状态抽象技术的研究[D].[博士论文].武汉,华中科技大学2007.
    [114]N. Woodcock, N. Hallam, P. Picton. Fuzzy BOXES as an alternative to neural networks for difficult control problems [J]. Application of Artificial Intelligence in Engineering VI (AIENG/91), 1991,903-919.
    [115]O. Fern, Daniel Borrajo. VQQL:Applying Vector Quantization to Reinforcement Learning [J]. RoboCup,1999:,292-303.
    [116]R. Sutton. Temporal Credit Assignment in Reinforcement Learning [D]. PhD thesis, University of Massachusetts, Amherst, MA,1984.
    [117]J. Peng. Efficient Dynamic Programming-Based Learning for Control [D].PhD thesis, Northeastern University, Boston, MA,1993.
    [118]P. Dayan, T. Sejnowski. TD(λ)converges with probability 1 [M]. Machine Learning, 1994,14:295-301.
    [119]L. Gurvits, L. Lin, S. Hanson. Incremental learning of evaluation functions for absorbing Markov chains:New methods and theorems, Preprint,1994.
    [120]J. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning [M], 1994,16:185-202
    [121]G. Hinton. Distributed representations[R]. Technical Report CMU-CS-84-157, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA,1984.
    [122]M. Waltz, K. Fu. A heuristic approach to reinforcement learning control systems [M]. IEEE Transactions on Automatic Control,1965,10:390-398.
    [123]C. Tham. Modular On-Line Function Approximation for Scaling up Reinforcement Learning [D]. PhD thesis, Cambridge University,1994.
    [124]J. Santos. Contribution to the study and design of reinforcement functions [D]. PhD thesis, Universidad de Buenos Aires, Universite d'Aix-Marseille Ⅲ,1999.
    [125]程玉虎,王雪松,易建强等.基于自组织模糊RBF网络的连续空间Q学习信息与控制.2008,37(1)：1—8
    [126]M. Er, C. Deng, Online Tuning of Fuzzy Inference Systems Using Dynamic Fuzzy Q-Learning [J], IEEE Transactions on Systems, Man and Cybernetics Part B:Cybernetics,2004,34(3): 1478-1489.
    [127]J. Platt. A resource allocating network for function interpolation [J]. Neural Computation,1991,3 (2):213-225.
    [128]R.Eberhart, J.Kennedy, A new optimizer using particles warm theory [C]. In Proc.6th Int. Symp. Micromachine Human Sci., Nagoya, Japan,1995,39-43.
    [129]Y.Shi, R.Eberhart, A modified particle swarm optimizer[C], In Proc. IEEE Congr. Evol. Comput. 1998,69-73.
    [130]康琦,汪镭,吴启迪半导体生产线工序参数的逻辑时序微粒群优化策略[J]控制与决策.2006,21(9)：969-978.
    [131]R.Ebert, P.Simpson, R.Dobbins. Computational Intelligence PC Tools:Academic[M],1996,.212-226.
    [132]J.Kennedy. Small worlds and mega-minds:Effects of neighborhood topology on particle swarm performance [C]. In:Proc. Congr. Evol. Comput,1999:1931-1938
    [133]P.Suganthan. Particle swarm optimizer with neighborhood operator [C], In Proc. Congr. Evol. Comput., Washington, DC,1999,1958-1962.
    [134]K.Parsopoulos and M.Vrahatis. UPSO-A unified particle swarm optimiza-tion scheme [C]. In Lecture Serieson Computational Sciences,2004,868-873.
    [135]R.Mendes, J.Kennedy, J.Neves. The fully informed particle swarm:Simpler, may be better [J]. IEEE TRANs. Evol. Comput..2004,8:204-210.
    [136]J. Kennedy, R. Mendes. Population structure and particle swarm performance [C]. In Proc. IEEE
    Congr. Evol. Comput. Honolulu, HI,2002:1671-1676.
    [137]J. Liang, A.Qin, Ponnuthurai Nagaratnam Suganthan, etc. Comprehensive Learning Particle Swarm Optimizer for Global Optimization of Multimodal Functions [J]. IEEE TRANs. Evol. Comput, 2006,10:281-294.
    [138]C.Chiang, H. Chung, J. Lin, A self-learning fuzzy logic controller using genetic algorithms with reinforcements [J]. IEEE Transactions on Fuzzy Systems,1997,5(3):460-467.
    [139]Cang Ye, N. Yung, Danwei Wang. A Fuzzy Controller with Supervised Learning Assisted Reinforcement Learning Algorithm for Obstacle Avoidance [J]. Systems, Man and Cybernetics, Part B, 2003,33 (1):17-27.
    [140]Toshiyuki Kondo and Koji Ito. A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control [J]. Robotics and Autonomous Sysetms,2004,46:111-124.
    [141]H.Berenji. A reinforcement learning-based architecture for fuzzy logic control [J]. International Journal of Approximate Reasoning,1992,6(2):267-292.
    [142]H. Berenji, P. Khedkar. Learning and tuning fuzzy logic controllers through reinforcements [J]. IEEE Transactions on Neural Networks.1992,3 (5):724-740.
    [143]C. Lin, C. Lee, Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems [J]. IEEE Transactions on Fuzzy Systems,1994,2 (1):41-63.
    [144]C. Lin, C. Lin, Reinforcement learning for an ART-based fuzzy adaptive learning control network [J]. IEEE Transactions on Neural Networks,1996,7 (3):709-731.
    [145]K. Samejima, T. Omori. Adaptive internal state space construction method for reinforcement learning of a real-world agent [J], Neural Networks,1999,12 (7-8):1143-1155.
    [146]Xue-Song Wang, Yu-Hu Cheng, Jian-Qiang Yi A fuzzy Actor-Critic reinforcement learning network [J]. Information Sciences,2007,177:3764-3781.
    [147]C. Juang. Combination of on-line clustering and Q-value based GA for reinforcement fuzzy system design. IEEE Transactions on fuzzy systems [J].2005,13(3):289-302.
    [148]Xiaohui Dai, Chi-Kwong Li A. Rad. An Approach to Tune Fuzzy Controllers Based on Reinforcement Learning for Autonomous Vehicle Control [J]. IEEE Transactions on intelligent transportation systems,2005,6(3):285-293.
    [149]L. Zdaeh. Fuzzysets [J]. Inofmration and Contorl,1965,8:338-353.
    [150]J. Jang, C. Sun. Functional equivalence between radial basis function networks and fuzzy inference systems, IEEE Transactions on Neural Networks [J],1993,4(1):156-159
    [151]王士同.模糊系统、模糊神经网络及应用程序设计[M].上海科学技术文献出版社,1998.
    [152]S. Singhal and L. Wu. Training multilayer perceptrons with the extended Kalman filter algorithm[C]. In:Advances in Neural Information Processing Systems, San Mateo, CA:Morgan Kaufmann,1989:133-140.
    [153]世界机器人最新统计数据[J].机器人技术与应用,2001,(1)：6-10.
    [154]李磊,叶涛,谭民.移动机器人技术研究现状与未来[J].机器人,2002,24(5)：475—480.
    [155]M. Holder, M Trivedi, S. Marapane. Mobile robot navigation by wall following using a rotating ultrasonic scanner [C]. In:Proceedings of the 13th International Conference on Pattern Recognition, 1996,3:298-302.
    [156]王栋耀,马旭东,戴先中.基于声纳的移动机器人沿墙导航控制[J].机器人,2004,26(4)：346—350.
    [157]R. Chancharoen, V. Sangveraphunsiri, Navakulsirinart T, et al. Target tracking and obstacle avoidance for mobile robots [C]. In: 2002 IEEE International Conference on Industrial Technology[C]. 2002,1:13-17.
    [158]S. Thongchai, K. Kawamura. Application of fuzzy control to a sonar-based obstacle avoidance mobile robot [C]. In: Proceedings of the 2000 IEEE International Conference on Control Applications. 2000,425-430.
    [159]P. Turennout, G. Honderd, and L. Schelven. Wall-following control of a mobile robot [C]. In: Proceedings of the 1992 IEEE International Conference on Robotics and Automation,1992,1:280-285.
    [160]Alberto Bemporad, Mauro Di Marco, and Alberto Tesi. Wall-following controls for sonar-based mobile robots[C]. In: Proceedings of the 36th Conference on Decision & Control,1997:3063-3068.
    [161]Med Slim Masmoudi, Insop Song, and Fakerddine Karray, et al. FPGA implementation of fuzzy wall-following control[C]. In: The 16th International Conference on Microelectronics,2004:133-139.
    [162]段勇,徐心和.基于模糊神经网络的强化学习及其在机器人导航中的应用[J].控制与决策2007,22(5)：525-534.
    [163]K. Jung, K. Hong, and S. Hong. Path planning of mobile robot using neural network[C]. In: Proceedings of the IEEE International Symposium on Industrial Electronics,1999,3:979-983.
    [164]P. Reignier. Fuzzy Logic Techniques for Mobile Robot Obstacle Avoidance [J]. Robotics and Autonomous Systems,1994,12:143-153.
    [165]P. Lee, L. Wang, Collision Avoidance by Fuzzy Logic Control for Automated Guided Vehicle Navigation [J]. Journal of Robotic Systems,1994,11(8):743-760.
    [166]A Martinez, E Tunstel, M Jamshidi. Fuzzy Logic Based Collision Avoidance for Mobile Robot [J]. Robotica,1994,12:521-527.
    [167]A. Dubrawski, L. James. Learning Locomotion Reflexes:A Self-supervised Neural System for a Mobile Robot [J]. Robotics and Autonomous System,2003,12:133-142.
    [168]张明路,基于神经网络和模糊控制的移动机器人反应导航和路径跟踪的研究[D],博士学位论文,天津,天津大学,1997.
    [169]P. Maes, A. Rodney. Learning to coordinate behaviors [C]. In:Proc of the 8th AAAI Conf, Morgan Kaufmann,1990,796-802.
    [170]S. Mahadevan, J. Connell, etc. Automatic Programming of behavior-based robots using reinforcement learning [R]. Research report RC16359#72625, IBM T.J. Watson Research Center, Yorktown Heights,1990.
    [171]L. Lin. Reinforcement Learning for Robots Using Neural Networks [D]. Ph.D. Dissertation, CMU,1993.
    [172]L. Lin. Self-improving reactive agents based reinforcement learning [J]. Planning and teaching. Machine Learning,1992,8 (3/4):293-321.
    [173]S. A. K-Team, Khepera 2 User Manual, Switzerland,2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700