基于谱图理论的强化学习研究

英文题名：Reinforcement Learning Based on Spectral Graph Theory
作者：朱美强
论文级别：博士
学科专业名称：控制理论与控制工程
中文关键词：分层强化学习 ; 启发式强化学习 ; 强化学习迁移 ; 谱图理论 ; 多路谱图分割 ; 距离度量学习 ; 拉普拉斯特征映射
英文关键词：Hierarchical Reinforcement Learning ; Heuristic Reinforcement Learning ; Transfer Learning for Reinforcement Learning ; Spectral Graph Theory ; Multiple
英文关键词：Spectral Graph Partition ; Distance Metric Learning ; Laplacian Eigenmap
学位年度：2012
导师：李明
学科代码：081101
学位授予单位：中国矿业大学
论文提交日期：2012-12-01
答辩委员会主席：叶桦

摘要

作为一类解决序贯优化决策问题的有效方法，强化学习应用于大规模或连续状态空间问题时会出现维数灾难。如何解决维数灾难，提高算法效率是现阶段强化学习面临的主要问题。谱图理论是一类可以揭示高维数据空间的内在拓扑结构的数学工具，近年来在复杂网络、图像视觉和流形学习等领域被广泛使用并取得巨大成功，将其引入强化学习中有重要的研究价值。
     为了提高强化学习算法的效率，本文主要从分层强化学习、基于流形距离的启发式强化学习和迁移强化学习三个方面研究了谱图理论在强化学习中的应用方法。在分层强化学习方面，本文借用多路谱聚类的相关理论与方法，提出了一种新的子任务策略求取方法和两种改进的任务分解方法；在启发式强化学习方面，针对基于目标位置的任务，本文建立了基于距离度量学习的启发式强化学习框架。在此框架下，将计算效率最高的拉普拉斯特征映射法应用于启发式回报函数设计、启发式策略选择和启发式Dyna规划三个方面，提出了三类启发式强化学习算法；在迁移强化学习方面，针对基于谱图理论的基函数迁移方法的不足，提出了一种基函数与子任务最优策略相结合的混合迁移方法。本文取得的主要研究成果如下：
     1.分层强化学习中的Option方法一般分为任务分解和子任务策略求取两部分。在任务分解部分，基于谱图分割的Option方法普遍存在需要手工确定子任务数目和应用范围有限的缺点。针对此问题，本文分析了其原因，并引入多路谱聚类的相关思想和特征值差法，提出了两种改进的Option自动分解算法。在子任务策略求取部分，现有的方法一般将其作为一个新的强化学习问题来处理，本文利用拉普拉斯特征映射能保持状态空间局部拓扑结构的特点，提出一种新的策略求取方法——虚拟值函数法。
     2.在基于目标位置的学习任务中，广义距离常作为启发式函数用于启发式回报函数设计、启发式动作选择和启发式Dyna规划中。如何根据任务的结构和性质定义广义距离是这类方法成功与否的关键。对于值函数在欧氏空间内不连续，但在流形上连续的情况，本文建立了基于距离度量学习的启发式强化学习框架。
     3.启发式回报函数的设计方法一般分为广义距离法和抽象模型法两类。对于广义距离法，在基于距离度量学习的启发式强化学习框架下，本文使用最简单的拉普拉斯特征映射法，提出了一种新的启发式回报函数设计方法。对于抽象模型法，本文将前述改进的Option生成算法用于抽象模型的产生中，提出了两种能自动实现子任务内势函数分解的启发式回报函数设计方法。
     4.仍然使用基于距离度量学习的启发式强化学习框架，针对强化学习的策略选择和Dyna规划，提出了一种新的启发式动作选择机制和一种改进的Dyna-Q规划算法。所提的两种方法都可以提高Q学习的初始学习性能。
     5.在状态空间比例放大的迁移任务中，基于谱图理论的原型值函数方法只能有效迁移较小特征值对应的基函数，用于目标任务的值函数逼近时会使部分状态的值函数出现错误。本文分析了值函数逼近错误的原因，并提出一种基函数与子任务最优策略相结合的混合迁移方法。所提的迁移方法能直接确定目标任务部分状态空间的最优策略，减少了值函数逼近所需要的最少基函数数目，降低了策略迭代次数，适合状态空间具有明显层次结构的迁移任务。
     全文的主要工作是围绕着强化学习的模型、立即回报、值函数和策略四个要素，提出了几种基于谱图理论的强化学习算法，并分析了它们的适用范围和计算复杂度。仿真研究的实验结果验证了所提算法的有效性和适用性。
As an effective method of solving the sequential decision-making problems,reinforcement learning encounters the curse of dimensionality when it is applied inlarge-scale or continuous spaces problems. Solving the curse of dimensionality andimproving the efficiency of the algorithm are the main problems of reinforcementlearning at the present stage. In recent years, as the mathematical tool which candiscover the topological structure of the high-dimensional data space, spectral graphtheory have been applied into these fields like complex network, image and vision,manifold learning and have achieved great success. Therefore, introducing spectralgraph theory into reinforcement learning has very important research value.
     In order to improve the efficiency of the algorithm, the dissertation mainlystudies how to apply spectral graph theory in following three areas: hierarchicalreinforcement learning, heuristic reinforcement learning based on manifold distance,and transfer learning for reinforcement learning. Firstly, a new subtask strategycalculating method and two modified task decomposition methods are proposed inhierarchical reinforcement learning. Then, aiming at these tasks of searching targetlocation, a framework of heuristic reinforcement learning based on distance metriclearning is established. Under above established framework, the most efficientLaplacian Eigenmap is used in the three aspects, namely Reward Shaping, heuristicstrategy selection and heuristic Dyna planning. At the same time, three categories ofheuristic reinforcement learning algorithms are put forward. At last, against theshortage of basis function transfer based on the spectral graph theory, a hybrid transfermethod integrating basis function with subtask optimal polices is designed in transferlearning for reinforcement learning. The main contributions of this dissertationinclude:
     1. The Option method includes task decomposition and subtask strategycalculating. In task decomposition, these existing Option methods based on spectralgraph partition need confirming the subtask number by hand and have a limitedapplication range. The dissertation analyzes the reason and puts forward two modifiedOption automatic decomposition algorithms through introducing multiple spectralclustering and the Eigengap method. In subtask strategy calculating, the existingmethods generally refer it as a new reinforcement learning problem. The dissertationuses the fact that Laplacian Eigenmap preserves the local topology structure of state space, and comes up with a new subtask strategy calculating method, namely virtualvalue function method.
     2. For these learning tasks based on the target location, generalized distanceoften is used as heuristic function in these areas that are the design of heuristic rewardfunction, the selection of heuristic action and heuristic Dyna planning. How todefinite the generalized distance according to the properties and structures of the tasksis the key. For these tasks whose value functions are discontinuous in Euclidean spacebut continuous in some manifold, a framework of heuristic reinforcement learningbased on the distance metric learning is built.
     3. The design method of heuristic reward function contains two types, namelygeneralized distance method and abstract model method. Following the idea ofgeneralized distance method, under the established framework of heuristicreinforcement learning, the dissertation uses the simplest Laplacian Eigenmap basedon spectral graph theory to get a new design method of heuristic reward function.Based on abstract model method and above two modified Option automaticdecomposition algorithms, the dissertation proposes two improved methods ofdesigning reward function which both can adaptively decompose subtask potentialfunction.
     4. Under the established framework of heuristic reinforcement learning, aimingat the strategy selection and Dyna planning of reinforcement learning, a new heuristicaction selection method and an improved Dyna-Q planning algorithm are put forward.The above two methods can speed initial learning performance of Q learning
     5. For scaling up state space transfer underlying the proto-value functionsframework based on spectral graph, only some basis functions corresponding to thesmaller eigenvalues are transferred effectively. However, the few effective basisfunctions will result in some error approximation of value functions in target task. Thereason that result in some error approximation of value functions is analyzed and ahybrid transfer method integrating basis function transfer with subtask optimal policestransfer is designed. The proposed hybrid transfer method can get directly optimalpolicies of some states, reduce iterations and the minimum number of the basisfunction needed to approximate the value functions. The method is suitable for scalingup state space transfer task with hierarchical control structures.
     There are four elements consisting of model, reward, value function and policy inreinforcement learning. Centering four elements, the dissertation studies several algorithms based on spectral graph theory, analyzes their application ranges andcomputational complexities, verifies their effectiveness and applicabilities bysimulation experiments.

引文

[1] George F. Luger(美)著,郭茂祖,刘扬等译.人工智能：复杂问题求解的结构和策略[M].北京：机械工业出版社,2009.
    [2] Tom M. Mitchell(美)著,曾华军,张银奎等译. Machine Learning(机器学习)
    [M].北京:机械工业出版社,2003.
    [3] Mjolsness E, DeCoste D. Machine learning for science: State of the art and futureprospects [J]. Science,2001,293(5537):2051-2055.
    [4]王珏,周志华,周傲英.机器学习及其应用[M].北京：清华大学出版社,2006.
    [5] Sutton R S, Barto A G. Reinforcement Learning: an Introduction. Cambridge, MA:MIT Press,1998.
    [6]高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100.
    [7]徐昕.增强学习与近似动态规划[M].北京：科学出版社,2010.
    [8]王雪松,程玉虎.机器学习理论、方法及应用[M].北京：科学出版社,2009:14-115.
    [9]徐昕,沈栋,高岩青,王凯.基于马氏决策过程模型的动态系统学习控制:研究前沿与展望[J].自动化学报,2012,38(5):673-687.
    [10]徐昕.增强学习及其在移动机器人导航与控制中的应用研究[D].长沙：国防科学技术大学博士学位论文,2002.
    [11] Sutton R S. Reinforcement learning: past, present and future[C]. Proceedings ofthe Second Asia Pacific Conference on Simulated Evolution and Learning (SEAL'98),Lecture Notes in Computer Science,1999(1585):195-197.
    [12] Thomas G D, Pedro D, Lise G, et al. Structured machine learning: the next tenyears[J]. Machine Learning,2008,73:3-23.
    [13] Fan R K Chung. Spectral graph theory[M]. American Mathematical Society,1997.
    [14] Spielman D A. Spectral graph theory and its applications[C]. Proceedings of the48th Annual IEEE Symposium on Foundations of Computer Science,2007:29-38.
    [15] Mahadevan S. Learning representation and control in Markov decision processes:new frontiers[J]. Foundations and Trends in Machine Learning,2009(4):403-565.
    [16] Marco W, Martijn V O. Reinforcement learning: state of the art[M]. Springer,2012.
    [17] Martijn V O. The logic of adaptive behavior: Knowledge representation andalgorithms for the Markov decision process framework in first-order domains[M].University of Twente, Enschede,2008.
    [18] Szepesvári Cs. Algorithms for Reinforcement Learning[M]. Morgan&ClaypoolPublishers,2010.
    [19] Busoniu L, Ernst D, Babuska R, Schutter B D. Approximate reinforcementlearning: an overview [C]. Proceedings of the2011IEEE International Symposium onAdaptive Dynamic Programming and Reinforcement Learning,2011:1-8.
    [20]陈宗海,杨志华,王海波等.从知识的表达和运用综述强化学习研究[J],控制与决策,2008,23(9):962-968.
    [21] Pascal P, Mohammad G, Yaakov E. Tutorial on Bayesian Methods forReinforcement Learning[C], the24th ICML Workshop on Bayesian ReinforcementLearning,2007.
    [22]赵志宏,高阳,骆斌,陈世福.多Agent系统中强化学习的研究现状和发展趋势[J].计算机科学,2004,31(3):23-27.
    [23] Shoham Y, Powers R, Grenager T. Multi-Agent Reinforcement Learning: ACritical Survey[R]. Stanford University,2003.
    [24] Simon H A. The Sciences of the Artificial[M]. Cambridge: MIT Press,1996.
    [25] Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning.Discrete-Event Systems[J]. Special Issue on Reinforcement Learning,2003(13):41-77.
    [26] Sutton R S, Precup D, Singh S. Between mdps and semi-mdps: A framework fortemporal abstraction in reinforcement learning[J]. Artificial Intelligence,1999(112):181-211.
    [27]沈晶.分层强化学习方法研究[D].哈尔滨：哈尔滨工程大学,2006.
    [28] Digney B. Learning hierarchical control structure for multiple tasks and changingenvironments[C]. Proceedings of the15th Conference on the Simulation of AdaptiveBehavior,1998:321-327.
    [29] McGovern A, Barto A G. Automatic discovery of subgoals in reinforcementlearning using diverse density[C]. Proceedings of the18th ICML,2001:361-368.
    [30] Matthew K, Todd F, Rohit B. Improved automatic discovery of subgoals foroptions in hierarchical reinforcement learning[J]. Journal of Computer Science andTechnology,2003,3(2):9-14.
    [31]苏畅,高阳等. SMDP环境下自主生成options的算法研究[J].模式识别与人工智能,2005.
    [32] Ishai M, Shie M. Q-cut:dynamic discovery of subgoals in reinforcementlearning[C]. Proceedings of13th European Conference on Machine learning,2002:295-306.
    [33] Simsek O, Alicia P W, Barto A G. Identifying useful subgoals in reinforcementlearning by local graph partitioning[C]. Proceedings of the22rd ICML,2005:816-823.
    [34] Chung Cheng-Chiu, Von-Wun Soo. Automatic complexity reduction inreinforcement learning[J]. Computational Intelligence,2010,26(1):1-25.
    [35] Moradi P, Shiri M E, Entezari N. Automatic skill acquisition in reinforcementlearning agents using connection bridge centrality[J]. Communication andNetworking,2010,(120):51-62.
    [36] Shie M. Dynamic abstraction in reinforcement learning via clustering[C].Proceedings of21st ICML,2004.
    [37]徐明亮,苏晓萍,须文波.基于禁忌搜索的Option自动构造[J].系统仿真学报,2009,21(23):7479-7482.
    [38]王本年,高阳,陈兆乾等.面向Option k聚类Subgoal发现算法[J].计算机研究与发展,2006,43(5):851-855.
    [39] Hengst B. Discovering hierarchy in reinforcement learning[D]. University ofNew South Wales,2003.
    [40] Dietterich T G. Hierarchical reinforcement learning with the MAXQ valuefunction decomposition[J]. Journal of Artificial Intelligence Research,2000,13(2):227-303.
    [41] Chavaxnzadeh M, Mahadevan S. Continuous-time hierarchial reinforcementlearning[C]. Proceedings of the18th ICML,2001:186-193.
    [42] Hao Tang, Wenjing Liu, Wenjuan Cheng, Lei Zhou. Continuous-time MAXQAlgorithm for Web Service Composition[J]. Journal of Software,2012,7(5):943-950.
    [43] Marthi B, Kaelbling L, Lozano-Perez T. Learning hierarchical structure inpolicies[C]. Proceedings of the NIPS Workshop on Hierarchical Organization ofBehavior,2007.
    [44] Stefan Elfwing, Eiji Uchibe, Kenji Doya. An Evolutionary approach to automaticconstruction of the structure in hierarchical reinforcement learning[J]. Genetic andEvolutionary Computation,2003(18):507-509.
    [45] Manfredi V, Mahadevan S. Hierarchical reinforcement learning using graphicalmodels[C]. Proceedings of the22nd ICML Workshop on Rich Representations forReinforcement Learning,2005.
    [46]胡圣波，郑志平.一种井下RFID定位系统的读卡器防碰撞算法[J].工矿自动化,2006(2):4-7
    [47] Jonsson A, Barto A. Causal graph based decomposition of factored mdps[J].Journal of Machine Learning,2006,7(11):2259-2301.
    [48] Mehta N, Ray S, Tadepalli P, Dietterich T. Automatic discovery and transfer ofmaxq hierarchies[C]. Proceedings of the25th ICML,2008:648-655.
    [49] Hengst B. Partial order hierarchical reinforcement learning[C]. Proceedings ofAustralasian Conference on Artificial Intelligence,2008:138-149.
    [50] Parr R. Hierarchical Control and learning for markov decision processes[D].University of California,1998.
    [51] Andre D, Russell S J. Programmable reinforcement learning agents[C].Proceedings of NIPS,2000:1019-1025.
    [52] Andre D, Russell S J. State abstraction for programmable reinforcement learningagents[C]. Proceedings of the Eighteenth National Conference on ArtificialIntellieence,2002:19-125.
    [53]杜小勤.强化学习中状态抽象技术的研究[D].武汉：华中科技大学,2007.
    [54] Whiteson S, Stone P. Evolutionary function approximation for reinforcementlearning[J]. Journal of Machine Learning Research,2006(7):877-917.
    [55] Preux P, Girgin S, Loth M. Feature discovery in approximate dynamicprogramming[C]. Proceedings IEEE Symposium on Adaptive Dynamic Programmingand Reinforcement Learning,2009:109-116.
    [56] Sutton R S Learning to predict by the methods of temporal differences[J].Machine Learning,1988,3(1):9-44.
    [57] Watkins C, Dayan P. Q-learning[J]. Machine Learning,1992,8(3):279-292.
    [58]冯焕婷.近似强化学习方法研究[D].徐州：中国矿业大学硕士论文,2012.
    [59]蒋国飞,吴沧浦.基于Q学习算法和BP神经网络的倒立摆控制[J].自动化学报,1998,24(5):662-666.
    [60] Crites R H, Barto A G. Elevator group control using multiple reinforcementlearning agents[J]. Machine Learning,1998,33(3):235-262.
    [61]高阳,胡景凯,王本年等.基于CMAC网络强化学习的电梯群控调度[J].电子学报,2007,35(2):362-365.
    [62] Jouffe L. Fuzzy inference system learning by reinforcement learning [J]. IEEETransactions on Systems, Man and Cybernetics,1998,28(3):338-355.
    [63] Er M.J, Deng C. Online tuning of fuzzy inference system using dynamic fuzzyQ-learning [J]. IEEE Transactions on Systems, Man and Cybernetics,2004,34(3):1478-1489.
    [64] Lin C T, Lee C G. Reinforcement structure parameter learning for neural networkbased on fuzzy logic control systems [J]. IEEE Transactions on Fuzzy Systems,1994,2(1):41-63.
    [65] Quah K H, Quek C, Leedham G. Reinforcement learning combined with a fuzzyadaptive learning control network for pattern classification[J]. Pattern Recognition,2005,38(4):513-526.
    [66]王雪松,程玉虎,易建强.一种自适应模糊Actor-Critic学习[J].控制与决策,2006,21(9):1068-1072.
    [67]程玉虎,王雪松,孙伟.基于自组织模糊RBF网络的连续空间Q学习[J].信息与控制,2008,37(1):1-8.
    [68]段勇,徐心和.基于模糊神经网络的强化学习及其在机器人导航中的应用[J].控制与决策,2007,22(5):525-523.
    [69] Lagoudakis M G, Parr R. Least-squares policy iteration[J]. Journal of MachineLearning Research,2003,4:1107-1149.
    [70] Konidaris G, Osentoski S. Value function approximation in reinforcementlearning using the Fourier basis[R]. University of Massachusetts, Amherst,2008.
    [71] Bradtke S J, Barto A G. Linear least-squares algorithms for temporal differencelearning[J]. Machine Learning,1996,22(1-3):33-57.
    [72] Boyan J A. Technical update: least-squares temporal difference learning[J].Machine Learning,2002,49:233-246.
    [73] Geramifard A, Bowling M, Sutton R. S. Incremental least-squares temporaldifference learning[C]. Proceedings of the Twenty-First National Conference onArtificial Intelligence,2006:356-361.
    [74] Geramifard A, Bowling M, Zinkevich M, et al. iLSTD: eligibility traces andconvergence analysis[C]. Proceedings of Advances in Neural Information ProcessingSystems,2006:826-833.
    [75] Lagoudakis M G, Parr R. Least-squares policy iteration[J]. Journal of MachineLearning Research,2003,4:1107-1149.
    [76] Baird L C. Residual algorithms: Reinforcement learning with functionapproximation[C]. Proceedings of the12nd ICML，1995.
    [77] Antos A, Szepesvari C, Munos R. Learning near-optimal policies withBellman-residual minimization based fitted policy iteration and a single samplepath[J]. Machine Learning,2008,71(1):89-129.
    [78] Guestrin C, Koller D, Parr R, et al. Efficient solution algorithms for factoredMDPs[J]. Journal of Artificial Intelligence Research,2003,19:399-468.
    [79] DeFarias D P, Roy B V. The linear programming approach to approximatedynamic programming[J]. Operations Research,2003,51(6):850-865.
    [80] Parr R, Li L, Taylor G, Painter-Wakefield C, Littman M L. An analysis of linearmodels, linear value-function approximation, and feature selection for reinforcementlearning. Proceedings of the25th ICML,2008:752-759.
    [81] Drummond C. Accelerating reinforcement learning by composing solutions ofautomatically identified subtasks[J]. Journal of Artificial Intelligence Research,2002,16:59-104.
    [82] Kretchmar R, Anderson C. Using temporal neighborhoods to adapt functionapproximators in reinforcement learning[C]. Proceedings of the5th Interational WorkConference on Artificial and Natural Neural Networks,1999:488–496.
    [83] Smart W. Explicit manifold representations for value-function approximation inreinforcement learning[C]. Proceedings of the8th International Symposium onArtifical Intellegence and Mathematics,2004.
    [84] Keller P, Mannor S, Precup D. Automatic basis function construction forapproximate dynamic programming and reinforcement learning[C]. Proceedings ofthe23rd ICML,2006:449-456.
    [85] Parr R, Wakefield C P, Li L. Analyzing feature generation for value-functionapproximation[C]. Proceedings of the24th ICML,2007:737-744.
    [86] Petrik M. An analysis of Laplacian methods for value function approximation inMDPs[C]. Proceedings of the20th International Joint Conference on ArtificialIntelligence,2007:2574-2579.
    [87] Mahadevan S. Proto-value functions: development reinforcement learning[C].Proceedings of the22nd ICML,2005:553-560.
    [88] Coifman R R, Maggioni M. Diffusion wavelets[J]. Applied and ComputationalHarmonic Analysis, Special Issues: Diffusion Maps and Wavelets,2006,21(1):53-94.
    [89] Maggioni M, Mahadevan S. A multiscale framework for markov decision processusing diffusion wavelets[R]. University of Massachusetts, Amherst,2006.
    [90]Sugiyama M, Hachiya H, Towell C, et al. Value function approximation onnon-linear manifolds for robot motor control[C]. Proceedings of the IEEEInternational Conference on Robotics and Automation,2007:1733-1740.
    [91]王雪松,张政,程玉虎等.基于测地高斯基函数的递归最小二乘策略迭代[J].信息与控制,2009,38(4):406-411.
    [92] Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporaldifference learning[C]. Proceedings of the26th ICML，2009:521-528.
    [93] Farahmand A M, Ghavamzadeh M, Szepesvári Cs, Mannor S. Regularized policyiteration[J].In Advances in Neural Information Processing Systems,2009:441-448.
    [94] Farahmand A M, Ghavamzadeh M, Szepesvári Cs, Shie M. Regularized fittedQ-iteration for planning in continuous space markovian decision problems[C].Proceedings of2009American Control Conference,2009:725-730.
    [95] Sutton R S, Maei H R, Precup, D, et al. Fast gradient-descent methods fortemporal-difference learning with linear function approximation[C]. Proceedings ofthe28th ICML,2009:993-1000.
    [96] Dietterich T G, Wang X. Batch value function approximation via supportvectors[C]. Proceedings of Advances in Neural Information Processing Systems,2002,14:1491-1498.
    [97]王雪松,田西兰,程玉虎.基于支持向量机的连续状态空间Q学习[J].中国矿业大学学报,2008,37(1):93-98.
    [98]王雪松,田西兰,程玉虎.基于协同最小二乘支持向量机的Q学习[J].自动化学报,2009,35(2):214-219.
    [99] Engel Y, Mannor S, Meir R. Bayes meets Bellman: The gaussian processapproach to temporal difference learning[C]. Proceedings of the20th ICML,2003,1:154-161.
    [100] Rasmussen C E, Kuss M. Gaussian processes in reinforcement learning[C].Advances in Neural Information Processing Systems,2004:751-759.
    [101]王雪松,张依阳,程玉虎.基于高斯过程分类器的连续空间强化学习[J].电子学报,2009,37(6):1153-1158.
    [102] Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration forreinforcement learning[J]. IEEE Transactions on Neural Networks,2007,18(4):973-992.
    [103] Rivest F, Precup D. Combining TD-learning with cascade-correlationnetworks[C]. Proceedings of20th ICML,2003:632-639.
    [104] Bagnell J, Schneider J. Covariant policy search[C]. In International JointConference on Artificial Intelligence,2003.
    [105] Aberdeen D. Policy-gradient algorithms for partially observable Markovdecision processes[D], Australian National Unversity,2003.
    [106]王学宁.策略梯度增强学习的理论、算法及应用研究[D].长沙:国防科学技术大学博士学位论文,2006.
    [107] Wierstra D, Schaul T, Peters J, Schmidhuber J. Natural Evolution Strategies
    [C]. Proceedings of the Congress on Evolutionary Computation,2008:3381-3387.
    [108] Heidrich Meisner V, Igel C. Similarities and differences between policy gradientmethods and evolution strategies[C].16th European Symposiumon Articial NeuralNetworks,2008:427-432.
    [109] Williams R J. Simple statistical gradient-following algorithms for connectionistreinforcement learning[J]. Machine Learning,1992,8:229-256.
    [110] Baxter J, Bartlett P. L. Infinite-horizon policy-gradient estimation[J]. Journal ofArtificial Intelligence Research,2001,15:319-350.
    [111] Greensmith E, Bartlett P L, Baxter J. Variance reduction techniques for gradientestimation in reinforcement learning[J]. Journal of Machine Learning Reseach,2004,5:1471-1530.
    [112] Weaver L, Tao N. The optimal reward baseline for gradient-based reinforcementlearning[C]. Proceedings of the17th Conference in Uncertainty in ArtificialIntelligence,2001:538-545.
    [113] Sutton R S, McAllester D, Singh S, et al. Policy gradient methods forreinforcement learning with function approximation[C]. Proceedings of Advances inNeural Information Processing Systems,2000:1057-1063.
    [114] Konda V R, Tsitsiklis J N. Actor-critic algorithms[C]. Proceedings of Advancesin Neural Information Processing Systems,2000:1008-1014.
    [115] Konda V R, Tsitsiklis J N. On Actor-critic algorithms[J]. SIAM Journal onControl and Optimization,2003,42(4):1143-1166.
    [116] Shelton C R. Policy improvement for POMDPs using normalized importancesampling[C]. Proceedings of the17th International Conference on Uncertainty inArticial Intelligence,2001:496-503.
    [117] Rückstie T, Felder M, Schmidhuber J. State-dependent exploration for policygradient methods[C]. Proceedings of the European conference on Machine Learningand Knowledge Discovery in Databases,2008:234-249.
    [118] Sehnke F, Osendorfer C, Rückstie T, et al. Parameter-exploring policygradients[J]. Neural Networks,2010,23(4):551-559.
    [119] Amri S. Natural gradient works efficiently in learning [J]. Neural Computation,1998,10(2):251-276.
    [120] Kakade S. A natural policy gradient[C]. Advances in Neural InformationProcessing Systems,2002:1530-1538.
    [121] Peters J, Schaal S. Natural actor-critic[J]. Neurocomputing,2008,71:1180-1190.
    [122] Bhatnagar S, Sutton R S, Ghavamzadeh M. Incremental natural actor-criticalgorithms[C]. Proceedings of Advances in Neural Information Processing Systems,2007:105-112.
    [123] Dayan P, Hinton G E. Using expectation-maximization for reinforcementlearning[J]. Neural Computation,1997,9(2):271-278.
    [124] Peters J, Schaal S. Reinforcement learning by reward-weighted regression foroperational space control[C]. Proceedings of the24th ICML,2007:745-750.
    [125]金钊.加速强化学习方法研究[D].昆明:云南大学博士学位论文,2010.
    [126] Land. A. Theory and application of reward shaping in reinforcementleaming[D]. University of lllinois. Urbana-ChamPaign,2004.
    [127] Mataric M J. Reward functions for accelerated learning[C]. Proceedings of11stICML,1994.
    [128] Randl v J, Alstr m P. Learning to drive a bicycle using reinforcementlearningand shaping[C]. Proceedings of the15th ICML,1998:463-471.
    [129]郭秋敏,徐博.基于马尔科夫链的煤矿灾变预测系统的建模研究[J].工矿自动化,2011(7):18-23.
    [130] Ng A Y, Harada D, Russell S J. Policy invariance under rewardtransformations:theory and application to reward shaping[C]. Proceedings of16thICML,1999:278-287.
    [131] Wiewiora E. Potential-based shaping and Q-value initialisation are equivalent[J].Journal of Artificial Intelligence Research,2003(19):205-208.
    [132] Asmuth J, Littman M L, Zinkov R. Potential-based shaping in model-basedreinforcement learning[C]. Proceedings of AAAI conference on artificial intelligence,2008:604-609.
    [133]魏英姿,赵明扬.强化学习算法中启发式回报函数的设计及其收敛性分析[J].计算机科学.2005.32(03):190-193.
    [134] Marthi B. Automatic shaping and decomposition of reward functions[C].Proceedings of the24th ICML2007:601-608.
    [135] Marek G, Daniel K. Reinforcement learning with reward shaping and mixedresolution function approximation[J]. International Journal of Agent Technologies andSystems,2009,1(2):36-54.
    [136] Marek G, Daniel K. Plan-based reward shaping for reinforcement learning[C].Proceedings of the4th IEEE International Conference on Intelligent Systems,2008:22-29.
    [137] Marek G. Improving exploration in reinforcement learning through domainknowledge and parameter analysis[D]. The University of York,2010.
    [138] Ng A Y, Russell S. Algorithms for inversere inforcement learning[C].Proceedings of17th ICML,2000:663-670.
    [139] Abbeel P, Ng A Y. Apprenticeship learning via inverse reinforcement learning[J].2004:1-8.
    [140] Abbeel P, Coates A, Quigley M, Ng. A. Y. An application of reinforcementlearning to aerobatic helicopter flight[C]. Proceedings of19th NIPS,2007:1-8.
    [141] Wyatt J. Exploration and inference in learning from reinforcement[D].University of Edinburgh,1997.
    [142] Asmuth J, Li L, Littman M L, Nouri A, Wingate D. A Bayesian samplingapproach to exploration in reinforcement learning[C]. Proceedings of the25thConference on Uncertainty in Artifical Intelligence,2009:19-26.
    [143] Poupart P, Vlassis N, Hoey J, Regan K. An analytic solution to discreteBayesian reinforcement learning[C]. Proceedings of the23rd ICML,2006:697-704.
    [144] Stéphane R, Joelle P, Brahim C, Pierre K. A bayesian approach for learning andplanning in partially observable markov decision processes[J]. Journal of MachineLearning Research,2011,(12):1729-1770.
    [145]陈飞,王本年,高阳,陈兆乾,陈世福.贝叶斯学习与强化学习结合技术的研究[J].计算机学报,2006,33(2):173-177.
    [146] Dearden R, Friedman N, Russell S. Bayesian Q-learning[C]. Proceedings of15th National Conference on Artificial Intelligence,1998.
    [147] Strens M J A. A Bayesian framework for reinforcement learning[C].Proceedings of the17th ICML,2000:943–950.
    [148] Dearden R, Friedman N, Andre D. Model based bayesian exploration[C].Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence,1999:150-159.
    [149] Sorg J, Singh S, Lewis R. Variance-based rewards for approximate Bayesianreinforcement learning[C]. Proceedings of the26th Conference on Uncertainty inArtificial Intelligence,2010:564-571.
    [150] Wang T, Lizotte D, Bowling M, Schuurmans D. Bayesian sparse sampling foron-line reward optimization[C]. Proceedings of the22nd ICML,2005:956–963.
    [151] Asmuth J, Li L, Littman M L, Nouri A, Wingate D. A Bayesian samplingapproach to exploration in reinforcement learning[C]. Proceedings of the25thConference on Uncertainty in Artifical Intelligence,2009:19-26.
    [152] Walsh T, Goschin S, Littman M L. Integrating sample-based planning andmodel-based reinforcement learning[C]. Proceedings of the Association for theAdvancement of Artificial Intelligence,2010:612-617.
    [153] Asmuth J, Littman M L Learning is planning: near Bayes-optimal reinforcementlearning via Monte-Carlo tree search.[2012-10-19]. http://arxiv.org/ftp/arxiv/papers/1202/1202.3699.pdf.
    [154] Asmuth J, Michael L. L. Approaching Bayes-optimality using Monte-Carlo treesearch[C]. Association for the Advancement of Artificial Intelligence,2011.
    [155] Kolter J, Ng A Y. Near-Bayesian exploration in polynomial time[C].Proceedings of the26th ICML,2009:513-520.
    [156] Alexander L S, Lihong L, Littman M L. Incremental model-based learners withformal learning time guarantees[C]. Proceedings of the22nd conferenceonUncertainty in Artificial Intelligence,2006:485-493.
    [157] Brafman R I, Tennenholtz M. R-MAX-a general polynomial time algorithmfor near-optimal reinforcement learning[J]. Journal of Machine Learning Research,2002(3):213-231.
    [158] Kearns M, Singh S. Near-optimal reinforcement learning in polynomial time[J].Machine Learning,2002,49(23):209-232.
    [159] Valiant L G. A theory of the learnable[J]. Communications of the ACM,1984(27):1134-1142.
    [160] Kakade S M. On the sample complexity of reinforcement learning[D].University College London,2003.
    [161] Alexander L S. Probably approximately Correct (PAC) Exploration inReinforcement Learning[D]. Rutgers University,2007.
    [162] Lihong Li. A unifying framework for computational reinforcement learningtheory[D]. Rutgers University,2009.
    [163] Alexander L S, Lihong L, Michael L. L. Reinforcement learning in finite MDPs:PAC analysis[J]. Journal of Machine Learning Research,2009(10):2413-244.
    [164] Bianchi R A C, Ribeiro C H C, Costa A. H. R. Accelerating autonomouslearning by using heuristicselection of actions[J]. Journal of Heuristics,2008,14(2):135-168.
    [165] Bradley K W, Peter S. Augmenting reinforcement learning with humanfeedback[C]. Proceedings of28th ICML Workshop on New Developments inImitation Learning,2011.
    [166]林芬,石川,罗杰文等．基于偏信息学习的双层强化学习算法［J］．计算机研究与发展,2008,45(9):1455-1462.
    [167] Robert G, William D S. Manifold representations for value-functionapproximation[C]. In Working Notes of the Workshop on Markov Decision Processes,AAAI,2004.
    [168]李明华.图象检索中的ISOMAP算法与LLE算法及其比较[J].工矿自动化,2007(12):30-32.
    [169] Osentoski S. Action-based representation discovery in markov decisionprocess[D]. University of Massachusetts, Amherst,2009.
    [170] Konidaris G D. A framework for transfer in reinforcement learning[C].Proceedings of23rd ICML Workshop on Structural Knowledge Transfer for MachineLearning,2006.
    [171] Ferguson K, Mahadevan S. Proto-transfer learning in markov decisionprocesses using spectral methods[C]. the23rd ICML Workshop on StructuralKnowledge Transfer for Machine Learning,2006.
    [172] Osentoski S, Mahadevan S. Basis function construction for hierarchicalreinforcement learning[C]. Proceedings of the9th International Conference onAutonomous Agents and Multiagent Systems,2010(1):747-754.
    [173] Ham J, Lee D. D, Mika S, et al. A kernel view of the dimensionafity reductionof manifolds[C]. Proceedings of21st ICML,2004:47-47.
    [174] Morimoto J, Hyon S. H, Atkeson C. G, Cheng G. Low dimensional featureextraction for humanoid locomotion using kernel dimension reduction[C].Proceedings of IEEE International Conference on Robotics and Automation,2008:3219-3225.
    [175] Sebastian B, Matthew H, Sethu V. Using Dimensionality reduction to exploitconstraints in reinforcement learning[C]. The2010International Conference onIntelligent Robots and Systems,2010:3219-3225.
    [176] Sugiyama M, Hachiya H, Towell C, Vijayakumar S. Geodesic Gaussian kernelsfor value function approximation[J]. Autonomous Robots,2008,25(3):287-304.
    [177] Ali Nouri, Michael L L. Dimension reduction and its application to model-basedexploration in continuous spaces. Machine Learning (2010)81:85-98.
    [178] Silva B C D, Konidaris G, Barto A G. Learning Parameterized Skills[C],Proceedings of29th ICML,2012.
    [179] Matthew R, Geoffrey J G, Sebastian T. Learning low dimensional predictiverepresentations[C]. Proceedings of21st ICML,2004.
    [180] Boots B, Siddiqi S, Gordon G. An online spectral learning algorithm forpartially observable nonlinear dynamical systems[C]. Proceedings of the25thNational Conference on Artificial Intelligence,2011:203-300.
    [181] Boots B, Gordon, G. Predictive state temporal difference learning[J]. Advancesin Neural Information Processing Systems2010,(23):271-279.
    [182] Ulrike von Luxburg. A tutorial on spectral clustering[J]. Statistics andComputing,200717(4):395-416.
    [183] Zhu M Q, Wang J, Li M, Lin Y J. Static gait analysis and planning of bipedrobot[C]. Seventh International Conference on Intelligent Information Hiding andMultimedia Signal Processing,2011:113-116.
    [184] Jinbo S, Jitendara M. Normalized cuts and image segmentation[J]. IEEETransactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
    [185] Hagen L, Kahng A B. New spectral methods for ratio cut partitioning andclustering[J]. IEEE Transactions on Computer Aided Design,1992,11(9):1074-1085.
    [186] Chan P. K, Schlag D. F, Zien J. Spectral k-way ratio-cut partitioning andclustering[J]. IEEE Transactions on Computer Aided Design of Integrated Circuitsand Systems,1994,13(9):1088-1096
    [187] Meila M, Xu L. Multiway cuts and spectral clustering[R]. Univerisity ofWashington,2003
    [188] Lee J A, Verleysen M. Nonlinear dimensionality reduction[M]. Springer,2007
    [189]罗四维,赵连伟.基于谱图理论的流形学习算法[J].计算机研究与发展,2006,43(7):1174-1179.
    [190]曾宪华.流形学习的谱方法相关问题研究[D].北京:北京交通大学,2009.
    [191]雷迎科.流形学习算法及其应用研究[D].合肥：中国科技大学,2011.
    [192] Yang L, Jin R. Distance metric learning: A comprehensive survey[R]. MichiganState University,2006.
    [193] Xing E, Ng A Y, Jordan M, Russell S. Distance metric learning, with applicationto clustering with side-information[C]. Advances in Neural Information ProcessingSystems,2002:505-512.
    [194]张钢.度量学习综述[R].中山大学,2010.
    [195] Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and datarepresentation[J]. Neural Computation,2003,15(6):1373-1396.
    [196]孔万增,孙志海,杨灿等.基于本征间隙与正交特征向量的自动谱聚类[J],电子学报,2010,38(8):1880-1885.
    [197]朱美强,程玉虎,李明,王雪松,冯涣婷.一类基于谱方法的强化学习混合迁移算法[J].自动化学报，2012,38(11):1765-1776.
    [198] Lee J A, Verleysen M. Nonlinear dimensionality reduction of data manifoldswith essential loops[J]. Neurocomputing,2005,67(8):29-53.
    [199] Devlin S, Kudenko D. Theoretical considerations of potential-based rewardshaping for multi-agent systems[C]. Proceedings of the10th Annual InternationalConference on Autonomous Agents and Multiagent Systems,2011:225-232.
    [200] Sam D, Daniel K. Dynamic Potential-Based Reward Shaping[C]. Proceedingsof the11th International Conference on Autonomous Agents and Multiagent Systems,2012(1):433-440.
    [201]刘全,闫其粹,伏玉琛,胡道京,龚声蓉.一种基于启发式奖赏函数的分层强化学习方法[J].计算机研究与发展,2011,48(12):2352-2358.
    [202] Maclin R, Shavlik J. Creating advice-taking reinforcement learners[J]. MachineLearning1996(22):251-281.
    [203] Kuhlmann G, Stone P, Mooney R, Shavlik J. Guiding a reinforcement learnerwith natural language advice: Initial results in RoboCup soccer[C], AAAI Workshopon Supervisory Control of Learning and Adaptive Systems,2004:30-35.
    [204] Sutton R S. Dyna, an integrated architecture for learning, planning, andreacting[C]. AAAI Spring Symposium,1991:151-155.
    [205]赵眴.有关强化学习的若干问题研究[D].南京：南京理工大学硕士论文,2009.
    [206] Peng J, Williams R J. Efficient learning and planning within the Dynaframework[J]. Adaptive Behaviore,1993,1(4):437-54.
    [207] Moore A W, Atkeson C G. Prioritized sweeping: reinforcement learning withless data and less real time[J]. Machine Learning,1993(13):103-130.
    [208] Wingate D, Seppi K D. Prioritization methods for accelerating MDP solvers[J].Journal of Machine Learning Research,2005(6):851-881.
    [209] Santos M, Martín H.JA, López V, Botella G. Dyna-H: a heuristic planningreinforcement learning algorithm applied to role-playing-game strategy decisionsystems. Knowledge-Based Systems,2012(32):28-36
    [210] Pan S J, Yang Q. A survey on transfer learning[J]. IEEE Transactions onKnowledge and Data Engineering,2010,22(10):1345-1359.
    [211] Matthew E T, Peter S. Transfer learning for reinforcement learning domains: asurvey[J]. Journal of Machine Learning,2009(10):1633-1685.
    [212]王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(12a):39-43.
    [213] Ferguson K, Mahadevan S. Proto-transfer learning in Markov decisionprocesses using spectral methods[R]. University Massachusetts, Amherst,2008.
    [214] Neville M, Sriraam N, Prasad T, Alan F. Transfer in variable-reward hierarchicalreinforcement learning[J]. Machine Learning,2008,73(3):289-312.
    [215]Theodore J P, Doina P. Using options for knowledge transfer in reinforcementlearning[R]. University Massachusetts, Amherst,1999.
    [216] Foster D, Dayan P. Structure in the space of value functions[J]. MachineLearning,2004,49(2):325-346.
    [217] Funlade T S, Jeremy L. W. Model transfer for Markov decision tasks viaparameter matching[C]. Proceedings of the25th Workshop of the UK Planning andScheduling Special Interest Group,2006.
    [218] Aaron W, Alan F, Soumya R, Prasad T. Multi-task reinforcement learning: ahierarchical Bayesian approach[J]. Proceedings of the24th ICML,2007:1015-1022.
    [219] Matthew E T. Autonomous inter-task transfer in reinforcement LearningDomains[D]. The University of Texas at Austin,2008.
    [220] George K, Barto A G. Building portable options: skill transfer in reinforcementlearning[C]. Proceedings of the20th International Joint Conference on ArtificialIntelligence,2007:895-900.
    [221] Jan R, Kurt D, Tom C. Transfer learning in reinforcement learning problemsthrough partial policy recycling[C]. Proceedings of The19th European Conference onMachine Learning,2007:699-707.
    [222]刘翠响.人脸识别中高维数据特征分析[D].天津：河北工业大学,2008.
    [223]张文志,吕恬生.强化学习理论在机器人应用中的几个关键问题探讨[J].计算机工程与应用,2004,40(4):69-72.
    [224]朱美强，李明，张倩.一类用于井下路径规划问题的Dyna_Q学习算法[J].工矿自动化,2012(12):71-75.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700