详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
As an effective method of solving the sequential decision-making problems,reinforcement learning encounters the curse of dimensionality when it is applied inlarge-scale or continuous spaces problems. Solving the curse of dimensionality andimproving the efficiency of the algorithm are the main problems of reinforcementlearning at the present stage. In recent years, as the mathematical tool which candiscover the topological structure of the high-dimensional data space, spectral graphtheory have been applied into these fields like complex network, image and vision,manifold learning and have achieved great success. Therefore, introducing spectralgraph theory into reinforcement learning has very important research value.
     In order to improve the efficiency of the algorithm, the dissertation mainlystudies how to apply spectral graph theory in following three areas: hierarchicalreinforcement learning, heuristic reinforcement learning based on manifold distance,and transfer learning for reinforcement learning. Firstly, a new subtask strategycalculating method and two modified task decomposition methods are proposed inhierarchical reinforcement learning. Then, aiming at these tasks of searching targetlocation, a framework of heuristic reinforcement learning based on distance metriclearning is established. Under above established framework, the most efficientLaplacian Eigenmap is used in the three aspects, namely Reward Shaping, heuristicstrategy selection and heuristic Dyna planning. At the same time, three categories ofheuristic reinforcement learning algorithms are put forward. At last, against theshortage of basis function transfer based on the spectral graph theory, a hybrid transfermethod integrating basis function with subtask optimal polices is designed in transferlearning for reinforcement learning. The main contributions of this dissertationinclude:
     1. The Option method includes task decomposition and subtask strategycalculating. In task decomposition, these existing Option methods based on spectralgraph partition need confirming the subtask number by hand and have a limitedapplication range. The dissertation analyzes the reason and puts forward two modifiedOption automatic decomposition algorithms through introducing multiple spectralclustering and the Eigengap method. In subtask strategy calculating, the existingmethods generally refer it as a new reinforcement learning problem. The dissertationuses the fact that Laplacian Eigenmap preserves the local topology structure of state space, and comes up with a new subtask strategy calculating method, namely virtualvalue function method.
     2. For these learning tasks based on the target location, generalized distanceoften is used as heuristic function in these areas that are the design of heuristic rewardfunction, the selection of heuristic action and heuristic Dyna planning. How todefinite the generalized distance according to the properties and structures of the tasksis the key. For these tasks whose value functions are discontinuous in Euclidean spacebut continuous in some manifold, a framework of heuristic reinforcement learningbased on the distance metric learning is built.
     3. The design method of heuristic reward function contains two types, namelygeneralized distance method and abstract model method. Following the idea ofgeneralized distance method, under the established framework of heuristicreinforcement learning, the dissertation uses the simplest Laplacian Eigenmap basedon spectral graph theory to get a new design method of heuristic reward function.Based on abstract model method and above two modified Option automaticdecomposition algorithms, the dissertation proposes two improved methods ofdesigning reward function which both can adaptively decompose subtask potentialfunction.
     4. Under the established framework of heuristic reinforcement learning, aimingat the strategy selection and Dyna planning of reinforcement learning, a new heuristicaction selection method and an improved Dyna-Q planning algorithm are put forward.The above two methods can speed initial learning performance of Q learning
     5. For scaling up state space transfer underlying the proto-value functionsframework based on spectral graph, only some basis functions corresponding to thesmaller eigenvalues are transferred effectively. However, the few effective basisfunctions will result in some error approximation of value functions in target task. Thereason that result in some error approximation of value functions is analyzed and ahybrid transfer method integrating basis function transfer with subtask optimal policestransfer is designed. The proposed hybrid transfer method can get directly optimalpolicies of some states, reduce iterations and the minimum number of the basisfunction needed to approximate the value functions. The method is suitable for scalingup state space transfer task with hierarchical control structures.
     There are four elements consisting of model, reward, value function and policy inreinforcement learning. Centering four elements, the dissertation studies several algorithms based on spectral graph theory, analyzes their application ranges andcomputational complexities, verifies their effectiveness and applicabilities bysimulation experiments.
[1] George F. Luger(美)著,郭茂祖,刘扬等译.人工智能:复杂问题求解的结构和策略[M].北京:机械工业出版社,2009.
    [2] Tom M. Mitchell(美)著,曾华军,张银奎等译. Machine Learning(机器学习)
    [3] Mjolsness E, DeCoste D. Machine learning for science: State of the art and futureprospects [J]. Science,2001,293(5537):2051-2055.
    [5] Sutton R S, Barto A G. Reinforcement Learning: an Introduction. Cambridge, MA:MIT Press,1998.
    [11] Sutton R S. Reinforcement learning: past, present and future[C]. Proceedings ofthe Second Asia Pacific Conference on Simulated Evolution and Learning (SEAL'98),Lecture Notes in Computer Science,1999(1585):195-197.
    [12] Thomas G D, Pedro D, Lise G, et al. Structured machine learning: the next tenyears[J]. Machine Learning,2008,73:3-23.
    [13] Fan R K Chung. Spectral graph theory[M]. American Mathematical Society,1997.
    [14] Spielman D A. Spectral graph theory and its applications[C]. Proceedings of the48th Annual IEEE Symposium on Foundations of Computer Science,2007:29-38.
    [15] Mahadevan S. Learning representation and control in Markov decision processes:new frontiers[J]. Foundations and Trends in Machine Learning,2009(4):403-565.
    [16] Marco W, Martijn V O. Reinforcement learning: state of the art[M]. Springer,2012.
    [17] Martijn V O. The logic of adaptive behavior: Knowledge representation andalgorithms for the Markov decision process framework in first-order domains[M].University of Twente, Enschede,2008.
    [18] Szepesvári Cs. Algorithms for Reinforcement Learning[M]. Morgan&ClaypoolPublishers,2010.
    [19] Busoniu L, Ernst D, Babuska R, Schutter B D. Approximate reinforcementlearning: an overview [C]. Proceedings of the2011IEEE International Symposium onAdaptive Dynamic Programming and Reinforcement Learning,2011:1-8.
    [21] Pascal P, Mohammad G, Yaakov E. Tutorial on Bayesian Methods forReinforcement Learning[C], the24th ICML Workshop on Bayesian ReinforcementLearning,2007.
    [23] Shoham Y, Powers R, Grenager T. Multi-Agent Reinforcement Learning: ACritical Survey[R]. Stanford University,2003.
    [24] Simon H A. The Sciences of the Artificial[M]. Cambridge: MIT Press,1996.
    [25] Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning.Discrete-Event Systems[J]. Special Issue on Reinforcement Learning,2003(13):41-77.
    [26] Sutton R S, Precup D, Singh S. Between mdps and semi-mdps: A framework fortemporal abstraction in reinforcement learning[J]. Artificial Intelligence,1999(112):181-211.
    [28] Digney B. Learning hierarchical control structure for multiple tasks and changingenvironments[C]. Proceedings of the15th Conference on the Simulation of AdaptiveBehavior,1998:321-327.
    [29] McGovern A, Barto A G. Automatic discovery of subgoals in reinforcementlearning using diverse density[C]. Proceedings of the18th ICML,2001:361-368.
    [30] Matthew K, Todd F, Rohit B. Improved automatic discovery of subgoals foroptions in hierarchical reinforcement learning[J]. Journal of Computer Science andTechnology,2003,3(2):9-14.
    [31]苏畅,高阳等. SMDP环境下自主生成options的算法研究[J].模式识别与人工智能,2005.
    [32] Ishai M, Shie M. Q-cut:dynamic discovery of subgoals in reinforcementlearning[C]. Proceedings of13th European Conference on Machine learning,2002:295-306.
    [33] Simsek O, Alicia P W, Barto A G. Identifying useful subgoals in reinforcementlearning by local graph partitioning[C]. Proceedings of the22rd ICML,2005:816-823.
    [34] Chung Cheng-Chiu, Von-Wun Soo. Automatic complexity reduction inreinforcement learning[J]. Computational Intelligence,2010,26(1):1-25.
    [35] Moradi P, Shiri M E, Entezari N. Automatic skill acquisition in reinforcementlearning agents using connection bridge centrality[J]. Communication andNetworking,2010,(120):51-62.
    [36] Shie M. Dynamic abstraction in reinforcement learning via clustering[C].Proceedings of21st ICML,2004.
    [38]王本年,高阳,陈兆乾等.面向Option k聚类Subgoal发现算法[J].计算机研究与发展,2006,43(5):851-855.
    [39] Hengst B. Discovering hierarchy in reinforcement learning[D]. University ofNew South Wales,2003.
    [40] Dietterich T G. Hierarchical reinforcement learning with the MAXQ valuefunction decomposition[J]. Journal of Artificial Intelligence Research,2000,13(2):227-303.
    [41] Chavaxnzadeh M, Mahadevan S. Continuous-time hierarchial reinforcementlearning[C]. Proceedings of the18th ICML,2001:186-193.
    [42] Hao Tang, Wenjing Liu, Wenjuan Cheng, Lei Zhou. Continuous-time MAXQAlgorithm for Web Service Composition[J]. Journal of Software,2012,7(5):943-950.
    [43] Marthi B, Kaelbling L, Lozano-Perez T. Learning hierarchical structure inpolicies[C]. Proceedings of the NIPS Workshop on Hierarchical Organization ofBehavior,2007.
    [44] Stefan Elfwing, Eiji Uchibe, Kenji Doya. An Evolutionary approach to automaticconstruction of the structure in hierarchical reinforcement learning[J]. Genetic andEvolutionary Computation,2003(18):507-509.
    [45] Manfredi V, Mahadevan S. Hierarchical reinforcement learning using graphicalmodels[C]. Proceedings of the22nd ICML Workshop on Rich Representations forReinforcement Learning,2005.
    [47] Jonsson A, Barto A. Causal graph based decomposition of factored mdps[J].Journal of Machine Learning,2006,7(11):2259-2301.
    [48] Mehta N, Ray S, Tadepalli P, Dietterich T. Automatic discovery and transfer ofmaxq hierarchies[C]. Proceedings of the25th ICML,2008:648-655.
    [49] Hengst B. Partial order hierarchical reinforcement learning[C]. Proceedings ofAustralasian Conference on Artificial Intelligence,2008:138-149.
    [50] Parr R. Hierarchical Control and learning for markov decision processes[D].University of California,1998.
    [51] Andre D, Russell S J. Programmable reinforcement learning agents[C].Proceedings of NIPS,2000:1019-1025.
    [52] Andre D, Russell S J. State abstraction for programmable reinforcement learningagents[C]. Proceedings of the Eighteenth National Conference on ArtificialIntellieence,2002:19-125.
    [54] Whiteson S, Stone P. Evolutionary function approximation for reinforcementlearning[J]. Journal of Machine Learning Research,2006(7):877-917.
    [55] Preux P, Girgin S, Loth M. Feature discovery in approximate dynamicprogramming[C]. Proceedings IEEE Symposium on Adaptive Dynamic Programmingand Reinforcement Learning,2009:109-116.
    [56] Sutton R S Learning to predict by the methods of temporal differences[J].Machine Learning,1988,3(1):9-44.
    [57] Watkins C, Dayan P. Q-learning[J]. Machine Learning,1992,8(3):279-292.
    [60] Crites R H, Barto A G. Elevator group control using multiple reinforcementlearning agents[J]. Machine Learning,1998,33(3):235-262.
    [62] Jouffe L. Fuzzy inference system learning by reinforcement learning [J]. IEEETransactions on Systems, Man and Cybernetics,1998,28(3):338-355.
    [63] Er M.J, Deng C. Online tuning of fuzzy inference system using dynamic fuzzyQ-learning [J]. IEEE Transactions on Systems, Man and Cybernetics,2004,34(3):1478-1489.
    [64] Lin C T, Lee C G. Reinforcement structure parameter learning for neural networkbased on fuzzy logic control systems [J]. IEEE Transactions on Fuzzy Systems,1994,2(1):41-63.
    [65] Quah K H, Quek C, Leedham G. Reinforcement learning combined with a fuzzyadaptive learning control network for pattern classification[J]. Pattern Recognition,2005,38(4):513-526.
    [69] Lagoudakis M G, Parr R. Least-squares policy iteration[J]. Journal of MachineLearning Research,2003,4:1107-1149.
    [70] Konidaris G, Osentoski S. Value function approximation in reinforcementlearning using the Fourier basis[R]. University of Massachusetts, Amherst,2008.
    [71] Bradtke S J, Barto A G. Linear least-squares algorithms for temporal differencelearning[J]. Machine Learning,1996,22(1-3):33-57.
    [72] Boyan J A. Technical update: least-squares temporal difference learning[J].Machine Learning,2002,49:233-246.
    [73] Geramifard A, Bowling M, Sutton R. S. Incremental least-squares temporaldifference learning[C]. Proceedings of the Twenty-First National Conference onArtificial Intelligence,2006:356-361.
    [74] Geramifard A, Bowling M, Zinkevich M, et al. iLSTD: eligibility traces andconvergence analysis[C]. Proceedings of Advances in Neural Information ProcessingSystems,2006:826-833.
    [75] Lagoudakis M G, Parr R. Least-squares policy iteration[J]. Journal of MachineLearning Research,2003,4:1107-1149.
    [76] Baird L C. Residual algorithms: Reinforcement learning with functionapproximation[C]. Proceedings of the12nd ICML,1995.
    [77] Antos A, Szepesvari C, Munos R. Learning near-optimal policies withBellman-residual minimization based fitted policy iteration and a single samplepath[J]. Machine Learning,2008,71(1):89-129.
    [78] Guestrin C, Koller D, Parr R, et al. Efficient solution algorithms for factoredMDPs[J]. Journal of Artificial Intelligence Research,2003,19:399-468.
    [79] DeFarias D P, Roy B V. The linear programming approach to approximatedynamic programming[J]. Operations Research,2003,51(6):850-865.
    [80] Parr R, Li L, Taylor G, Painter-Wakefield C, Littman M L. An analysis of linearmodels, linear value-function approximation, and feature selection for reinforcementlearning. Proceedings of the25th ICML,2008:752-759.
    [81] Drummond C. Accelerating reinforcement learning by composing solutions ofautomatically identified subtasks[J]. Journal of Artificial Intelligence Research,2002,16:59-104.
    [82] Kretchmar R, Anderson C. Using temporal neighborhoods to adapt functionapproximators in reinforcement learning[C]. Proceedings of the5th Interational WorkConference on Artificial and Natural Neural Networks,1999:488–496.
    [83] Smart W. Explicit manifold representations for value-function approximation inreinforcement learning[C]. Proceedings of the8th International Symposium onArtifical Intellegence and Mathematics,2004.
    [84] Keller P, Mannor S, Precup D. Automatic basis function construction forapproximate dynamic programming and reinforcement learning[C]. Proceedings ofthe23rd ICML,2006:449-456.
    [85] Parr R, Wakefield C P, Li L. Analyzing feature generation for value-functionapproximation[C]. Proceedings of the24th ICML,2007:737-744.
    [86] Petrik M. An analysis of Laplacian methods for value function approximation inMDPs[C]. Proceedings of the20th International Joint Conference on ArtificialIntelligence,2007:2574-2579.
    [87] Mahadevan S. Proto-value functions: development reinforcement learning[C].Proceedings of the22nd ICML,2005:553-560.
    [88] Coifman R R, Maggioni M. Diffusion wavelets[J]. Applied and ComputationalHarmonic Analysis, Special Issues: Diffusion Maps and Wavelets,2006,21(1):53-94.
    [89] Maggioni M, Mahadevan S. A multiscale framework for markov decision processusing diffusion wavelets[R]. University of Massachusetts, Amherst,2006.
    [90]Sugiyama M, Hachiya H, Towell C, et al. Value function approximation onnon-linear manifolds for robot motor control[C]. Proceedings of the IEEEInternational Conference on Robotics and Automation,2007:1733-1740.
    [92] Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporaldifference learning[C]. Proceedings of the26th ICML,2009:521-528.
    [93] Farahmand A M, Ghavamzadeh M, Szepesvári Cs, Mannor S. Regularized policyiteration[J].In Advances in Neural Information Processing Systems,2009:441-448.
    [94] Farahmand A M, Ghavamzadeh M, Szepesvári Cs, Shie M. Regularized fittedQ-iteration for planning in continuous space markovian decision problems[C].Proceedings of2009American Control Conference,2009:725-730.
    [95] Sutton R S, Maei H R, Precup, D, et al. Fast gradient-descent methods fortemporal-difference learning with linear function approximation[C]. Proceedings ofthe28th ICML,2009:993-1000.
    [96] Dietterich T G, Wang X. Batch value function approximation via supportvectors[C]. Proceedings of Advances in Neural Information Processing Systems,2002,14:1491-1498.
    [99] Engel Y, Mannor S, Meir R. Bayes meets Bellman: The gaussian processapproach to temporal difference learning[C]. Proceedings of the20th ICML,2003,1:154-161.
    [100] Rasmussen C E, Kuss M. Gaussian processes in reinforcement learning[C].Advances in Neural Information Processing Systems,2004:751-759.
    [102] Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration forreinforcement learning[J]. IEEE Transactions on Neural Networks,2007,18(4):973-992.
    [103] Rivest F, Precup D. Combining TD-learning with cascade-correlationnetworks[C]. Proceedings of20th ICML,2003:632-639.
    [104] Bagnell J, Schneider J. Covariant policy search[C]. In International JointConference on Artificial Intelligence,2003.
    [105] Aberdeen D. Policy-gradient algorithms for partially observable Markovdecision processes[D], Australian National Unversity,2003.
    [107] Wierstra D, Schaul T, Peters J, Schmidhuber J. Natural Evolution Strategies
    [C]. Proceedings of the Congress on Evolutionary Computation,2008:3381-3387.
    [108] Heidrich Meisner V, Igel C. Similarities and differences between policy gradientmethods and evolution strategies[C].16th European Symposiumon Articial NeuralNetworks,2008:427-432.
    [109] Williams R J. Simple statistical gradient-following algorithms for connectionistreinforcement learning[J]. Machine Learning,1992,8:229-256.
    [110] Baxter J, Bartlett P. L. Infinite-horizon policy-gradient estimation[J]. Journal ofArtificial Intelligence Research,2001,15:319-350.
    [111] Greensmith E, Bartlett P L, Baxter J. Variance reduction techniques for gradientestimation in reinforcement learning[J]. Journal of Machine Learning Reseach,2004,5:1471-1530.
    [112] Weaver L, Tao N. The optimal reward baseline for gradient-based reinforcementlearning[C]. Proceedings of the17th Conference in Uncertainty in ArtificialIntelligence,2001:538-545.
    [113] Sutton R S, McAllester D, Singh S, et al. Policy gradient methods forreinforcement learning with function approximation[C]. Proceedings of Advances inNeural Information Processing Systems,2000:1057-1063.
    [114] Konda V R, Tsitsiklis J N. Actor-critic algorithms[C]. Proceedings of Advancesin Neural Information Processing Systems,2000:1008-1014.
    [115] Konda V R, Tsitsiklis J N. On Actor-critic algorithms[J]. SIAM Journal onControl and Optimization,2003,42(4):1143-1166.
    [116] Shelton C R. Policy improvement for POMDPs using normalized importancesampling[C]. Proceedings of the17th International Conference on Uncertainty inArticial Intelligence,2001:496-503.
    [117] Rückstie T, Felder M, Schmidhuber J. State-dependent exploration for policygradient methods[C]. Proceedings of the European conference on Machine Learningand Knowledge Discovery in Databases,2008:234-249.
    [118] Sehnke F, Osendorfer C, Rückstie T, et al. Parameter-exploring policygradients[J]. Neural Networks,2010,23(4):551-559.
    [119] Amri S. Natural gradient works efficiently in learning [J]. Neural Computation,1998,10(2):251-276.
    [120] Kakade S. A natural policy gradient[C]. Advances in Neural InformationProcessing Systems,2002:1530-1538.
    [121] Peters J, Schaal S. Natural actor-critic[J]. Neurocomputing,2008,71:1180-1190.
    [122] Bhatnagar S, Sutton R S, Ghavamzadeh M. Incremental natural actor-criticalgorithms[C]. Proceedings of Advances in Neural Information Processing Systems,2007:105-112.
    [123] Dayan P, Hinton G E. Using expectation-maximization for reinforcementlearning[J]. Neural Computation,1997,9(2):271-278.
    [124] Peters J, Schaal S. Reinforcement learning by reward-weighted regression foroperational space control[C]. Proceedings of the24th ICML,2007:745-750.
    [126] Land. A. Theory and application of reward shaping in reinforcementleaming[D]. University of lllinois. Urbana-ChamPaign,2004.
    [127] Mataric M J. Reward functions for accelerated learning[C]. Proceedings of11stICML,1994.
    [128] Randl v J, Alstr m P. Learning to drive a bicycle using reinforcementlearningand shaping[C]. Proceedings of the15th ICML,1998:463-471.
    [130] Ng A Y, Harada D, Russell S J. Policy invariance under rewardtransformations:theory and application to reward shaping[C]. Proceedings of16thICML,1999:278-287.
    [131] Wiewiora E. Potential-based shaping and Q-value initialisation are equivalent[J].Journal of Artificial Intelligence Research,2003(19):205-208.
    [132] Asmuth J, Littman M L, Zinkov R. Potential-based shaping in model-basedreinforcement learning[C]. Proceedings of AAAI conference on artificial intelligence,2008:604-609.
    [134] Marthi B. Automatic shaping and decomposition of reward functions[C].Proceedings of the24th ICML2007:601-608.
    [135] Marek G, Daniel K. Reinforcement learning with reward shaping and mixedresolution function approximation[J]. International Journal of Agent Technologies andSystems,2009,1(2):36-54.
    [136] Marek G, Daniel K. Plan-based reward shaping for reinforcement learning[C].Proceedings of the4th IEEE International Conference on Intelligent Systems,2008:22-29.
    [137] Marek G. Improving exploration in reinforcement learning through domainknowledge and parameter analysis[D]. The University of York,2010.
    [138] Ng A Y, Russell S. Algorithms for inversere inforcement learning[C].Proceedings of17th ICML,2000:663-670.
    [139] Abbeel P, Ng A Y. Apprenticeship learning via inverse reinforcement learning[J].2004:1-8.
    [140] Abbeel P, Coates A, Quigley M, Ng. A. Y. An application of reinforcementlearning to aerobatic helicopter flight[C]. Proceedings of19th NIPS,2007:1-8.
    [141] Wyatt J. Exploration and inference in learning from reinforcement[D].University of Edinburgh,1997.
    [142] Asmuth J, Li L, Littman M L, Nouri A, Wingate D. A Bayesian samplingapproach to exploration in reinforcement learning[C]. Proceedings of the25thConference on Uncertainty in Artifical Intelligence,2009:19-26.
    [143] Poupart P, Vlassis N, Hoey J, Regan K. An analytic solution to discreteBayesian reinforcement learning[C]. Proceedings of the23rd ICML,2006:697-704.
    [144] Stéphane R, Joelle P, Brahim C, Pierre K. A bayesian approach for learning andplanning in partially observable markov decision processes[J]. Journal of MachineLearning Research,2011,(12):1729-1770.
    [146] Dearden R, Friedman N, Russell S. Bayesian Q-learning[C]. Proceedings of15th National Conference on Artificial Intelligence,1998.
    [147] Strens M J A. A Bayesian framework for reinforcement learning[C].Proceedings of the17th ICML,2000:943–950.
    [148] Dearden R, Friedman N, Andre D. Model based bayesian exploration[C].Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence,1999:150-159.
    [149] Sorg J, Singh S, Lewis R. Variance-based rewards for approximate Bayesianreinforcement learning[C]. Proceedings of the26th Conference on Uncertainty inArtificial Intelligence,2010:564-571.
    [150] Wang T, Lizotte D, Bowling M, Schuurmans D. Bayesian sparse sampling foron-line reward optimization[C]. Proceedings of the22nd ICML,2005:956–963.
    [151] Asmuth J, Li L, Littman M L, Nouri A, Wingate D. A Bayesian samplingapproach to exploration in reinforcement learning[C]. Proceedings of the25thConference on Uncertainty in Artifical Intelligence,2009:19-26.
    [152] Walsh T, Goschin S, Littman M L. Integrating sample-based planning andmodel-based reinforcement learning[C]. Proceedings of the Association for theAdvancement of Artificial Intelligence,2010:612-617.
    [153] Asmuth J, Littman M L Learning is planning: near Bayes-optimal reinforcementlearning via Monte-Carlo tree search.[2012-10-19]. http://arxiv.org/ftp/arxiv/papers/1202/1202.3699.pdf.
    [154] Asmuth J, Michael L. L. Approaching Bayes-optimality using Monte-Carlo treesearch[C]. Association for the Advancement of Artificial Intelligence,2011.
    [155] Kolter J, Ng A Y. Near-Bayesian exploration in polynomial time[C].Proceedings of the26th ICML,2009:513-520.
    [156] Alexander L S, Lihong L, Littman M L. Incremental model-based learners withformal learning time guarantees[C]. Proceedings of the22nd conferenceonUncertainty in Artificial Intelligence,2006:485-493.
    [157] Brafman R I, Tennenholtz M. R-MAX-a general polynomial time algorithmfor near-optimal reinforcement learning[J]. Journal of Machine Learning Research,2002(3):213-231.
    [158] Kearns M, Singh S. Near-optimal reinforcement learning in polynomial time[J].Machine Learning,2002,49(23):209-232.
    [159] Valiant L G. A theory of the learnable[J]. Communications of the ACM,1984(27):1134-1142.
    [160] Kakade S M. On the sample complexity of reinforcement learning[D].University College London,2003.
    [161] Alexander L S. Probably approximately Correct (PAC) Exploration inReinforcement Learning[D]. Rutgers University,2007.
    [162] Lihong Li. A unifying framework for computational reinforcement learningtheory[D]. Rutgers University,2009.
    [163] Alexander L S, Lihong L, Michael L. L. Reinforcement learning in finite MDPs:PAC analysis[J]. Journal of Machine Learning Research,2009(10):2413-244.
    [164] Bianchi R A C, Ribeiro C H C, Costa A. H. R. Accelerating autonomouslearning by using heuristicselection of actions[J]. Journal of Heuristics,2008,14(2):135-168.
    [165] Bradley K W, Peter S. Augmenting reinforcement learning with humanfeedback[C]. Proceedings of28th ICML Workshop on New Developments inImitation Learning,2011.
    [167] Robert G, William D S. Manifold representations for value-functionapproximation[C]. In Working Notes of the Workshop on Markov Decision Processes,AAAI,2004.
    [169] Osentoski S. Action-based representation discovery in markov decisionprocess[D]. University of Massachusetts, Amherst,2009.
    [170] Konidaris G D. A framework for transfer in reinforcement learning[C].Proceedings of23rd ICML Workshop on Structural Knowledge Transfer for MachineLearning,2006.
    [171] Ferguson K, Mahadevan S. Proto-transfer learning in markov decisionprocesses using spectral methods[C]. the23rd ICML Workshop on StructuralKnowledge Transfer for Machine Learning,2006.
    [172] Osentoski S, Mahadevan S. Basis function construction for hierarchicalreinforcement learning[C]. Proceedings of the9th International Conference onAutonomous Agents and Multiagent Systems,2010(1):747-754.
    [173] Ham J, Lee D. D, Mika S, et al. A kernel view of the dimensionafity reductionof manifolds[C]. Proceedings of21st ICML,2004:47-47.
    [174] Morimoto J, Hyon S. H, Atkeson C. G, Cheng G. Low dimensional featureextraction for humanoid locomotion using kernel dimension reduction[C].Proceedings of IEEE International Conference on Robotics and Automation,2008:3219-3225.
    [175] Sebastian B, Matthew H, Sethu V. Using Dimensionality reduction to exploitconstraints in reinforcement learning[C]. The2010International Conference onIntelligent Robots and Systems,2010:3219-3225.
    [176] Sugiyama M, Hachiya H, Towell C, Vijayakumar S. Geodesic Gaussian kernelsfor value function approximation[J]. Autonomous Robots,2008,25(3):287-304.
    [177] Ali Nouri, Michael L L. Dimension reduction and its application to model-basedexploration in continuous spaces. Machine Learning (2010)81:85-98.
    [178] Silva B C D, Konidaris G, Barto A G. Learning Parameterized Skills[C],Proceedings of29th ICML,2012.
    [179] Matthew R, Geoffrey J G, Sebastian T. Learning low dimensional predictiverepresentations[C]. Proceedings of21st ICML,2004.
    [180] Boots B, Siddiqi S, Gordon G. An online spectral learning algorithm forpartially observable nonlinear dynamical systems[C]. Proceedings of the25thNational Conference on Artificial Intelligence,2011:203-300.
    [181] Boots B, Gordon, G. Predictive state temporal difference learning[J]. Advancesin Neural Information Processing Systems2010,(23):271-279.
    [182] Ulrike von Luxburg. A tutorial on spectral clustering[J]. Statistics andComputing,200717(4):395-416.
    [183] Zhu M Q, Wang J, Li M, Lin Y J. Static gait analysis and planning of bipedrobot[C]. Seventh International Conference on Intelligent Information Hiding andMultimedia Signal Processing,2011:113-116.
    [184] Jinbo S, Jitendara M. Normalized cuts and image segmentation[J]. IEEETransactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
    [185] Hagen L, Kahng A B. New spectral methods for ratio cut partitioning andclustering[J]. IEEE Transactions on Computer Aided Design,1992,11(9):1074-1085.
    [186] Chan P. K, Schlag D. F, Zien J. Spectral k-way ratio-cut partitioning andclustering[J]. IEEE Transactions on Computer Aided Design of Integrated Circuitsand Systems,1994,13(9):1088-1096
    [187] Meila M, Xu L. Multiway cuts and spectral clustering[R]. Univerisity ofWashington,2003
    [188] Lee J A, Verleysen M. Nonlinear dimensionality reduction[M]. Springer,2007
    [192] Yang L, Jin R. Distance metric learning: A comprehensive survey[R]. MichiganState University,2006.
    [193] Xing E, Ng A Y, Jordan M, Russell S. Distance metric learning, with applicationto clustering with side-information[C]. Advances in Neural Information ProcessingSystems,2002:505-512.
    [195] Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and datarepresentation[J]. Neural Computation,2003,15(6):1373-1396.
    [198] Lee J A, Verleysen M. Nonlinear dimensionality reduction of data manifoldswith essential loops[J]. Neurocomputing,2005,67(8):29-53.
    [199] Devlin S, Kudenko D. Theoretical considerations of potential-based rewardshaping for multi-agent systems[C]. Proceedings of the10th Annual InternationalConference on Autonomous Agents and Multiagent Systems,2011:225-232.
    [200] Sam D, Daniel K. Dynamic Potential-Based Reward Shaping[C]. Proceedingsof the11th International Conference on Autonomous Agents and Multiagent Systems,2012(1):433-440.
    [202] Maclin R, Shavlik J. Creating advice-taking reinforcement learners[J]. MachineLearning1996(22):251-281.
    [203] Kuhlmann G, Stone P, Mooney R, Shavlik J. Guiding a reinforcement learnerwith natural language advice: Initial results in RoboCup soccer[C], AAAI Workshopon Supervisory Control of Learning and Adaptive Systems,2004:30-35.
    [204] Sutton R S. Dyna, an integrated architecture for learning, planning, andreacting[C]. AAAI Spring Symposium,1991:151-155.
    [206] Peng J, Williams R J. Efficient learning and planning within the Dynaframework[J]. Adaptive Behaviore,1993,1(4):437-54.
    [207] Moore A W, Atkeson C G. Prioritized sweeping: reinforcement learning withless data and less real time[J]. Machine Learning,1993(13):103-130.
    [208] Wingate D, Seppi K D. Prioritization methods for accelerating MDP solvers[J].Journal of Machine Learning Research,2005(6):851-881.
    [209] Santos M, Martín H.JA, López V, Botella G. Dyna-H: a heuristic planningreinforcement learning algorithm applied to role-playing-game strategy decisionsystems. Knowledge-Based Systems,2012(32):28-36
    [210] Pan S J, Yang Q. A survey on transfer learning[J]. IEEE Transactions onKnowledge and Data Engineering,2010,22(10):1345-1359.
    [211] Matthew E T, Peter S. Transfer learning for reinforcement learning domains: asurvey[J]. Journal of Machine Learning,2009(10):1633-1685.
    [213] Ferguson K, Mahadevan S. Proto-transfer learning in Markov decisionprocesses using spectral methods[R]. University Massachusetts, Amherst,2008.
    [214] Neville M, Sriraam N, Prasad T, Alan F. Transfer in variable-reward hierarchicalreinforcement learning[J]. Machine Learning,2008,73(3):289-312.
    [215]Theodore J P, Doina P. Using options for knowledge transfer in reinforcementlearning[R]. University Massachusetts, Amherst,1999.
    [216] Foster D, Dayan P. Structure in the space of value functions[J]. MachineLearning,2004,49(2):325-346.
    [217] Funlade T S, Jeremy L. W. Model transfer for Markov decision tasks viaparameter matching[C]. Proceedings of the25th Workshop of the UK Planning andScheduling Special Interest Group,2006.
    [218] Aaron W, Alan F, Soumya R, Prasad T. Multi-task reinforcement learning: ahierarchical Bayesian approach[J]. Proceedings of the24th ICML,2007:1015-1022.
    [219] Matthew E T. Autonomous inter-task transfer in reinforcement LearningDomains[D]. The University of Texas at Austin,2008.
    [220] George K, Barto A G. Building portable options: skill transfer in reinforcementlearning[C]. Proceedings of the20th International Joint Conference on ArtificialIntelligence,2007:895-900.
    [221] Jan R, Kurt D, Tom C. Transfer learning in reinforcement learning problemsthrough partial policy recycling[C]. Proceedings of The19th European Conference onMachine Learning,2007:699-707.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700