基于值函数估计的强化学习算法研究

英文题名：Study of Reinforcement Learning Algorithms Based on Value Function Approximation
作者：陈兴国
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：强化学习 ; 函数估计 ; 核方法 ; 分段线性基 ; 选择性集成学习 ; 局部有效性 ; Actor ; Critic学习
英文关键词：Reinforcement Learning ; Function Approximation ; Kernel Methods ; Piece-
英文关键词：wise Linear Basis ; Selective Ensemble Learning ; Local Validity ; Actor Critic Learning
学位年度：2013
导师：高阳
学科代码：081203
学位授予单位：南京大学
论文提交日期：2013-11-01

摘要

近年来,强化学习得到了机器学习研究人员的广泛关注。基于值表的强化学习算法在小规模状态空间的强化学习问题上,不仅得到了优异的实验效果验证,而且获得了完美的收敛性证明。
     然而,在实际应用中,强化学习算法通常面临大规模或者连续的状态空间,甚至是连续的动作空间(如自动驾驶的转向控制问题)。这使得基于值表的强化学习算法无法存储值表,并且无法遍历整个状态和动作空间。即强化学习算法遭遇了“维度灾难”问题的挑战。通常的解决方法是通过将经典的强化学习算法与函数估计相结合,以增强值函数对状态空间和动作空间的抽象和泛化能力。
     从函数估计角度,本文的主要工作和取得的创新如下：
     (1)简要介绍了强化学习的基本模型,综述了基于线性值函数估计的强化学习算法以及基于核方法的强化学习算法。
     (2)基于线性函数估计的强化学习试图求解一个最小二乘解,其预测误差受界于最优值函数与最优值函数投影后的残差,其中投影函数为Ⅱ=Φ(ΦΤDΦ)-1ΦΤD。可以看出,投影函数与特征函数有密切的关系,也直接影响到预测误差界。对于实际问题,受限于线性值函数的表达能力,当专家知识不足或者特征Φ的定义不够好时,该误差界会变得很大。
     为了解决该问题,本文提出了基于分段线性基的时间差分学习(Temporal Difference learning with Piecewise Linear Value Function:PLVF-TD)以更进一步的减小误差界。PLVF-TD学习框架有两个过程：对于不同维度的问题建立分段线性基；以及用复杂度为O(n)的时间差分学习算法来学习值函数的参数。经分析,误差界随着分段线性基个数的增加而减小。当分段线性基个数趋向于无穷时,误差界趋向于0。实验结果验证了PLVF-TD算法的有效性。
     (3)与基于线性函数估计的强化学习不同,根据表达定理,基于核方法的强化学习具备非常强大的表达能力。然而面对现实的强化学习问题,由于精度和复杂度两方面的问题,传统的基于核方法的强化学习算法不能满足在线学习的要求。
     针对该问题,本文提出了基于核方法的在线选择时间差分学习(Online Selective Kernel-based Temporal Difference:OSKTD)。OSKTD有两个在线过程：在线稀疏化和值函数的参数更新。在线稀疏化中,我们根据选择性集成学习,提出了基于核距离的在线稀疏化方法,其算法复杂度为O(n),比其它稀疏化方法的复杂度都低。在函数的参数更新中,我们根据局部有效性原理,提出了基于核方法的选择性值函数,并根据经典的时间差分学习结合梯度下降方法迭代学习值函数的参数。实验结果验证了OSKTD算法的有效性。
     (4)现实世界的问题通常是连续的状态空间、连续的动作空间并存的,为了精确控制,连续动作空间问题成为了一个新的研究热点。
     为了解决该问题,本文结合了Actor-Critic方法在处理连续动作空间的优点以及核方法在处理连续状态空间的优势,提出了基于核方法的连续动作Actor Critic学习算法(Kernel-based Continuous-action Actor Critic Learning:KCACL)。其中,Actor根据奖赏不作为原则更新动作执行的概率,Critic根据OSKTD学习算法更新状态值函数。实验结果验证了KCACL学习算法在求解连续动作空间强化学习问题上的有效性。
In recent years, more and more machine learning researchers focus on reinforce-ment learning. On reinforcement learning problems with both a small scale state space and a small scale action space, classic value table-based reinforcement learning algo-rithms are proved by mathematics for convergence, and well evaluated on the experi-mental performance.
     However, in practice, reinforcement learning problems are usually with large scale and/or continuous state space, even with continuous action space, e.g., steering con-trol problems in automatic driving. This brings the "curse of dimensionality", which challenges the classic table-based reinforcement learning algorithms on both memory space and learning efficiency. A common solution is to combine the classic reinforce-ment learning algorithms with function approximation methods in order to enhance the abstraction ability and generalization ability on state space and action space.
     In the aspect of function approximation, the main work and contributions of this thesis are as follows:
     (1) A short introduction to reinforcement learning model is given. Then, sur-veys about the linear function approximated reinforcement learning algorithms and the kernel-based reinforcement learning algorithms are summarized.
     (2) Temporal Difference (TD) learning family tries to learn a least-squares solu-tion of an approximate Linear Value Function (LVF). However, due to the representive ability of the features in LVF, the predictive error of the learned LVF is bounded by the residual between the optimal value function and the projected optimal value function, where the projection operator is Π=Φ(ΦT DΦ)-1ΦΤ D. We find that the predictive error can be very large if the feature function Φ is not well designed.
     To deal with this problem, Temporal Difference learning with Piecewise Linear Value Function (PLVF-TD) is proposed to further decrease the error bounds. In PLVF-TD, there are two steps:(i) building the piecewise linear basis for problems with differ-ent dimensions;(ii) learning the parameters via temporal difference learning, whose complexity is O(n). The error bounds are proved to decrease to zero when the size of the piecewise basis goes into infinite.
     (3) Different from linear approximated reinforcement learning, kernel-based re-inforcement learning has a strong representive ability because of the Representer The-orem. However, the typical kernel-based reinforcement learning algorithms can not satisfy online learning both on accuracy and complexity.
     To deal with this problem, an algorithm named Online Selective Kernel-based Temporal Difference (OSKTD) learning is proposed. OSKTD includes two online procedures:online sparsification and parameter updating for the selective kernel-based value function. A new sparsification method (i.e. a kernel distance-based online spar-sification method) is proposed based on selective ensemble learning, which is com-putationally less complex compared with other sparsification methods. Based on the proposed sparsification method, the sparsified dictionary of samples is constructed on-line by checking if a sample needs to be added to the sparsified dictionary. Also, based on local validity, a selective kernel-based value function is proposed to select the best samples from the sample dictionary for the selective kernel-based value function ap-proximator. The parameters of the selective kernel-based value function are iteratively updated by using the temporal difference learning algorithm combined with the gra-dient descent technique. The complexity of the online sparsification procedure in the OSKTD algorithm is O(n).
     (4) Real-world problems often require learning algorithms to deal with both con-tinuous state and continuous action spaces in order to control accurately. Thus, contin-uous action space reinforcement learning problems become a hot research topic.
     To deal with this problem, Actor-Critic methods are combined with the kernel methods, because that Actor-Critic methods are good at dealing with continuous action space problem while the kernel methods are good at dealing with continuous state space problem. Thus, Kernel-based Continuous-action Actor Critic Learning (KCACL) is proposed, where the actor updates the probability of each action based on reward-inaction, and the critic updates the state value function based on the Online Selective Kernel-based Temporal Difference (OSKTD) learning. The empirical results demonstrate the effectiveness of all the proposed algorithms.

引文

[1]高阳,陈世福,陆鑫,2004.强化学习研究综述.自动化学报30,86-100.
    [2]王皓,高阳,陈兴国,2008.强化学习中的迁移：方法和进展.电子学报36,39-43.
    [3]周志华,2003.关于监督学习、非监督学习和强化学习.南京大学小百合站：AI版.
    [4]周志华,2007.机器学习与数据挖掘.中国计算机学会通讯,35-44.
    [5]高阳,王皓,陈兴国,2009.强化学习.中国计算机学会通讯5,42-50.
    [6]高阳,王巍巍,陈兴国,葛屾,2009.关系强化学习研究：以俄罗斯方块游戏为例.机器学习及其应用,33-48.
    [7]Andrea Bonarini, Alessandro Lazaric, M.R.,2007. Reinforcement learning in continuous action spaces through sequential monte carlo methods, in:Proc. Ad-vances in Neural Information Processing Systems 20, Vancouver, B.C., Canada.. pp.833-840.
    [8]Antos, A., Szepesvari, C., Munos, R.,2008. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71,89-129.
    [9]Baird, L., Klopf, A.H.,1993. Reinforcement learning with high-dimensional, continuous actions. US Air Force Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base, OH.
    [10]Baird, L., Moore, A.W.,1999. Gradient descent for general reinforcement learn-ing. Advances in Neural Information Processing Systems,968-974.
    [11]Baird, L.,等,1995. Residual algorithms:Reinforcement learning with func-tion approximation, in:Proceedings of the 12th International Conference on Machine Learning, pp.30-37.
    [12]Barto, A., Mahadevan, S.,2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13,341-379.
    [13]Belkin, M., Niyogi, P., Sindhwani, V.,2006. Manifold Regularization:A geo-metric framework for learning from labeled and unlabeled examples. The Jour-nal of Machine Learning Research 7,2399-2434.
    [14]Bertsekas, D.,2011. Temporal difference methods for general projected equa-tions. IEEE Transactions on Automatic Control 56,2128-2139.
    [15]Bertsekas, D., Tsitsiklis, J., Scientific, A.,1996. Neuro-dynamic programming. Athena Scientific Press, Athens, Greece.
    [16]Bethke, B., How, J.,2009. Approximate dynamic programming using Bellman residual elimination and Gaussian process regression, in:Proc.2009 American Control Conf., Piscataway, NJ, USA. pp.745-750.
    [17]Bethke, B., How, J., Ozdaglar, A.,2008. Kernel-based Reinforce-ment Learning using Bellman Residual Elimination. [Online]. Available: http://web.mit.edu/asuman/www/documents/ml-adp.pdf.
    [18]Bhatnagar, S.,2010. An actor-critic algorithm with function approximation for discounted cost constrained markov decision processes. Systems & Control Letters 59,760-766.
    [19]Bhatnagar, S., Kumar, S.,2004. A simultaneous perturbation stochas-tic approximation-based actor-critic algorithm for markov decision processes. IEEE Transactions on Automatic Control 49,592-598.
    [20]Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.,2009. Natural actor-critic algorithms. Automatica 45,2471-2482.
    [21]Bohm, N., Kokai, G., Mandl, S.,2005. An evolutionary approach to tetris, in: The Sixth Metaheuristics International Conference (MIC2005).
    [22]Bohm, N., Kokai, G., Mandl, S.,2005. An evolutionary approach to tetris, in: the 6th Metaheuristics International Conference, pp. CD-ROM.
    [23]Boumaza, A.,2009. On the evolution of artificial tetris players, in:Computa-tional Intelligence and Games,2009. CIG 2009. IEEE Symposium on, IEEE. pp.387-393.
    [24]Boumaza, A.,2011. Designing artificial tetris players with evolution strategies and racing, in:Proceedings of the 13th annual conference companion on Genetic and evolutionary computation, ACM. pp.117-118.
    [25]Boyan, J.,2002. Technical update:Least-squares temporal difference learning. Machine Learning 49,233-246.
    [26]Bradtke, S.,1994. Incremental Dynamic Programming for On-line Adaptive Opti-mal Control. Ph.D. thesis. University of Massachusetts. Amherst.
    [27]Bradtke, S., Barto, A.,1996. Linear least-squares algorithms for temporal dif-ference learning. Machine Learning 22,33-57.
    [28]Busoniu, L., Ernst, D., De Schutter, B., Babuska, R.,2009. Policy search with cross-entropy optimization of basis functions, in:Adaptive Dynamic Program-ming and Reinforcement Learning,2009. ADPRL'09. IEEE Symposium on, IEEE. pp.153-160.
    [29]Chen, X., Gao, Y., Wang, R.,2013. Online selective kernel-based temporal dif-ferece learning. IEEE Transactions on Neural Networks and Learning Systems (accept) 24,1944-1956.
    [30]Chen, X., Wang, H., Wang, W., Shi, Y., Gao, Y,2009. Apply ant colony op-timization to tetris, in:Proceedings of the 11th Conference Genetic and Evolu-tionary Computation, ACM. pp.1741-1742.
    [31]Chen, Z., Jagannathan, S.,2008. Generalized Hamilton-Jacobi-Bellman formulation-based neural network control of affine nonlinear discrete-time sys-tems. IEEE Transactions on Neural Networks 19,90-106.
    [32]Chin, T., Suter, D.,2007. Incremental kernel principal component analysis. IEEE Transactions on Image Processing 16,1662-1674.
    [33]Croonenborghs, T., Driessens, K., Bruynooghe, M.,2008. Learning relational options for inductive transfer in relational reinforcement learning, in:Inductive Logic Programming. Springer, pp.88-97.
    [34]D. Castro, D., Mannor, S.,2010. Adaptive bases for reinforcement learning, in: Proceedings of the 21st European Conference on Machine Learning, Barcelona, Spain. pp.312-327.
    [35]Deisenroth, M., Peters, J., Rasmussen, C.,2008. Approximate dynamic pro-gramming with Gaussian processes, in:Proceedings of the 23rd AAAI Confer-ence on Artificial Intelligence, Seattle, WA, USA. pp.4480-4485.
    [36]Dietterich, T.G.,1998. The maxq method for hierarchical reinforcement learn-ing., in:Proceedings of the 15th International Conference on Machine Learning, Citeseer. pp.118-126.
    [37]Dietterich, T.G.,2000a. Hierarchical reinforcement learning with the maxq value function decomposition, in:Journal of Artificial Intelligence Research.
    [38]Dietterich, T.G.,2000b. An overview of maxq hierarchical reinforcement learn-ing, in:Abstraction, Reformulation, and Approximation. Springer, pp.26-44.
    [39]Doya, K.,2000. Reinforcement learning in continuous time and space. Neural computation 12,219-245.
    [40]van den Dries, S., Wiering, M.A.,2012. Neural-fitted td-leaf learning for playing othello with structured neural networks. IEEE Transactions on Neural Networks and Learning Systems 23,1701-1713.
    [41]Driessens, K., Dzeroski, S.,2004. Integrating guidance into relational reinforce-ment learning. Machine Learning 57,271-304.
    [42]Driessens, K., Ramon, J.,2003. Relational instance based regression for rela-tional reinforcement learning, in:Proceedings of the 21st International Confer-ence Machine Learning, Banff, Alberta, Canada, pp.123-130.
    [43]Driessens, K., Ramon, J., Blockeel, H.,2001. Speeding up relational reinforce-ment learning through the use of an incremental first order decision tree learner, in:Machine Learning:ECML 2001. Springer, pp.97-108.
    [44]Driessens, K., Ramon, J., Gartner, T.,2006. Graph kernels and gaussian pro-cesses for relational reinforcement learning. Machine Learning 64,91-119.
    [45]Duff, S.J.B.M.O.,1995. Reinforcement learning methods for continuous-time markov decision problems. Advances in Neural Information Processing Sys-tems:77,393.
    [46]Dutech, A., Edmunds, T., Kok, J., Lagoudakis, M., Littman, M., Riedmiller, M., Russell, B., Scherrer, B., Sutton, R., Timmer, S.,等,2005. Reinforcement learning benchmarks and bake-offs II, in:Proceedings of the Advances in Neu-ral Information Processing Systems (NIPS) 17 Workshop, Vancouver, Canada. [Online]. Available:webdocs.cs.ualberta.ca/sutton/nips.pdf.
    [47]Dzeroski, S., De Raedt, L., Driessens, K.,2001. Relational reinforcement learn-ing. Machine learning 43,7-52.
    [48]El-Fakdi, A., Carreras, M., Ridao, P.,2006. Towards direct policy search rein-forcement learning for robot control, in:2006 IEEE/RSJ International Confer-ence on Intelligent Robots and Systems, IEEE. pp.3178-3183.
    [49]Engel, Y., Mannor, S., Meir, R.,2003. Bayes meets Bellman:the Gaussian process approach to temporal difference learning, in:Proceedings of the 20th International Conference on Machine Learning, Washington, DC. pp.154-161.
    [50]Engel, Y., Mannor, S., Meir, R.,2004. The kernel recursive least-squares algo-rithm. IEEE Transactions on Signal Processing 52,2275-2285.
    [51]Engel, Y, Mannor, S., Meir, R.,2005. Reinforcement learning with Gaussian processes, in:Proceedings of the 22nd International Conference on Machine Learning, New York, NY, USA. pp.201-208.
    [52]Ernst, D., Geurts, P., Wehenkel, L.,2005. Tree-based batch mode reinforcement learning. The Journal of Machine Learning Research 6,503-556.
    [53]FauBer, S., Schwenker, F.,2011. Ensemble methods for reinforcement learning with function approximation, in:Proceedings of the 10th International Work-shop on Multiple Classifier Systems, Naples, Italy. pp.56-65.
    [54]Ferns, N., Castro, P., Precup, D., Panangaden, P.,2006. Methods for comput-ing state similarity in markov decision processes, in:Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA. pp. 174-181.
    [55]Ferns, N., Panangaden, P., Precup, D.,2011. Bisimulation metrics for continu-ous markov decision processes. SIAM Journal of Computing 40,1662-1714.
    [56]Flom, L., Robinson, C.,2005. Using a genetic algorithm to weight an evaluation function for tetris.
    [57]Gartner, T., Driessens, K., Ramon, J.,2003. Graph kernels and gaussian pro-cesses for relational reinforcement learning, in:Inductive Logic Programming. Springer, pp.146-163.
    [58]Gaskett, C., Wettergreen, D., Zelinsky, A.,1999. Q-learning in continuous state and action spaces, in:Advanced Topics in Artificial Intelligence. Springer, pp. 417-428.
    [59]Gearhart, C.,2003. Genetic programming as policy search in markov decision processes. Genetic Algorithms and Genetic Programming at Stanford,61-67.
    [60]Geist, M., Scherrer, B., Lazaric, A., Ghavamzadeh, M.,2012. A dantzig selector approach to temporal difference learning, in:Proceedings of the 29th Interna-tional Conference on Machine Learning, Edinburgh, Scotland.
    [61]Geramifard, A., Bowling, M., Sutton, R.,2006a. Incremental least-squares tem-poral difference learning, in:Proceedings of the 21st AAAI Conference on Ar-tificial Intelligence, Boston, Massachusetts. pp.356-361.
    [62]Geramifard, A., Bowling, M.H., Zinkevich, M., Sutton, R.S.,2006b. iLSTD: Eligibility traces and convergence analysis, in:Advances in Neural Information Processing Systems 19, Vancouver, B.C., Canada. pp.441-448.
    [63]Ghavamzadeh, M., Engel, Y.,2007. Bayesian actor-critic algorithms, in:Pro-ceedings of the 24th International Conference on Machine learning, ACM. pp. 297-304.
    [64]Ghavamzadeh, M., Lazaric, A., Maillard, O.A., Munos, R.,2010. LSTD with random projections, in:Advances in Neural Information Processing Systems 23, Lake Tahoe, Nevada, USA. pp.721-729.
    [65]Ghavamzadeh, M., Lazaric, A., Munos, R., Hoffman, M.,2011. Finite-sample analysis of Lasso-TD, in:Proceedings of the 28th International Conference on Machine Learning, Bellevue, Washington, USA. pp.1177-1184.
    [66]Ghavamzadeh, M., Mahadevan, S.,2001. Continuous-time hierarchical rein-forcement learning, in:Proceedings of the 18th International Conference on Machine Learning, pp.186-193.
    [67]Goel, S., Huber, M.,2003. Subgoal discovery for hierarchical reinforcement learning using learned policies. Ph.D. thesis. University of Texas at Arlington.
    [68]GRANT, I.H.,1987. Recursive least squares. Teaching Statistics 9,15-18.
    [69]Gross, H.M., Stephan, V., Krabbes, M.,1998. A neural field approach to topo-logical reinforcement learning in continuous action spaces, in:Neural Networks Proceedings,1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on, IEEE. pp.1992-1997.
    [70]Hans, A., Udluft, S.,2010. Ensembles of neural networks for robust reinforce-ment learning, in:Proceedings of the 9th International Conference on Machine Learning Application, Washington, D.C., USA. pp.401-406.
    [71]Hansen, L., Salamon, P.,1990. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12,993-1001.
    [72]van Hasselt, H.,2012. Reinforcement learning in continuous state and action spaces, in:Reinforcement Learning. Springer, pp.207-251.
    [73]van Hasselt, H., Wiering, M.,2007. Reinforcement learning in continuous action spaces, in:IEEE International Symposium on Approximate Dynamic Program-ming and Reinforcement Learning, Hawaii, USA., pp.272-279.
    [74]van Hasselt, H., Wiering, M.A.,2009. Using continuous action spaces to solve discrete problems, in:International Joint Conference on Neural Networks, IEEE. pp.1149-1156.
    [75]Heidrich-Meisner, V., Igel, C.,2008a. Evolution strategies for direct policy search, in:Parallel Problem Solving from Nature-PPSN X. Springer, pp.428-437.
    [76]Heidrich-Meisner, V., Igel, C.,2008b. Similarities and differences between pol-icy gradient methods and evolution strategies., in:ESANN, pp.149-154.
    [77]Hengst, B.,2002. Discovering hierarchy in reinforcement learning with hexq, in:Proceedings of the 19th International Conference on Machine Learning, pp. 243-250.
    [78]Heydari, A., Balakrishnan, S.N.,2013. Finite-horizon control-constrained non-linear optimal control using single network adaptive critics. IEEE Transactions on Neural Networks and Learning Systems 24,145-157.
    [79]Hoffman, M., Lazaric, A., Ghavamzadeh, M., Munos, R.,2011. Regularized least squares temporal difference learning with nested 12 and 11 penalization, in: European Workshop on Reinforcement Learning.
    [80]Howell, M., Frost, G., Gordon, T., Wu, Q.,1997. Continuous action reinforce-ment learning applied to vehicle suspension control. Mechatronics 7,263-276.
    [81]Howell, M., Gordon, T.,2001. Continuous action reinforcement learning au-tomata and their application to adaptive digital filter design. Engineering Appli-cations of Artificial Intelligence 14,549-561.
    [82]Hu, J., Fu, M.C., Ramezani, V.R., Marcus, S.I.,2007. An evolutionary ran-dom policy search algorithm for solving markov decision processes. INFORMS Journal on Computing 19,161-174.
    [83]Jong, N., Stone, P.,2006. Kernel-based models for reinforcement learning, in: Proceedings of the ICML Workshop on Kernel Machine and Reinforcement Learning, Pittsburgh, PA, USA. [Online]. Available:http://www.grappa.univ-lille3.fr/-ppreux/krl/jong-stone.pdf.
    [84]Kolter, J., Ng, A.,2009. Regularization and feature selection in least-squares temporal difference learning, in:Proceedings of the 26th International Confer-ence on Machine Learning, ACM. pp.521-528.
    [85]Konidaris, G., Barto, A.,2006. Autonomous shaping:Knowledge transfer in reinforcement learning, in:Proceedings of the 23rd International Conference on Machine learning, ACM. pp.489-496.
    [86]Konidaris, G., Barto, A.G.,2007. Building portable options:Skill transfer in reinforcement learning., in:Proceedings of the 20th International Joint Confer-ence on Artificial Intelligence, pp.895-900.
    [87]Kroemer, O.B., Peters, J.,2011. A non-parametric approach to dynamic pro-gramming, in:Proceedings of the 23rd Advances in Neural Information Pro-cessing Systems, Granada, Spain. pp.1719-1727.
    [88]Lagoudakis, M., Parr, R.,2003. Least-squares policy iteration. The Journal of Machine Learning Research 4,1107-1149.
    [89]Langenhoven, L., van Heerden, W., Engelbrecht, A.,2010. Swarm tetris:Apply-ing particle swarm optimization to tetris, in:Evolutionary Computation (CEC), 2010 IEEE Congress on, IEEE. pp.1-8.
    [90]Lappalainen, H., Miskin, J.,2000. Ensemble learning, in:Girolami, M. (Ed.), Advances in Independent Component Analysis, Springer-Verlag Press, Berlin, Heidelberg.
    [91]Lee, J.H., Oh, S.Y., Choi, D.H.,1998. Td based reinforcement learning using neural networks in control problems with continuous action space, in:Neural Networks Proceedings,1998. IEEE World Congress on Computational Intelli-gence. The 1998 IEEE International Joint Conference on, IEEE. pp.2028-2033.
    [92]Li, B., Si, J.,2010. Approximate robust policy iteration using multilayer percep-tron neural networks for discounted infinite-horizon Markov decision processes with uncertain correlated transition matrices. IEEE Transactions on Neural Net-works 21,1270-1280.
    [93]Liu, W., Park, I., Principe, J.,2009a. An information theoretic approach of designing sparse kernel adaptive filters. IEEE Transactions on Neural Networks 20,1950-1961.
    [94]Liu, W., Park, I., Wang, Y., Principe, J.,2009b. Extended kernel recursive least squares algorithm. IEEE Transactions on Signal Processing 57,3801-3814.
    [95]Liu, W., Pokharel, P., Principe, J.,2008. The kernel least-mean-square algo-rithm. IEEE Transactions on Signal Processing 56,543-554.
    [96]Liu, W., Principe, J.,2008. Kernel affine projection algorithms. EURASIP Journal of the Advance on Signal Processing 2008,1-13.
    [97]Liu, Y, Stone, P.,2006. Value-function-based transfer for reinforcement learn-ing using structure mapping, in:Proceedings of the National Conference on Artificial Intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. p.415.
    [98]Loth, M., Davy, M., Preux, P.,2007. Sparse temporal difference learning using lasso, in:Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, IEEE. pp.352-359.
    [99]Luger, G.F.,2002. Artificial Intelligence:Structures and Strategies for Complex Problem Solving. Addison Wesley.
    [100]Maei, H., Sutton, R.,2010. GQ(A):A general gradient algorithm for temporal-difference prediction learning with eligibility traces, in:Proceedings of the 3rd Conference on Artificial General Intelligence, Lugano, Switzerland. pp.91-96.
    [101]Maei, H., Szepesvari, C., Bhatnagar, S., Sutton, R.,2010. Toward off-policy learning control with function approximation, in:Proceedings of the 27th Inter-national Conference on Machine Learning, Haifa, Israel. pp.719-726.
    [102]Mahadevan, S.,1992. Enhancing transfer in reinforcement learning by build-ing stochastic models of robot actions, in:Proceedings of the 9th International Workshop on Machine learning, Morgan Kaufmann Publishers Inc.. pp.290-299.
    [103]Mannor, S., Rubinstein, R.Y., Gat, Y.,2003. The cross entropy method for fast policy search, in:Proceedings of the 20th International Conference on Machine Learning, pp.512-519.
    [104]Marthi, B., Russell, S.J., Latham, D., Guestrin, C.,2005. Concurrent hierar-chical reinforcement learning., in:Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh. pp.779-785.
    [105]Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.,2008. Transfer in variable-reward hierarchical reinforcement learning. Machine Learning 73,289-312.
    [106]Melo, F.S., Lopes, M.,2008. Fitted natural actor-critic:A new algorithm for continuous state-action mdps, in:Machine Learning and Knowledge Discovery in Databases. Springer, pp.66-81.
    [107]Millan, J.D.R., Posenato, D., Dedieu, E.,2002. Continuous-action q-learning. Machine Learning 49,247-265.
    [108]Mitchell, T.,1997. Machine learning. McGraw-Hill Boston, MA.
    [109]Moody, J., Saffell, M.,2001. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks 12,875-889.
    [110]Morimoto, J., Doya, K.,2001. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems 36,37-51.
    [111]Ng, A.Y., Jordan, M.,2000. Pegasus:A policy search method for large mdps and pomdps, in:Proceedings of the 16th conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc.. pp.406-415.
    [112]Nguyen-Tuong, D., Peters, J.,2012. Online kernel-based learning for task-space tracking robot control. IEEE Transactions on Neural Networks and Learning Systems 23,1417-1425.
    [113]Orabona, F., Jie, L., Caputo, B.,2012. Multi kernel learning with online-batch optimization. The Journal of Machine Learning Research 13,227-253.
    [114]Orabona, F., Keshet, J., Caputo, B.,2008. The projectron:a bounded kernel-based perceptron, in:Proceedings of the 25th International Conference on Ma-chine Learning, Helsinki, Finland. pp.816-823.
    [115]Orabona, F., Keshet, J., Caputo, B.,2009. Bounded kernel-based online learn-ing. The Journal of Machine Learning Research 10,2643-2666.
    [116]Ormoneit, D., Sen, S.,2002. Kernel-based reinforcement learning. Machine Learning 49,161-178.
    [117]Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., Littman, M.,2008. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning, in:Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. pp.752-759.
    [118]Parr, R., Painter-Wakefield, C., Li, L., Littman, M.,2007. Analyzing feature generation for value-function approximation, in:Proceedings of the 24th Inter-national Conference on Machine Learning, ACM, Corvallis, Oregon. pp.737-744.
    [119]Parr, R., Russell, S.,1998. Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems,1043-1049.
    [120]Pazis, J., Lagoudakis, M.G.,2009. Binary action search for learning continuous-action control policies, in:Proceedings of the 26th Annual International Con-ference on Machine Learning, ACM. pp.793-800.
    [121]Peters, J., Schaal, S.,2008. Natural actor-critic. Neurocomputing 71,1180-1190.
    [122]Peters, J., Vijayakumar, S., Schaal, S.,2005. Natural actor-critic, in:Machine Learning:ECML 2005. Springer, pp.280-291.
    [123]Phua, C., Fitch, R.,2007. Tracking value function dynamics to improve rein-forcement learning with piecewise linear function approximation, in:Proceed-ings of the 24th International Conference on Machine Learning, Corvallis, Ore-gon, USA. pp.751-758.
    [124]Platt, J.,1991. A resource-allocating network for function interpolation. Neural computation 3,213-225.
    [125]Prokhorov, D.V., Wunsch, D.C.,等,1997. Adaptive critic designs. IEEE Trans-actions on Neural Networks 8,997-1007.
    [126]Ravindran, B., Barto, A.G.,2002. Model minimization in hierarchical reinforce-ment learning, in:Abstraction, Reformulation, and Approximation. Springer, pp.196-211.
    [127]Reisinger, J., Stone, P., Miikkulainen, R.,2008. Online kernel selection for Bayesian reinforcement learning, in:Proceedings of the 25th International Con-ference on Machine Learning, Helsinki, Finland. pp.816-823.
    [128]Robards, M., Sunehag, P., Sanner, S., Marthi, B.,2011. Sparse kernel-sarsa (λ) with an eligibility trace, in:Proceedings of the 22nd European Conference on Machine Learning, Athens, Greece. pp.1-17.
    [129]Rosenstein, M.T., Barto, A.G.,2001. Robot weightlifting by direct policy search, in:International Joint Conference on Artificial Intelligence, Citeseer. pp.839-846.
    [130]Rottmann, A., Plagemann, C., Hilgers, P., Burgard, W.,2007. Autonomous blimp control using model-free reinforcement learning in a continuous state and action space, in:IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE. pp.1895-1900.
    [131]Scherrer, B.,2010. Should one compute the temporal difference fix point or min-imize the bellman residual? the unified oblique projection view, in:Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, pp. 959-966.
    [132]Schneegaβ, D., Udluft, S., Martinetz, T.,2006. Kernel rewards regression:an information efficient batch policy iteration approach, in:Proceedings of the 24th IASTED International Conference on Artificial Intelligence and Applications, Innsbruck, Austria. pp.428-433.
    [133]Schoknecht, R.,2002. Optimality of reinforcement learning algorithms with linear function approximation, in:Advances in Neural Information Processing Systems 15, Vancouver, B.C., Canada. pp.1555-1562.
    [134]Scholkopf, B., Herbrich, R., Smola, A.,2001. A generalized representer theo-rem, in:Proc.14th Computational Learning Theory, Amsterdam, Netherlands. pp.416-426.
    [135]Scholkopf, B., Smola, A., Muller, K.,1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput.10,1299-1319.
    [136]Silver, D., Sutton, R., Miiller, M.,2007. Reinforcement learning of local shape in the game of go, in:Proceedings of the 20th International Joint Conferences on Artificial Intelligence, Hyderabad, India, pp.1053-1058.
    [137]Silver, D., Sutton, R., Muller, M.,2012. Temporal-difference search in computer go. Machine Learning 87,183-219.
    [138]Sutton, R.,1988. Learning to predict by the methods of temporal differences. Machine Learning 3,9-44.
    [139]Sutton, R., Barto, A.,1998. Reinforcement learning:an introduction. MIT Press, Cambridge, MA.
    [140]Sutton, R., Maei, H., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., Wiewiora, E.,2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation, in:Proceedings of the 26th In-ternational Conference on Machine Learning, ACM. pp.993-1000.
    [141]Sutton, R., Szepesvari, C., Maei, H.,2008. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation, in: Proceedings of the 21st Advances in Neural Information Processing Systems, Vancouver, B.C., Canada. pp.1609-1616.
    [142]Szita, I., Lorincz, A.,2006. Learning tetris using the noisy cross-entropy method. Neural computation 18,2936-2941.
    [143]Tadepalli, P., Givan, R., Driessens, K.,2004. Relational reinforcement learn-ing:An overview, in:Proceedings of the ICML-2004 Workshop on Relational Reinforcement Learning, pp.1-9.
    [144]Tan, X., Chen, S., Zhou, Z., Zhang, F.,2005. Recognizing partially occluded, expression variant faces from single training image per person with SOM and soft k-NN ensemble. IEEE Transactions on Neural Networks 16,875-886.
    [145]Taylor, G., Parr, R.,2009. Kernelized value function approximation for rein-forcement learning, in:Proceedings of the 26th International Conference on Machine Learning, Montreal, QC, Canada. pp.1017-1024.
    [146]Taylor, M., Stone, P.,2009. Transfer learning for reinforcement learning do-mains:A survey. Journal of Machine Learning Research 10,1633-1685.
    [147]Taylor, M.E., Stone, P.,2005. Behavior transfer for value-function-based rein-forcement learning, in:Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, ACM. pp.53-59.
    [148]Taylor, M.E., Stone, P.,2007. Cross-domain transfer for reinforcement learn-ing, in:Proceedings of the 24th International Conference on Machine learning, ACM. pp.879-886.
    [149]Taylor, M.E., Whiteson, S., Stone, P.,2007. Transfer via inter-task mappings in policy search reinforcement learning, in:Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, ACM. p.37.
    [150]Thiery, C., Scherrer, B.,2010. Improvements on learning tetris with cross en-tropy. ICGA Journal.
    [151]Torrey, L., Shavlik, J., Walker, T., Maclin, R.,2008. Relational macros for transfer in reinforcement learning, in:Inductive Logic Programming. Springer, pp.254-268.
    [152]Torrey, L., Walker, T., Shavlik, J., Maclin, R.,2005. Using advice to transfer knowledge acquired in one reinforcement learning task to another, in:Machine Learning:ECML 2005. Springer, pp.412-424.
    [153]Tsitsiklis, J., Van Roy, B.,1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42,674-690.
    [154]Wang, F., Jin, N., Liu, D., Wei, Q.,2011. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Transactions on Neural Networks 22,1-13.
    [155]Wang, W., Gao, Y., Chen, X., Ge, S.,2008. Reinforcement learning with markov logic networks, in:MICAI 2008:Advances in Artificial Intelligence. Springer, pp.230-242.
    [156]Weinstein, A., Littman, M.L.,2012. Bandit-based planning and learning in continuous-action markov decision processes, in:Proceedings of the 22nd In-ternational Conference on Automated Planning and Scheduling (ICAPS).
    [157]Wiering, M., van Hasselt, H.,2008. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cy-bernetics 38,930-936.
    [158]Xu, X.,2006. A Sparse Kernel-Based Least-Squares Temporal Difference Al-gorithm for Reinforcement Learning, in:Proc.2nd International Conference Advances in Natural Computation, Xi'an, China. pp.47-56.
    [159]Xu, X., Hu, D., Lu, X.,2007. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks 18,973-992.
    [160]Xu, X., Liu, C., Hu, D.,2011. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Computing-A Fu-sion of Foundations, Methodologies and Applications 15,1055-1070.
    [161]Yujing Hu, Yang Gao, R.W.Z.S., Chen, X.,2012. nmetaq-an n-agent rein-forcement learning algorithm based on meta equilibrium, in:Proceedings of the Adaptive Learning Agent Workshop in the 12th International Conference on Autonomous Agents and Multi-agent Systems, pp.87-94.
    [162]Zhao, L., Ding, J.,2009. Least squares approximation to lognormal sum distri-bution via piecewise linear functions, in:Proceedings of the 4th IEEE Confer-ence on Industrial Electronics and Applications, Xi'an, China. pp.1324-1329.
    [163]Zhou, Z., Wu, J., Tang, W.,2002. Ensembling neural networks:many could be better than all. Artificial Intelligence 137,239-263.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700