强化学习中离策略算法的分析及研究

英文题名：Analysis and Research on Off-policy Algorithms in Reinforcement Learning
作者：傅启明
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：强化学习 ; 离策略 ; 函数近似 ; 自模拟度量 ; 值函数迁移 ; 策略迭代 ; 贝叶斯推理
英文关键词：Reinforcement Leaning ; Off-policy ; Function approximation ; Bisimulation
英文关键词：metric ; Value function transfer ; Policy iteration ; Bayesian Inference
学位年度：2014
导师：刘全
学科代码：0812
学位授予单位：苏州大学
论文提交日期：2014-03-01

摘要

强化学习是一种通过与环境进行“试错”交互寻找能够带来最大期望累积奖赏策略的学习方法。根据学习过程中行为策略与目标策略是否一致，强化学习方法分为在策略算法和离策略算法。与在策略算法相比，离策略算法具有更加广泛的应用范围，离策略算法已经成为当前强化学习领域的一个研究热点。本文针对当前离策略算法研究中难以收敛、收敛速度慢以及收敛精度低的问题展开分析，并提出一系列解决方案。本文主要研究内容包括以下四部分：
     (1)提出一种基于线性函数逼近的离策略Q(λ)算法，该算法通过引入重要性关联因子，在迭代次数逐步增长的过程中，使得在策略与离策略相统一，确保算法的收敛性。同时在保证在策略与离策略的样本数据一致性的前提下，对算法的收敛性给予理论证明。
     (2)从TD Error的角度出发，给出n阶TD Error的概念，并将n阶TD Error用于经典的Q(λ)学习算法，提出一种二阶TD Error快速Q(λ)学习算法——SOE FQ (λ)。该算法利用二阶TD Error修正Q值函数，并通过资格迹将TD Error传播至整个状态动作空间，加快算法的收敛速度。在此基础之上，分析算法的收敛性及收敛效率，在仅考虑一步更新的情况下，算法所要执行的迭代次数T主要指数依赖于1γ、ε1
     1。
     (3)提出在学习过程中通过迁移值函数信息，减少算法收敛所需要的样本数量，加快算法的收敛速度。基于强化学习中经典的离策略Q-Learning算法的学习框架，结合值函数迁移方法，优化算法初始值函数的设置，提出一种新的基于值函数迁移的快速Q-Learning算法——VFT-Q-Learning。该算法在执行前期，通过引入自模拟度量方法，在状态空间以及动作空间一致的情况下，对目标任务中的状态与历史任务中的状态之间的距离进行度量，对其中相似并满足一定条件的状态进行值函数迁移，而后再通过学习算法进行学习。
     (4)针对大规模状态空间或者连续状态空间、确定环境问题中的探索和利用的平衡问题，提出一种基于高斯过程的离策略近似策略迭代算法。该算法利用高斯过程对带参的值函数进行建模，结合重要性关联因子构建生成模型，根据贝叶斯推理，求解值函数的后验分布。且在学习过程中，根据值函数的概率分布，求解动作的信息价值增益，结合值函数的期望值，选择相应的动作。在一定程度上，该算法可以解决探索和利用的平衡问题，加快算法的收敛速度。
Reinforcement learning is a kind of learning method, which interacts with theenvironment in order to find the most optimal policy with the maximal expectedaccumulated reward. According to the equivalence of the behavior policy and the targetpolicy in the learning process, reinforcement learning algorithms can be divided into twomain parts: on-policy algorithms and off-policy algorithms. Compared with the on-policyalgorithms, off-policy algorithms can provide a much wider application range, andnowadays, the research related to the off-policy algorithms has been more and morepopular. With respect to the main problems, such as non-convergence, slow convergencerate and low convergence accuracy, in off-policy algorithms, the paper provides a series ofsolutions, which mainly include the following four parts:
     (1) Proposed a novel off Policy Q (λ)algorithm based on Linear FunctionApproximation, which introduces associated importance factor, uses associated importancefactor to unify the on-policy and off-policy on sample data distribution in iteration process,and assures the convergence. Under the premise of sample data consistency, the paper gavethe proof of the convergence for the algorithm.
     (2) From the aspect of the TD Error, the paper defined the N-order TD Error, used itin the traditional Q(λ) algorithm, and put forward a fast Q(λ) algorithm based on thesecond-order TD Error. The algorithm adjusts the Q value with the second-order TD Errorand broadcast the TD Error to the whole state-action space, which speed up theconvergence of the algorithm. In addition, the paper analyzed the convergence rate，andunder the condition of one-step update, the result shows that the number of iteration mainlydepends on11γ, ε
     1.
     (3) Proposed to transfer the value function between different similar learning taskswith the same state space and action space, which tries to reduce the needed samples in thetarget task and speed up the convergence rate. Based on the framework of off-policyQ-Learning algorithm, combined with the value function transfer method, this paper putforward a novel fast Q-Learning algorithm based on the value function transfer— VFT-Q-Learning. At the beginning, the algorithm uses Bisimulation metric to measure thedistance between states in target task and historical task on the condition that these twotasks have the same state space and action space, transfers the value function if the distancemeets some condition, and finally executes the learning algorithm.
     (4) In allusion to the problem of balancing the exploration and exploitation in thelarge or continuous state space, the paper put forward a novel off-policy approximatepolicy iteration algorithm based on Gaussian process. The algorithm uses Gaussian processto model the action-value function, and combined with associated importance factor toconstruct generative model, get the posteriori distribution of the parameter vector of theaction-value function by Bayesian inference. During the learning process, according to theposteriori distribution, compute the value of perfect information, and combined with theexpected value of the action-value function, we can select the appropriate action. To acertain extent, the algorithm can balance the exploration and exploitation in learningprocess, and accelerate the convergence.

引文

[1] Sutton R S, Barto G A. Reinforcement learning[M]. Cambridge: MIT Press,1998
    [2] Sutton R S. Learning to predict by the method of temporal differences[J]. Machine Learning,1988,22:33-57
    [3] Antos A, Szepesvari Cs and Munos R. Learning near-optimal policies with bellman-residualminimization based fitted policy iteration and a single sample path[J]. Machine Learning,2008,71:89-129
    [4] Sutton R S, McAllester D, Singh S and Mansour Y. Policy gradient methods for reinforcementlearning with function approximation[C]. In: Proceedings of the16th Annual Conference onNeural Information Processing Systems, Denver, USA,2000
    [5] Tsitsiklis J N, Van Roy B. An analysis of temporal-difference learning with functionapproximation[J]. IEEE Transactions on Automatic Control,1997,42:674-690
    [6] Sutton R S, Szepesvari Cs and Maei H R. A convergent O(n) algorithm for off-policytemporal-difference learning with linear function approximation[C]. In: Proceedings of the22th Annual Conference on Neural Information Processing Systems, Vancouver,2008
    [7] Sutton R S, Hamid R M, Precup D, Bhatnagar S, Silver D, Szepesvari Cs and Wiewiora E.Fast gradient-descent methods for temporal difference learning with linear functionapproximation[C]. In: Proceedings of the26th International Conference on MachineLearning, Montreal,2009
    [8] Hasselt H V. Double Q-learning[C]. In: Proceedings of the Neural Information ProcessingSystems, Atlanta,2010
    [9] Tadic V. On the convergence of temporal-difference learning with linear function approxima-tion[J]. Machine Learning,2001,42(3):241-267
    [10] Grzes M, Kudenko D. Online learning of shaping rewards in reinforcement learning[J].Neural Networks,2010,23:541-550
    [11] Geramifard A, Bowling M and Sutton R S. Incremental least-square temporal difference learn-ing[C]. In: Proceedings of the21th National Conference on Artificial Intelligence, Boston,USA,2006
    [12]王雪松,张依阳,程玉虎.基于高斯过程分类器的连续空间强化学习[J].电子学报,2009,37(6):1153-1158
    [13]高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378
    [14]石川,史忠植,王茂光.基于路径匹配的在线分层强化学习方法[J].计算机研究与发展,2008,45(9):1470-1476
    [15]陶隽源,孙金玮,李德胜.基于线性平均的强化学习函数估计算法[J].吉林大学学报(工学版),2008,45(6):1407-1411
    [16]赵昀,陈庆伟,胡维礼.一种基于信息熵的强化学习算法[J].系统工程与电子技术,2010,32(5):1043-1046
    [17]刘全,傅启明,龚声蓉,伏玉琛,崔志明.一种最小状态变元平均报酬的强化学习方法[J].通信学报,2011,32(1):66-71
    [18] Dearden R, Friedman N, and Andre D. Model-based bayesian exploration[C]. In: Proceedingsof the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm,1999
    [19] Paduraru C, Kaplow R, Doina Precup and Joelle Pineau. Model-based reinforcement learningwith state aggregation[C]. In: Proceeding of the6th European Workshop on ReinforcementLearning, Boston,2008
    [20] Busoniu L, Ernst D, Schutter B D and Babuska R. Approximate dynamic programming witha fuzzy parametrization[J]. Automatica,2010,46(5):804–814
    [21] Ramachandran D, Amir E. Bayesian inverse reinforcement learning[C]. In: Proceeding of the20th International Joint Conference on Artificial Intelligence, Hyderabad,2007
    [22] Deisenroth M P, Fox D and Rasmussen C E. Gaussian processes for data-efficient learning inrobotics and control[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,99:1-19
    [23] Xu ML, Xu WB. Fuzzy Q-Learning in continuous state and action space[J]. The Journal ofChina Universities of Posts and Telecommunications,2010,14(17):100-109
    [24] Beigi A, Parvin H, Mozayani N and Minaei B. Improving reinforcement learning agentsusing genetic algorithms[J]. Lecture Notes in Computer Science,2010,6335:330-337
    [25] Wilson A, Fern A and Tadepalli P. Using trajectory data to improve bayesian optimizationfor reinforcement learning[J]. Journal of Machine Learning Research,2014,15:253-282
    [26] Wu B, Feng YP and Zheng HY. A model-based factored bayesian reinforcement learningapproach[J]. Applied Mechanics and Materials,2014,513:1092-1095
    [27] Shah H, Gopal M. Fuzzy decision tree function approximation in reinforcement learning[J].International Journal of Artificial Intelligence and Soft Computing,2010,2(1-2):26-45
    [28] Mahadevan S, Giguere S and Jacek N. Basis adaptation for sparse nonlinear reinforcementlearning[C]. In: Proceedings of the27th AAAI Conference on Artificial Intelligence,Bellevue,2013
    [29] Ross S, Pineau J. Model-based bayesian reinforcement learning in large structured domains
    [C]. In: Proceedings of the24th Conference in Uncertainty in Artificial Intelligence, Helsinki,2008
    [30] Vien N A, Ertel W, Dang V H and Chung T C. Monte-carlo tree search for bayesianreinforcement learning[J]. Applied intelligence,2013,39(2):345-353
    [31]王雪松,田西兰,程玉虎,马小平.最小二乘支持向量机在强化学习系统中的应用[J].系统仿真学报,2008,20(14):3702-3706
    [32]王雪松,田西兰,程玉虎,易建强.基于协同最小二乘支持向量机的Q学习[J].自动化学报,2009,35(2):214-219
    [33]吴南.未知环境下移动机器人的避障路径规划[D].沈阳:沈阳工业大学,2013
    [34]刘全,高阳,陈道蓄,孙吉贵,姚望舒.一种基于启发式轮廓表的逻辑强化学习方法[J].计算机研究与发展,2008,45(11):1824-1830
    [35] Sigh S. Transfer of learning by composing solutions of elemental sequential tasks[J].Machine Learning,1992,8(3/4):323-339
    [36] Crites R H, Barto A G. Elevator group control using multiple reinforcement learning agents[J]. Machine Learning,1998,33:235-262
    [37] Tham C K, Prager R W. A modular Q-Learning architecture for manipulator task decom-position[C]. In: Proceeding of the11th International Conference on Machine Learning, SanFrancisco,1994
    [38] Littman M L, Ravi N, Fenson E and Howard R. Reinforcement learning for autonomicnetwork repair[C]. In: proceeding of the1st International Conference on AutonomicComputing, Rome,2004
    [39] Kamio S, Iba H. Adaptation technique for integrating genetic programming and reinforce-ment learning for real robots[J]. IEEE Transactions on Evolutionary Computation,2005,9(3):318-333
    [40] Pieter A, Adam C, Morgan Q and Andrew Y Ng. An application of reinforcement learning toaerobatic helicopter flight[C]. In: Proceeding of the19th Annual Conference on NeuralInformation Processing Systems, Vancouver,2007
    [41] Pallavi A, Zheng R and Szepesvari Cs. Sequential learning for optimal monitoring of multi-channel wireless networks[C]. INFOCOM, Orlando,2011
    [42]李耀宇,朱一凡,杨峰,贾全.基于逆向强化学习的舰载机甲板调度优化方案生成方法[J].国防科技大学学报,2013,35(4):171-175
    [43]张水平.在策略强化学习算法在互联电网AGC最优控制中的应用[D].广州:华南理工大学,2013
    [44]许亚.基于强化学习的移动机器人路径规划研究[D].济南:山东大学,2013
    [45]章惠龙,李龙澍. Q学习在RoboCup前场进攻动作决策中的应用[J].计算机工程与应用,2013,49(7):240-242
    [46]李程.基于强化学习的足球机器人比赛决策策略研究[D].沈阳:沈阳工业大学,2013
    [47]刘陶,何炎祥,熊琦.一种基于Q学习的LDoS攻击实时防御机制及其CPN实现[J].计算机研究与发展,2011,48(3):432-439
    [48]章韵,王静玉,陈志,鲍贵城,周峰,扈罗全.基于Q学习的无线传感器网络自组织方法研究[J].传感技术学报,2010,25(11):1623-1626
    [49]陈荣.利用应答器信息的在线学习停车算法[D].北京:北京交通大学,2013
    [50] Liu Q, Liu Z, Xi LX and Li MH. The research on the spider of the vertical search based onreinforcement learning[J]. Journal of Computational Information System,2008,4(1):83-90
    [51] Watkins C J C H. Learning from delayed rewards[D]. Cambridge: Cambridge University,1989
    [52] Watkins C J C H, Dayan P. Q-learning[J]. Machine Learning,1992,8(3):279-292
    [53] Gordon G J. Stable function approximation in dynamic programming[C]. In: Proceedings ofthe12th International Conference on Machine Learning, California, USA,1995
    [54] Precup D, Sutton R S and Dasgupta S. Off-policy temporal difference learning with functionapproximation[C]. In: Proceedings of the18th International Conference on MachineLearning, Williamstown, Australia,2001
    [55] Lagoudakis M, Parr R. Least squares policy iteration[J]. Journal of Machine LearningResearch,2003,4:1107-1149
    [56] Pawe W and Andrzej P. Model-free off-policy reinforcement learning in continuous environ-ment[C]. In: Proceedings of the International Joint Conference on Neural Networks,Budapest,2004
    [57] Szepesvari Cs. The asymptotic convergence rate of Q-learning[C]. In: Proceedings of the10th Neural Information Processing Systems, Nevada,1997
    [58] Eyal E D, Yishay M. Learning rates for Q-learning[J]. Journal of Machine Learning Research,2003,5:1–25
    [59] Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning[J]. Journal ofMachine Learning Research,2005,6(1):503–556
    [60] Strehl A L, Li L, Wiewiora E, Langford J and Littman M L. PAC Model-free ReinforcementLearning[C]. In: Proceedings of the23rd International Conference on Machine learning,Pennsylvania,2006
    [61] Thomas D, Martha W, Sutton R S. Off-policy actor-critic[C]. In: Proceedings of the31stInternational Conference on Machine Learning, Edinburgh,2012
    [62] Precup D, Sutton R S, Paduraru C, Koop A and Singh S P. Off-policy learning with optionsand recognizers[C]. In: Proceedings of the19th Annual Conference on Neural InformationProcessing Systems, Vancouver,2005
    [63] Bertsekas D，Yu HZ. Projected equation methods for approximate solution of large linearsystems[J]. Journal of Computational and Applied Mathematics,2009,227:27-50
    [64] Maei H R, Szepesvari Cs, Bhatnagar S and Sutton R S. Toward off-policy learning controlwith function approximation[C]. In: Proceedings of the27th International Conference onMachine Learning, Haifa,2010
    [65] Yu HZ. Convergence of least squares temporal difference methods under general conditions
    [C]. In: Proceedings of the27th International Conference on Machine Learning, Haifa,2010
    [66] Kolter J Z. The fixed points of off-policy TD[C]. In: Proceedings of the25th AnnualConference on Neural Information Processing Systems, Granada,2011
    [67] White A, Modayil J and Sutton R S. Scaling life-long off-policy learning[J]. IEEE ICDL-EPIROB,2012:1-6
    [68] Geist M, Scherrer B. Off-policy learning with eligibility traces: a survey[J]. Journal ofMachine Learning Research,2014,15:289-333
    [69] Puterman M L. Markon decission processes: discrete and stochastic dynamic programming
    [M]. NY: Wiley,1994
    [70] Damien M. Policy iteration-based conditional termination and ranking functions[J]. LectureNotes in Computer Science,2014,8318:453-471
    [71] Alessandro A, Maurizio F and Dante K. An efficient policy iteration algorithm for dynamicprogramming equations[C]. In: Proceedings of the84th Annual Meeting of the InternationalAssociation of Applied Mathematics and Mechanic, Novi Sad,2013
    [72] Farias D P, Van Roy B. On the existence of fixed points for approximate value iteration andtemporal-difference learning[J]. Journal of Optimization Theory and Applications,2000,105(3):589-608
    [73] Szepesvari Cs. Algorithms for reinforcement learning[M]. California: Morgan&ClaypoolPublishers,2010
    [74] Schweitzer P J, Seidmann A. Generalized polynomial approximations in Markovian decisionprocesses[J]. Journal of Mathematical Analysis and Applications,1985,110(2):568-582
    [75] Li L, Littman M L and Mansley C R. Online exploration in least squares policy iteration[C].In: Proceedings of the8th International Joint Conference on Autonomous Agents andMultiagent Systems, Budapest,2009
    [76] Thiery C, Scherrer B. Least squares policy iteration: Bias-variance trade-off in control prob-lems[C]. In: Proceedings of the27th International Conference on Machine Learning, Haifa,2010
    [77] Tsitsiklis J N. On the convergence of optimistic policy iteration[J]. Journal of Machine Learn-ing Research,2002,3:59-72
    [78] Xin X, Dewen H and Xicheng L. Kernel-based least squares policy iteration for reinforce-ment learning[J]. IEEE Transactions on Neural Networks,2007,18(4):973–992
    [79] Istratescu V I. Fixed point theory: an introduction[M]. Berlin: Springer,2002
    [80] Goldberg D E. Genetic algorithms in search, optimization and machine learning[M]. Boston:Addison Wesley,1989
    [81] Glover F, Laguna M. Tabu search[M]. London: Kluwer,1997
    [82] Torczon V. On the convergence of pattern search algorithms[J]. SIAM Journal on Optimiza-tion,1997,7(1):1-25
    [83] Lewis R M and Torczon V. Pattern search algorithms for linearly constrained minimization[J].SIAM Journal on Optimization,2000,10(3):917-941
    [84] Rubinstein R R, Kroese D P. The cross entropy method: a unified approach to combina-torialoptimization, Monte-Carlo simulation and machine learning[M]. Berlin: Springer,2004
    [85] Mannor S, Rubinstein R Y and Gat Y. The cross-entropy method for fast policy search[C]. In:Proceeedings of the20th Internatioal Conference on Machine Learning, Washington,2003
    [86] Scherrer B. Should one compute the Temporal Difference fix point or minimize the BellmanResidual? the unified oblique projection view[C]. In: Proceedings of the27th InternationalConference on Machine Learning, Haifa,2010
    [87] Precup D, Sutton R S and Dasgupta S. Off-policy temporal difference learning with functionapproximation[C]. In: Proceedings of the18th International Conference on MachineLearning, Williamstown,2001
    [88] Borkar V S, Meyn S P. The ODE method for convergence of stochastic approximation andreinforcement learning[J]. SIAM Journal on Control And Optimization,2000,38(2):447-469
    [89] Baird L C. Residual algorithms: reinforcement learning with function approximation[C]. In:Proceedings of the12th International Conference on Machine Learning, California, USA,1995
    [90] Chen SL, Wei YM. Least-squares SARSA(λ) algorithms for reinforcement Learning[C]. In:Proceeding of the4th International Conference on Natural Computation, JiNan,2008
    [91] Peng J, Williams R J. Incremental multi-step Q-learning[J]. Machine Learning,1996,22:283-290
    [92] Hasselt H V. Double Q-learning[C]. In: Proceedings of the Annual Conference onNeuralInformation Processing Systems, Vancouver,2010
    [93] Taylor J, Precup D and Panangaden P. Bounding performance loss in approximate MDP hom-omorphisms[C]. In: Proceedings of the22nd Annual Conference on Neural InformationProcessing Systems, NY,2008
    [94]王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(12A):39-43
    [95] Sunmola F T, Wyatt J L. Model transfer for Markov decision tasks via parameter matching
    [C]. In: Proceedings of the25th Workshop of the UK Planning and Scheduling SpecialInterest Group, Nottingham,2006
    [96] Konidaris G D, Barto A G. Building portable options: skill transfer in reinforcement learn-ing[C]. In: Proceedings of the20th International Joint Conference on Artificial Intelligence,CA,2007
    [97] Ferrante E, Lazaric A and Restelli M. Transfer of task representation in reinforcementlearning using policy-based proto-value functions[C]. In: Proceedings of the7th InternationalConference on Autonomous Agents and Multi-Agent Systems, Estoril,2008
    [98] Lazaric A, Restelli M and Bonarini A. Transfer of samples in batch reinforcement learn-ing[C]. In: Proceedings of the25th International Conference on Machine Learning, NY,2008
    [99] Sorg J and Singh S. Transfer via soft homomorphisms[C]. In: Proceedings of the8th Interna-tional Conference on Autonomous Agents and Multiagent Systems Hungary,2009
    [100] Ammar H B, Taylor M, Tuyls K and Weiss G. Reinforcement learning transfer using asparse coded inter-task mapping[C]. In: Proceedings of the9th European Workshop onMulti-agent Systems, Berlin,2012
    [101] Konidaris G D, Scheidwasser I and Barto A G. Transfer in reinforcement learning viashared features[J]. Journal of Machine Learning Research,2012,13:1333-1371
    [102] Givan R, Dean T and Greig M. Equivalence notions and model minimization in Markovdecision processes[J]. Artificial Intelligence,2003,147(1-2):163-223
    [103] Ferns N, Panangaden P and Precup D. Metrics for finite markov decision processes[C]. In:Proceeding of the20th Conference on Uncertainty in Artificial Intelligence, Arlington,2004
    [104] Wiering M and Hasselt H V. The QV family compared to other reinforcement learningalgorithms[C]. In: Proceedings of IEEE International Symposium on Approximate DynamicProgramming and Reinforcement Learning, Nashville,2009

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700