基于模拟退火策略的Sarsa强化学习方法

英文篇名：The Sarsa Reinforcement Learning Method Based on Simulated Annealing Strategy
作者：王现磊 ; 郝文宁 ; 陈刚 ; 余晓晗
英文作者：WANG Xian-lei;HAO Wen-ning;CHEN Gang;YU Xiao-han;College of Command Information Systems,Army Engineering University of PLA;
关键词：强化学习 ; 算法 ; 模拟退火 ; 迷宫仿真
英文关键词：Reinforcement learning;;Algorithm;;Simulated annealing;;Maze simulation
中文刊名：JSJZ
英文刊名：Computer Simulation
机构：中国人民解放军陆军工程大学指挥信息系统学院;
出版日期：2019-04-15
出版单位：计算机仿真
年：2019
期：v.36
基金：国家自然科学基金青年科学基金项目(71501186)
语种：中文;
页：JSJZ201904046
页数：5
CN：04
ISSN：11-3724/TP
分类号：225-228+234

摘要

针对传统强化学习算法(如Sarsa算法)收敛速度缓慢的问题,提出了基于模拟退火策略的Sarsa(SA-Sarsa)算法。在策略选择上使用模拟退火策略替代ε-greedy策略,利用退火速率控制算法的收敛速度,有效克服了Sarsa算法直接通过随机数与贪婪值比较选择策略而导致的陷入局部最优解的问题,达到了保证最优解、提高收敛速度的目的。通过迷宫的路径规划问题仿真,将SA-Sarsa算法与Q-Learning和Sarsa两种传统算法进行了对比,实验表明,SA-Sarsa学习算法在取得同等最优解下探索效率高且收敛速度更快。
A Sarsa(SA-Sarsa) algorithm based on simulated annealing strategy proposed in order to solve the problem that the convergence speed of traditional reinforcement learning algorithm(such as Sarsa algorithm) is slow. Simulated annealing strategy was used to controll the convergence speed of SA-Sarsa instead of ε-greedy strategy,which can overcome the disadvantage of failing into the local optimal solution in the original Sarsa algorithm and achieve a faster convergence speed. The SA-Sarsa algorithm was compared with the traditional algorithms of Q-Learning and Sarsa by simulation experiments of maze path planning problem. Experiments show that the SA-Sarsa learning algorithm has higher exploration efficiency and faster convergence speed under the same optimal solution.

引文

[1] R S Sutton,A G Barto.Reinforcement Learning:An Introduction[M].Cambridge:The MIT Press,1998.
    [2] R S Sutton.Learning to Predict by the Methods of Temporal Differences[M].Kluwer Academic Publishers,1988.
    [3] S Mabu,et al.Genetic Network Programming with Rein-forcement Learning Using Sarsa Algorithm[C].Evolutionary Computation,2006.CEC 2006.IEEE Congress on.IEEE,2006:463-469.
    [4] F Wen,X Wang.Sarsa Learning Based Route Guidance System with Global and Local Parameter Strategy[J].Ieice Transactions on Fundamentals of Electronics Communications & Computer Sciences,2015,E98.A(12):2686-2693.
    [5] 刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2017.
    [6] R S Sutton.Dyna,an integrated architecture for learning,planning,and reacting[J].Acm Sigart Bulletin,1991,2(4):160-163.
    [7] 高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100.
    [8] 许亚.基于强化学习的移动机器人路径规划研究[D].山东大学,2013.
    [9] 黄炳强.强化学习方法及其应用研究[D].上海交通大学,2007.
    [10] H V Hasselt,A Guez,D Silver.Deep Reinforcement Learning with Double Q-learning[J].Computer Science,2015.
    [11] 马朋委.Q_learning强化学习算法的改进及应用研究[D].安徽理工大学,2016.
    [12] R S Sutton.Introduction:The Challenge of Reinforcement Learning[M].MIT Press,1992.
    [13] 郭茂祖,等.基于MetrOPOlis准则的Q-学习算法研究[J].计算机研究与发展,2002,39(6):684-688.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700