基于强化学习的微小型弹药两回路驾驶仪设计

英文篇名：Design Two-loop Autopilot Based on Reinforcement Learning for Miniature Munition
作者：范军芳 ; 张鑫
英文作者：Fan Junfang;Zhang Xin;Beijing Key Laboratory of High Dynamic Navigation Technology,Beijing Information Science and Technology University;
关键词：微小型弹药 ; 两回路自动驾驶仪 ; 策略迭代 ; 代数黎卡提方程 ; 强化学习 ; 贝尔曼最优性原理
英文关键词：miniature ammunition;;two-loop autopilot;;policy iteration;;algebraic Riccati equation;;reinforcement learning;;Bellman optimality principle
中文刊名：ZSDD
英文刊名：Tactical Missile Technology
机构：北京信息科技大学高动态导航技术北京市重点实验室;
出版日期：2019-07-15
出版单位：战术导弹技术
年：2019
期：No.196
基金：北京市科技新星计划(xxjh2015B041);; 北京市委组织部青年拔尖人才计划(2015000026833ZK03);; 北京市教委青年拔尖人才项目(CIT&TCD201504055);; 高动态导航技术北京市重点实验室开放课题(HDN2018002)
语种：中文;
页：ZSDD201904008
页数：7
CN：04
ISSN：11-1771/TJ
分类号：52-58

摘要

利用强化学习及自适应动态规划原理,设计了一种适用于微小型弹药的两回路驾驶仪,并建立了纵向通道控制模型。由于跟踪器问题的最优解不易获得,将系统矩阵与期望输出信号进行增广,构成增广系统并引入折扣因子,将系统的跟踪器设计问题转换为调节器设计问题。基于贝尔曼最优性原理,采用策略迭代的方法对黎卡提方程进行求解,并证明了该算法的收敛性。最后仿真验证了通过策略评估及策略更新两步迭代计算,可以收敛至跟踪器的最优解。
Based on theory of reinforcement learning and adaptive dynamic programming,the two-loop autopilot of miniature munition is designed. Besides,a linear miniature munition control model of longitudinal channel is established. Because it is difficult to get the optimal solution of tracker,the argument system is constructed by putting the system sate matrix and expected signal together,and the tracker problem is transformed into a regulator problem by augmenting the system and introducing a discount factor. Based on the Bellman optimality principle,the iterative is used to solve the Riccati equation,and the convergence of the iterative algorithm is proved. The simulation results show the tracker can converge to optimal solution through the iteration by two steps of strategy evaluation and strategy updating.

引文

[1] Al Tamimi A,Lewis F L,Abu Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming:convergence proof[J]. IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2008,38(4):943-949.
    [2] Wei Q L,Wang F Y,Liu D R,et al. Finite approximation-error-based discrete-time iterative adaptive dynamic programming[J]. IEEE Transactions on Cybernetics,2014,44(12):2820-2833.
    [3] Wei Q,Song R,Li B,et al. A novel policy iterationbased deterministic Q-learning for discrete-time nonlinear systems[M]. Self-Learning Optimal Control of Nonlinear Systems,2018.
    [4] Hwangbo J,Sa I,Siegwart R,et al. Control of a quadrotor with reinforcement learning[J]. IEEE Robotics&Automation Letters,2017,(99):1.
    [5]乔俊飞,王亚清,柴伟.基于迭代ADP算法的污水处理过程最优控制[J].北京工业大学学报,2018,(2):200-206.
    [6]郭超,梁晓庚,王斐.基于ADP的高超声速飞行器非线性最优控制[J].火力与指挥控制,2014,39(6):77-81.
    [7]关世义,从敏,林涛.关于国外微型导弹发展的思考[J].战术导弹技术,2011,(6):1-4.
    [8] Steven Felix U S. Navy spike missile system:a new generation of miniature precision guided weapons[R].RTOMP-AVT-135,2006:1-17.
    [9] John M Hanson. Advanced guidance and control for reusable launch vehicles[R]. AIAA 2000-3957.
    [10]段冬冬.飞航导弹变结构纵向通道控制方法[D].太原:中北大学,2013,10(4):11-15.
    [11] Idan,Moshe. Integrated sliding mode autopilot-guidance for dual control missiles[J]. Journal of Guidance,Control and Dynamics,2012,30(4):1081-1089.
    [12] Yamasaki T,Balakrishnan S N,Takano H. Separatechannel integrated guidance and autopilot for automatic path-following[J]. Journal of Guidance,Control and Dynamics,2013,36(1):25-34.
    [13] Zheng Wenda,Liu Gang. An adaptive fuzzy variable structure controller for bank-to-turn missile[J]. Journal of Computional Information System,2011,7(2):562-569.
    [14]周觐,雷虎民,李炯,等.基于神经网络的导弹制导控制一体化反演设计[J].航空学报,2015,36(5):1661-1672.
    [15]李林静,刘永善.基于自适应控制理论的自动驾驶仪设计[J].战术导弹控制技术,2004,(3):13-16.
    [16]范军芳,林德福,祁载康,等.两回路自动驾驶仪设计与分析[J].系统工程与电子技术,2008,30(12):2447-2450.
    [17] Lewis F L,Vrabie D,Syrmos V L. Optimal control(third edition)[M]. John Wiley&Sons,2012.
    [18] Barbieri E,Alba Flores R. On the infinite-horizon LQ tracker[J]. Systems&Control Letters,2000,40(2):77-82.
    [19] Barbieri E,Alba Flores R. Real-time infinite horizon linear-quadratic tracking controller for vibration quenching in flexible beams[C]. IEEE Conference on Systems,Man,and Cybernetics,Taipei,Taiwan,2000:38-43.
    [20] Modares H. Optimal tracking control of uncertain systems:on-policy and off-policy reinforcement learning approaches[J]. Control of Complex Systems,2015:165-186.
    [21] Kleinman D L. On an iterative technique for Riccati equation computations[J]. IEEE Transactions on Automatic Control,1968,18(1):114-115

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700