用户名: 密码: 验证码:
基于逻辑马尔可夫决策过程的关系强化学习研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
目前普遍认为智能主体应当具有学习能力,能够把握和适应动态环境的变化。在没有任何老师指导的情况下,强化学习让主体尝试行动,在与环境交互过程中试错,根据收集到的环境反馈,对尝试的行动进行评价,最终把握环境并学会行动决策以适应环境。以马尔可夫决策过程为基础,近年来提出了很多强化学习方法,获得了很大的进展,对以特性向量表示的状态也有了充分的研究。
     然而特性向量这种命题表示法,很难表示环境中的关系信息,特别是有大量物体,物体之间又有很多关系的领域。为了把强化学习方法应用到这些复杂的环境,最近提出了以关系表示为基础的关系强化学习,研究在用关系逻辑表示环境的状态和主体的行动时,如何进行学习,以及如何对环境状态进行抽象以把握环境。用常原子表示的环境基本状态空间巨大,需要使用适当的有变量的抽象状态表示方法来把握环境。
     最近提出了一些关系强化学习的方法和模型,但对关系强化学习问题本身还缺乏透彻的理解,关系强化学习的理论也很不充分。本文在简单的仅用原子表示的逻辑马尔可夫决策过程LOMDP的基础上,提出了带否定词的逻辑马尔可夫决策过程nLMDP,并基于该模型,提出了替换学习方法及状态演化方法。
     在逻辑马尔可夫决策过程nLMDP中,首次引入了逻辑否定,用来准确的描述环境和任务。然后又提出了抽象状态空间的生成方法和扩展方法,从一个准确描述的的目标抽象状态开始,使用一次生成方法和多次扩展方法,可以让设计者很容易的得到一个规模适度的互补抽象状态空间,即每个基本状态只有一个抽象状态来表示,所有的抽象状态又能表示所有的基本状态。
     本文也提出了原型行动,以表示环境内主体的基本行动方式,是抽象行动上的更高抽象。原型行动中同样引入了逻辑否定表示行动的执行条件,根据原型行动和互补的抽象状态空间,可以很容易得到抽象状态上的可执行抽象行动。
     逻辑马尔可夫决策过程nLMDP基于互补的抽象状态空间和原型行动集构建。基于nLMDP,本文提出了替换学习(θ(λ))方法,实现了主体在线自动获得抽象行动,并完成对原型行动到抽象状态上有效替换的评价估计。试验显示替换学习是一个高效的学习方法。
     对于复杂的领域,设计者很难给出完善的互补抽象状态空间,也很难对给出的互补抽象状态空间进行评价。本文提出了状态演化的方法,基于逻辑马尔可夫决策过程nLMDP和替换学习,仅需要设计者提供任务的目标抽象状态和主体的原型行动集,主体在学习中自己组织抽象状态空间,并对他们进行评价,完成策略的学习。试验显示状态演化过程中,主体能够抓住任务的本质,获得的自组织互补抽象状态空间也是合理的。
     本文的主要贡献与创新:
     1.引入逻辑否定描述抽象状态,准确表述环境和任务;提出抽象状态空间的生成和扩展方法,为关系强化学习提供了一个构建互补抽象状态空间的简单方法。
     2.提出引入逻辑否定的原型行动,并形式定义了可执行抽象行动空间,为关系强化学习中主体自动获得抽象行动提供了基础。
     3.基于互补抽象状态空间和原型行动集,提出逻辑马尔可夫决策过程nLMDP,成为关系强化学习的一个理论模型。
     4.提出替换学习,实现抽象行动的在线获得,学习从原型行动到抽象状态有效替换的评价函数。
     5.提出状态演化的理论和方法,主体在学习最优策略过程中,也学习对环境状态的组织,最终得到互补的抽象状态空间。这也为关系强化学习提供了一个主体自组织环境状态的框架。
It is widely agreed that the intelligent agent should have the ability of learning in order to adapt to the changing of the dynamic environment.Reinforcement Learning(RL) permits the agent to learn policy and capture environment information by trying actions and receiving feedback from the interaction with environment without supervisors.Based on the Markov Decision Process(MDP), many algorithms of RL have been proposed with much progress over the past years.The propositional representations of states of RL,i.e.attribute-values, have also been studied extensively.
     However,their usefulness in complex domains is limited because of their inability to incorporate relational information about the environment,especially in the domain with objects of various kinds.In order to solve this problem, Relational Reinforcement Learning(RRL) was proposed based on the relational representation.It concerns use of RL in complex domains with the states and actions in relational form.Furthermore,it focuses on the abstract methods because of the huge state space if the states represented by ground atoms.
     However,the RRL is still not well-understood and the theory is not sufficient, although a number of RRL algorithms have been developed with several preliminary models proposed.This work is based on the Logical MDP(LOMDP). We propose a new model of RRL,which called the Logical MDP with Negation (nLMDP).Based on nLMDP,we also proposeΘ(λ)-learning method and states evolution algorithm.
     In nLMDP,logical negation is first introduced for describing the environment and the task precisely.By introducing the generating method and the expanding method,a complementary abstract state space can be constructed using the generating method on goal state once and the expanding method several times in turn.They are useful tools for designers to construct the complementary abstract state space in an easy way,where the complementarity of abstract states means that each ground state can only be represented by one abstract state and all the ground states can be represented by all the abstract states.
     Prototype Action,a super abstraction over the abstract actions,is also introduced into the nLMDP.It means the basic action ways of the environment.The logical negation is also used in the precondition of the prototype action.Based on the set of prototype actions and the complementary abstract state space,the applicable abstract actions of certain abstract state can be obtained easily.
     Consequently,an nLMDP is defined over a complementary abstract state space and a set of prototype actions.Based on the nLMDP,we proposeΘ(λ)-learning for obtaining the valid substitutions from prototype actions to abstract states and estimating the values of the substitutions.The experiments show that it is an efficient algorithm.
     For a very complex domain,it is rather difficult for the designer to give a perfect state space and criterion for judgement.Based on the nLMDP andΘ(λ)-learning, an states evolution algorithm is proposed.A complementary abstract state space is emerged while the values of actions and the policy are learned. As a result,only goal state and the prototype actions have been enough for the designer.The experiments show that the agent can catch the essence of the task, and the self-organized states are rational.
     The main contributions are summarized as follows.
     1.The logical negation is introduced in the abstract state for describing the environment and the task precisely.The generating method and the expanding method give an easy way to construct the complementary abstract state space for designer
     2.The prototype action is proposed and the logical negation is used in the precondition of it.The applicable abstract state space is also defined formally for it being obtained automatically.
     3.Based on the complementary abstract state space and prototype actions,a new model of RRL,nLMDP,is proposed.
     4.Θ(λ)-learning is proposed for obtaining valid substitutions automatically and estimating values of them.
     5.The theory and method of states evolution are proposed.The agent learns not only the policy but the abstract state space in the evolution process. This leads to a framework to strengthen the agents' intelligence and simplify the designers' work.
引文
[Albus,1971]Albus,J.(1971).A theory of cerebellar function.Mathematical Biosciences,10:25-61.
    [Barto et al.,1995]Barto,A.,Bradtke,S.,and Singh,S.(1995).Learning to act usting real-time dynamic programming.Artificial Intelligence,72:81-138.
    [Barto and Duff,1994]Barto,A.and Duff,M.(1994).Monte Carlo matrix inversion and reinforcement learning.Advances in Neural Information Processing Systems,6:687-694.
    [Barto and Mahadevan,2003]Barto,A.and Mahadevan,S.(2003).Recent advances in hierarchical reinforcement learning.Discrete Event Systems,13:41-77.
    [Barto et al.,1981]Barto,A.,Sutton,R.,and Brouwer,P.(1981).Associative search network:A reinforcement learning associative memory.IEEE Transactions on Systems,Man,and Cybernetics,40:201-211.
    [Bellman,1956]Bellman,R.(1956).A problem in the sequential design of experiments.Sankhya,16:221-229.
    [Bellman,1957a]Bellman,R.(1957a).Dynamic Programming.Princeton University Press,Princeton,NJ.
    [Bellman,1957b]Bellman,R.(1957b).A markov desition process.Journal of Mathematical Mech,6:679-684.
    [Benbrahim and Franklin,1997]Benbrahim,H.and Franklin,J.(1997).Biped dynamic walking using reinforcement learning.Robotics Autonomous System,22:283-302.
    [Blockeel and De Raedt,1998]Blocked,H.and De Raedt,L.(1998).Top-down induction of first order logical decision trees.Artificial Intelligence,101:285-297.
    [Blockeel et al.,1999]Blocked,H.,De Raedt,L.,Jaeobs,N.,and Demoen,B.(1999).Scaling up inductive logic programming by learning from interpretations.Data Mining and Knowledge Discovery,3:59-93.
    [Boutilier et al.,1999]Boutilier,C.,Dean,T.,and Hanks,S.(1999).Decisiontheoretic planning:Structural assumptions and computational leverage.Journal of Artificial Intelligence Research,11:1-94.
    [Boutilier et al.,2001]Boutilier,C.,Reiter,R.,and Price,B.(2001).Symbolic dynamic programming for first-order MDPs.In Seventeenth International Joint Conference on Artificial Intelligence(IJCAI-01),pages 690-700.
    [Chapman and Kaelbling,1991]Chapman,D.and Kaelbling,L.(1991).Input generalization in delayed reinforcement learning:An algorithm and performance comparisions.In Proccedings of the l'2th International Joint Conference on Artificial Intelligence,pages 726-731.
    [Clark,1977]Clark,K.(1977).Negation as failure.In Logic and Data Bases,pages 293-322.
    [Cole et al.,2003]Cole,J.,Lloyd,K.,and Ng,K.(2003).Symbolic learning for adaptive agents.In The Annual Partner Conference,Smart Inter'net Technology Cooperative Research Centre.
    [De Raedt and Dzeroski,1994]De Raedt,L.and Dzeroski,S.(1994).First order jk-clausal theories are PAC-learnable.Artificial Intelligence,70:375-392.
    [Diestel,2000]Diestel,R.(2000).Graph Theory.Springer-Verlag.
    [Dietterich and Wang,2002]Dietterich,T.and Wang,X.(2002).Batch value function approximation via support vectors.Advances in Neural Information Processing Systems,14.
    [Driessens and Blockeel,2001]Driessens,K.and Blockeel,H.(2001).Learning digger usting hierarchical reinforcement learning for concurrent goals.In Euro-pean Workshop on Reinforcement Learning,EWRL.Utrecht the Netheriands.
    [Driessens and Dzeroski,2004]Driessens,K.and Dzeroski,S.(2004).Integrating guidance into relational reinforcement learning.Machine Learning,57:271-304.
    [Driessens and Ramon,2003]Driessens,K.and Ramon,J.(2003).Relational instrance based regression for relational reinforcement learning.In ICML-2003.
    [Driessens et al.,2001]Driessens,K.,Ramon,J.,and Blockeel,H.(2001).Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner.Lecture Notes in Computer Science,2167.
    [Dzeroski et al.,2001]Dzeroski,S.,De Raedt,L.,and Driessens,K.(2001).Relational reinforcement learning.Machine Learning,43:7-52.
    [Farley and Clark,1954]Farley,B.and Clark,W.(1954).Simulation of selforganizing systems by digital computer.IRE Transactions on Information Theory,4:76-84.
    [Fern et al.,2003]Fern,A.,Yoon,S.,and Givan,R.(2003).Approximate policy iteration with a policy language bias.In NIPS'03.
    [G(a|¨)rtner et al.,2003]G(a|¨)rtner,T.,Driessens,K.,and Ramon,J.(2003).Graph kernels and gaussian processes for relational reinforcement learning.In ILP'03.
    [Goldberg,1989]Goldberg,D.(1989).Genetic Algorithms in Search,Optimization and Machine Learning.Addison-Wesley.
    [Gretton and Thiébaux,2004]Gretton,C.and Thiébaux,S.(2004).Exploiting first-order regression in inductive policy selection.In Proceedings of UIA '04.
    [Guestrin et al.,2003]Guestrin,C.,Koller,D.,Gearhart,C.,and Kanodia,N.(2003).Generalizing plans to new environments in relational MDPs.In IJCAI'03.
    [Hanks and McDermott,1994]Hanks,S.and McDermott,D.(1994).Modelling a dynamic and uncertain world i:Symbolic and probabilisitic reasoning about change.Artificial Intelligence,66(1):1-55.
    [Howard,1960]Howard,R.(1960).Dynamic Programming and Markov Processes.MIT Press,Cambridge,MA.
    [Kaelbling et al.,1996]Kaelbling,L.,Littman,M.,and Moore,A.(1996).Reinforcement learning:A survey.Journal of Artificial Intelligence Research,4:237-285.
    [Kanerva,1988]Kanerva,P.(1988).Sparse Distributed Memory.MIT Press,Cambridge,MA.
    [Kersting and De Raedt,2003]Kersting,K.and De Raedt,L.(2003).Logical markov decision programs.In IJCAI'03 Workshop on Learning Statistical Models of Relational Data.
    [Kersting and De Raedt,2004]Kersting,K.and De Raedt,L.(2004).Logical markov decision programs and the convergence of logical TD(λ).In Fourteenth International Conference on Inductive Logic Programming(ILP'04),pages 180-197.
    [Kersting et al.,2004]Kersting,K.,van Otterlo,M.,and De Raedt,L.(2004).Bellman goes relational.In Proceedings of the 21st International Conference on Machine Learning.
    [Kim and Dean,2003]Kim,K.-E.and Dean,T.(2003).Solving factored mdps using non-homogeneous partitions.Artificial Intelligence,147:225-251.
    [Kloph,1972]Kloph,A.(1972).Brain function and adaptive systems-A heterostatic theory.Technical Report AFCRL-72-0164.
    [Kloph,1982]Kloph,A.(1982).The Hedonistic Neuron:A Theory of Memory,Learning,and Intelligence.Hemisphere,Washington,D.C.
    [Kobsa,2001]Kobsa,A.(2001).Generic user modeling systems.Journal of User Modeling and User-Adaptive Interaction,11(1):49-63.
    [Korte and Vygen,2002]Korte,B.and Vygen,J.(2002).Combinatorial Optimization:Theory and Algorithms.Springer-Verlag.
    [Langley,1994]Langley,P.(1994).Elements of Machine Learning.Morgan Kaufmann.
    [Lin,1992]Lin,L.(1992).Self-improving reactive agents based on reinforcement learning,planning and teaching.Machine Learning,8:293-321.
    [McCallum,1995]McCallum,A.(1995).Reinforcement Learning with Selective Perception and Hidden State.PhD thesis,Computer Science Department,University of Rochester.
    [Minsky,1954]Minsky,M.(1954).Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem.PhD thesis,Princeton University.
    [Minsky,1961]Minsky,M.(1961).Steps toward artificial intelligence.In Proceedings of the Institute of Radio Engineer,volume 49,pages 8-30.
    [Mitchell,1996]Mitchell,M.(1996).An Introduction to Genetic Algorithms.MIT Press.
    [Mitchell,1997]Mitchell,T.(1997).Machine Learning.McGraw-Hill.
    [Moore,1990]Moore,A.(1990).Efficient Memory-Based Learning for Robot Control.PhD thesis,University of Cambridge,UK.
    [Morales,2003]Morales,E.(2003).Scaling up reinforcement learning with a relational representation.In Proceedings of the Workshop on Adaptability in Multi-agent Systems at AORC'03,Sydney.
    [Neinhuys-Cheng and de Wolf,1997]Neinhuys-Cheng,S.-H.and de Wolf,R.(1997).Foundations of Inductive Logic Programming,vol.1228 of Lecture Notes in Artifical Intelligence.Springer-Verlag.
    [Nilsson,1980]Nilsson,N.(1980).Principles of Artificial Intelligence.Tioga Publishing Company.
    [Nilsson,1996]Nilsson,N.(1996).Introduction to Machine Learning.Stanford.
    [Rummery and Niranjan,1994]Rummery,G.and Niranjan,M.(1994).On-line q-learning using connectionist systems.Technical Report CUED/FINFENG/TR 166,Cambridge University Engineering Department.
    [Russell and Norvig,2003]Russell,S.and Norvig,P.(2003).Artificial Intelligence:A Modern Approach.Prentice Hall.
    [Sanmel,1959]Samuel,A.(1959).Some studies in machine learning using the game of checkers.IBM Journal on Research and Development,pages 210-229.
    [Slaney and Thiebaux,2001]Slaney,J.and Thiebanx,S.(2001).Blocks world revisited.Artificial Intelligence,125:119-153.
    [Smart and Kaelbing,2000]Smart,W.and Kaelbing,L.(2000).Practical reinforcement learning in continuous spaces.In Proceedings of the 17th Internatinoal Conference on Machine Learning,pages 903-910.Morgan Kaufmann.
    [Sutton,1978]Sutton,R.(1978).Single channel theory:A neuronal theory of learning.Brain Theory Newsletter,4:72-75.
    [Sutton,1984]Sutton,R.(1984).Temporal Credit Assignment in Reinforcement Learning.PhD thesis,University of Massachusetts,Amherst,MA.
    [Sutton,1988]Sutton,R.(1988).Learning to predict by the method of temporal differences.Machine Learning,5:9-44.
    [Sutton,1996]Sutton,R.(1996).Generalization in reinforcement learning:Successful examples using sparse coarse coding.In Advances in Neural Information Processing Systems:Proceedings of the 1995 Conference,pages 1038-1044.Cambridge,MA,MIT Press.
    [Sutton and Barto,1998]Sutton,R.and Barto,A.(1998).Reinforcement Learning:An Introduction.The MIT Press.
    [Tadepalli et al.,2004]Tadepalli,P.,Givan,R.,and Driessens,K.(2004).Relational reinforcement learning:An overview.In ICML'04 Workshop on Relational Reinforcement Learning.
    [Tan,1991]Tan,M.(1991).Learning a cost-sensitive internal representation for reinforcement learning.In Birnbaurn,L.and Collins,G.,editors,Proceedings of the Eighth International Workshop on Machine Learning,pages 358-362.San Mateo,CA,Morgan Kaufmann.
    [Tesauro,1992]Tesauro,G.(1992).Practical issues in temporal difference learning.Machine Learning,8:257-277.
    [Thorndike,1911]Thorndike,E.(1911).Animal Intelligence.Hafner,Darien,Conn.
    [Van Otterlo,2004]Van Otterlo,M.(2004).Reinforcement learning for relational MDPs.In Machine Learning Conference of Belgium and the Netherlands (BeNeLearn'04).
    [Van Otterlo and Kersting,2004]Van Otterlo,M.and Kersting,K.(2004).Challenges for relational reinforcement learning.In ICML'04 Workshop on Relational Reinforcement Learning.
    [Waltz and Fu,1965]Waltz,M.and Fu,K.(1965).A heuristic approach to reinforcement learning control systems.IEEE Transactions on Automatic Control,10:390-398.
    [Watkins,1989]Watkins,C.(1989).Learning from Delayed Rewards.PhD thesis,King's College,Cambridge.
    [Widrow and Hoff,1960]Widrow,B.and Hoff,M.(1960).Adaptive switching circuits.In WESCON Convention Record Part Ⅳ,pages 96-104.
    [Wiering,1999]Wiering,M.(1999).Explorations in Efficient Reinforcement Learning.PhD thesis,University of Amsterdam.
    [Witten and Frank,1999]Witten,I.and Frank,E.(1999).Data Mining:Practical Machine Learning Tools and Techniques.Morgan Kaufmann.
    [Yoon et al.,2002]Yoon,S.,Fern,A.,and Givan,R.(2002).Inductive policy selection for first-order MDPs.In UAI'02.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700