面向城市自适应交通信号控制的强化学习方法研究

英文题名：The Study of Reinforcement Learning Towards Urban Self-adaptive Traffic Signal Control Environment
作者：夏新海
论文级别：博士
学科专业名称：交通信息工程及控制
中文关键词：强化学习 ; agent ; 自适应交通信号控制 ; 对策论 ; 马尔科夫决策过程
英文关键词：Reinforcement Learning ; Self-adaptive Traffic Signal Control ; Game Theory ; Markov Decision Process
学位年度：2013
导师：许伦辉
学科代码：082302
学位授予单位：华南理工大学
论文提交日期：2013-12-01
答辩委员会主席：徐建闽

摘要

由于城市交通的迅速发展，城市道路功能增多，密度加大，国外从20世纪60年代便开始了自适应交通信号控制的研究。自适应交通信号控制是缓解城市交通拥挤的很有潜力的方法。但由于城市交通系统具有非线性、动态性、非确定性、模糊性、复杂性等特征，传统的自适应交通信号控制系统及智能控制方法虽然取得了一定的成绩，但由于对多变的交通流在一定程度上不能适应，对交通模型依赖较严重。强化学习方法由于不需要外部环境的数学模型，对环境的先验知识要求低，可在大空间、复杂的非线性系统中取得良好的学习性能，因此，近年来许多学者提出的基于agent（智能体）的强化学习方法在自适应交通信号控制中将有广阔的发展前景。本论文首先为每个信号控制的交叉口定义一个agent,即交叉口交通信号控制agent，分析了面向自适应交通信号控制的标准强化学习的过程及有效性，研究了面向自适应交通信号控制的几种典型强化学习算法的应用，包括分布式Nash Q-学习方法、多遇历史学习方法、策略梯度上升方法。论文的重点及创新成果如下：
     (1)交叉口交通信号控制agent体系结构模型的构建
     针对交叉口交通流具有的多干扰、动态性、不确定性等特性，以agent的BDI理论模型为基础，将认知型agent结构和反应型agent结构进行融合，根据“感知-认知-行为”模式构建了交叉口交通信号控制agent体系混合结构模型。
     (2)面向自适应交通信号控制的标准强化学习算法的实现
     利用标准强化学习方法中方法对交叉口交通信号进行控制。首先设计了独立标准强化学习算法对单交叉口交通信号进行控制，并与定时控制方法进行对比分析，验证了独立标准强化学习控制方法的有效性。针对独立标准强化学习算法存在的维数灾难问题，通过引入协调机制对独立标准强化学习算法进行延伸设计了基于协调机制的标准强化学习算法，并与独立标准强化学习进行了比较，分析了其收敛性和有效性。
     (3)面向自适应交通信号控制的分布式Nash Q-学习方法的设计
     针对交叉口间交通流的相互关联性，利用n人非零和Markov对策建立了交叉口交通信号控制agent间的交互数学模型，提出了求解该模型的分布式Nash Q-学习算法。在所提出的算法中各个交叉口交通信号控制agent的配时动作选择不仅仅依赖自身的Q值函数，而且必须考虑其他交通信号控制agent的Q值函数，选择的配时动作是当前所有交叉口交通信号控制agent的Q值函数下的Nash平衡解，这种方法使得每一交叉口交通信号控制agent在联合配时动作及不完备信息下更新Q值。通过理论分析和仿真实验证明了此算法的收敛性，并与基于独立强化学习算法的交通信号控制、定时交通信号控制、基于国外相关文献算法的交通信号控制等进行比较分析，验证了其有效性。
     (4)面向自适应交通信号控制的多遇历史学习法的设计
     针对目前应用多agent学习协调机制进行自适应交通信号控制存在着完备知识假设和单遇交互假设的不足，利用对策论构建了城市交叉口交通信号控制agent间多遇交互数学模型，通过引入记忆因子设计了多交互历史学习协调算法。在此模型和算法中，每一交叉口交通信号控制agent与相邻交叉口交通信号控制agent进行交互，根据选择策略获得的效用值来更新它的混合策略，并且交叉口交通信号控制agent通过对其他相邻交叉口交通信号控制agent以往历史交互行为，特别是最近的历史行为的记忆学习达到协调。从理论上分析了此算法的收敛性。以数个交叉口相连接的干道交通信号协调控制为例分析了记忆因子、学习概率、交叉口交通流变化率等参数对此方法的性能的影响，并与国外相关文献方法进行了比较分析，证明了该方法的有效性，并具有一定的动态环境适应能力和协调能力。
     (5)面向自适应交通信号控制的策略梯度上升方法的设计
     由于城市交通系统的环境状态信息很难被控制系统完全感知，将自适应交通信号控制看成是POMDP（Partially Observable Markov Decision Process，部分感知马尔科夫决策）问题，建立了交叉口自适应交通信号控制POMDP环境模型，在引入GPOMDP算法的基础上，针对一般策略梯度估计法的不足，将自然策略梯度、值函数方法的优点进行融合，设计了在线NAC(NaturalActor Critic)算法来进行自适应交通信号控制。通过仿真实验分析了相关参数等对两种算法收敛性的影响，并与基于饱和度平衡策略的交通信号控制、定时交通信号控制及基于国外相关文献方法的交通信号控制进行了比较分析，证明了采用策略梯度上升强化学习方法的有效性，表明了其对自适应交通信号控制具有一定的适用性。
Due to the rapid development of urban traffic and the increase of urban road’s functionand density, foreign scholars have began the research of adaptive traffic signal control.Adaptive traffic signal control is a potential approach to alleviate congestion. Urbantransportation system has the characteristics of nonlinearity, dynamics, uncertainty, fuzziness,and complexity etc., so the traditional adaptive traffic signal control system and intelligentcontrol method cannot be adapted to the variation of traffic flow in a certain extent andseriously rely on the traffic model although they have obtained certain achievements.Reinforcement learning(RL) needs less the mathematical model and the priori knowledge ofexternal environment, so it can achieve good learning performance in large space andcomplicated nonlinear system. Then, agent-based RL proposed by many scholars will havebroad prospects for development in the adaptive traffic signal control. The study employs atraffic signal control agent for each signalized intersection. On the analysis of standardreinforcement learning’s process and effectiveness under adaptive traffic signal control,applications of several typical reinforcement learning algorithms on the adaptive trafficcontrol have been studied, including the distributed Nash Q-learning algorithm,multi-interactive history learning algorithm and policy gradient ascent algorithm. The focusand the innovation achievements of the thesis are as follows:
     (1) Construction of the system structure model for intersection’s traffic signal controlagent
     Due to more interference, dynamic and uncertainty of the intersection’s traffic flow, thehybrid system structure model for intersection’s traffic signal control agent was establishedby the fusion of cognitive and reactive agent structure based on agent’s BDI theory modelaccording to the "perception-cognition-behavior" mode.
     (2)The realization of standard reinforcement learning algorithm toward adaptive trafficsignal control
     Use a independent standard reinforcement learning method such as Q-learning forintersection traffic signal control,and the realization process of Q-learning algorithm wasanalyzed. Compare with traditional timing control method, the Q-learning was effective.Aimed at dimension disaster problems of independent standard reinforcement learning algorithm,the independent standard reinforcement learning algorithm was extended byintroducing coordination mechanism. Compared with the independent standard reinforcementlearning, the convergence and effectiveness of coordination-based standard reinforcementlearning was analyzed.
     (3)The design of distributed Nash Q-learning algorithm toward adaptive traffic signalcontrol
     According to the mutual relevance of traffic flow between intersection, mathematicalmodel of interaction for intersection’s traffic signal control agents was built based on nonzero-sum Markov game, and the distributed Nash Q-learning algorithm to solve the modelwas put forward. In the proposed algorithm, each intersection’s traffic signal control agentselects action according to not only its own Q-values but also the Q-values of otherintersections’ traffic signal control agents. The selected action is the Nash equilibrium ofQ-values of all the current intersections’ traffic signal control agents. This method let eachintersection’s traffic signal control agent learn to update its Q-values under the joint actionsand imperfect information. Theoretical analysis and simulation experiment results show thatthe method is convergent. Compare with independent reinforcement learning algorithm, fixedtiming control, and foreign relevant literature’s algorithms, its effectiveness was analyzed.
     (4)The design of multi-interactive history learning coordinated algorithm towardself-adaptive traffic signal control
     In view of the deficiency of the hypothesis of complete knowledge and single interactionin the present application muti-agent-based learning coordination mechanism forself-adaptive traffic signal control, multi-interaction mathematical model for intersection’straffic signal control agents was built based on game theory, and a multi-interactive historylearning algorithm was constructed by introducing memory factor. In the proposed model andalgorithm, each intersection’s traffic signal control agent plays the coordination game with itsneighbors and update its mixed strategy according to the getting payoff, and it takes allhistory interactive information which comes from neighbouring intersection’s traffic signalcontrol agents into account. The learning rule assigns greater significance to recent than topast payoff information. The convergence of the approach was analyzed theoretically. Howthe parameters such as memory factor, learning probability, the local traffic changeprobability etc. will affect the algorithm’s performance was analyzed.Compare with foreign relevant literature’s method by an experiment of coordinated control for main intersections inthe arterial, the result indicates that this method is effective.
     (5)The design of policy gradient approach toward self-adaptive traffic signal control
     As the status information of urban traffic system environment is difficult to completelyperceived by control system, the self-adaptive traffic signal control was seen as POMDP（Partially Observable Markov Decision Process) problem, and the POMDP environmentmodel of intersection self-adaptive traffic signal control was established. Based on theintroduced GPOMDP algorithm, the shortage of the general policy gradient estimationapproach, the OLNAC algorithm for self-adaptive traffic signal control was designed by thefusion of natural gradient, value function method. How the related parameters will affect thetow algorithm’s convergence was analyzed by simulation experiment. Compared withSAT(saturation-balancing technique), uniform technique, random technique, and foreignrelevant literature’s method,the proposed algorithms are effective and has a certainapplicability to solve the self-adaptive traffic signal control.

引文

[1]徐建闽.交通管理与控制[M].北京:人民交通出版社,2007:139-140
    [2]王珏,周志华,周傲英.机器学习及其应用[M].北京：清华大学出版社，2006:1-27
    [3]洪家荣.机器学习——回顾与展望[J].计算机科学，1991，（02）：1‐8
    [4] Winston P H. Learning Structural Descriptions from Examples[R].Massachusetts Instituteof Technology,1970：15‐19
    [5] Minsky M L.Theory of Neural-Analog Reinforcement Systems and its Application to theBrain-Model Problem[D].USA,Princeton University,1954:25‐29
    [6]Waltz M D,Fu K S. A Heuristic Approach to Reinforcement Learning ControlSystem[J].IEEE Transaction on Automatic Control,1965,10(4):390-398
    [7]张汝波.强化学习理论及应用[M].哈尔滨：哈尔滨工程大学出版社，2001:10-18
    [8]高阳，陈世福，陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100
    [9] Thorndike E L.Animal Intelligence:Experiment Studies[M].New York:Macmillan,1911：9
    [10] Farley B,Clark W. Simulation of self-organizing systems by digital computer[J]. IEEETransactions on Information Theory,1954,4(4):76-84
    [11] Minsky M L.Steps toward artificial intelligence [J].Proceedings of the Institute of RadioEngineers,1961,49(1):8-30
    [12] Samuel A L. Some Studies in Machine Learning using the Game of Checkers[J]. IBMJournal of Research and Development,1959,3(3):210-229
    [13]Klopf A H.Brain function and adaptive systems:A heterostatictheory[R].Bedford,USA:Air Force Cambridge Research Laboratories,1972:18-21
    [14] Sutton R S. Single channel theory:A neuronal theory of learning[J]. Brain TheoryNewsletter,1978,3(3-4):72-75
    [15] Barto A G,Sutton R S.Goal seeking components for adaptive intelligence:An initialassessment[R].Ohio,USA:Wright-Patterson Air Force Base,1981:71-72
    [16] Klopf A H. A neuronal model of classical conditioning[J].Psychobiology,1988,16(2):85-125
    [17] Koivo A J. List of publications:Richard Bellman[J]. IEEE Transactions of AutomaticControl,1981,26(5):1213-1223
    [18] Boltyanskii V G,Gamkrelidez R V,Pontryagin L S.On the theory of optimal processes (inRussian)[J].Dokl.Akad.Nauk USSR,1956,110(1):7-10
    [19] Bellman R E. A Markov decision progress[J]. Journal of MathematicalMechanics,1957,6(5):679-684
    [20] Howard R A. Dynamic Programming and Markov Processes[M]. New York:Wiley,1960
    [21] Barto A G,Sutton R S,Brouwer P S. Associative search network:A reinforcement learningassociative memory[J]. IEEE Transactions on Systems,Man,and Cybernetics,1981,40(3):201-211
    [22]Williams R J. Simple statistical gradient-following algorithms for connectionist[J].Machine Learning,1992,8(3):229-256
    [23] Barto A G,Sutton R S,Anderson C W. Neuronlike elements that can solve difficultlearning control problems[J]. IEEE Transactions on Systems,Man,andCybernetics,1981,40(3):201-211.
    [24]Sutton R S. Temporal credit assignment in reinforcement learning[D].Amherst,USA:University of Massachusetts,1988:75-78
    [25] Sutton R S. Learning to Predict by the Methods of Temporal Difference[J]. MachineLearning,1988,(3):9-44
    [26] Dayan P. The convergence of TD(λ) for general λ[J]. MachineLearning,1992,8(8):341-362
    [27] Schapire R E,Warmuth M K.On the Worst-Case Analysis of Temporal-DifferenceLearning Algorithms[J].Machine Learning1996,22(1/3):95-121
    [28] Watkins C J C H.Learning from delayed rewards[D]. University of Cambridge,UK,1989
    [29] Peng J,Williams R J. Increment multi-step Q-learning[J]. MachineLearning,1996,22(1/3):283-291
    [30] Werbos P. Approximate dynamic programming for real-time control and neuralmodeling[M].Handbook of Intelligent Control:Neural,Fuzzy,and AdaptiveApproaches,New York:Van Nostrand Reinhold,1992:493-525
    [31]阎平凡.再励学习——原理、算法及其在智能控制中的应用[J].信息与控制,1996,(1),28‐34
    [32]徐建闽,许伦辉,撒元功.交叉口有交通信号控制时用户最优动态配流模型[J].控制理论与应用,2000,17(1):117-120
    [33]徐建闽,傅惠,许伦辉.关联交叉口短时交通流可预测性分析及组合预测算法[J].华南理工大学学报(自然科学版),2007,35(10):194-197
    [34]杨兆升.智能运输系统概论[M].北京：人民交通出版社,2009,2:26-28
    [35]陆化普.智能运输系统概论[M].北京：中国铁道出版社,2004,11:39‐41
    [36]刘以成.21世纪的公路交通—智能车路系统[J].中国公路学报，1995,8（3）：15‐17
    [37]刘智勇著.智能交通控制理论及其应用[M].科学出版社,2003:56-58
    [38]卢凯.交通信号协调控制基础理论与关键技术研究[D].华南理工大学,2010:46-48
    [39]邝先验,吴翠琴,许伦辉,黄艳国.交通信号多相位智能优化控制方法[J].计算机工程,2008,34(20):205-207
    [40]邝先验,许伦辉,吴银凤.基于多智能体的信号交叉口微观仿真[J].计算机工程与应用,2007,43(19):217-220
    [41] J. C. Medina and R. F. Benekohal.Q-learning and Approximate Dynamic Programmingfor Traffic Control-A Case Study for an Oversaturated Network[J]. TransportationResearch Board Annual Meeting,2012:19-23
    [42]欧海涛.基于RMM和贝叶斯学习的城市交通多智能体系统[J].控制与决策，2001，16（3）：291‐295
    [43] Jeffrey L.Adler,Victor J.Blue. A cooperative multi-agent transportation management androute guidance system[J].Transportation Research Part C,2002,10:433-454
    [44] Jolm France, Ali A Gborbani. A Multiagent System for Optimizing Urban Traffic
    [C].Proceedings of the IEEE/WIC International Conference on Intelligent AgentTechnology (IAT'03), IEEE,2003:68‐69
    [45] Foy M. D., R. F. Benekohal, D. E. Goldberg. Signal timing determination using geneticalgorithms. In Transportation Research Record,1992,(1365)：108-115
    [46]董友球，刘智勇.基于Q学习的区域交通控制方法[J].五邑大学学报，2008,2(3):15-18
    [47]汤安，任恩恩.基于多Agent的城市交通控制建模研究[J].交通信息与安全，2009,5(13):52-65
    [48] Jayakrishnan R., Mattingly S., Mcnally M. Performance study of SCOOT traffic controlsystem with non-ideal detectorization: field operational test in the city of Anaheim[C].80th Ann. Meeting of the Transportation Research Board, Washington, DC,2001：88‐89
    [49] Koonce P. Traffic signal timing manual[M]. US Department of TransportationFHWA-HOP-08-024, Federal Highway Administration,2008:101‐102
    [50]Kuyer L., Whiteson S., B. Bakker, and N. Vlassis.Multiagent reinforcement learning forurban traffic control using coordination graph[C]. The19th European Conference onMachine Learning,2008:99‐101
    [51] Choy M.C., Srinivasan D., Cheu R.L. Neural networks for continuous online learningand control[J].IEEE Trans.Neural Netw.,2006,17:1511–1531
    [52] Srinivasan D., Choy M. Distributed problem solving using evolutionary learning inmulti-agent systems[J]. Adv. Evol. Comput. Syst. Des.,2007,66:211–227
    [53] Choy M.C., Srinivasan D., Cheu R.L. Cooperative, hybrid agent architecture forreal-time traffic signal control[J]. IEEE Trans. Syst. Man Cybern. A (Syst. Hum.),2003,33:597–607
    [54]Srinivasan D., Choy M.C. Cooperative multi-agent system for coordinated traffic signalcontrol[J]. Intell. Transp. Syst.,IEEE Proc.,2006,153,(1):41–50
    [55]欧海涛等.基于多智能体技术的城市智能交通控制系统[J].电子学报，2000,28(12):52-55
    [56]李英.多Agent系统及其在预测与智能交通系统中的应用[M].上海：华东理工大学出版社,2004:154-158
    [57] Watkins C.,Dayan P. Technical note Q-learning[J].Journal of Machine Learning,1992,8:279-292
    [58] Wiering M., Vreeken J., Veenen J.,et al. Simulation and optimization of traffic in acity[C]. In IEEE Intelligent Vehicles Symposium(IV’04), Parma, Italy, IEEE,2004:453–458
    [59] Abdulhai B., Pringle P. Autonomous Multiagent Reinforcement Learning-5gc urbanTraffic Control[C]//Annual Transportation Research Board Meeting.TRB,2003
    [60]承向军,常歆识,杨肇夏.基于Q-学习的交通信号控制方法[J].系统工程理论与实践,2006,28(8):136-140
    [61]陈阳舟,张辉,杨玉珍.基于Q学习的Agent在单路口交通控制中的应用[J].公路交通科技,2007,24(5)：117-120
    [62] Lu Shoufeng, Liu Ximin, Dai Shiqiang. Q-learning for adaptive traffic signal controlbased on delay minimization strategy[C].USA: The International Conferenceon Networking, Sensing and Control. Institute of Electrical and ElectronicsEngeners,2008:687-691
    [63]何兆成,佘锡伟，杨文臣.结合Q学习和模糊逻辑的单路口交通信号自学习控制方法[J].计算机应用研究,2011,28(1):200‐202
    [64] Wiering M. Multi-Agent Reinforcement Learning for Traffic Light Control[C]//Seventeeth International Conference on Machine Learning andApplications.SanFrancisco,CA,USA:Morgan Kaufmann Publishers Incorporation,2000:1151–1158
    [65] Bakker B., Steingr¨over M., Schouten R., et al. Cooperative Multi-agent ReinforcementLearning of Traffic Lights. In Proceedings of the Workshop on CooperativeMulti-Agent Learning[C]//European Conference on MachineLearning.ECML,2005:24–36
    [66] Thomas L. Thorpe,Charles W. Anderson. Traffic light control using SARSA with threestate representations[R]. IBM Corporation,1996:12‐16
    [67] Abdulhai B., Pringle P., Karakoulas G. Reinforcement Learning for True AdaptiveTraffic Signal Control[J]. ASCE Journal of Transportation Engineering,2003,129(3):278–284
    [68] Salkham A., Cunningham R., Garg A., Cahill V. A collaborative reinforcementlearning approach to urban traffic control optimization[C]. Proc.2008IEEE/WIC/ACMInt. Conf. on Web Intelligence and Intelligent Agent Technology, Sydney, Australia,December2008:560–566
    [69] Balaji P.G.,German X.,Srinivasan D. Urban traffic signal control using reinforcementlearning agents[J].IET Intelligent Transport Systems,2010,4(3):177–188
    [70] Arel C., Liu T. Urbanik,et al.Reinforcement learning-based multi-agent system fornetwork traffic signal control[J].IET Intelligent Transport Systems,2010,4(2):128-135
    [71]李春贵.基于多智能体团队强化学习的交通信号控制[J].广西工学院学报,2011,22(2):1-5
    [72]刘智勇,宋正东.城市区域交通信号的混沌模糊Q学习控制[J].计算机工程与应用2011,48(4):207-210
    [73]魏赟,邵清.基于Q-学习和粒子群算法的区域交通控制模型[J].系统仿真学报2011,23(10):2108-2111
    [74]刘德铭，黄振高.对策论及其应用[M].国防科技大学出版社，1995：56‐58
    [75]杜荣华.城市区域交通协调控制中的多智能体博弈研究[J].计算机工程与科学.2007,29（4）：120‐123
    [76]李振龙,陈德望.交通信号区域协调优化的多智能体博弈模型[J].公路交通科技.2004,21（1）：85‐88
    [77] Wunderlich R., Liu C., Elhanany I., Urbanik T. A novel signal scheduling algorithm withquality of service provisioning for an isolate intersection[J].IEEE Trans. Intell. Transp.Syst.,2008,9,(3):536–547
    [78] Alvarez I.,Poznyak A.,Malo A.Urban Traffic Control Problem a Game TheoryApproach[C].Proceedings of the47th IEEE Conference on Decision andControl.IEEE,2008:2168-2172
    [79]石纯一.基于Agent的计算[M].北京：清华大学出版社，2007:149‐161
    [80]马寿峰.一种基于agent协调的两路口交通控制方法[J].系统工程学报,2003,6(3):273‐278
    [81] Xia Xinhai, Xu Lunhui. Traffic Signal Control Agent Interaction Model Based on GameTheory and Reinforcement Learning[C]//2009International Forum on ComputerScience-Technology and Applications.ifcsta,2009,1:164-168
    [82]赵晓华,李振龙,于泉等.基于Q学习算法的两交叉口信号灯博弈协调控制[J].2007,19(18):4253-4256
    [83] Shahaboddin Shamshirband.A Distributed Approach for Coordination Between TrafficLights Based on Game Theory[J].The International Arab Journal of InformationTechnology,2012,9(2):148-152
    [84] Mikami S.,Kakazu Y.. Genetic reinforcement learning for cooperative traffic signalcontrol[C]//In International Conference of Intelligence Computation and EvolutionaryComputation. ICEC,1994:223–228
    [85] Pendrith M.D. Distributed Reinforcement Learning for a Traffic EngineeringApplication[M]. New York: ACM Press,2000:404–411
    [86]张辉.基于分布式Q学习的区域交通协调控制的研究[J].武汉理工大学学报:交通科学与工程版,2007,31(6):1121-1124
    [87] Minsky M.The Society of Mind[M].Simon&Schuster,New York,1986:45-48
    [88] Wooldridge M.,Jeenings N R,Kinny D.The Gaia methodology for agent-orientedanalysis and designs[J].Autonomous Agents and Multi-AgentSystems,2000,3(3):285-312
    [89] Schwartz A. A Reinforcement Learning method for maximizing undiscountedrewards[C]. In:Proceedings of the Tenth International Conference on Machine Learning,1993,298-305
    [90] Mahadevan S. To Discount or Not to Discount in Reinforcement Learning: A Case StudyComparing R-Learning and Q-Learning[C]. In Proc. Of International MachineLearning Conf., New Brunswick, NJ,1994:164-172
    [91]Watkins C. Q-Learning [J]. Machine Learning,1992,8(3):279-292
    [92]张芳.面向多移动机器人系统的再励学习方法研究[D].上海：上海交通大学，2002:12-13
    [93] Singh S. Reinforcement Learning Algorithms for Average-payoff MarkovDecision Processes [J].In Proceedings of the12thAAAI. MIT Press,1994:700-705
    [94] Tadepalli C., Ok D. Model-based average reward reinforcement learning[J]. ArtificialIntelligence,1998,(100):177-224
    [95] Kaelbling L P.,Littman M L.,Cassandra A R. Planning and acting in partially observablestochastic domains[J]. Artificial Intelligence,1998,101:99-134
    [96]Weiss G.,Dillenbourg P. What is multi in multiagent learning?[A].In:DillenbourgP,ed.Collaborative Learning,Cognitive and Computational Approaches[C].Amsterdam:Pergamon Press,1998：64-80
    [97] Bowling M. Convergence and no-regret in multiagent learning[J].Advances in NeuralInformation Processing Systems,2004：88-84
    [98] Chang Y.,Kaelbling L.Playing is believing:The role of beliefs in multi-agentlearning[C].Proceedings of NIPS-2001,Vancouver,Canada,2001:45-48
    [99] Claus C.,Boutilier C. The dynamics of reinforcement learning in cooperative multiagentsystem.Proceedings of the15thNational/Tenth Conference on Artificial IntelligenceInnovative Application on Artificial Intelligence,Madison,Wisconsin,UnitedStates:American Association for Artificial Intelligence,1998,746-752
    [100]许伦辉,习利安,衷路生.孤立交叉口多相位自适应模糊控制及其神经网络实现[J].中国公路学报,2005,18(3)：90-93
    [101]许伦辉,衷路生,徐建闽.基于神经网络的交叉口多相位模糊控制[J].华南理工大学报(自然科学版),2004,32(6):67-70
    [102]许伦辉,衷路生,徐建闽.基于神经网络实现的交叉口多相位模糊逻辑控制[J].系统工程理论与实践,2004,(7):135-140
    [103]许伦辉,丘建栋,刘正东.基于贝叶斯智能学习OD矩阵估计与网络拓扑优化研究[J].公路交通科技,2007,24(6):106-117
    [104]黄艳国,唐军,许伦辉.基于Agent的城市道路交通信号控制方法[J].公路交通科技,2009,(10):126-129
    [105]史忠科.交通控制系统导论[M].科学出版社，2003:72-73
    [106] Littman M L. Markov games as a framework for muitiagent reinforcement learning[C].In: Proceedings of the11th International Conference on Machine learning, MorganKaufmann,1994:157-153
    [107] Littman M L. Friend-or-foe:Q-learning in general-sum games[C]. In: Proceedings ofthe18th International Conference on Machine Learning, Morgan Kaufmann,2001:322-328
    [108] Hu J.,Wellman M P. Nash Q-Learning for General-Sum stochastic games[J]. Journalof Machine Learning,2003,4:1039-1069
    [109] Claus C., Boutilier C. The dynamics of reinforcement learning systems [C]. In:Proceedings of workshop on multi-agent learning, in cooperativemulti-agent,1997:602-608
    [110] Szepesvari C.,Littman M. L. A unified analysis of value-function-based reinforcementlearning algorithms[J].Neural Computing,1999,11(8):2017-2060
    [111]Jerzy F.,Koos V. Competitive markov decisionprocess[M].Berlin,Germany:Springer-Verlag,1997:82-85
    [112]夏新海，许伦辉.交叉口TSCA间的博弈学习协调方法[J].重庆交通大学学报：自然科学版，2010,29（2）：269‐271
    [113]汪贤裕，肖玉明.博弈论及其应用[M].科学出版社，2008:122‐128
    [114]冯诺伊曼，奥斯卡摩根斯坦.博弈论与经济行为[M].王文玉，王宇译.北京：三联书店:121‐125
    [115]肖条军，盛昭瀚.信号博弈均衡结果的唯一性及其算法[J].系统工程学报，2000,15（4）：366-372
    [116]Nash J F.Non-Cooperation Games[J].Annals of Maths,1951,54:286-295
    [117]Luce R D.,Raiffa H.Games and Decision[M].NY.John wiley&Sons,Inc.,1957:12-15
    [118]Thomas L C.Games,Theory and Applications[M].Chichester,1984
    [119]Roozemond D A,Van der veer P. Usability of Intelligent Agent Systems in UrbanTrafficmanagement[J].Application of Artifical Intelligence inEngineering,1999(7):15-18
    [120] Niitymaki J. General Fuzzy Base for Isolated Traffic Signal Control Formation[J].Transportation Planning and Technology,2001,24(3):237-247
    [121] Bingham E. Reinforcement learning in neurofuzzy traffic signal control[J]. EuropeanJournal of Operational Research,2001,131(2):232–241
    [122] Balaji P.G., Srinivasan D., Chen-khong T. Coordination in distributed multi-agentsystem using type-2fuzzy decision systems[C]. IEEE16th Int. Conf. on Fuzzy Systems(FUZZ-IEEE). Piscataway, NJ, USA,1–6June2008:2291–2298
    [123]黄艳国,许伦辉,邝先验.基于Multi-agent协调的区域交通信号优化控制[J].江西理工大学学报,2009,(01):49-52
    [124]Kaelbling L.P., Littman M.L.Reinforcement learning:a survey.Intell. Res.,1996,4:279–284
    [125] Du Shiping, Wang Jian, Wei Yuming.New Learning Algorithms for Third-Order2-DHidden Markov Models[J].IJACT: International Journal of Advancements inComputing Technology,2011,3(2):104~111
    [126] Sutton R., Barto A. Reinforcement learning: an introduction[M]. USA,MIT Press,1998:121‐126
    [127]Douglas Aberdeen. Policy-Gradient Algorithms for Partially Observable MarkovDecision Processes[D]. Australian National University, Canberra, Australia, March2003:123‐128
    [128]Jonathan Baxter,Peter L. Bartlett. Infinite-horizon policy-gradient estimation[J].Journal of Artificial Intelligence Research,2001,15:319-350
    [129]Aleksandrov V M.,Sysoyev,V I.,et al. Stochastic optimization[J]. EngineeringCybernetics,1968,5(1):11-16
    [130]De Oliveira L.B.,Camponogara E. Multi-agent model predictive control of signalingsplit in urban traffic networks[J]. Transp. Res.,2010,18(1):120–139
    [131]Medina J.,Hajbabaie A.,Benekohal R.Arterial traffic control using reinforcementlearning agents and information from adjacent intersections in the state and rewardstructure[C].in Intelligent Transportation Systems (ITSC),201013th International IEEEConference. IEEE,2010:525–530
    [132]王学宁.策略梯度增强学习的理论、算法及应用研究[D].长沙：国防科技大学，2006,10：25-28
    [133]Abounadi J.,Bertsekas D.,Borkar V.Learning algorithms for Markov decision processeswith average cost[J]. SIAM Journal on Control and Optimization,2002,40(3):681–698
    [134]郭锐,吴敏,彭军,等.一种新的多智能体Q学习算法[J].自动化学报,2007,33(4):365-372
    [135]Satinder P. Singh, Tommi Jaakkola, Michael I. Jordan.Learning without state-estimationin partially observable Markovian decision processes[C]. In Proceedings of the11thInternational Conference on Machine Learning(ICML’94),1994:284–292
    [136]Dimitri P. Bertsekas, John N. Tsitsiklis. Neuro-dynamic Programming[M].AthenaScientific, Belmont, MA,1996:23‐28
    [137]Sutton R S,Mcallester D,Singh S, et al.Policy gradient methods for reinforcementlearning with function approximation[C]. Advances in Neural Information ProcessingSystems12, pp.1057-1063, MIT Press,2000:251-256
    [138]Jan Peters, Sethu Vijayakumar, Stefan Schaal. Natural actor-critic[C].In Proceedings ofthe16th European Conference on Machine Learning(ECML'OS),Porto, Portugal,2005：280-291
    [139]Bhatnagar S.,Sutton R.,Ghavamzadeh M.,et al. Incremental Natural Actor-CriticAlgorithms[C]//21st Annual Conference on Neural Information ProcessingSystems(NIPS2007),Curran Associates Inc.,57Morehouse Lane, Red Hook, NY12571, United States,2007:1-9
    [140] Shalabh B.,Richard S.,Mohammad G.,et al. Natural actor-critic algorithms[J].Automatica,2009,45(11):2471–2482
    [141]Angelia Nedi′c, Dimitri P. Bertsekas. Least-squares policy evaluation algorithms withlinear function approximation[J]. Discrete Event Dynamic Systems:Theory andApplications,2003,13(1–2):79–110
    [142]Burmeister B,Haddadi A,Matylis G.Application of multi-agent systems in traffic andtransportation[J].IEEE Proceedings on Software Engineering,1997,144(1):51-60
    [143]Susan E Lander. Issues in multi-agent design systems [J]. IEEE Expert IntelligentSystems&Their Application,1997,12(2):18-26
    [144]Du Shiping, Wang Jian, Wei Yuming.New Learning Algorithms for Third-Order2-DHidden Markov Models[J]. International Journal of Advancements in ComputingTechnology,2011,3(2):104-111
    [145]Chang Wen-Chi,Chou Yu-Min, Chen Kuen-Chi. Game-based Collaborative LearningSystem[J].Journal of Convergence Information Technology,2011,6(4):273-284
    [146] Mortaza Zolfpour Arokhlo, Ali Selamat, Siti Zaiton Mohd Hashim,et al.Multi-agentReinforcement Learning for Route Guidance System[J].IJACT: International Journalof Advancements in Computing Technology,2011,3(6):224~232
    [147]Andrew Bagnell J.,Andrew Y. Ng. On local rewards and scaling distributedreinforcement learning. In Advances in Neural Information Processing Systems[C].Proceedings of the19th Neural Information Processing Systems Conference(NIPS’2005),2006:91–98
    [148]Camponogara E.,W. K. Jr. Distributed learning agents in urban traffic control[C]//In F.Moura-Pires and S. Abreu, editors, EPIA, Lecture Notes in Computer Science. Springer,2003, Volume2902:324–335
    [149]Xia Xin-hai,Xu Lun-hui.Coordination Based on Multi-Agent and Its Application ofUrban Intersection Traffic Coordination Control[C]//20108th World Congress onIntelligent Control and Automation. WCICA,2010,7:5982–5987
    [150]Xia Xin-hai,Xu Lun-hui.Coordination of Urban Intersection Agents Based onMulti-interaction History Learning Method[C]//in2010ICSI,Lecture Notes inComputer Science. Springer,2010, Volume6146/2010:383-390
    [151]夏新海，许伦辉.城市交叉口Agent间的多遇交互历史学习协调方法[J].华南理工大学学报：自然科学版，2011,39（3）：114‐119
    [152]夏新海，许伦辉.交叉口Agent间的多遇协调策略及其参数影响分析[J].公路交通科技,2011,38(4):100‐104
    [153]夏新海,许伦辉.POMDP环境下交通信号自适应控制的策略梯度学习方法[J].武汉理工大学学报，2012,34(7):51‐56
    [154] Kok Jelle R., Vlassis N. Collaborative Multiagent Reinforcement Learning by PayoffPropagation[J]. Journal of Machine Learning,2006,7:1789-1828
    [155]Moore A.W., Atkenson C.G. Prioritized Sweeping. Reinforcement Learning with lessdata and less time[J]. Machine Learning,1993,13:103-130
    [156]Prashanth L. A., Shalabh Bhatnagar, Senior Member.Reinforcement Learning WithFunction Approximation for Traffic Signal Control[J].IEEE TRANSACTIONS ONINTELLIGENT TRANSPORTATION SYSTEMS,2011,12(2):412-421
    [157]李威武.城域智能交通系统中的控制与优化问题研究[D].浙江大学,2003:111‐112
    [158] Spall J. C., Chin D.C. Traffic-Responsive Signal Timing for System-wide TrafficControl[J]. Journal of Transportation Research Part C: Emerging Technologies,1997,5(3):153~163
    [159] Mizuno K., Nishihara S. Distributed constraint satisfaction for urban traffic signalcontrol[C]. Second Int.Conf. on Knowledge Science, Engineering and Management.KSEM2007,28–30November2007, Berlin,Germany,2007:73–84
    [160]Adler J.,Setapathy G.,Manikonda V.,Bowles B.A multiagent approach to cooperativetraffic management and route guidance[J].Transportation Research Part B,2005,39:297–318
    [161]Monireh Abdoos,Nasser Mozayani.Traffic Light Control in Non-stationaryEnvironments based on Multi Agent Q-learning[C].14th International IEEE Conferenceon Intelligent Transportation Systems,Washington, DC, USA. October5-7,2011:1580-1585
    [162] L. Busoniu, R. Babuska, and B. De Schutter.A comprehensive survey of multiagentreinforcement learning[J]. IEEE Transactions on Systems, Man andCybernetics,2008(3):156-172
    [163] Bagnel l J A, Schneider J.Covariant Policy Search[C]//M Proceedings of InternationalJoint Conference on Artificial Intelligence.2003
    [164] Morimura T,Uchibe E, Doya K. Utilizing the natural gradient in temporal differencereinforcement learning with eligibility t races

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700