基于增强学习的计算机博弈策略的研究与实现

英文题名：Research and Implementation of Computer Game Strategy Based on Reinforcement Learning
作者：宫瑞敏
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：博弈 ; TD(λ)算法 ; 评估函数 ; BP神经元网络 ; 五子棋
英文关键词：Game ; TD(λ) Algorithm ; Evaluation Function ; BP Neural Network ; Renju
学位年度：2011
导师：吕艳辉
学科代码：081203
学位授予单位：沈阳理工大学
论文提交日期：2010-12-27

摘要

计算机博弈作为人工智能领域的一个重要分支,得到了极其快速的发展。计算机博弈是一个有关对策和斗智问题的研究领域,属于人工智能中的问题求解与搜索技术。博弈的核心思想实际上就是对博弈树节点的估值过程和对博弈树搜索过程的结合。估值是各种博弈问题中最难以处理的一个问题,局面估值的准确性在很大程度上决定了博弈程序的棋力高低。
     本文基于增强学习,研究了计算机博弈中的一些关键技术。针对静态估值函数依赖人类棋类知识水平和评估不够准确的问题,将TD(λ)算法与BP神经元网络相结合,即BP-TD(λ)算法。该算法使用BP神经元网络作为局面的估值函数,利用TD(λ)算法直接从原始经验中学习,自动调整估值函数的参数,将BP神经元网络的有监督学习转换为无监督学习,避免了神经网络在有监督学习下调整参数值容易受人类经验影响的缺陷。为了更好地提高博弈训练的性能,针对开局和中局,提出分阶段设置参数值的策略。设置开局阶段的参数值时,着法选择使用的是随机的着法选择策略;设置中局阶段的参数值时,着法选择使用的是极大极小的选择策略。
     采用以上的方法和策略,以五子棋为模型,实现了基于增强学习的五子棋博弈系统TDRenju,通过对估值部分的改进和增强,提高了棋力。
As an important branch of artificial intelligence, Computer game has been got extremely rapid development. Computer game is a battle of wits on strategies and research issues. It belongs to problem solving and search technology in artificial intelligence. The core idea of game is actually the combination of evaluation process of the game tree node and game-tree search process. Evaluation is one of the most difficult problems to tackle in game playing. The accuracy of evaluation usually determines the discretion of Game.
     In this paper, the key technology of game and the relevant principles of Reinforcement Learning were studied. The static evaluation function dependent on human chess knowledge and assessment is inaccurate. Aiming at this problem, BP-TD(λ) algorithm is put forward which combining TD(λ) algorithm with BP neural network. Using BP neural network as the evaluation function of the situation, TD(λ) algorithm can adjust the weights of BP neural network automatically by learning directly from the original experience. The supervised learning of BP neural network is converted into unsupervised learning. The BP neural network is easy to affected by the human experience when adjust parameter values by supervised learning. This algorithm that learning unsupervised can avoid this defect. In order to put training performance into better play, the paper also proposed the strategy of setting parameter values in stages for the opening and the middlegame. When using opening parameter values we choose the method of random selection strategy and When using middlegame parameter values we choose the method of minimax selection strategy.
     Taking the above-mentioned method and strategy and choosing Renju Game as a model, TDRenju that Renju Game based on reinforcement learning system is implemented. Through the improvements and enhancements of evaluation function, thinking depth is increased.

引文

[1] Tom M. Mitchell著,曾华军,张银奎等译.机器学习.机械工业出版社,2003. 65-80
    [2]徐心和,邓志立,王骄等.机器博弈研究面临的各种挑战.智能系统学报.2008.3(4):289-293
    [3]王珏,石纯一.机器学习研究.广西师范大学学报(自然科学版),2003. 21(2):10-12
    [4]付强,陈焕文.基于RL算法的自学习博弈程序设计与实现.长沙理工大学学报,2007.14(4):128-131
    [5] Cai Zhi-xing,Xu Guang-you. Artificial intelligence principles and applications.Beijing:Tsing hua Press,2003
    [6]杨璐,洪家荣,黄梯云.用加强学习方法解决基于神经网络的时序实时建模问题.哈尔滨工业大学学报,1996.28(4):136-139
    [7]李春贵,吴沧浦,刘永信.一种基于状态聚类的SARSA(λ)强化学习算法.计算机工程,2003.29(5):37-38
    [8]蒋国飞,吴沧浦.基于Q学习算法和BP神经网络的倒立摆控制.自动化学报,1998.24(5):662-666
    [9]徐长明,马宗民,徐心和等.面向机器博弈的即时差分学习研究,计算机科学,2010.37(8):219-223
    [10]潘璐.基于增强学习的博弈主体的研究.沈阳工业大学硕士学位论文.2007
    [11]谷蓉.计算机围棋博弈系统的若干问题研究.清华大学硕士学位论文.2003
    [12]王骄,王涛等.中国象棋计算机博弈系统评估函数的自适应遗传算法实现.东北大学学报,2005.26(10):949-952
    [13]虞靖靓.基于Q学习的Agent智能决策的研究与实现.合肥工业大学硕士学位论文.2005
    [14]王一非.具有自学习功能的计算机象棋博弈系统的研究与实现.哈尔滨工程大学硕士学位论文.2007
    [15]莫建文.机器自学习博弈策略研究与实现.广西师范大学硕士学位论文.2002
    [16] Tesauro G. Practical issues in temporal difference learning. Machine Learning,1992.8(3-4):257-277
    [17] Sutton R.S. Learning to Predict by the method of temporal difference. Machine Learning,1988.3,9-44
    [18] Dayan P. The convergence of TD(λ)for generalλ. Machine Learning, 1992(8):341-362
    [19] Wang Lichun, Denbigh P.N. Monaural localization using combination of TD(λ)With back propagation. IEEE Int. Conf. On Neural Network. San Francisco, USA,1993.187-190
    [20] Avraham Bab,Ronen I.Brafman.Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study. Department of Computer Science.2008.2635-2675
    [21] Cichosz P. Truncating temporal differences: on the efficient implementation of TD(λ) for reinforcement learning. Journal of Artificial Intelligence Research,1996(12):287-318
    [22] Badtke S J, Barto R.G. Linear least squares algorithms for temporal difference learning,Machine Learning,1996.22,33-57
    [23] ChristoPher J.C.H. Watkins, Peter Dayan, Technical Note: Q-Learning,Machine Learning,1992.Vol8,No.3-4,279-292
    [24] Zomaya Albert Y,Reinforcement learning for the adaptive control of nonlinear systems, IEEE Control Systems,1994,24(2):357-363
    [25] Szepesvari C. The asymptotic convergence rate of Q-learning. Proceedings of Neural Information Processing Systems. Cambridge, MA: The MIT Press,1997.1064-1070
    [26] Singh SP. Reinforcement learning with replacing eligibility traces. Machine Learning,1996.22,159-195
    [27]黄苇.计算机博弈与人工智能.科教导刊,2010.27
    [28] Richard S .Suton, Andrew G. Barto. Reinforcement Learning. MIT Press, Cambridge,MA,1998
    [29] Tadepali P, ok D. Model-base average reward reinforcement learning. Artificial Intelligence,1998.100,177-224
    [30]王小春.PC游戏编程(人机博弈).重庆:重庆大学出版社,2002
    [31] Heather Desurvire, Martin Caplan, Jozsef A. Toth. Using Heuristics to Evaluate the Playability of Games[J], In CHI '04: CHI '04 extended abstracts on Human factors in computing systems,2004.1509-1512. 6(3):293-326
    [32] Mahadevan S. Average reward reinforcement learning: Foundations, algorithms, and empirical results, Machine Learning,1996. 22(1):159-195
    [33] Bellman R E. Dynamic Programming. Princeton NJ: Princeton University Press,1957
    [34] MiehieD, Chambers R A. BOXES: An Experiment in Adaptive Control, In Dale E and Michie D, editors, Machine Intelligence, Oliver and Boyd, Edinburg,1968.2:137-152
    [35] Samuel A L. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and. Development,1959.3:210-229
    [36]王雪松,程玉虎.一种基于时间差分算法的神经网络预测控制系统.信息与控制,2004,5(4):234-236
    [37] ZHANG Han-dong, BAO Wei, WANG Li-hua. Reinforcement Learning for Collaborative Behavior Optimization Based on Game Theory.2009
    [38] Thrun S. The role of exploration in learning control. Handbook of intelligent control: Neural, fizzy, and adaptive approaches. New York: Van No strand Reinhold,1992.527-559
    [39] Boyan, J., & Moore A. Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in Neural Information Processing Systems 7.Cambridge, MA: MIT Press,1995
    [40] Peng J., & Williams R.. Incremental multi-step Q-learning. Proceedings of the Eleventh international Conference on Machine Learning,1994, 22:226-232.
    [41]徐长明,马宗民,徐心和.一种新的连珠棋局面表示法及其在六子棋中的应用.东北大学学报(自然科学版).2009.30(4):514-517
    [42] I-Chert Wu and Dei-Yen Huang. A new family of k-in-a-row games, In Proceedings of The 11th Advances in Computer Games Conference,2005. 88-100
    [43] Martin T. Hagan,Howard B.Demuth,Mark H.Beale著,戴葵等译,李柏民校.神经网络设计,机械工业出版社,2002.31-54
    [44] Chang-ming Xu, Z. M. Ma, Xinhe Xu, A method to construct knowledge table-base in k-in-a-row games, Proceedings of the 2009 ACM symposium on Applied Computing,2009.929-933
    [45] Mannen H, Wiering M. Learning to Play Chess Using TD(λ)-Learning With Database Games. Proceedings of the Thirteenth Belgian- Dutch Conference on Machine Learning,Benelearn,2004
    [46] Ghory I. Reinforcement Learning in Board Games. Technical Report,Department of Computer Science,UK:Univ.of Bristol,2004

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700