一种自适应的多臂赌博机算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

一种自适应的多臂赌博机算法

详细信息查看全文 | 推荐本文 |

英文篇名：An Adaptive Algorithm in Multi-Armed Bandit Problem
作者：章晓芳 ; 周倩 ; 梁斌 ; 徐进
英文作者：Zhang Xiaofang;Zhou Qian;Liang Bin;Xu Jin;School of Computer Science and Technology, Soochow University;State Key Laboratory for Novel Software Technology (Nanjing University);
关键词：强化学习 ; 多臂赌博机 ; 探索和利用 ; 自适应 ; 上下文相关
英文关键词：reinforcement learning;;multi-armed bandit;;exploration and exploitation;;adaptation;;contextual
中文刊名：JFYZ
英文刊名：Journal of Computer Research and Development
机构：苏州大学计算机科学与技术学院;计算机软件新技术国家重点实验室(南京大学);
出版日期：2019-03-15
出版单位：计算机研究与发展
年：2019
期：v.56
基金：国家自然科学基金项目(61772263,61772014,61572375);; 苏州市科技发展计划基金项目(SYG201807)~~
语种：中文;
页：JFYZ201903019
页数：12
CN：03
ISSN：11-1777/TP
分类号：193-204

摘要

多臂赌博机问题是强化学习中研究探索和利用两者平衡的经典问题,其中,随机多臂赌博机问题是最经典的一类多臂赌博机问题,是众多新型多臂赌博机问题的基础.针对现有多臂赌博机算法未能充分使用环境反馈信息以及泛化能力较弱的问题,提出一种自适应的多臂赌博机算法.该算法利用当前估计值最小的动作被选择的次数来调整探索和利用的概率(chosen number of arm with minimal estimation, CNAME),有效缓解了探索和利用不平衡的问题.同时,该算法不依赖于上下文信息,在不同场景的多臂赌博机问题中有更好的泛化能力.通过理论分析给出了该算法的悔值(regret)上界,并通过不同场景的实验结果表明:CNAME算法可以高效地获得较高的奖赏和较低的悔值,并且具有更好的泛化能力.
As an important ongoing field in machine learning, reinforcement learning has received extensive attention in recent years. The multi-armed bandit(MAB) problem is a typical problem of the exploration and exploitation dilemma in reinforcement learning. As a classical MAB problem, the stochastic multi-armed bandit(SMAB) problem is the base of many new MAB problems. To solve the problems of insufficient use of information and poor generalization ability in existing MAB methods, this paper presents an adaptive SMAB algorithm to balance exploration and exploitation based on the chosen number of arm with minimal estimation, namely CNAME in short. CNAME makes use of the chosen times and the estimations of an action at the same time, so that an action is chosen according to the exploration probability, which is updated adaptively. In order to control the decline rate of exploration probability, the parameter w is introduced to adjust the influence degree of feedback during the selection process. Furthermore, CNAME does not depend on contextual information, hence it has better generalization ability. The upper bound of CNAME's regret is theoretically proved and analyzed. Our experimental results in different scenarios show that CNAME can yield greater reward and smaller regret with high efficiency than commonly used methods. In addition, its generalization ability is very strong.

引文

[1]Sutton R, Barto G. Reinforcement Learning: An Introduction[M]. Cambridge, MA: MIT Press, 1998: 1- 20
    [2]Gao Yang, Zhou Ruyi, Wang Hao, et al. Study on an average reward reinforcement learning algorithm[J]. Chinese Journal of Computers, 2007, 30(8): 1372- 1378 (in Chinese)(高阳, 周如益, 王皓, 等. 平均奖赏强化学习算法研究[J]. 计算机学报, 2007, 30(8): 1372- 1378)
    [3]Zhao Fengfei, Qin Zheng. A multi-motive reinforcement learning framework[J]. Journal of Computer Research and Development, 2013, 50(2): 240- 247 (in Chinese)(赵凤飞, 覃征. 一种多动机强化学习框架[J]. 计算机研究与发展, 2013, 50(2): 240- 247)
    [4]Liu Quan, Fu Qiming, Yang Xudong, et al. A scalable parallel reinforcement learning method based on intelligent scheduling[J]. Journal of Computer Research and Development, 2013, 50(4): 843- 851 (in Chinese)(刘全, 傅启明, 杨旭东, 等. 一种基于智能调度的可扩展并行强化学习方法[J]. 计算机研究与发展, 2013, 50(4): 843- 851)
    [5]Liu Zhibin, Zeng Xiaoqin, Liu Huiyi, et al. A heuristic two-layer reinforcement learning algorithm based on BP neural networks[J]. Journal of Computer Research and Development, 2015, 52(3): 579- 587 (in Chinese)(刘智斌, 曾晓勤, 刘惠义, 等. 基于BP神经网络的双层启发式强化学习方法[J]. 计算机研究与发展, 2015, 52(3): 579- 587)
    [6]Robbins H. Some aspects of the sequential design of experiments[J]. American Mathematical Society, 1952, 58(5): 527- 535
    [7]Devanur N R, Kakade S M. The price of truthfulness for pay-per-click auctions[C] //Proc of ACM Conf on Electronic Commerce. New York: ACM, 2009: 99- 106
    [8]Cheng Shi, Mao Luhong, Chang Peng. Multi-armed bandit recommender algorithm with matrix factorization[J]. Journal of Chinese Computer Systems, 2017, 38(12): 2754- 2758 (in Chinese)(成石, 毛陆虹, 常鹏. 融合矩阵分解的多臂赌博机推荐算法[J]. 小型微型计算机系统, 2017, 38(12): 2754- 2758)
    [9]Yu Yonghong, Gao Yang, Wang Hao, et al. Integrating user social status and matrix factorization for item recommendation[J]. Journal of Computer Research and Development, 2018, 55(1): 113- 124 (in Chinese)(余永红, 高阳, 王皓, 等. 融合用户社会地位和矩阵分解的推荐算法[J]. 计算机研究与发展, 2018, 55(1): 113- 124)
    [10]Jain S, Gujar S, Zoeter O, et al. A quality assuring multi-armed bandit crowdsourcing mechanism with incentive compatible learning[C] //Proc of the 13th Int Conf on Autonomous Agents and Multi-Agent Systems. New York: ACM, 2014: 1609- 1610
    [11]Jain S, Narayanaswamy B, Narahari Y. A multi-armed bandit incentive mechanism for crowdsourcing demand response in smart grids[C] //Proc of the 28th AAAI Conf on Artificial Intelligence and the 26th Innovative Applications of Artificial Intelligence Conf. Menlo Park, CA: AAAI, 2014: 721- 727
    [12]Sun Xinxin. Research on task assignment technology in crowdsourcing environment[D]. Yangzhou: Yangzhou University, 2016 (in Chinese)(孙信昕. 众包环境下的任务分配技术研究[D]. 扬州: 扬州大学, 2016)
    [13]Zhou Baojian. Research on search engine keyword optimal selection strategy based on multi-armed bandit[D]. Harbin: Harbin Institute of Technology, 2014 (in Chinese)(周保健. 基于多臂赌博机的搜索引擎关键字最优选择策略研究[D]. 哈尔滨: 哈尔滨工业大学, 2014)
    [14]Chen Hongcui. Research on channel selection mechanism based on multi-armed bandit in cognitive network[D]. Chongqing: Chongqing University of Posts and Telecommu-nications, 2016 (in Chinese)(陈红翠. 认知网络中基于赌博机模型的信道选择机制研究[D]. 重庆: 重庆邮电大学, 2016)
    [15]Bubeck S, Cesabianchi N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems[J]. Foundations and Trends in Machine Learning, 2012, 5(1): 1- 22
    [16]Slivkins A. Contextual bandits with similarity information[J]. Journal of Machine Learning Research, 2014, 15(1): 2533- 2568
    [17]Watkins C J C H. Learning from delayed rewards[J]. Robotics & Autonomous Systems, 1989, 15(4): 233- 235
    [18]Luce R D. Individual choice behavior[J]. American Economic Review, 1959, 67(1): 1- 15
    [19]Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem[J]. Machine Learning, 2002, 47(2/3): 235- 256
    [20]Li Lihong, Chu Wei, Langford J, et al. A contextual-bandit approach to personalized news article recommendation[C] //Proc of ACM Int Conf on World Wide Web. New York: ACM, 2010: 661- 670
    [21]Chu Hongmin, Lin Hsuan Tien. Can active learning experience be transferred?[C] //Proc of IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2017: 841- 846
    [22]Krishnamurthy B, Wills C, Zhang Yin. On the use and performance of content distribution networks[C] //Proc of the 1st ACM SIGCOMM Internet Measurement Workshop. New York: ACM, 2001: 169- 182
    [23]Thathachar M A L, Sastry P S. A new approach to the design of reinforcement schemes for learning automata[J]. IEEE Transactions on Systems Man & Cybernetics, 1985, SMC-15(1): 168- 175
    [24]Even-Dar E, Mannor S, Mansour Y. PAC bounds for multi-armed bandit and Markov decision processes[C] //Proc of the 15th Annual Conf on Computational Learning Theory (COLT). Berlin: Springer, 2002: 255- 270
    [25]Cesa-Bianchi N, Fischer P. Finite-time regret bounds for the multi-armed bandit problem[C] //Proc of the Int Conf on Machine Learning. New York: ACM, 1998: 100- 108
    [26]Strens M. Learning Cooperation and feedback in pattern recognition[D]. London: Physics Department, King’s College London, 1999
    [27]Vermorel J, Mohri M. Multi-armed bandit algorithms and empirical evaluation[C] //Proc of the European Conf on Machine Learning. Berlin: Springer, 2005: 437- 448
    [28]Kirkpatrick B S, Gelatt C, Vecchi D. Optimization by simulated annealing[J]. Science, 1983, 220(4598): 671- 680
    [29]Lai T L, Robbins H. Asymptotically efficient adaptive allocation rules[J]. Advances in Applied Mathematics, 1985, 6(1): 4- 22
    [30]Xia Yingce, Qin Tao, Ding Wenkui, et al. Finite budget analysis of multi-armed bandit problems[J]. Neurocomputing, 2017, 258: 13- 29
    [31]Sz?rényi B, Busa-Fekete R, Weng P. Qualitative multi-armed bandits: A quantile-based approach[C] //Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 1660- 1668
    ① https://webscope.sandbox.yahoo.com

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700