基于强化学习的订单生产型企业的订单接受策略

英文篇名：Reinforcement learning based order acceptance policy in make-to-order enterprises
作者：王晓欢 ; 王宁宁 ; 樊治平
英文作者：WANG Xiao-huan;WANG Ning-ning;FAN Zhi-ping;School of Business Administration,Northeastern University;
关键词：收益管理 ; 订单接受 ; SMART算法 ; 平均利润 ; 强化学习
英文关键词：revenue management;;order acceptance;;SMART algorithm;;average profit;;reinforcement learning
中文刊名：XTLL
英文刊名：Systems Engineering-Theory & Practice
机构：东北大学工商管理学院;
出版日期：2014-12-25
出版单位：系统工程理论与实践
年：2014
期：v.34
基金：国家自然科学基金(71201020);; 中央高校基本科研业务经费(N120406002);; 中国博士后科学基金(2013M540233)
语种：中文;
页：XTLL201412012
页数：9
CN：12
ISSN：11-2267/N
分类号：115-123

摘要

针对订单生产型企业在订单接受决策过程中的不确定性,基于强化学习的思想,在考虑生产成本、延迟惩罚成本以及拒绝成本的前提下,引入顾客等级这一要素,从收益管理的角度建立了基于半马尔可夫决策过程的订单接受模型.在此基础上,提出了基于SMART算法的最优订单接受策略求解方法,旨在最大化订单生产型企业的长期利润.仿真实验结果表明:基于SMART算法得到的订单接受策略要优于基于先来先服务方法得到的订单接受策略;同时,针对考虑顾客等级的仿真实验及数据分析结果,也验证了引入顾客等级这一要素的必要性和重要性.
From the perspective of revenue management,a semi-Markov decision process based order acceptance model(SMDP-OA model) is proposed on the basis of reinforcement learning.This model is to solve the uncertainties during order accepting decision processes for make-to-order(MTO) companies,not only taking into account the production cost,delay cost and reject cost of the incoming order,but also the factor of customer level.Besides,SMART-based optimal order acceptance algorithm is presented,aiming at maximizing the profit of MTO companies.The simulation experiments indicate that the proposed SMART-based algorithm performs better than the algorithm based on the first-come-first-serve(FCFS) order acceptance strategy.Moreover,the experiments also justify the necessity and importance of incorporating the customer level factor during the determination of the optimal order acceptance policy.

引文

[1]Scott C,Izak D.Optimal admission control and sequencing in a make-to-stock/make-to-order production system[J].Operations Research,2000,48(5):709-720.
    [2]Wu A W D,Chiang D M H.The impact of estimation error on the dynamic order admission policy in B2B MTO environment[J].Expert Systems with Applications,2009,36(9):11782-11791.
    [3]Slotnick S A.Order acceptance and scheduling:A taxonomy and review[J].European Journal of Operational Research,2011,212(1):1-11.
    [4]Balakrishnan N,Patterson J W,Sridharan V.Rationing capacity between two product classes[J].Decision Sciences,1996,27(2):185-214.
    [5]Barut M,Sridharan V.Revenue management in order-driven production systems[J].Decision Sciences,2005,36(2):287-316.
    [6]张欣,马士华.基于有限生产能力和产出缓存的订单接受策略[J].工业工程与管理,2008,13(2):34-38.Zhang Xin,Ma Shihua.Order acceptance with limited capacity and finite output buffers in MTO environment[J].Industrial Engineering and Management,2008,13(2):34-38.
    [7]张人千.考虑时间序列关联的订单选择决策比较研究[J].管理科学学报,2009,12(3):44-55.Zhang Renqian.Comparative study of order selection decision considering time series association[J].Journal of Management Sciences in China,2009,12(3):44-55.
    [8]Rom W O,Slotnick S A.Order acceptance using genetic algorithms[J].Computers and Operations Research,2009,36(6):1758-1767.
    [9]Oguz C,Salman F S,Yalcm Z B.Order acceptance and scheduling in make-to-order systems[J].International Journal of Production Economics,2010,125(1):200-211.
    [10]范丽繁,陈旭.基于EMSR方法的订单接受策略研究[J].管理评论,2010,22(4):109-113.Fan Lifan,Chen Xu.Order acceptance policy based EMSR model[J].Management Review,2010,22(4):109-113.
    [11]Hung Y F,Lee T Y.Capacity rationing decisions procedures with order profit as a continuous random variable[J].International Journal of Production Economics,2010,125(1):125-136.
    [12]范丽繁,陈旭.基于收益管理的MTO企业订单定价和接受策略[J].系统工程,2011,29(2):87-93.Fan Lifan,Chen Xu.Order pricing and acceptance policy in make-to-order firm based on revenue management[J].Systems Engineering,2011,29(2):87-93.
    [13]Li X P,Wang J,Sawhney R.Reinforcement learning for joint pricing,lead-time and scheduling decisions in make-to-order systems[J].European Journal of Operational Research,2012,221(1):99-109.
    [14]Sutton R S,Barto A G.Reinforcement learning:An introduction[M].Cambridge MA:MIT Press,1998.
    [15]承向军,常歆识,杨肇夏.基于Q-学习的交通信号控制方法[J].系统工程理论与实践,2006,26(8):136-140.Cheng Xiangjun,Chang Xinshi,Yang Zhaoxia.A traffic signal control method based on Q-learning[J].Systems Engineering—Theory&Practice,2006,26(8):136-140.
    [16]倪建军,刘明华,任黎,等.强化学习在基于多主体模型决策支持系统中的应用——以湖泊水环境决策支持系统为例[J].系统工程理论与实践,2012,32(8):1777-1783.Ni Jianjun,Liu Minghua,Ren Li,et al.Reinforcement learning for DSS based on multi-agent model:A case of lake water environment DSS[J].Systems Engineering—Theory&Practice,2012,32(8):1777-1783.
    [17]黄炳强,曹广益,李建华.平均报酬模型强化学习理论、算法及应用[J].计算机工程,2007,33(18):18-19.Huang Bingqiang,Cao Guangyi,Li Jianhua.Average reward reinforcement learning theory algorithms and its application[J].Computer Engineering,2007,33(18):18-19.
    [18]高阳,周如益,王皓,等.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378.Gao Yang,Zhou Ruyi,Wang Hao,et al.Study on an average reward reinforcement learning algorithm[J].Chinese Journal of Computers,2007,30(8):1372-1378.
    [19]Abhijit G,Naveen B,Tapas K D.A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking[J].IIE Transactions,2002,34(9):729-742.
    [20]黄炳强.强化学习方法及其应用研究[D].上海:上海交通大学,2007.Huang Bingqiang.Research on the reinforcement learning and its applications[D].Shanghai:Shanghai Jiaotong University,2007.
    [21]Darken C,Moody J.Learning rate schedules for faster stochastic gradient search[M].IEEE Press,1992.
    [22]Tapas K D,Abhijit G,Sridhar M,et al.Solving semi-Markov decision problem using average reward reinforcement learning[J].Management Science,1999,45(4):560-574.