动态不确定环境下的智能体序贯决策方法及应用研究

英文题名：Agent Sequential Decision-making Approach and Its Application under Uncertain Enviroment
作者：仵博
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：部分可观察马尔可夫决策过程 ; 信念状态空间 ; 基于点的在线值迭代 ; 贝叶斯增强学习 ; 无线传感器网络
英文关键词：partially observable Markov decision processes ; belief states
英文关键词：space ; point-based online value iteration ; Bayesian reinforcement
英文关键词：learning ; wireless sensors network
学位年度：2013
导师：吴敏 ; 陈鑫
学科代码：0812
学位授予单位：中南大学
论文提交日期：2013-05-01

摘要

近年来,动态不确定环境下的智能体在线规划和学习引起了科学界的极大关注,已就智能体在决策时必须考虑各种不确定性作为设计健壮系统的必备条件而达成共识。部分可观察马尔可夫决策过程(Partially Observable Markov Decision Processes,简称POMDPs)为智能体在动态不确定环境下的序贯决策提供了一个理想的模型,该模型可以对传感器噪音、丢失信息和部分观察信息等不确定性信息提供鲁棒性建模,进而最优化序贯策略。
     然而,基于POMDPs的智能体在线规划与学习常陷入信念状态空间“维数灾”和“历史灾”问题,造成现有算法仅适用于小规模问题,难于应用到大规模实际工程中。本文针对上述问题,重点研究信念状态空间压缩方法、在线规划和在线学习方法,并将本文方法应用到无线传感器网络能量高效领域。主要研究成果和创新点如下：
     (1)提出一种基于非负矩阵分解更新规则的可分解POMDPs信念状态空间降维算法
     针对求解可分解POMDPs规划问题时遭遇的“维数灾”问题,提出一种基于非负矩阵分解更新规则的可分解POMDPs信念状态空间降维算法。首先,根据POMDPs的结构特性,对状态、观察和动作进行可分解表示,利用动态贝叶斯网络的独立关系对信念状态空间进行压缩,从而降低信念状态空间的稀疏性。然后,采用信念状态空间值直接降维方法进行降维,利用非负矩阵分解更新规则来更新信念状态空间,从而不但避免Krylov迭代,加快降维速度,而且保留了值函数分段线性凸特性,使得降维前后值函数不发生改变。仿真结果表明,该算法具有较低误差率和较高收敛性。
     (2)提出一种基于点的POMDPs在线值迭代算法
     针对POMDPs序贯决策遭遇的“历史灾”问题,提出一种基于点的POMDPs在线值迭代算法。该算法在给定的可达信念状态点上进行更新操作,避免对整个信念状态空间单纯体进行求解,从而加速问题求解；采用分支界限裁剪方法对信念状态与或树进行在线裁剪；提出信念状态结点重用思想,重用上一时刻已求解出的信念状态点,避免重复计算。仿真结果表明,该算法具有较低误差率、较快收敛性,满足系统实时性的要求。
     (3)提出一种基于模型的可分解贝叶斯增强学习算法
     针对POMDPs在线学习面临的学习参数巨大、算法收敛速度慢等问题,提出一种基于模型的可分解贝叶斯增强学习算法。首先,将学习参数进行可分解表示,降低学习参数的个数；然后,根据智能体先验知识和观察数据利用贝叶斯方法来学习,最优化探索和利用二者之间的平衡关系；最后,采用基于点的增量裁剪方法实现算法的快速收敛。仿真结果表明,该算法能够满足实时系统性能的要求。
     (4)提出一种基于POMDPs的无线传感器网络能量高效策略
     无线传感器网络能量高效策略是目前无线传感器网络面临的难题。针对无线传感器网络节能问题,应用本文提出的方法,首先,提出一种基于广义逆非负矩阵分解的无线传感器网络能量高效通信算法,采用非负矩阵分解方法对奇异值分解后的特征空间进行降维。然后,提出一种基于信念重用的无线传感器网络能量高效跟踪算法,针对现有跟踪算法误差较大问题,采用最大报酬值启发式方法获得跟踪性能的近似最优值。针对传感器能量消耗过大问题,采用信念重用方法,不仅可以减少传感器通信能量,而且还能够进一步降低POMDPs值函数误差,提高跟踪性能。图41幅,表11个,参考文献172篇。
The problem of online planning and learning under uncertainty has received significant attention in the scientific community over the past few years. It is now well-recognized that considering uncertainty during planning and decision-making is imperative to the design of robust computer systems. The partially observable Markov decision processes (POMDPs) provide a rich framework to model a wide range of sequential decision making problems under uncertainty. The POMDPs model can optimize sequences of policies which are robust to sensor noise, missing information, as well as partially observable information.
     However, the online planning and learning in partially observable Markov decision processes are often intractable due to belief states space has two curses:dimensionality and history. To date, the use of POMDPs in real-world problems has been limited by the poor scalability of existing solution algorithms, which can only solve problems with small scale. In order to solving these problems, this dissertation addresses some issues about belief states space compression, online planning, online learning, as well as the application in wireless sensors network. The main results and innovations are listed as follows:
     (1) An novel algorithm for belief states space compression using NMF update rules in POMDPs.
     For solving planning in POMDPs over belief states space is the curse of dimensionality, this paper presents a novel approach to compress belief states space using non-negative matrix factorization update rules, which reduces high dimensional belief states space by two steps. Firstly, this algorithm adopts factored representations of states, observations and actions by exploiting the structure of factored POMDPs, then decomposes and compresses transition functions by exploiting conditional independence and context-specific independence of DBNs, and then removes the zero probability to lower the sparsity of belief states space. Secondly, it adopts value-directed compression approach to make the approximate belief states induce decisions that are as close to the optimal one as possible, and exploits NMF update rules instead of Krylov iterations to speed up reducing the high dimensions of belief states space. This algorithm not only guarantees the value function and reward function of the belief states unchanged after reducing dimensions, but also keeps the piecewise linear and convex property to compute the optimal policy by using dynamic programming. Simulation results demonstrate that the proposed belief compression algorithm has its effectiveness in reducing the cost of computing policies and retaining the quality of the policies.
     (2) Point-based online value iteration algorithm for POMDPs.
     In order to address the curse of history for sequential decision-making in POMDPs, this paper proposes a point-based online value iteration (PBOVI) algorithm for POMDPs. This algorithm for speeding up POMDPs solving involves performing value backup at specific reachable belief points, rather than over the entire belief simplex. The paper exploits branch-and-bound pruning approach to prune the AND/OR tree of belief states online, and proposes a novel idea to reuse the belief states that have been computed last time to avoid repeated computation. Simulation results show that the proposed algorithm has its effectiveness in reducing the cost of computing policies and retaining the quality of the policies, so it can meet the requirement of a real-time system.
     (3) Model-based Bayesian reinforcement learning in factored partially observable Markov decision processes.
     Due to the enormous number of parameters and slow convergence in model-based Bayesian reinforcement learning, the paper presents a novel model-based Bayesian reinforcement learning in factored partially observable Markov decision processes. Firstly, factored representations are made to represent the dynamics with few parameters. Then, according to prior knowledge and observable data, this paper exploits model-based reinforcement learning to provide an elegant solution to the optimal exploration-exploitation tradeoff. Finally, a novel point based incremental pruning algorithm is presented to speed up the convergence. Theoretical and numerical results show that the discrete POMDPs approximate the underlying Bayesian reinforcement learning task well with guaranteed performance.
     (4) POMDPs-based energy-efficient solutions in wireless sensor networks.
     Energy-efficient solution is a challenging problem in wireless sensor networks (WSNs). To conquer the problem, this paper addresses some energy-efficient algorithms using the above proposed POMDPs approaches. First of all, this paper proposed a novel algorithm using generalized inverse nonnegative matrix factorization for energy-efficient communication in WSNs. The algorithm adopts nonnegative matrix factorization approach to reduce dimensions of the matrix decomposed by SVD into lower dimensions, and uses multiplication update law quickly acquire the final dimension reduction results. Then, another novel algorithm using belief reusing for energy-efficient tracking in sensor networks is proposed. It exploits the maximum rewards heuristic search algorithm to approximate the optimal value with respect to tracking performance, and presents belief reusing approach to avoid repeatedly acquiring belief states, which can effectively reduce sensors energy consumption during communication. The numerical results show that the proposed algorithms have their effectiveness in optimizing the tradeoff between tracking performance and energy consumption, so they can meet the requirement of high tracking performance with low energy consumption. There are41figures,11tables, and172references.

引文

[1]Bellingham J G, Rajan K. Robotics in Remote and Hostile Environments [J]. Science,2007,318 (5853):1098-1102.
    [2]Majercik S M, Littman M L. Contingent planning under uncertainty via stochastic satisfiability [J]. Artificial Intelligence,2003,147 (2):119-162.
    [3]Lin R, Kraus S, Wilkenfeld J, et al. Negotiating with bounded rational agents in environments with incomplete information using an automated agent [J]. Artificial Intelligence,2008,172 (6-7):823-851.
    [4]Teacy W T L, Patel J, Jennings N R, et al. Travos:trust and reputation in the context of inaccurate information sources [J]. Autonomous Agents and Multi-Agent Systems,2006,12 (2):183-198.
    [5]Kaelbling L P, Littman M L, Cassandra A R. Planning and acting in partially observable stochastic domains [J]. Artificial Intelligence,1998,101 (2):99-134.
    [6]Littman M L. A tutorial on partially observable Markov decision processes [J]. Journal of Mathematical Psychology,2009,53 (3):119-125.
    [7]Sridharan M, Wyatt J, Dearden R. Planning to see:A hierarchical approach to planning visual actions on a robot using POMDPs [J]. Artificial Intelligence, 2010,174 (11):704-725.
    [8]Amalia F, Panos T. Real-time hierarchical POMDPs for autonomous robot navigation [J]. Robotics & Autonomous Systems,2007,55 (7):561-571.
    [9]Ocana M, Bergasa L M, Sotelo M A, et al. Automatic training method applied to a WiFi+Ultrasound POMDP navigation system [J]. Robotica,2009,27 (7): 1049-1061.
    [10]Spaan M T J, Vlassis N. A point-based POMDP algorithm for robot planning [C] //Valavanis K P. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation. Washington, DC:IEEE Press,2004:2399-2404.
    [11]Augusto C E B, Angelo E M C. Planning under the uncertainty of the technical analysis of stock markets [J]. Advances in Artificial Intelligence,2010,64 (33): 110-119.
    [12]Gasic M, Young S. Effective handling of dialogue state in the hidden information state POMDP-based dialogue manager [J]. ACM Transactions on Speech and Language Processing,2011,7(3):1-28.
    [13]Williams J D, Young S. Partially observable Markov decision processes for spoken dialog systems [J]. Computer Speech & Language,2007,21 (2):393-422.
    [14]张波,蔡庆生,郭百宁.口语对话系统的POMDP模型及求解[J].计算机研究与发展,2002,39(2)：217-224.
    [15]Woodward M P. Framing human-robot task communication as a partially observable Markov decision process [D]. Cambridge:Harvard University, 2012.
    [16]Alex B, Alexei M, Stefan W, et al. Parametric POMDPs for planning in continuous state spaces [J]. Robotics & Autonomous Systems,2006,54 (11): 887-897.
    [17]Hideaki I, Kiyohiko N. Partially observable Markov decision processes with imprecise parameters [J]. Artificial Intelligence,2007,171 (8-9):453-490.
    [18]Lee D D, Seung H S. Learning the parts of objects by non-negative matrix factorization [J]. Nature,1999,401:788-791.
    [19]Tenenbaum J B, Silva V D, Langford J C. A global geometric framework for nonlinear dimensionality reduction [J]. Science,2000,290 (5500):2319-2323.
    [20]孟德宇,徐晨,徐宗本.基于Isomap的流形结构重建方法[J].计算机学报,2010,33(3)：545-555.
    [21]Bishop C M. Pattern recognition and machine learning [M]. New York: Springer,2006:55-60.
    [22]Warmuth M K, Kuzmin D. Online variance minimization [J]. Machine Learning, 2012,87:1-32.
    [23]Delgado K V, Sanner S, Barros L N d. Efficient solutions to factored MDPs with imprecise transition probabilities [J]. Artificial Intelligence,2011,175:1498-1527.
    [24]Boutilier C, Dearden R, Goldszmidt M. Stochastic dynamic programming with factored representations [J]. Artificial Intelligence,2000,121:49-107.
    [25]仵博,吴敏.一种基于信念状态压缩的实时POMDP算法[J].控制与决策,2007,22(12)：1417-1420.
    [26]卞爱华,王崇骏,陈世福.基于点的POMDP算法的预处理方法[J].软件学报,2008,19(6)：1309-1316.
    [27]张新良,石纯一.M-POMDP模型及其规划求解算法[J].清华大学学报(自然科学版).2005,45(10)：1413-1416.
    [28]Ross S. Pineau J, Paquet S, et al. Online planning algorithms for POMDPs [J]. Journal of Artificial Intelligence Research,2008,32(6):663-704.
    [29]Paquet S. Distributed decision-making and task coordination in dynamic uncertain and real-time multi-agent environments [D]. Quebec:Laval University, 2006.
    [30]Wu F, Zilberstein S, Chen X P. Online planning for multi-agent systems with bounded communication [J]. Artificial Intelligence,2011,175(2):441-790.
    [31]Ross S, Pineau J, Chaib-draa B. Theoretical analysis of heuristic search methods for online POMDPs [J]. Advances in Neural Information Processing Systems, 2008,20:1216-1225.
    [32]Duff M. Optimal learning:Computational procedures for Bayes-adaptive Markov decision processes [D]. Amherst:University of Massassachusetts Amherst,2002.
    [33]Ross S, Pineau J, Chaib-draa B, et al. A Bayesian approach for learning and planning in partially observable Markov decision processes [J]. Journal of Machine Learning Research,2011,12:1729-1770.
    [34]Kovacs T, Egginton R. On the analysis and design of software for reinforcement learning with a survey of existing systems [J]. Machine Learning,2011,84:7-49.
    [35]Doshi V F, Pineau J, Roy N. Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs [J]. Artificial Intelligence, 2012,187:115-132.
    [36]Wang Y, Won K S, Hsu D, et al. Monte Carlo Bayesian reinforcement learning [C]//Langford J, Pineau J. In Proceedings of the 29th International Conference on Machine Learning. Edingburgh Scotland:Omni Press,2012:1135-1142.
    [37]Ross S, Pineau J. Model-based Bayesian reinforcement learning in large structured domains [C]//McAllester D A, Myllymaki P. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence. Cambridge, MA:AUAI Press,2008:476-483.
    [38]Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks [J]. Science,2006,313 (5786):504-507.
    [39]Evans J A, Foster J G. Metaknowledge [J], Science,2011,331(6018):721-725.
    [40]Frey B J, Dueck D. Clustering by passing messages between data points [J]. Science,2007,315(5814):972-976.
    [41]Baraniuk R G. More is less:signal processing and the data deluge [J]. Science, 2011,331(6018):717-719.
    [42]Zou H, Hastie T, Tibshirani R. Sparse principal component analysis [J]. Journal of Computational and Graphical Statistics,2006,15 (2):265-286.
    [43]Sun J G, Crow M, Fyfe C. Extending metric multidimensional scaling with Bergman divergences [J]. Pattern Recognition,2011,44 (5):1137-1154.
    [44]Kaltofen E, May J P, Yang Z F, et al. Approximate factorization of multivariate polynomials using singular value decomposition [J]. Journal of Symbolic Computation,2008,43 (5):359-376.
    [45]Roy N, Gordon G Finding approximate POMDP solutions through belief compression [J]. Journal of Artif icial Intelligence Reseach,2005,23:1-40.
    [46]Poupart P. Exploiting structure to efficiently solve large scale partially observable Markov decision processes [D]. Toronto:Univ. of Toronto,2005.
    [47]Lee D D, Seung H S. Algorithms for nonnegative matrix factorization [J]. Advance in Neural Information Process Systems,2001,13:556-562.
    [48]Zhang N, Tian X M. Nonlinear dynamic fault detection method based on Isometric mapping [J]. Journal of shanghai Jiaotong University(Science),2011, 45(8):1202-1206.
    [49]Roweis S, Saul L. Nonlinear dimensionality reduction by locally linear embedding [J]. Science,2000,290 (5500):2323-2326.
    [50]White C C.A survey of solution techniques for the partially observed Markov decision process [J]. Annals of Operations Research,1991,32:215-230.
    [51]Bellman R. Dynamic Programming [M]. Princeton:Princeton University Press. 1957:78-122.
    [52]Sondik E J. The optimal control of partially observable Markov processes [D]. Stanford:Stanford University,1971.
    [53]Monahan G E. A survey of partially observable Markov decision processes: theory, models, and algorithms [J]. Management Science,1982,28(1):1-16.
    [54]Sondik E J. The optimal control of partially observable Markov processes over the infinite horizon:Discounted costs [J]. Operations Research,1978,23(2): 282-304.
    [55]Cassandra A, Littman M L, Zhang N L. Incremental pruning:A simple, fast, exact method for partially observable Markov decision processes [C]//Dan G, Prakash P S. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. San Francisco:Morgan Kaufmann,1997:54-61.
    [56]Zhang N L, Zhang W. Speeding up the convergence of value iteration in partially observable Markov decision processes [J]. Journal of Artificial Intelligence Research,2001,14:29-51.
    [57]Littman M L. Algorithms for sequential decision making [D]. Providence: Brown University,1996.
    [58]Poupart P, Boutilier C. Bounded finite state controllers [C]//Thrun S, Saul L K, Scholkopf B. In Advances in Neural Information Processing Systems 16. Cambridge, MA:MIT Press,2004.
    [59]Bonet B. An e-optimal grid-based algorithm for partially observable Markov decision processes [C]//Sammut C, Hoffmann A G. In Proceeding of the Nineteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann,2002:51-58.
    [60]Theocharous G Hierarchical learning and planning in partially observable Markov decision processes [D]. Lansing:Michigan State University,2002.
    [61]Pineau J, Gordon G, Thrun S. Anytime point-based approximations for large POMDPs[J]. Journal of Artificial Intelligence Research,2006,27:335-380.
    [62]Spaan M T J, Vlassis N. Perseus:Randomized point-based value iteration for POMDPs [J]. Journal of Artificial Intelligence Research,2005,24:195-220.
    [63]Smith T, Simmons R. Point-based POMDP algorithms:improved analysis and implementation [C]//Bacchus F, Jaakkola T. In Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence. Cambridge, MA:AUAI Press,2005:542-547.
    [64]McMahan H B, Likhachev M, Gordon G. Bounded real-time dynamic programming:RTDP with monotone upper bounds and performance guarantees [C]//Raedt L D, Wrobel S. In Proceedings of the Twenty-Second International Conference on Machine Learning. New York:ACM Press,2005:569-576.
    [65]Shani G, Brafman R, Shimony S. Forward search value iteration for POMDPs [C]//Veloso M M. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad:AAAI Press,2007:2619-2624.
    [66]McAllester D, Singh S. Approximate planning for factored POMDPs using belief state simplification [C]//Laskey K, Prade H. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. San Fransisco: Morgan Kaufinann,1999:409-416.
    [67]Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming [M]. Belmont, MA: Athena Scientific,1996:70-90.
    [68]Chang H S, Givan R, Chong E K P. Parallel rollout for online solution of partially observable Markov decision processes [J]. Discrete Event Dynamic Systems,2004,14 (3):309-341.
    [69]Satia J K, Lave R E. Markovian decision processes with probabilistic observation of states [J]. Management Science,1973,20 (1):1-13.
    [70]Washington R. BI-POMDP:bounded, incremental partially observable Markov model planning [C]//Steel S, Alami R. In Proceedings of the 4th European Conference on Planning. London:Springer-Verlag,1997:440-451.
    [71]Wolf T B, Kochenderfer M J. Aircraft collision avoidance using Monte Carlo real-time belief space search [J], Journal of Intelligent and Robotic Systems, 2011,64:277-298.
    [72]He R J, Brunskill E, Roy N. Efficient planning under uncertainty with macro-actions [J]. Journal of Artificial Intelligence Research,2011,40:23-570.
    [73]Kaelbling L P, Littman M L, Moore A P. Reinforcement learning:a survey [J]. Journal of Artificial Intelligence Research,1996,4:237-285.
    [74]Silver D, Sutton R, Muller M. Temporal-difference search in computer Go [J]. Machine Learning,2012,87:183-219.
    [75]陈学松.强化学习及其在机器人系统中的应用研究[D].广州：广东工业大学,2011.
    [76]Wang F Y, Jin N, Liu D R, et al. Adaptive dynamic programming for finite horizon optimal control of discrete time nonlinear systems with ε-error bound [J]. IEEE Transactions on Neural Networks,2011,22(1):24-36.
    [77]Hafher R, Riedmiller M. Reinforcement learning in feedback control:challenges and benchmarks from technical process control [J]. Machine Learning,2011,84: 137-169.
    [78]Sutton R S, Barto A G Reinforcement learning:an introduction [M]. Cambridge, MA:MIT Press,1998:21-43.
    [79]Watkins C J C H, Dayan P. Q-learning [J]. Machine Learning,1992,8:279-292.
    [80]Singh S P, Jaakkola T, Littman M L, et al. Convergence results for single-step on-policy reinforcement learning algorithms [J]. Machine Learning,2000,38: 287-308.
    [81]Barto A, Sutton R, Anderson C. Neuron-like adaptive elements that can solve difficult learning control problems [J]. IEEE Transactions on Systems, Man and Cybernetics,1983,13:834-846.
    [82]Sutton R S. DYNA, an integrated architecture for learning, planning and reacting [J]. ACM SIGART Bulletin,1991,2(4):160-163.
    [83]Moore A W, Atkeson C G. Prioritized sweeping:Reinforcement learning with less data and less real time [J]. Machine Learning,1993,13:103-130.
    [84]Tsitsiklis J N, Roy B. An analysis of temporal-difference learning with function approximation J]. IEEE Transactions on Automatic Control,1997,42:674-690.
    [85]Frommberger L, Wolter D. Structural knowledge transfer by spatial abstraction for reinforcement learning agents [J]. Adaptive Behavior,2010,18(6):531-539.
    [86]Jonsson A, Barto A G. Automated state abstraction for options using the U-Tree algorithm [C]//Leen T K, Dietterich T G, Tresp V. In Proceedings of the 2000 Conference in Advances in Neural Processing Information Systems. Cambridge, MA:MIT Press,2000:1054-1060.
    [87]杜小勤.强化学习中状态抽象技术的研究[D].武汉：华中科技大学,2007.
    [88]Kozlova O. Hierarchical & Factored reinforcement learning [D]. Paris: Universite Pierre et Marie Curie,2010.
    [89]Guestrin C, Koller D, Parr R, et al. Efficient solution algorithms for factored MDPs [J]. Journal of Artificial Intelligence Research,2003,19:399-468.
    [90]Guestrin C, Patrascu R, Schuurmans D. Algorithm-directed exploration for model-based reinforcement learning in factored MDPs [C]//Sammut C, Hoffmann G. In Proceedings of the 19th International Conference on Machine Learning. San Fransisco:Morgan Kaufmann,2002:235-242.
    [91]Degris T, Sigaud O, Wuillemin P H. Learning the structure of factored Markov decision processes in reinforcement learning problems [C]//Cohen W, Moore A. In Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh:ACM Press,2006:257-264.
    [92]Kroon M, Whiteson S. Automatic feature selection for model-based reinforcement learning in factored MDPs [C]//Wani M A, Kantardzic M M, Palade V, et al.In Proceedings of 2009 International Conference on Machine Learning and Applications. Washington, DC:IEEE Press,2009:324-330.
    [93]Kozloval O, Sigaud O, Wuillemin P H, et al. Considering unseen states as impossible in factored reinforcement learning [C]//Buntine W. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases:Part Ⅰ. Berlin:Springer-Verlag,2009:721-735.
    [94]Hester T, Stone P. Generalized model learning for reinforcement learning in factored domains [C]//Decker K, Sichman J, Sierra C, et al. In The Eighth International Conference on Autonomous Agents and Multiagent Systems. Richland, SC:IFAAMS,2009:10-15.
    [95]Szita I, Lorincz A. Optimistic initialization and greediness lead to polynomial time learning in factored MDPs [C]//Wani M A, Kantardzic M M, Palade V, et al. In Proceedings of 2009 International Conference on Machine Learning and Applications. Washington, DC:IEEE Press,2009:1001-1008.
    [96]Vigorito C M, Barto A G. Incremental structure learning in factored MDPs with continuous states and actions [R]. Amherst:University of Massachusetts Amherst,2009.
    [97]Sallans B, Hinton G E. Reinforcement learning with factored states and actions [J]. Journal of Machine Learning Research,2004,5:1063-1088.
    [98]Sutton R S, Precup D, Singh S. Between MDPs and semi-MDPs:a framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999,112:181-211.
    [99]Parr R. Hierarchical control and learning for Markov decision processes [D], Berkeley:University of California,1998.
    [100]Dietterich T G. Hierarchical reinforcement learning with the MAXQ value function decomposition [J]. Journal of Artificial Intelligence Research,2000,13: 227-303.
    [101]Subramanian K, Isbell C, Thomaz A. Learning options through human interaction [C]//Beal J, Knox W B. In Proceedings of 2011 IJCAI Workshop on Agents Learning Interactively from Human Teachers. Palo Alto:AAAI Press, 2011:39-45.
    [102]Joshi M, Khobragade R, Sarda S. Hierarchical action selection for reinforcement learning in infinite Mario [C]//Kersting K, Toussaint M. In The Sixth Starting Artificial Intelligence Research Symposium. Lansdale, PA:IOS Press,2012: 162-167.
    [103]Jong N K, Stone P. Hierarchical model-based reinforcement learning:Rmax+ MAXQ [C]//McCallum A, Roweis S. In Proceedings of the Twenty-Fifth International Conference on Machine Learning. New York:ACM Press,2008: 432-439.
    [104]Otterlo M. A survey of reinforcement learning in relational domains [R]. Enschede:Centre for Telematics and Information Technology University of Twente,2005.
    [105]Dzeroski S, Raedt L D, Driessens K. Relational reinforcement learning [J]. Machine Learning,2001,43:7-52.
    [106]Liu Q, Gao Y, Chen DX, et al. A heuristic contour prolog list method used in logical reinforcement learning [J]. Journal of Information & Computational Science,2008,5(5):2001-2007.
    [107]Song Z W, Chen X P, Cong S. Agent learning in relational domains based on logical MDPs with negation [J]. Journal of Computers,2008,3(9):29-38.
    [108]Sanner S, Kersting K. Symbolic dynamic programming for first-order POMDPs [C]//Fox M, Poole D. In Proceeding of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10). Atlanta:AAAI Press,2010:1140-1146.
    [109]Heckerman D, Geiger D, Chickering D M. Learning Bayesian networks:the combination of knowledge and statistical data [J]. Machine Learning,1995,20: 197-243.
    [110]Friedman N, Koller D. Being Bayesian about network structure:a Bayesian approach to structure discovery in Bayesian networks [J]. Machine Learning, 2003,50:95-125.
    [111]Ghavamzadeh M, Engel Y. Bayesian actor-critic algorithms [C]//Ghahramani, Z. In Proceedings of the 24th international conference on machine learning. New York:ACM Press,2007:297-304.
    [112]Ghavamzadeh M, Engel Y. Bayesian policy gradient algorithms [C]//Bernhard S, John C P, Thomas H. In Proceedings of the advances in Neural Information Processing Systems 19. Cambridge, MA:MIT Press,2007:457-464.
    [113]Poupart P, Vlassis N. Model-based Bayesian reinforcement learning in partially observable domains [C]//Padgham L, ParkesD. In Proceedings of the International Joint Conference on Autonomous Agents and Multi Agent Systems. New York:ACM Press,2008:1025-1032.
    [114]Li X, Cheung W K, Liu J M. Improving POMDP tractability via belief compression and clustering:IEEE T Syst Man CY B,2010,40:125-136.
    [115]仵博,吴敏.部分可观察马尔可夫决策过程研究进展[J].计算机工程与设计,2007,28(9)：2116-2119.
    [116]高阳,陈世富,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1)：86-99.
    [117]Hansen E A, Feng Z Z. Dynamic programming for POMDPs using a factored state representation. In:Proceedings of the Fifth International Conference on AI Planning Systems.2000,130-139.
    [118]王飞,刘大有,卢奕男,等.Bayesian网中的独立关系[J].计算机科学,2001,28(12)：33-36.
    [119]Hoey J, St-Aubin R, Hu A J, et al. SPUDD:stochastic planning using decision diagrams [C]//Laskey K B, Prade H. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Stockholm:Morgan Kaufinann,1999:279-288.
    [120]仵博,吴敏,郑红燕,冯延蓬.基于动态贝叶斯网络的可分解信念状态空间压缩算法[J].信息与控制,2012,41(6)：713-719.
    [121]Tan V Y F, Anandkumar A, Willsky A S. Learning high-dimensional Markov forest distributions:analysis of error rates [J]. Journal of Machine Learning Research,2011,12:1617-1653.
    [122]Nakajima S, Sugiyama M. Theoretical analysis of Bayesian matrix factorization [J]. Journal of Machine Learning Research,2011,12:2583-2648.
    [123]Poupart P, Kim K E, Kim D. Closing the gap:improved bounds on optimal POMDP solutions [C]//Bacchus F, Domshlak C, Edelkamp S, et al. In Proceedings of the Twenty-First International Conference on Automated Planning and Scheduling. Freiburg:AAAI,2011:194-201.
    [124]张晓勇,彭军,李哲琴.多智能体系统中子域适应度评估的合作协进化协作[J].中南大学学报(自然科学版),2010,41(2)：572-577.
    [125]仵博,郑红燕,冯延蓬.POMDPs算法复杂度对比分析研究[J].深圳职业技术学院学报,2012,12(1)：3-10.
    [126]Hauskrecht M. Value-function approximations for partially observable Markov decision processes [J]. Journal of Artificial Intelligence Research,2000,13:33-94.
    [127]仵博,吴敏,佘锦华.基于点的POMDPs在线值迭代算法[J].软件学报.2013,24(1)：25-36.
    [128]Smith T, Simmons R. Heuristic search value iteration for POMDPs [C]//Meek C. In Proceedings of the 20th conference on Uncertainty in artificial intelligence. Arlington:AUAI Press,2004,520-527.
    [129]Cohn R, Durfee E, Singh S. Planning delayed-response queries and transient policies under reward uncertainty [C]//Doshi P, Witwicki S, Kwak J, et al. In Proceedings of the Seventh Annual Workshop on Multiagent Sequential Decision-Making under Uncertainty. Valencia, Spain:ACM Press,2012:17-23.
    [130]Kwok C, Fox D, Meila M. Real-time particle filters [J]. Proceedings of the IEEE,2004,92 (3):469-484.
    [131]仵博,吴敏.基于后验信念聚类的POMDPs在线规划算法[J].计算机工程.(已录用)
    [132]Kearns M J, Koller D. Efficient Reinforcement Learning in Factored MDPs [C] //Dean T. In Proceedings of the 16th international joint conference on artificial intelligence. San Francisco:Morgan Kaufmann,1999:740-747.
    [133]Sallans B. Reinforcement learning for factored Markov decision processes [D]. Toronto:University of Toronto,2002.
    [134]Ross, S. Model-based Bayesian reinforcement learning in complex domain [D]. Montreal Quebec:MxGill University,2008.
    [135]Eaton D, Murphy K. Bayesian structure learning using dynamic programming and MCMC [C]//Ronald P, Gaag V D, Linda C. In Proceedings of the 23rd conference annual conference on uncertainty in artificial intelligence (UAI-07). Cambridge, MA:AUAI Press,2007:101-108.
    [136]Brooks S P. Markov Chain Monte Carlo method and its application [J]. Journal of the Royal Statistical Society. Series D,1998,47(1):69-100.
    [137]Andrieu C, Doucet A, Holenstein R. Particle Markov Chain Monte Carlo methods [J]. Journal of the Royal Statistical Society. Series B,2010,72(3): 269-342.
    [138]Hastings W K. Monte Carlo sampling methods using Markov chains and their applications [J]. Biometrika,1970,57:97-109.
    [139]Poupart P, Vlassis N, Hoey J, et al. An analytic solution to discrete Bayesian reinforcement learning [C]//Cohen W, Moore A. In Proceedings of the 23rd international conference on Machine learning. New York:ACM Press,2006: 697-704.
    [140]Jaulmes R, Pineau J, Precup D. Active learning in partially observable Markov decision processes [C]//Gama J, Camacho R, Brazdil P, et al. In Proceedings of the 16th European Conference on Machine Learning. Brelin:Springer,2005: 601-608.
    [141]Kearns M, Mansour Y, Ng A. A sparse sampling algorithm for near-optimal planning in large Markov decision processes [J]. Machine Learning,2002, 49:193-208.
    [142]Wang T, Lizotte D, Bowling M, et al. Bayesian sparse sampling for online reward optimization [C]//Dzeroski S, Slovenia J S I. In Proceedings of the 22nd international conference on Machine learning. New York:ACM Press, 2005:956-963.
    [143]Castro P S, Precup D. Using linear programming for Bayesian exploration in Markov decision processes [C]//Sangal R, Mehta H, Bagga R K. In Proceedings of the 20th international joint conference on Artifical intelligence. San Francisco:Morgan Kaufmann,2007:2437-2442.
    [144]Porta J M, Vlassis N A, Spaaan M T J, et al. Point-based value iteration for continuous POMDPs [J]. Journal of Machine Learning Research,2006,7: 2329-2367.
    [145]Smallwood R D, Sondik E J. The optimal control of partially observable Markov processes over a finite horizon [J]. Operations Research,1973,21: 1071-1088.
    [146]郑红燕,仵博,冯延蓬,等.基于信念点裁剪策略树的POMDP求解算法[J].信息与控制,2013,42(1)：53-57.
    [147]冯奇,周雪忠,黄厚宽,等.POMDP基于点的值迭代算法中一种信念选择方法[J].北京交通大学学报,2009,33(5)：77-80.
    [148]Strens M. A Bayesian framework for reinforcement learning [C]//Langley P. In Proceeedings of the Seventeenth International Conference on Machine Learning. San Francisco:Morgan Kaufmann,2000:943-950.
    [149]Asmuth J, Li L, Littman M L, et al. A Bayesian sampling approach to exploration in reinforcement learning [C]//McAllester D. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. Arlington: AUAI Press,2009:19-26.
    [150]Mottola L, Picco G P. Programming wireless sensor networks:fundamental concepts and state of the art [J]. ACM computing surveys,2011,43 (3):19.
    [151]Srivastava N. Challenges of next-generation wireless sensor networks and its impact on society [J]. Journal of telecommunication,2010,1 (1):128-133.
    [152]Keeler H P, Taylor P G.A model framework for greedy routing in a sensor network with a stochastic power scheme [J]. ACM transactions on sensor networks,2011,7 (4):34.
    [153]Veeravalli V V, Varshney P K. Distributed inference in wireless sensor networks [J]. Philosophical transactions of the royal society A,2012,370:100-117.
    [154]Wang T, Wang G J, Guo M Y, et al. Hash-area-based data dissemination protocol in wireless sensor networks [J]. Journal of Center South Univwesity and Technology.2008,15:392-398.
    [155]Huang G S, Chen Z G, Li Q H, et al. A novel dynamic call admission control policy for wireless network [J]. Journal of Center South Univwesity and Technology,2010,17:110-116.
    [156]Ye W, Heidemann J, Estrin D. An energy-efficient MAC protocol for wireless sensor networks[C]//In Procedings of twenty-first annual joint conference of the IEEE computer and communications societies. New York:IEEE, 2002:1567-1576.
    [157]Fuemmeler J A, Atia G, Veeravalli V V. Sleep control for tracking in sensor networks [J]. IEEE transactions on signal processing,2011,59(9):4354-4366.
    [158]Han J A, Jeon W S, Jeong D G. Energy-efficient channel management scheme for cognitive radio sensor networks [J]. IEEE transactions on vehicular technology,2011,60(4):1905-1910.
    [159]Varshney K R, Willsky A S. Linear dimensionality reduction for margin-based classification:high-dimensional data and sensor networks [J]. IEEE transactions on signal processing,2011,59(6):2496-2512.
    [160]Hu W Z, Hao Y, Huang L. Multidimensional data reduction based on compressed sensing for sensor network [M]//Zhang Y. Future wireless networks and information systems. Heidelberg:Springer,2012:721-728.
    [161]Bertrand A. Signal processing algorithms for wireless acoustic sensor networks [D]. Belgium:Katholieke Universiteit Leuven,2011.
    [162]Kromer P, Platos J, Snasel V. Fast dimension reduction based on NMF [C]//Cai Z H, Hu C Y, Kang Z, et al. In Proceedings of the 5th international conference on advances in computation and intelligence. Heidelberg:Springer,2010: 424-433.
    [163]Zhou G, Cichocki A, Xie S. Fast nonnegative matrix/tensor factorization based on low-rank approximation [J]. IEEE transactions on signal processing,2012, 60(6):2928-2940.
    [164]Ben I A, Greville T N E. Generalized inverses:theory and applications [M]. Heidelberg:Springer,2003:182-224.
    [165]仵博,吴敏.基于广义逆非负矩阵分解的WSNs节能通信[J].中南大学学报(自然科学版).(已录用)
    [166]Wang A Y. Low power RF transceiver modeling and design for wireless sensors network [D]. Boston:Massachusetts Institute of Technology,2005:55-77.
    [167]Heinzelman W R, Chandrakasan A, Balakrishnan H. Energy-efficient communication protocol for wireless microsensor networks [C]//Sprague R H. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. Hawaii:IEEE Press,2000:3005-3014.
    [168]蒋阳,陈碧云,吴磊,等.LEACH无线传感器网络中增加协作传输的能耗研究[J].传感器与微系统,2012,31(3)：50-52.
    [169]仵博,吴敏,郑红燕,等.基于信念重用的WSNs能量高效跟踪[J].传感器与微系统,2012,31(8)：30-33.
    [170]冯延蓬,仵博,郑红燕,等.WSN中一种目标追踪在线节点调度算法[J].计算机工程,2012,38(11)：96-99.
    [171]He Y, Chong E K P. Sensor scheduling for target tracking:A Monte Carlo sampling approach [J]. Journal of Digital Signal Processing,2006,16(5):533-545.
    [172]徐听,沈栋,高岩青,等.基于马氏决策过程模型的动态系统学习控制：研究前沿与展望[J].自动化学报,2012,38(5)：673-687.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700