分布式系统自愈调控关键技术研究

英文题名：Research on Key Technologies for Self-healing Scheduling in Distributed Systems
作者：卢旭
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：自愈调控 ; POMDP ; 失效预测 ; 流形学习 ; 任务调度
英文关键词：Self-healing Scheduling ; POMDP ; Failure Prediction ; Manifold Learning ; Task Scheduling
学位年度：2009
导师：王慧强
学科代码：081201
学位授予单位：哈尔滨工程大学
论文提交日期：2009-03-02

摘要

自愈调控是构建可信计算机系统的必要手段,也是系统高可用性的重要保证。传统的分布式系统失效恢复技术主要依赖高成本冗余和人为管理,由于系统失效后人为修复的难度和成本加大,如何实现无人干预下自主修复系统失效,维持系统高可用性成为当前研究的一个热点。针对分布式系统自愈调控领域中的主要问题,本文开展了一系列研究,主要体现在如下几个方面:
     提出一种分布式系统失效恢复的部分可观察随机决策模型(PartiallyObservable Markov Decision Processes model),并采用FIB(Fast InformedBound)值迭代方法来求解POMDP模型,解决了不准确的失效检测下分布式系统恢复策略生成问题。为衡量恢复策略的近似最优性,首先证明FIB迭代收敛并具有压缩映射的性质,然后给出POMDP模型最优解的界,进而利用最优解的最大界差(Maximum Bound Difference)估计近似最优解的误差。以某网络安全态势感知系统为例进行仿真实验,与其他恢复策略相比POMDP策略对于不准确检测下的失效恢复具有较好的效果。
     提出了一个基于微重启的分布式系统任务失效快速恢复模型,该模型与现有恢复模型相比,不仅考虑到了恢复时间,还考虑到了恢复的可靠性代价问题,因此更加接近于实际和精确。在此基础上提出了一个基于扩展型贝叶斯分析的实时恢复算法。算法以提高系统的可用性为目标,并在时间优先级相同的情况下考虑了可靠性代价。最后对扩展性贝叶斯分析的实时恢复算法进行了可恢复性证明,为系统失效自恢复提供了相应的理论依据。
     为了解决失效预测问题,本文提出了基于流形学习的失效预测方法,并将非线性降维的思想应用到了失效特征提取中,提出一种基于有监督Hessian局部线性嵌入降维的特征提取方法,从而实现了失效内在特征自动提取和无人干预下的失效预测,以小型局域网即时通讯系统为研究对象搭建失效预测实验环境,初步实验结果说明了基于流形学习的失效预测方法的可行性。
Self-healing scheduling technique is critical for dependability of computer systems and also a guarantee of high availability. Traditional techniques for failure recovery highly depend on the redundancy and administrators' domain knowledge. Due to the cost and difficulty of failure recovery, self-healing ability became an important research field in dependability computers research. Therefore, relative researches were developed in this paper to tackle the problem in this field. Our main contributions are summarized as follows:
     To overcome the challenges of recovery polices generation in the presence of inaccurate failure detection, a failure recovery model for microrebootable distributed systems based on discounted Partially Observable Markov Decision Processes is presented in this paper. Thus the reasonable recovery policies are generated by solving the POMDP model. To tackle the problem of computational complexity of exact solution, a value function approximate solution called fast informed bound solution is used for the near-optimal policies. In addition, the lower and upper approximations bound of the optimal value function are proposed, which are used for the error estimation of near-optimal value function with maximum bound difference. Simulation-based experimental results on a realistic network security situation prediction system demonstrate that the proposed model can be solved effectively, and the resulting policies convincingly outperform others.
     Secondly, a distributed systems tasks failure recovery model is presented based on microreboot. Compared with other models, our model not only takes recovery time into consideration, but also considers the reliability cost of recovery. Therefore, our model is more precise and accurate. The correspondingly algorithm of real-time task failure recovery is presented based on the extended bayes analysis, which takes reliability cost into account when recovery time priority is equal. To provide theoretic foundation for failure recovery, we prove the recoverability of our algorithm.
     Finally, we present an failure prediction method based on manifold learning. To extract failure features for prediction, we apply an nonlinear dimensionality deduction algorithm called supervised Hessian locally linear embedding algorithm. Then we adopt k nearest neighbors classifier for classification. The experimental results show that manifold learning approach can effectively find the failure inherent features and makes the failure prediction based on manifold learning possible.

引文

[1]J.O Kephart,D.M Chess.The vision of autonomic computing.IEEE Computer Society,January 2003:41-59P
    [2]Alan Ganek.Autonomic Computing:Implementing the Vision.Proceeding of the Autonomic Computing Workshop Fifth Annual International Workshop on Active Middleware Services(AMS '03).Seattle,2003:1-11P
    [3]Jarrett,M.,Seviora,R.Constructing an Autonomic Computing Infrastructure Using Cougaar.Proceedings of the Third IEEE International Workshop on Engineering of Autonomic and Autonomous Systems,Waterloo,2006:119-128P
    [4]IBM DB2增自主计算功能支持最新版Linux内核.Internet draft,http://crab.sanmen.gov.cn/News/show.asp?urlTechNews/s/n/2004-05-04/1046357823.html,2004
    [5]Autonomic Computing Toolkit.Internet draft,http://www.128.ibm.com/developer works/ autonomic /overview.html,2005
    [6]IBM Corp.Autonomic computing concepts.Proceeding of 27th Annual IEEE/NASA Software Engineering Workshop,Greenbelt,MD,December 2002:40-47P
    [7]A Bendiab.Autonomic Computing Meets Complex Information Systems:Theory and Practice.WEBIST,Barcelona,March,2007
    [8]Salim Hariri,Bithika Khargharia,Houping Chen,et al.The Autonomic Computing Paradigm.Cluster Computing,2006,9(1):5-17P
    [9]Mazeiar Salehie,Ladan Tahvildari.Autonomic computing:emerging trends and open problems.Proceedings of the 2005 workshop on Design and evolution of autonomic application software,St.Louis,2005:1-7P
    [10]张海俊,史忠植.自主计算软件工程方法.小型微型计算机系统,2006,27(6):1077-1082页
    [11]张海俊,史忠植.自主计算环境.计算机工程,2006,32(7):1-3页
    [12]李春江,肖侬,杨学军.具有自主计算特征的计算网格资源备份服务系统,计算机工程与科学,2005,27(12):59-61页
    [13]刘文洁,李战怀.虚拟化技术在基于自律计算的高可用性系统中的应用.计算机应用.2006,26(2):485-487页
    [14]付长冬,舒继武,郑纬民,沈美明.基于自主运算的自适应存储区域网络系统.软件学报.2004,15(7):1056-1063页
    [15]B.Schroeder and G.Gibson.A large-scale study of failures in high-performance computing systems.In Proceedings of the International Conference on Dependable Systems and Networks,2007:522-526P
    [16]张宇,洪炳熔.基于检测点设置依赖图和属性表的回卷恢复算法.计算机研究与发展.2001,38(2):246-251页
    [17]张宇,洪炳熔.基于异步消息记录(CMLA)的回卷恢复算法.高技术通讯.2002,12(70):48-52页
    [18]Kun-Lung Wu,W.Kent Fuchs.Rapid Transaction-Undo Recovery Using Twin-Page Storage Management.IEEE Transactions on Software Engineering.1993,19(2):155-164P
    [19]S.Garg,Y Huang,C.Kintala,K.S.Trivedi.Minimiz- ing Completion Time of a Program by Checkpointing and Rejuvenation.Proc.1996 ACM SIGMETRICS Conference,Philadelphia,May 1996:252-261P
    [20]M.G.Hinchey,J.L.Rash,W.F.Truszkowski,C.A.Rouff,R.Sterritt,"Challenges of Developing New Classes of NASA Self-Managing Missions",In Proc.IEEE International conference on Parallel and Distributed Systems(Workshops),2006,463-467P
    [21]Patterson D A,Brown A,etc.Recovery-Oriented Computing(ROC):Motivation,Definition,Techniques and Case Studies.UC Berkeley TR UCB//CSD-02-1175.Berkeley,CA,2002:571-574P
    [22]Brown,A.and D.A.Patterson.Embracing Failure:A Case for Recovery-Oriented Computing(ROC).High Performance Transaction Processing Symposium, Asilomar, CA, October 2001: 897-900P
    [23] Brewer E A. Lessons from giant-scale services. IEEE Internet Computing 2001: 46-55P
    [24] CANDEA G, CUTLER J, FOX A, Reducing recovery time in a small recursively restartable system. In Proceedings of the International Conference on Dependable Systems and Networks,2002:542-546P
    [25] J. Appavoo, K. Hui, C. A. N. Soules. Enabling autonomic behavior in systems software with hot swapping. IBM Systems Journal,2003,42(1):60-76P
    [26] Tiejun Ma, Jane Hillston and Stuart Anderson. Evaluation of the QoS of crash-recovery failure detection. In Proceedings of the 2007 ACM symposium on Applied computing, 2007:538-542P
    [27] D. Oppenheimer, A. Ganapathi, and D. Patterson, "Why do internet services fail, and what can be done about it" in Proc. 4th USENIX Symp.Internet Technologies and Systems (USITS'03), 2003:l-16P
    [28] G. Candea, E. Kiciman, S. Kawamoto, and A. Fox, "Autonomous Recovery in Componentized Internet Applications," Cluster Comput.2006,9(1):56-59P
    [29] N. Arshad, D. Heimbigner, and A. L. Wolf. A Planning Based Approach to Failure Recovery in Distributed Systems. In Proceedings of the ACM SIGSOFT International Workshop on Self-Managed Systems (WOSS'04).ACM Press, Oct./Nov. 2004:145-152P
    [30] Qijun Zhu, Chun Yuan. A Reinforcement Learning Approach to Automatic Error Recovery. In proceeding of IEEE/IFIP international conference on dependable Systems and networks ,2007,729-738P
    [31] K.R. Joshi, W.H. Sanders, M.A. Hiltunen and R.D. Schlichting. Automatic Model-Driven Recovery in Distributed Systems. SRDS 2005: 25-38P
    [32] K.R. Joshi, W.H. Sanders, M.A. Hiltunen and R.D. Schlichting. Automatic Recovery Using Bounded Partially Observable Markov Decision Processes.In Proc. of the 2006 International Conference on Dependable Systems and Networks.2007: 445-456P
    [33] Avizienis. J.C.Laprie, Brian Randell,ed. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 2004,1(1):11-33P
    [34] G. Candea, S. Kawamoto, Y. Fujiki, and A. Fox. A microrebootable system- design, implementation, and evaluation. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, Dec. 2004
    [35] M. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374-382p, Apr. 1985.
    [36] T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM (JACM),1996,43(2):225-267P
    [37] Michael Littman. Algorithms for Sequential Decision Making. PhD thesis,Department Computer Science, Brown University, Providence, Rhode Island, 1996.
    [38] M. Hauskrecht. Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research,2000(13):33-94P
    [39] Pineau, J., Gordon, G, and Thrun, S. (2003). Point-based value iteration:An anytime algorithm for POMDPs. In Proc. of IJCAI, 2003:245-250P
    [40] S. Garg, A. Puliafito, M. Telek, K. S. Trivedi. Analysis of Preventive Maintenance in Transactions Based Software Systems. IEEE Trans. on Computers, 1998,47(1):96-107P
    [41] S. Garg, A. van Moorsel, K. Vaidyanathan, etc. A Methodology for Detection and Estimation of Software Aging. In Proc. of the Ninth Intl.Symposium on Software Reliability Engineering, Paderborn, Germany,November 1998: 282-292P

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700