面向分布式关键任务系统的自律恢复机制研究

英文题名：Research on Autonomic Recovery Mechanisms for Distributed Mission-Critical System
作者：叶海智
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：分布式关键任务系统 ; 自律计算 ; 检测 ; 决策 ; 恢复
英文关键词：Distributed Mission-Critical System ; Autonomic Computing ; Detection ; Decision-making ; Recovery
学位年度：2010
导师：王慧强
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2009-12-01

摘要

随着技术的不断发展和应用需求的变化,人们对分布式关键任务系统的可用性要求越来越高,不仅希望系统能够保障关键业务数据信息的完整性,而且具有不间断运行或者即使失效发生也能在最短时间内自动恢复的能力。然而,由于系统功能种类和结构复杂性的不断增加,以及恶意攻击和软件缺陷等因素的存在,失效事件频繁发生,失效场景呈现出多样性和不可预测性的特点,使得对失效根源的追踪、分析和恢复变得异常困难,迫切需要系统具有自我检测、并能针对不同的失效场景智能化地进行恢复决策和实现自我恢复的能力。在这种背景下开展面向分布式关键任务系统自律恢复机制的研究,旨在将最近提出的自律计算技术与检测技术、恢复技术和决策方法相结合,通过合理的设计使系统在较少人为干预的情况下,具有自我恢复的能力,确保系统应用服务的可用性和连续性。
     目前,自律计算仍处于起步阶段,应用其解决分布式关键任务系统失效恢复问题的相关研究工作还比较缺乏,如何构建系统的自律恢复框架、如何对系统进行失效检测、恢复决策和恢复实现等诸多问题尚待研究和解决。基于上述情况,本文以提高系统自我恢复能力为目标,以应用服务的恢复为重点,以应用构件和运行环境的失效检测、决策及恢复方法为主线,对系统的自律恢复机制进行深入研究。
     首先,针对本文的研究目标并结合系统的特征需求,从自律计算的基本思想出发,构建一种面向系统自律恢复的框架模型DARA(DMCS Autonomic Recovery Architecture)。该框架模型依次分为知识层、管理层和目标层,在整体结构上形成一个“失效检测—恢复决策—恢复执行”的自律恢复控制环,在由系统实体模型、状态模型和恢复策略组成的恢复管理知识支持下,可有效地降低对系统恢复管理的复杂性,同时通过引入π演算完成对该模型的形式化描述和验证,证明了模型的合理性。
     其次,从检测方法和消息传递机制两个方面,开展面向系统自律恢复的失效检测问题研究。在检测方法上,为满足运行环境失效检测的准确性和对失效根源的定位需求,提出一种基于混合模式的检测方法A-Hybrid。该方法利用服务器模型、主机模型等信息来检测和定位失效对象；在消息传递方面,根据检测器与被检测对象间消息交互的松耦合需求,给出一种基于发布／订阅的检测消息传递机制。实验结果表明：A-Hybrid方法不仅能够以较高的准确度检测到失效对象,而且能够对失效根源进行定位,为下一步运行环境的恢复决策提供了可靠依据。
     再次,从应用构件和运行环境两个方面,进行面向系统自律恢复的决策方法研究。对于应用构件,针对其强关联性失效所带来的决策低效问题,给出一种基于重启树优化的恢复决策方法。该方法首先计算出构件间的失效关联度FRD,将关联度高的构件合并为一个重启群实现对重启树的优化,然后根据该重启树和检测结果给出可疑失效构件的恢复计划。应用结果表明,这种方法具有较高的决策效率,有利于应用构件的快速恢复。对于运行环境,根据其失效场景的多样性特点,提出一种基于智能规划的决策方法。运用环境中实体间的依赖关系进行领域描述,并根据检测结果和目标策略确定初始状态和目标状态,然后通过规划器生成恢复计划。实验结果表明,该方法能够对不同的失效场景智能地给出相应的恢复计划,为环境的恢复奠定了基础。
     最后,从应用构件和运行环境两个方面,开展系统失效恢复方法的研究。对于应用构件,以其短暂性失效恢复为重点,以系统应用服务的高可用性需求为目标,提出一种多粒度微重启的恢复方法。该方法通过将重启对象划分并包装为不同粒度的可重启元素,从而能够进行更为有效地重启恢复。实验结果表明,该方法同一般微重启方法相比,重启恢复时间可减少48%,使系统应用服务的可用性得到显著提高。对于运行环境,给出一种基于脚本的恢复方法,重点研究恢复计划与脚本的对应关系,并对运行环境在不同失效程度下恢复计划及其脚本的生成时间进行了实验研究,以便为具体关键任务系统的不同需求提供灵活的环境恢复方案。
With the development of IT technology and the changing requirements, the availability demands of Distributed Mission-Critical System was getting higher and higher, the information integrality of critical mission data should be not only guaranteed,but also the uninterrupted running or automatic recovery in a short time when failures happen. However, with the increasing system scale and complexity as well as inevitable faults such as malicious attack and bug, system failures happened frequently, and the tracking, analysis and recovery of failures have become extremely difficult. Therefore, abilities of self-monitoring, self-diagnosis,intelligent decision-making and self-recovery of the system according to different failure scenarios and self-recovery are in urgent need. Autonomic computing technology provides a new research idea for the settlement of this issue. By combining autonomic computing with detection technologies, recovery technologies and decision-making methods,disteibuted mission critical system can recovery from system failures automatically and its high availability is guaranteed.
     However, the autonomic computing was still in its infancy and its application in distributed mission critical system failure recovery is also in a lack. Many basic issues such as how to build the autonomic recovery architecture of system as well as the implementation of autonomic failure detection, decision-making and recovery still need to be studied carefully. So,the system autonomic recovery mechanism was studied deeply in order to improve the self-recovery abilities of the system.
     Firstly, in order to fulfill the system specific requirement, its architecture DARA (DMCS Autonomic Recovery Architecture) was proposed based on the autonomic computing concept. In this architecture, autonomic recovery was divided into knowledge level, management level and target level.The functionality of each level was analyzed. From the architecture perspective, a failure "detection--decision-making--recovery" control loop was formed which can lower the complexity of autonomic recovery. Then, the system recovery management knowledge database including system entity dependency model, state model and management strategy was built which can provide support for failure detection and recovery. The architecture formalization and validation based on% calculus was carried out to prove the rationality of architecture.
     Secondly, failure detection of distributed mission critical system was studied from detection method and massage transfer mechanism. On the detection aspect, to meet the high accuracy of runtime environment failure detection, A-Hybrid detection method was proposed. This method can detect and locate failure objects through applying application configuration model, server model and host model. On the massage transfer aspect, according to the loose-coupling massage requirements,a mechanism for detection message transfer based on subscription/publishing was proposed. Experimental results showed that compared with other detection methods, A-Hybrid method can accurately detect failures and identifies the specific failure objects.
     Thirdly, from the aspect of application components and runtime environment, a decision-making method about autonomic recovery of system was studied. For the application components,to solve the low decision-making efficiency problem with failure strong correlation, a recovery decision-making method was proposed based on reboot tree optimization. To achieve the optimization of reboot tree, the components with high failure correlation were unite as a whole reboot group by computing the failure relevancy degree,and then a recovery plan was made based on the reboot tree and detection results.Examples showed that compared with the method without the reboot tree optimization, our method can achieve high efficiency and less recovery time. At the same time, considering the diversity of runtime environment failure scenarios,a decision-making method based on AI planning was put forward. A domain description was carried out between the dependencies of objects in runtime environment, and the initial state and goal state were determined by detection results and target policy, and then the recovery plan was made by planner. Experimental results showed that the decision-making method based on AI planning can generate relevant recovery plan effectively.
     Finally, autonomic recovery implementation for mission critical system was studied from two aspects:application components and runtime environment. For the application components,by clustering the reboot objects as different microreboot elements,a multi-granularity microreboot method was proposed for transient failure recovery in order to achieve high availability. Experimental results showed this method need less 48% reboot time than traditional reboot method which helps to achieve high availability. For runtime environment, a recovery method based on scripts was put forward, which focused on the relationship between recovery plan and scripts,moreover, the generating time of scripts under different failure degree in runtime environment was studied, which can provide flexible recovery plan according to different application environment requirement.

引文

[1]Group Y. "How much is an hour of downtime worth to you?".Technical Report, Boston,2002
    [2]Horn P. Autonomic computing:IBM's perspective on the state of information technology. IBM Corporation,2001,15:1-39P
    [3]Kephart J, Chess D.The vision of autonomic computing. IEEE Computer Society,2003:41-50P
    [4]IBM. An architectural blueprint for autonomic computing. Autonomic Computing White Paper,2006:4-5P
    [5]Agrawal S,Bruno N, Chaudhuri S,et al. Autoadmin:Self-tuning database systems technology. IEEE Data Engineering Bulletin,2006, 29(3):7-15P
    [6]Agoulmine N, Balasubramaniam S,Botvich D, et al. Challenges for autonomic network management. Proceedings of 1st Conference on Modelling Autonomic Communication. Dublin, Ireland,2006:1-20P
    [7]Melcher B,Mitchell B.Towards an autonomic framework: Self-configuring network services and developing autonomic applications.Intel Technology Journal,2004,8(4):279-290P
    [8]Jennings B,Vander S,Balasubramaniam S,et al. Towards autonomic management of communications networks.IEEE Communications Magazine,2007,45(10):112-121P
    [9]Strassner J, Agoulmine N, Lehtihet E. FOCALE-A novel autonomic networking architecture.Proceedings of American Autonomic Computing Symposium. NY, USA,2006:1-13P
    [10]McCann J, Huebscher M.Evaluation issues in autonomic computing. Lecture Notes in Computer Science,2004:597-608P
    [11]Pertet S,Narasimhan P, Wilkes J, et al. Prato:Databases on demand. Proceedings of the Fourth International Conference on Autonomic Computing. Florida, USA,2007:10-11P
    [12]Salehie M, Tahvildari L. Autonomic computing:Emerging trends and open problems.Proceedings of the 2005 workshop on Design and evolution of autonomic application software.Missouri, USA,2005:1-7P
    [13]Hariri S,Khargharia B,Chen H, et al.The autonomic computing paradigm. Cluster Computing,2006,9(1):5-17P
    [14]Dong X, Hariri S,Xue L, et al.Autonomia:an autonomic computing environment. Proceedings of IEEE International Conference on Performance, Computing, and Communications.Washington, D.C.,USA, 2003:61-68P
    [15]Schmid S,Sifalakis M, Hutchison D.Towards autonomic networks. Lecture Notes in Computer Science,2006,4195:1-11P
    [16]Derbel H, Agoulmine N, Sala M. ANEMA:Autonomic network management architecture to support self-configuration and self-optimization in IP networks.Computer Networks,2009,53(3): 418-430P
    [17]Latr S,Simoens P, De Vleeschauwer B,et al. An autonomic architecture for optimizing QoE in multimedia access networks.Computer Networks, 2009,53(10):1587-1602P
    [18]Balasubramaniam S,Botvich D,Jennings B,et al. Policy-constrained bio-inspired processes for autonomic route management. Computer Networks,2009,53(10):1666-1682P
    [19]Raymer D,Meer S v d, Strassner J.From autonomic computing to autonomic networking:An architectural perspective.Proceedings of 5th IEEE Workshop on Engineering of Autonomic and Autonomous Systems. Belfast,UK,2008:174-183P
    [20]Mikic-Rakic M, Mehta N, Medvidovic N. Architectural style requirements for self-healing systems.Proceedings of the 2002 Workshop on Self-managed systems.Charleston, South Carolina USA,2002:49-54P
    [21]Dabrowski C,Mills K. Understanding self-healing in service-discovery systems.Proceedings of the 2002 Workshop on Self-managed systems. Charleston,South Carolina,USA,2002:15-20P
    [22]张海俊,史忠植.自主计算软件工程方法.小型微型计算机系统,2006,27(6)：1077-1082页
    [23]廖备水,李石坚,姚远.自主计算概念模型与实现方法研究.软件学报,2008,19(4)：779-802页
    [24]臧铖,黄忠东,董金祥.基于状态的通用自主计算模型.计算机辅助设计与图形学学报,2007,19(11)：1476-1481页
    [25]万群丽,杨群,李正.一种基于Agent适用于Web应用的软件抗衰方法.计算机应用研究,2004,2(8)：18-21页
    [26]李春江,肖侬,杨学军.具有自主计算特征的计算网格资源备份服务系统.计算机工程与科学,2005,27(12)：59-60页
    [27]马骞,马晓星,吕建.ARTEMIS-ARC系统协同模型的自省式实现技术研究.计算机科学,2006,33(10)：242-246页
    [28]Siewiorek D,Swarz R. Reliable computer systems:design and evaluation.Second edition:AK Peters,Ltd.,1998:26-27P
    [29]杨仕平,熊光泽,桑楠.安全关键系统高可信保障技术的研究.计算机科学,2003,30(5)：97-100页
    [30]Steinder M, Sethi A. The present and future of event correlation:A need for end-to-end service fault localization. Proceedings of the 5th world multiconference on systemics,cybernetics,and informatics.Orlando, Florida,2001:124-129P
    [31]卢暾.基于可生存性规范的软件构件系统恢复的建模与推理.软件学报,2007,18(12)：3031-3047页
    [32]H M,F C,YD F, et al.ABC:An architecture-based component oriented approach to software development. Journal of Software,2003,14(4): 721-732P
    [33]云晓春,余翔湛.基于确认度失效检测算法的研究与设计.北京邮电大学学报,2005,28(3)：10-13页
    [34]Chandra T, Toueg S.Unreliable failure detectors for reliable distributed systems.Journal of the ACM,1996,43(2):225-267P
    [35]Ranganathan S,George A, Todd R, et al.Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters.Cluster Computing,2001,4(3):197-209P
    [36]Aguilera M, Chen W, Toueg S.Heartbeat:A timeout-free failure detector for quiescent reliable communication. Lecture Notes in Computer Science,1997,1320:126-140P
    [37]左朝树.基于寄生式故障检测的分布式并行服务器系统容错技术.电子科技大学,成都,2005：56页
    [38]Bollo R, Le Narzul J, Raynal M,et al.Probabilistic analysis of a group failure detection protocol.Research Report,2003:1-38P
    [39]Das A, Gupta I, Motivala A. SWIM:Scalable Weakly-consistent Infection-style process group Membership Protocol.Proceedings of the International Conference on Dependable Systems and Networks. Pasadena,CA,2004:303-312P
    [40]Felber P, Defago X, Guerraoui R, et al.Failure detectors as first class objects.Proceedings of the International Symposium on Distributed Objects and Applications.Edinburgh, United Kingdom,1999:132-141P
    [41]Chen M, Kiciman E, Fratkin E, et al.Pinpoint:Problem determination in large,dynamic internet services.Proceedings of the 2002 International Symposium on Dependable Networks and Systems.Washington, D.C., USA,2002:595-604P
    [42]王万森.人工智能及其原理,北京：电子工业出版社,2005：284页
    [43]杨善林.智能决策方法与智能决策支持系统,北京：科学出版社,2005：1-4页
    [44]Viktor H,Cloete I. Inductive learning with a computational network. Journal of Intelligent and Robotic Systems,1998,21(2):131-141P
    [45]Dhar V, Stein R. Intelligent decision support methods:the science of knowledge work:Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1997
    [46]Kibler D.A Review of Machine Learning. AI Magazine,1998,19(1): 136P
    [47]Chen M, Zheng A, Lloyd J, et al.Failure diagnosis using decision trees. Proceedings of the 1st International Conference on Autonomic Computing. New York, USA,2004:36-43P
    [48]Brown A, Patterson D.Undo for operators:Building an undoable e-mail store. ACM Transactions on Software Engineering and Methodology.2003:1-14P
    [49]张义荣,鲜明,肖顺平.一种基于神经网络和系统调用的异常入侵检测方法.计算机应用研究,2006,23(9)：119-121页
    [50]任明伦,杨善林,朱卫东.智能决策支持系统：研究现状与挑战.系统工程学报,2002,17(3)：430-440页
    [51]Xu W, Bodik P, Patterson D.A flexible architecture for statistical learning and data mining from system log streams.Temporal Data Mining:Algorithms, Theory and Applications.Brighton, UK,2004
    [52]Rouvellou I, Hart G.Automatic alarm correlation for fault identification. Proceedings of the Maintenance and Reliability Conference.Gatlinburg, Tennessee,2001:453-461P
    [53]罗杰文,施智平,何清.一种CBR与RBR相结合的快速预案生成系统.计算机研究与发展,2007,44(4)：660-666页
    [54]刘健,陈前.基于多级案例库的系统故障诊断.电子与信息学报,2003,4(25)：48-51页
    [55]Chandy K, Browne J, Dissly C, et al. Analytic models for rollback and recovery strategies in data base systems.IEEE Trans.Software Engineering,1975,1(1):100-110P
    [56]P.Jalote.Fault Tolerance in Distributed Systems:Prentice-Hall:448P
    [57]Huang Y, Kintala C,Kolettis N, et al. Software rejuvenation:Analysis, module and applications.Proceedings of the 25th International Symposium on Fault-Tolerant Computing. Pasadena, CA,1995: 381-390P
    [58]Vaidyanathan K, Trivedi K. A measurement-based model for estimation of resource exhaustion in operational software systems.Proceedings of the 10th International Symposium on Software Reliability Engineering. Charleston,SC,USA,1999:83-84P
    [59]游静,徐建,张琨.计算系统软件抗衰重启技术研究.信息与控制,2006,35(3)：355-361页
    [60]Patterson D,Brown A, Broadwell P, et al.Recovery-oriented computing (ROC):Motivation, definition, techniques, and case studies.Technical Report,2002,2(1):37-49P
    [61]Candea G, Fox A. Recursive restartability:Turning the reboot sledgehammer into a scalpel. Performance Evaluation,2001,5(1):47-52P
    [62]Candea G, Cutler J, Fox A. Improving availability with recursive microreboots:a soft-state system case study. Performance Evaluation, 2004,6(4):213-248P
    [63]Candea G, Kawamoto S,Fujiki Y, et al.Microreboot-a technique for cheap recovery. Proceedings of the 6th Symposium on Operating Systems Design and Implementation. San Francisco,California, USA, 2004:31-44P
    [64]Littman M, Ravi N,Fenson E, et al. Reinforcement learning for autonomic network repair. Proceedings of the 1st International Conference on Autonomic Computing. New York, USA,2004: 284-285P
    [65]Reiss S,Eddon G. Automated support for recovery. Proceedings of the 1st International Conference on Autonomic Computing.IEEE Computer Society,2004:302-307P
    [66]Dudley G, Joshi N, Ogle D, et al. Autonomic Self-healing systems in a cross-product IT environment. Proceedings of the 1st International Conference on Autonomic Computing. IEEE Computer Society, New York,USA,2004:312-313P
    [67]Demsky B,Rinard M.Automatic data structure repair for self-healing systems.Proceedings of the 1st Workshop on Algorithms and Architectures for Self-Managing Systems. San Diego, California,USA 2003:78-81P
    [68]Arshad N, Heimbigner D, Wolf A. A planning based approach to failure recovery in distributed systems.Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems.New York, NY, USA,2004:8-12P
    [69]张海俊,史忠植.自主计算环境.计算机工程,2006,32(7)：1-3页
    [70]张海俊.基于主题的自主计算研究.中国科学院研究生院,北京,2005：38页
    [71]Nami M, Sharifi M.Autonomic computing:a new approach. Proceedings of the 1st Asia International Conference on Modelling & Simulation. Phuket,Thailand,2007:352-357P
    [72]Milner R, Parrow J, Walker D.A calculus of mobile processes-Part I. Technical Report,1990:7-16P
    [73]A.Westerinen, J.Schnizlein, J.Strassner. Terminology for Policy-Based Management. RFC 3198,2002:1-22P
    [74]侯丽珊,金芝,吴步丹.需求驱动的Web服务建模及其验证：一个基于本体的方法.中国科学(E辑),2006,36(10)：1189-1209页
    [75]L. J G, S.G, U. M, et al.A Model-checking verification environment for mobile Process.ACM transactions on software engineering and methodology,2003,12(4):440-448P
    [76]Klemm R, Singh N.Automatic Failure detection and recovery for Java servers.Proceedings of the Maintenance and Reliability Conference. Gatlinburg, Tennessee, USA,2001:7-12P
    [77]Gross P, Gupta S,Kaiser G, et al.An active events model for systems monitoring. Proceedings of the Working Conference on Complex and Dynamic Systems Architecture.Charleston, South Carolina, USA,2001: 11-17P
    [78]Roblee C, Berk V, Cybenko G. Implementing large-scale autonomic server monitoring using process query systems.Proceedings of the 2nd IEEE International Conference on Autonomic Computing (ICAC 2005). Seatle,WA,2005:123-133P
    [79]Li J.Monitoring of component-based systems.Technical Report, 2002,5(2):44-70P
    [80]Defago X, Hayashibara N, Katayama T. On the design of a failure detection service for large scale distributed systems.Proceedings of the 1st International Symposium Towards Peta-Bit Ultra-Networks.Ishikawa, Japan,2003:88-95P
    [81]Hayashibara N, Defago X, Katayama T. Two-ways adaptive failure detection with the failure detector. Proceedings of the 3th Workshop on Adaptive Distributed Systems. Sorrento, Italy,2003:22-27P
    [82]Dunagan J,Harvey N,Jones M, et al. Fuse:Lightweight guaranteed distributed failure notification. Proceedings of OSDI. San Francisco, California,2004:151-166P
    [83]Mills K, Rose S,Quirolgico S,et al.An autonomic failure detection algorithm.ACM SIGSOFT Software Engineering Notes,2004,29(1): 79-83P
    [84]Garlan D,Schmerl B,Chang J. Using gauges for architecture-based monitoring and adaptation. IEEE Transactions on Computers,2001, 21(7):1101-1109P
    [85]李琪林,甄威,周明天.一种适用于分布对象环境的层次型故障检测方法的研究.计算机科学,2009,36(2)：278-281页
    [86]P.Felber. A Service Approach to Object Groups in CORBA. Swiss Federal Institute of Technology, Lausanne,1998:98P
    [87]汪锦岭.面向Internet的发布/订阅系统的关键技术研究.中国科学院研究生院,北京,2005：6页
    [88]周明辉,邓佳,郭长国.基于中间件的动态重配置容错对象管理框架.计算机学报,2005,28(9)：1431-1439页
    [89]Eugster P, Felber P, Guerraoui R, et al.The many faces of publish/subscribe.ACM Computing Surveys,2003,35(2):119-131P
    [90]Y.Liu, Plale B.Survey of Publish Subscribe Event Systems.Technical Report,2003:1-19P
    [91]Candea G, Kawamoto S,Fujiki Y, et al. Microreboot-a technique for cheap recovery. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. San Francisco,CA,2004: 31-44P
    [92]Silva P, Silva L, Andrzejak A. Using Micro-reboot to improve software rejuvenation in Apache Tomcat. Technical Report,2007:1-19P
    [93]Candea G, Delgado M, Chen M, et al.Automatic failure-path inference: A generic introspection technique for internet applications.Proceedings of the 3rd IEEE Workshop on Internet Applications.San Jose, California, USA,2003:132-141P
    [94]Candea G, Cutler J,Fox A, et al.Reducing recovery time in a small recursively restartable system. Proceedings of the 6th International Conference on Dependable Systems and Networks.Washington, D.C., USA,2002:605-614P
    [95]Candea G, Cutler J.Minimizing mean time to recover in a recursively restartable software system. Proceedings of the 6th International Conference on Dependable Systems and Networks.Washington, D.C., USA,2002:423-429P
    [96]李伟生,王三民,王宝树.基于计划识别的态势估计方法研究.电子与信息学报,2006,28(3)：532-536页
    [97]RM J, M V, RE B.Fault tolerant planning:Toward probabilistic uncertainty models in symbolic non-deterministic planning. Proceedings of 14th IEEE International Conference on Automated Planning and Scheduling. Belfast, UK,2004:335-344P
    [98]吴向军,姜云飞,凌应标.ISTRIPS规划领域中动作效果关系的研究.软件学报,2007,18(6)：1329-1344页
    [99]Fox M, Long D.PDDL2.1:An extension to PDDL for expressing temporal planning domains.Technical Report, University Of Durham, Uk,2002
    [100]Hoffmann J, Nebel B.In Defense of PDDL Axioms.Artificial intelligence,2005,168(1):38-69P
    [101]Edelkamp S.Taming numbers and durations in the model checking integrated planning system. Journal of Artificial Intelligence Research, 2003,20:195-238P
    [102]Gerevini A, Serina I. LPG:a Planner based on Planning Graphs with Action Costs.Proceedings of the 6th International Conference on AI Planning and Scheduling(AIPS'02).Toulouse,France,2002:12-22P
    [103]Arshad N, Heimbigner D,Wolf A. Deployment and dynamic reconfiguration planning for distributed software systems.Software Quality Journal,2007,15(3):265-281P
    [104]Homann J,Nebel B.The FF planning system:Fast plan generation through heuristic search. Journal of Artificial Intelligence Research, 2001,14:253-302P
    [105]YX C, CW H, BW W. Subgoal partitioning and resolution in planning. Proceedings of the 4th International Conference on Automated Planning and Scheduling. Whistler, British Columbia, Canada,2004:32-36P
    [106]Gerevini A, Serina I. LPG:A planner based on local search for planning graphs.Proceedings of the 2002 International Conference on AI Planning end Seheduling. Toulouse,France,2002:39-45P
    [107]T.Kichkaylo,A.Ivan, Karamcheti V. Sekitei:An AI planner for constrained component deployment in wide-area networks.Technical Report,2004:1-15P
    [108]W E,J C.Plan-IT-2:The next generation planning and scheduling tool. Telematics and Informatics,1990,7(3):189-207P
    [109]ABLE.http://alphaworks.ibm.com/tech/able,2009.8
    [110]Haitao Z, Huiqiang W. Design and Implement of Autonomic Software Maturity Evaluation. Proceedings of the 1st International Conference on Internet Computing in Science and Engineering(ICICSE).Harbin, China,2008:191-195P
    [111]Chillarege R, Biyani S,Rosenthal J. Measurement of failure rate in widely distributed software.Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing. Pasadena, California,1995:424-433P
    [112]Ciardo G, Marie R, Sericola B,et al. Performability analysis using semi-Markov reward processes.IEEE Transactions on Computers,1990, 39(10):1251-1264P
    [113]Candea G, Kiciman E, Kawamoto S,et al.Autonomous recovery in componentized internet applications.Cluster Computing,2006,9(2): 175-190P
    [114]Cecchet E, Marguerite J, Zwaenepoel W. Performance and scalability of EJB applications.Proceedings of 17th Conference on Object-Oriented Programming, System, Language, and Applications.Seattle, Washington, USA,2002:246-261P
    [115]Appavoo J, Hui K, Soules C,et al.Enabling autonomic behavior in systems software with hot swapping. IBM Systems Journal,2003,42(1): 60-76P
    [116]Shen J, Sun X, Huang G, et al.Towards a unified formal model for supporting mechanisms of dynamic component update. Proceedings of the 10th European software engineering conference.Lisbon, Portugal, 2005:80-89P
    [117]Feng N, Ao G, White T, et al. Dynamic evolution of network management software by software hot-swapping. Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management. Seattle,Washington, USA,2001:14-18P
    [118]Wang Q,Chen F, Mei H, et al. An application server to support online evolution. Proceedings of the International Conference on Software Maintenance (ICSM'02).Montreal, Quebec,Canada,2002:131-140P
    [119]Fox A, Kiciman E, Patterson D.Combining statistical monitoring and predictable recovery for self-management. Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems.Newport Beach, California,2004:49-53P
    []20]王晓鹏,王千祥,梅宏.一种面向构件化软件的在线演化方法.计算机学报,2005,28(11)：1890-1897页

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700