移动计算环境下检查点回卷恢复容错技术研究

英文题名：Research on Checkpointing and Rollback Recovery Fault-tolerant Techniques for Mobile Computing Environment
作者：徐振朋
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：移动计算 ; 容错 ; 事件日志 ; 回卷恢复 ; 检查点间隔
英文关键词：Mobile Computing ; Fault Tolerance ; Event Log ; Rollback Recovery ; Checkpoint
英文关键词：Interval
学位年度：2011
导师：门朝光
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2011-09-20

摘要

高性能计算、互联网、无线通信、分布式计算、普适计算和云计算等领域的迅速发展，极大地推动了移动计算技术的发展。与传统固定有线分布式环境相比，移动计算系统具备临时搭建、自治、节点移动和网络拓扑结构易变等特点，拥有广阔的应用前景。但移动计算环境下进程发生故障的概率远大于传统的有线分布式计算系统，有线分布式计算系统的检查点回卷恢复容错技术已无法适用移动计算环境，因此，为移动计算系统设计高效的检查点回卷恢复容错机制是十分有意义的。依据检查点回卷恢复容错机制的研究现状和存在问题，本文围绕移动计算环境下检查点回卷恢复容错技术开展相关研究，具体包括：
     （1）对移动计算环境下进程日志存储维护进行了研究，基于m-MSS-m模型和进程分段确定性执行模型假定，提出了低开销的进程事件日志记录机制。该机制具备以下特点：移动支持站统一存储维护服务组内移动主机计算进程检查点、事件日志和日志间先于偏序依赖关系；各进程检查点信息和事件日志以确定因子的形式记录于一维数组；计算进程历经事件间先于偏序依赖关系由数组元素的先后顺序表示；进程日志先被同步记录到移动支持站的高速内存，仅在特定事件的触发下异步更新到可靠存储设备。
     （2）对移动计算环境下故障进程故障进程回卷恢复进行了研究，针对设计的进程事件日志记录机制提出了故障进程回卷恢复机制。该恢复机制与日志记录机制构成了基于事件日志的检查点回卷恢复容错机制。完备容错日志情况下，基于事件日志检查点回卷恢复容错机制能够支持故障计算进程独立异步地实现一致性恢复；不完备容错日志情况下，基于事件日志检查点回卷恢复容错机制仍能够协同本服务组内计算进程实现一致性恢复。
     （3）对移动计算环境下进程容错信息迁移维护进行了研究，为兼顾计算进程无故障运行和故障后回卷恢复期间的系统性能，提出了基于冗余信息分块的弱迁移管理机制。逻辑上移动主机进程容错信息被移动支持站分割为核心和非核心两部分，移动主机迁移期间不同容错信息的维护时机和调度方式实行差异化管理，并推导出了确定计算进程核心和非核心容错信息大小的约束条件。
     （4）对容错机制中进程检查点间隔的求解进行了研究，针对进程泊松故障分布，提出了一种基于拉普拉斯变换的等距进程检查点间隔分析求解模型以确保容错机制的整体性能。为应对其它进程故障分布情形，基于简易的进程检查点计时方式，推导出了容错机制系统平均利用率的表达式，并以此得到了优化进程检查点间隔的约束条件，提出了一种准最优进程检查点序列的通用确定算法。
     性能分析表明为移动计算系统设计的基于事件日志检查点回卷恢复机制在容错日志信息记录维护、进程状态先于偏序依赖关系存储维护、故障进程回卷恢复、进程容错信息迁移维护和求解优化检查点间隔序列等方面表现优异。本文研究成果是提升移动计算系统可靠性的有效容错措施。
With the rapid development of the high performance computing, internet, wirelesscommunication, distributed computing, pervasive computing and cloud computing fields, thedevelopment of the mobile computing technology has been greatly promoted. Compared tothe traditional wired distributed computing environment, mobile computing system has broadapplication foreground due to its various features, such as celerity of setting, autonomy,dynamic mobility of the node, flexibility of topological structure and equivalence. However,the process failure event probability of mobile computing environment is greater than that ofthe traditional wired distributed computing system. The checkpoint and rollback recoveryfault tolerant mechanism for the wired distributed computing system is inappropriate for themobile computing environment. Therefore, it is meaningful to design an appropriate efficientcheckpoint and rollback recovery fault tolerant mechanism for mobile computing system.According to the current researches and existent problems of the fault tolerant schemes, thisdissertation is mainly focus on the checkpoint and rollback recovery fault tolerant techniquesfor mobile computing environment. The contents of the dissertation are:
     (1) The storage and maintenance of the process log for mobile computing is studied. Alow overhead process event logging mechanism has been proposed based on m-MSS-mmodel and the piece-wise deterministic execution model assumptions. In the proposal, thecheckpoint, the event log and the happened-before relation of the mobile host process in thelocal cell are stored and managed by the mobile support station uniformly. The checkpointand the event log in the form of the determinant are recorded in the single-dimensional array.Specifically, the happened-before relation among the event experienced by the process isindicated by the sequence of the array. The process log is recorded in the high-speed memoryof the mobile support station synchronously, and the log is flushed into the reliable storageupon the special event asynchronously.
     (2) The rollback recovery of the failure process for mobile computing is studied. Arollback recovery mechanism of the failure process has been proposed according to theproposed logging mechanism. The recovery and logging mechanism constitute the checkpointand rollback recovery fault tolerant mechanism based on the event logging. In the case of the complete fault tolerant log, the failure process can implement independent consistent rollbackrecovery. In the case of the incomplete fault tolerant log, the processes in the local cell requireimplementing a coordinate consistent rollback recovery.
     (3) The handoff maintenance of the process recovery information for mobile computingis studied. To balance the system performance during the failure-free execution phase and therollback recovery phase after the failure event, a weak handoff management mechanism basedon the redundant information partition is proposed. Logically, the fault tolerant information ofa mobile host is partitioned into two parts, including the kernel and un-kernel parts. When amobile host incurs a handoff, the different fault tolerant logs are managed in times anddifferent styles. Finally, the constraint for determining the kernel and un-kernel part amountsof the process is derived in this dissertation.
     (4) The determination of the process checkpoint interval in the fault tolerant mechanismis studied. For Passion failure distribution, an equidistant checkpoint interval based onLaplace Transformation is proposed to ensure the entire performance of the fault tolerantmechanism for mobile computing. For the other arbitrary failure distribution, the averageprocess computation effective rate of the fault tolerant mechanism is derived based on asimple checkpointing timing method. The general constraint of the optimality is presentedaccording to the average process computation effective rate, and a general checkpointscheduling algorithm is developed to perform a qausi-optimal process checkpoint sequence.
     The performance analysis shows that the proposed fault tolerant mechanism for mobilecomputing is considerable in various aspects, such as the recording and maintenance of theprocess checkpoints and logs, the recording and maintenance of happened-before relationamong the processes, the determination of optimal checkpoint interval, the handoffmaintenance of the fault tolerant information and the independence of the rollback recoveryprocess. The research result of this dissertation is an efficient fault tolerant mechanism for thereliability of mobile computing.

引文

[1] E.N.Elnozahy, L.Alvisi, Y.M.Wang, D.B.Johnson. A Survey of Rollback-RecoveryProtocols in Message-Passing Systems[J]. ACM Computing Surveys,2002,34(3):375-408
    [2]张展,左德承,慈轶为,杨孝宗.一种基于移动计算环境的因果日志卷回恢复算法[J].计算机研究与发展,2008(2):348-357
    [3] S. K.Gupta, R. K Chauhan, P. Kumar. Backward Error Recovery Protocols inDistributed Mobile Systems: A Survey[J]. Journal of Theoretical and AppliedInformation Technology,2008:337-347
    [4]左德承,张展,董剑,刘宏伟,杨孝宗.基于事务处理的容错计算机系统结构设计与实现[J].高技术通讯,2008,18(2):111-115
    [5]门朝光.分布式系统协同检查点技术的研究.哈尔滨工业大学博士论文.2004:4-20
    [6] G. Li, H. Wang. A novel min-process checkpointing scheme for mobile computingsystems[J]. Journal of Systems Architecture,2005,51:45-61
    [7] Chaoguang Men, Nianbin Wang, Yunlong Zhao. Using Computing CheckpointsImplement Consistent Efficient Non-Blocking Coordinated Checkpointing[J]. ChineseJournal of Electronics,2006,15(2):193-196
    [8] G. Li, L. Shu. Design and Evaluation of a Low-Latency Checkpointing Scheme forMobile Computing Systems[J]. The Computer Journal,2006,49(5):527-540
    [9] T. Park, N. Woo, H. Y. Yeom. An Efficient Optimistic Message Logging Scheme forthe Recoverable Mobile Computing Systems[J]. IEEE Trans. on Mobile Computing,2002,1(4):265-277
    [10] L. Kumar, M. Mishra, R. C. Joshi. Low Overhead Optimal Checkpointing for MobileDistributed Systems[C]. Proc. of19th International Conference on Data Engineering,2003:686-688
    [11] Katinka Wolter. Stochastic Models for Fault Tolerance: Restart, Rejuvenation andCheckpointing. Springer;1st Edition, June8,2010:3-285
    [12] Q. Zheng, B. Veeravalli, T. C. Khong. On the Design of Fault-Tolerant SchedulingStrategies Using Primary-Backup Approach for Computational Grids with LowReplication Costs[J]. IEEE Transactions on Computers,2009,58(3):380-393
    [13] M. S. Bouguerra, T. Gautier, D. Trystram, al et. A Flexible Checkpoint/Restart Modelin Distributed Systems Parallel Processing and Applied Mathematics[J]. Lecture Notesin Computer Science,2010,6067:206-215
    [14] R. Tolosana-Calasanz, J. A. Banares, P. Alvarez, J. Ezpeleta, O. F. Rana. AnUncoordinated Asynchronous Checkpointing Model for Hierarchical ScientificWorkflows[C]. Proc. of Computer and System Sciences,2010:403-415
    [15]张展.移动计算环境下卷回恢复技术的研究.哈尔滨工业大学博士论文.2008:3-20
    [16] S. Jafar, A. Krings, T. Gautier. Flexible Rollback Recovery in Dynamic HeterogeneousGrid Computing[J]. IEEE Transactions on Dependable and Secure Computing,2009,6(1):32-44
    [17] G. H. Cao, M. Singhal. Mutable Checkpoints: A New Checkpointing Approach forMobile Computing Systems[J]. IEEE Trans. on Parallel and Distributed Systems,2001,12(2):157-172
    [18] K. D. Hyung, P. C. Soon. A Communication-Induced Checkpointing Algorithm UsingVirtual Checkpoint on Distributed Systems[C]. Proc. of the27th InternationalConference on Parallel and Distributed Systems,2000:145-150
    [19] T. Park, N. Woo, H. Y. Yeom. An Efficient Recovery Scheme for Mobile ComputingEnvironments[J]. Future Generation Computer Systems,2003,19:37-53
    [20] S. Siva Sathya, K. S. Babu. Survey of Fault Tolerant Techniques for Grid[J]. ComputerScience Review,2010,4:101-120
    [21] S. Biswas, S. Neogy. A Handoff Based Checkpointing and Failure Recovery Schemein Mobile Computing System[C]. Proc. of the2011International Conference onInformation Networking,2011:441-446
    [22]刘云生,张传富,张童,查亚兵,黄柯棣.基于Markov链的分布式仿真系统最佳检查点间隔研究[J].国防科技大学学报,2005,27(5):73-77
    [23] J. T. Daly. A Strategy for Running Large Scale Applications Based on a Model thatOptimizes the Checkpoint Interval for Restart Dumps[C]. Proc. of the26thInternational Conference on Software Engineering, Edinburgh, UK,2004:70-74
    [24] A. Ziv, J. Bruck. Performance Optimization of Checkpointing Schemes with TaskDuplication [J]. IEEE Trans Computers,1997,46(12):1381-1386
    [25]梁蓓,张大方,杨金民,季洁.用时间序列分析方法动态确定检查点时间间隔[J].系统仿真学报,2004,16(10):2350-2363
    [26]蒋廷耀,李庆华.一种新的优化的检查点间隔的求解模型[J].小型微型计算机系统,2003,24(3):448-451
    [27]鄢喜爱,杨金民,田华.双机容错系统中最佳检查点间隔的分析[J].计算机工程,2007,33(5):283-285
    [28] D. K. Pradhan, P. Krishna, N. H. Vaiday. Recoverable Mobile Environment: Designand Trade-off Analysis[C]. Proc. of the26th Int’l Symp. on Fault Tolerant ComputingSystem, Sendai, Japan,1996:16-25
    [29] I. R. Chen, G Baoshan, Sapna E. George. On Failure Recoverability of Client-serverApplications in Mobile Wireless Environments[J]. IEEE Transactions on Reliability,2005,54(1):115-122
    [30] K. M. Chandy, J. C. Browne, C.W. Dissly, W. R. Uhrig. Analytic Models for Rollbackand Recovery Strategies in Data Base Systems[J]. IEEE Trans. Software Eng.,1975,1:100-110
    [31] A. Duda. The Effects of Checkpointing on Program Execution time[J]. InformationProcessing Letters, June1983,16:221-229
    [32] J. W. Young. A First Order Approximation to the Optimum Checkpoint Interval[J].Communications of the ACM,1974,17(9):530-531
    [33] J. T. Daly. A Model for Predicting the Optimum Checkpoint Interval for RestartDumps[C]. Proc. of the ICCS2003, LNCS2660(4):3-12
    [34] J. C. M. J. Pamila, K. Thanushkodi. Log Management Support for Recovery in MobileComputing Environment[J]. Journal of Computer Science,2009,3(1):1-6
    [35] J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for RestartDumps[J]. Future Generation Computer Systems,2006,22:303-312
    [36] N. H. Vaidya. Impact of Checkpoint Latency on Overhead Ratio of a CheckpointingScheme[J]. IEEE Trans. Computers,1997,46(8):942-947
    [37] Y. Ling, J. Mi, X. Lin. A Variational Calculus Approach to Optimal CheckpointPlacement[J]. IEEE Trans. Computers, July2001,50(7):699-707
    [38] T. Ozaki, T. Dohi, N. Kaio. Numerical Computation Algorithms for SequentialCheckpoint Placement[J]. Performance Evaluation,2009,66:311-326
    [39] Y. Liu, R. Nassa, C. Leangsuksun, al et. A Reliability-aware Approach for an OptimalCheckpoint/Restart Model in HPC Environments [C]. Proc. of2007IEEEInternational Conference on Cluster Computing,2007:452-457
    [40] T. Ozaki, T. Dohi, H. Okamura. Distribution-Free Checkpoint Placement AlgorithmsBased on Min-Max Principle[J]. IEEE Trans. Dependable and Secure Computing,2006,3(2):130-140
    [41] T. Dohi, T. Ozaki, N. Kaio. Optimal Checkpoint Placement with EqualityConstraints[C]. Proc. of2nd IEEE International Symposium on Dependable,Autonomic and Secure Computing,2006:77-84
    [42] Y. Liu, R. Nassa, C. Leangsuksun, al et. An Optimal Checkpoint/Restart Model for aLarge Scale High Performance Computing System[C]. Proc. of the22nd IEEEInternational Parallel and Distributed Processing Symposium, Program and CD-ROM.April,2008:1-9
    [43] H. Okamura, Y. Nishimura, T.Dohi. A Dynamic Checkpointing Scheme Based onReinforcement Learning[C]. Proc. of the10th IEEE Pacific Rim InternationalSymposium on Dependable Computing,2004:151-158
    [44] N.H. Vaidya. A Case for Two-level Distributed Recovery Schemes. Proc. of ACMSIGMETRICS Conf. Measurement and Modeling of Computer Systems,1995:64-73
    [45] K. Naruse, S. Umemura, S. Nakagawa. Optimal Checkpointing Interval for Two-LevelRecovery Schemes[J]. Computers and Mathematics with Applications,2006,51:371-376
    [46] A. Bobbio, M. Sereno, C. Anglano.Fine Grained Software Degradation Models forOptimal Rejuvenation Policies[J]. Performance Evaluation,2001,46:45-62
    [47] H. Okamura, K. Iwamoto, T. Dohi. A Dynamic Programming Algorithm for SoftwareRejuvenation Scheduling under Distributed Computation Circumstance[C]. Proc. ofIEEE11th International Conference on Parallel and Distributed Systems,2005,II:493-497
    [48] H. Okamura, T. Dohi. Analysis of a Software System with Rejuvenation, Restorationand Checkpointing[C]. Proc. of Service Availability5th International ServiceAvailability Symposium, LNCS,2008,5017:110-128
    [49] H. Okamura, T. Dohi. Comprehensive Evaluation of Aperiodic Checkpointing andRejuvenation Schemes in Operational Software System[J]. The Journal of Systemsand Software,2010,83(9):1591-1604
    [50] Y. Liu. Reliability-Aware Optimal Checkpoint/Restart Model in High PerformanceComputing. PhD thesis, Louisiana Tech University,2007:5-45
    [51]周恩强,卢宇彤,沈志宇.一个适合大规模集群并行计算的检查点系统[J].计算机研究与发展,2005,42(6):987-992
    [52] P. J. D. III, N. Tzeng. Decentralized QoS-Aware Checkpointing Arrangement inMobile Grid Computing[J]. IEEE Transactions on Mobile Computing,2010,9(8):1173-1186
    [53]闵应骅.容错计算二十五年.计算机学报,1995,18(12):930-943
    [54]刘建,汪东升,沈美明,郑纬民.一种基于检查点的并行程序调试器的设计与实现[J].计算机研究与发展,2002,39(12):1580-1586
    [55]王准,陈俊亮.悲观消息日志法在交换软件中的应用[J].通信学报,2000,21(2):23-29
    [56]谢旻,卢宇彤,周恩强,曹宏嘉,杨学军.基于Lustre文件系统的MPI检查点系统实现技术与性能测试[J].计算机研究与发展,2007,40(10):1709-1716
    [57]杨金民,张大方.基于分块消息日志的回卷恢复策略[J].电子学报,2004,32(5):857-859
    [58]罗元盛,闵应骅,张大方.一种基于索引的准同步检查点协议[J].计算机学报,2005,28(10):1620-1625
    [59]张展,左德承,慈轶为,杨孝宗.基于穿戴计算机的内核级检查点机制优化策略研究[J].高技术通讯,2008(5):492-497
    [60] Zhan Zhang, Yiwei Ci, Decheng Zuo, Xiaozong Yang. A Step-checkpointing SchemeBased on Failure Rate Prediction[C]. ICIC Express Letters-An International Journal ofResearch and Surveys,2007,1(1):53-58
    [61] Zhan Zhang, Yiwei Ci, Decheng Zuo, Xiaozong Yang. Optimization of Kernel-levelCheckpoint/Restart Mechanism Based on the Wearable Computer System[J]. Journalof HarBin Institute of Technology,2006,38(sup):441-446
    [62]时锐,左德承,张展,杨孝宗.移动自组网的容错拓扑控制技术[J].北京邮电大学学报,2005,28(5):110-113
    [63]张展,左德承,慈轶为,杨孝宗.一种异步低开销的移动计算环境卷回恢复算法[J].哈尔滨工业大学学报,2007,39(sup):7-12
    [64]慈轶为,左德承,张展,杨孝宗.移动系统中基于距离的懒惰恢复信息迁移策略[J].哈尔滨工业大学学报,2007,39(sup):22-25
    [65] R. M. Menderico, I. C. Garcia. Diskless Checkpointing with Rollback-DependencyTrackability[C]. Proc. of29th IEEE Symposium on Reliable Distributed Systems,New Delhi, Punjab, India,2010:275-281
    [66] E. Feller, J. M. Spahn, M. Schoettner, Christine Morin. Independent Checkpointing ina Heterogeneous Grid Environment[J]. Future Generation Computer Systems, Articlein Press.
    [67] O. Laadan, S. E. Hallyn. Linux CR: Transparent Application Checkpoint/Restart inLinux[C]. Proc. of the12th Annual Linux Symposium, Ottawa, July2010:1-14
    [68] G. Rodríguez, X. C. Pardo, M. J. Martín, P. González. Performance Evaluation of anApplication-level Checkpointing Solution on Grids[J]. Future Generation ComputerSystems,2010,26(7):1012-1023
    [69]门朝光,左德承,杨孝宗.移动计算环境下的检查点恢复策略[J].计算机研究与发展,2003,40(Suppl.):313-318
    [70] C. Bertolli, M. Vanneschi, B. Ciciani, F. Quaglia. Enabling Replication in theASSISTANT Programming Model[C]. Proc. of International Conference on WirelessCommunications and Mobile Computing,2010:509-513
    [71] K. H. Kim. Issues Insufficiently Resolved in Century20in the Fault-TolerantDistributed Computing Field[C]. Proc. of19th Symposium on Reliable DistributedSystems, Nuremberg, Germany,2000:106-115
    [72] T. Abdelzaher, B. Blum, Q.Cao. EnviroTrack: Towards an Environmental ComputingParadigm for Distributed Sensor Networks[C]. Proc. of the24th InternationalConference on Distributed Computing Systems,2004:582-589
    [73] Z. Li, W. Cai, S. J. Turner, K. Pan. A Replication Structure for Efficient and Fault-tolerant Parallel and Distributed Simulate[C]. Proc. of Spring SimulationMulticonference-SpringSim,2010:1-10
    [74] John Paul Walters, Vipin Chaudhary.Replication-Based Fault Tolerance for MPIApplications[J]. IEEE Transactions on Parallel and Distributed Systems,2009,20(7):997-1010
    [75] D. B. Johnson, W. Zwaenepoel. Recovery in Distributed Systems Using OptimisticMessage Logging and Checkpointing[J]. Journal of Algorithms,1990,11:462-491
    [76] L. Lamport. Distributed Snapshots: Determining Global States of DistributedSystems[J]. ACM Transactions on Computing Systems,1985,3(1):63-75
    [77] L.Alvisi and K.Marzullo. Message Logging: Pessimistic, Optimistic, Causal andOptimal[J]. IEEE Trans. on Software Engineering,1998,24(2):149-159
    [78] R. E. Ahmed, A. Khaliq. A Low-Overhead Checkpointing Protocol for MobileNetworks[C]. IEEE Canadian Conference on Electrical and Computer Engineering,Montreal, Canada.2003:1779-1782
    [79] S. Kalaiselvi, V. Rajaramana. A Survey of Checkpointing Algorithms for Parallel andDistributed Computers[C]. The Sadhana Academy Proceedings in EngineeringSciences,2000,25(5):489-510
    [80] S. Neogy, A. Sinha. Checkpoint Processing in Distributed Systems Software UsingSynchronized Clocks[C]. Proc. of IEEE Sponsored International Conference onInformation Technology: Coding and Computing, Las Vegas, USA,2001:555-559
    [81]李凯原.基于任务复制的检查点性能优化技术研究与实现.哈尔滨工业大学.博士论文.1999:5-21
    [82] S. Neogy, A. Sinha. Distributed Checkpointing Using Synchronized Clocks[C]. Proc.of26th Annual International Computer Software and Applications Conference, Oxford,England,2002:199-206
    [83] Shobhit Mishra. Design and Implementation of Process Migration and Cloning inBLCR. Master of Science Thesis: North Carolina State University.2011:1-32
    [84] N. Naksinehaboon, M. Paun, R. Nassar, B. Leangsuksun, S. Scott, High performancecomputing systems with various checkpointing schemes. International Journal ofComputers, Communications&Control,2009, IV (4):386-400
    [85] Y. M. Wang, P. Y. Chung, I. J. Lin, W. K. Fuchs. Checkpoint Space Reclamation forUncoordinated Checkpointing in Message-Passing Systems[J]. IEEE Trans. onParallel and Distributed Systems.1995,6(5):546-554
    [86] D. Manivannana, Q. Jianga, J. Yangb, M. Singhal. A Quasi-synchronousCheckpointing Algorithm that Prevents Contention[J]. Information Sciences,2008,178(15):3110-3117
    [87] S. Basu, S.Palchaudhuri, S.Podder, M. Chakrabarty. A Checkpointing and RecoveryAlgorithm Based on Location Distance, Handoff and Stationary Checkpoints forMobile Computing Systems[C]. Proc. of the2009International Conference onAdvances in Recent Technologies in Communication and Computing,2009:58-62
    [88] I. C. Garcia, L. E. Buzato. A Linear Approach to Enforce the MminimalCharacterization of the Rollback Dependency Trackability Property[R]. TechnicalReport IC-01-17,2001:1-5
    [89] N. Neves, W. K. Fuchs. Using Time to Improve the Performance of CoordinatedCheckpointing[C]. Proc. of2nd IEEE International Computer Performance andDependability Symposium, Urbana-Champaign, USA,1996:282-291
    [90] N. Neves, W. K. Fuchs. Coordinated Checkpointing Without Direct Coordination[C].Proc. of IEEE International Computer Performance and Dependability Symposium,Durham, North Carolina,1998:23-31
    [91] E. N. Elnozahy, D. B. Johnson, W. Zwaenepoel. The Performance of ConsistentCheckpointing[C]. Proc. of the11th Symposium on Reliable Distributed Systems,1992:39-47
    [92] J. M. Helary, A. Mostefaoui, R. H. B. Netzer, M. Raynal. Preventing UselessCheckpoints in Distributed Computations[C]. Proc. of the16th Symposium onReliable Distributed Systems,1997:183-190
    [93]裴丹,汪东升,沈美明,郑纬民. WOB:一种新的文件检查点设置策略[J].电子学报,2000,28(5):9-12
    [94]杨晖,陈闳中.支持文件迁移的Linux检查点机制的实现[J].计算机工程,2010,36(3):266-268
    [95]刘少锋,汪东升,朱晶.基于虚拟文件操作的文件检查点设置[J].软件学报,2002,13(8):1528-1533
    [96]魏晓辉,鞠九滨.分布式系统中的检查点算法[J].计算机学报,1998,21(4):367-375
    [97] D. Manivannan, M. Singhal. A Low-overhead Recovery Technique Using Quasi-synchronous Checkpointing[C]. Proc. of the16th International Conference onDistributed Computing Systems,1996:100-107
    [98] Joshua Hursey. Coordinated Checkpoint/Restart Process Fault Tolerance for MPIApplications on HPC Systmes,2010:1-121
    [99] K. H. Kim. Issues Insufficiently Resolved in Century20in the Fault-TolerantDistributed Computing Field[C]. Proc. of the IEEE CS19th Symposium on ReliableDistributed Systems, Nuremberg, Germany,2000:106-115
    [100] M. Lotfi, S. A. Motamedi, M. Bandarabadi. Proactive blocking CoordinatedCheckpointing with Dynamic Intervals[C]. Proc. of the41st Southeastern Symposiumon System Theory,2009:118-121
    [101]汪东升,沈美明,郑纬民,裴丹.一种基于检查点的卷回恢复与进程迁移系统[J].软件学报,1999,10(1):68-73
    [102] Y. Zhang, J. P. Hu. Checkpointing and Process Migration in Network ComputingEnvironment[C]. Proc. of the2001International Conference on Info-Tech andInfo-Net, Beijing, China.2001:179-184
    [103] C. Y. Lin, S. C. Wang, S. Y. Kuo, I. Y. Chen. A Low Overhead Checkpointing Protocolfor Mobile Computing Systems[C]. Proc. of2002Rim International Symposium onDependable Computing, Tsuba, Japan,2002:37-44
    [104] J. M. Cangussu, R. A. DeCarlo, A. P. Mathur. A Formal Model of the Software TestProcess[J]. IEEE Transactions on Software Engineering,2002,28(8):782-796
    [105] S. Marzouk, A. J. Maalej, I. B. Rodriguez, M. Jmaiel. Periodic Checkpointing forStrong Mobility of Orchestrated Web Services[C]. Proc. of the2009Congress onServices-I,2009:203-210
    [106] R. Tuli, P. Kumar. Analysis of Recent Checkpointing Techniques for MobileComputing Systems[J]. International Journal of Computer Science&EngineeringSurvey,2011,2(3):133-141
    [107] S. Yi, D. Kondo, B. Kim, G. Park, Y. Cho. Using Replication and Checkpointing forReliable Task Management in Computational Grids[C]. Proc. of HPCS.2010:125-131
    [108] J. Xu, R. H. Netzer. Adaptive Independent Checkpointing for Reducing RollbackPropagation[C]. Proc. of5th IEEE Symposium on Parallel and Distributed Processing,1993:754-761
    [109] S. Biswas, S. Neogy, A Mobility-Based Checkpointing Protocol For MobileComputing System[J]. International Journal of Computer Science and Communication,2010,2(1):75-82
    [110] R. Tuli, P. Kumar. A Survey and Perfromance Analysis of Checkpointing andRevovery Schemes for Mobile Computing Systems[J]. International Journal ofComputer Science and Communication,2011,2(1):89-95
    [111] G.. Cao, M. Singhal. On Coordinated Checkpointing in Distributed Systems[J]. IEEETrans. on Parallel and Distributed Systems,1998,9(12):1213-1225
    [112] R. Prakash, M. Singhal. Low-Cost Checkpointing and Failure Recovery in MobileComputing Systems[J]. IEEE Trans. on Parallel and Distributed Systems,1996,7(10):1035-1048
    [113] G. Cao, M. Singhal. On the Impossibility of Min-process Non-blocking Checkpointingand an Efficient Checkpointing Algorithm for Mobile Computing System[C].International Conference on Parallel Processing,1998:37-44
    [114] G. Cao, M. Singhal. Checkpointing with Mutable Checkpoints[J]. TheoreticalComputer Science,2003,290:1127-1148
    [115] G. Cao, M. Singhal. Low-Cost Checkpointing with Mutable Checkpoints in MobileComputing Systems[C]. Proc. of18th Int’l Conf. on Distributed Computing Systems,Amsterdam, Netherlands,1998:464-471
    [116] M. Chtepen, F. H. A. Claeys, B. Dhoedt, al et. A. Vanrolleghem. Adaptive TaskCheckpointing and Replication Toward Efficient Fault-Tolerant Grids[J]. IEEETransactions on Parallel and Distributed Systems,2009,20(2):180-190
    [117] L. Alvisi, E. Elnozahy, S. Rao, al et. An Analysis of Communication-InducedCheckpointing[C]. Proc. of the29th Annual International Symposium onFault-Tolerant Computing,1999:242-249
    [118] J. Tsai, S. Y. Kuo, Y. M. Wang. Theoretical Analysis for Communication-inducedCheckpointing Protocols with Rollback-dependency Trackability[J]. IEEETransactions on Parallel and Distributed Systems,1998,9(10):963-971
    [119] L. lvisi, K. Marzullo. Message Logging Pessimistic, Optimistic and Causal[C]. Proc.of the15th International Conference on Distributed Computing Systems,1995:229-236
    [120] F. Quaglia, B. Ciciani, R. Baldoni. Checkpointing Protocols in Distributed Systemswith Mobile Hosts: a Performance Analysis[C]. The12th International ParallelProcessing Symposium and9th Symposium on Parallel and Distributed Processing,Orlando, Florida, USA,1998, LNCS1388:742-755
    [121] B. Gupta, S. K. Banerjee, B. Liu. Design of New Roll-forward Recovery Approach forDistributed Systems[C]. IEEE Proceedings on Computers and Digital Techniques,2002,149(3):105-112
    [122] T. Park and H. Y. Yeom. An Asynchronous Recovery Scheme Based on OptimisticMessage Logging for Mobile Computing Systems[C]. Proc. of the2000InternationalConference on Distributed Computing Systems, Taipei,2000:436-443
    [123] S. W. Kwak, B. J. Choi, B. K. Kim. An Optimal Checkpointing Strategy for Real-TimeControl Systems under Transient Faults[J]. IEEE Transactions on Reliability,2001,50(3):293-301
    [124] T. Park, I. Byun, H. Kim, H. Y. Yeom. The Performance of Checkpointing andReplication Schemes for Fault Tolerant Mobile Agent Systems[C]. Proc. of the21stSymposium on Reliable Distributed Systems, Osaka, Japan.2002:256-261
    [125] Y. B. Lin, B. R. Preiss, et al. Selecting the Checkpoint Interval in Time WarpSimulation[J]. ACM7th Workshop on Parallel and Distributed Simulation,1993:3-10
    [126] A. Agbaria, H. Attiya, R. Friedman, R. Vitenberg. Quantifying Rooback Propagationin Distributed Checkpointing[C]. Proc. of the20th Symposium on Reliable DistributedSystems, New Orleans,2001:36-45
    [127] D. Manivannan, R. H. B. Netze, M. Singhal. Finding Consistent Global Checkpointsin a Distributed Computation[J]. IEEE Trans. on Parallel and Distributed Systems,1997,8(6):623-627
    [128] A. Luckow, B. Schnor. Adaptive Checkpoint Replication for Supporting the FaultTolerance of Applications in the Grid[C]. Proc. of NCA,2008:299-306
    [129] P. E. Chung, W. J. Lee, Y. Huang, et al. Winckp: A Transparent Checkpointing andRollback Recovery Tool for Windows NT Applications[C]. Proc. of the29th Annualinternational Symposium on Fault-tolerant Computing,1999:220-223
    [130] R. Friedman, A. Kama. Transparent Fault-Tolerant Java Virtual Machine[C]. Proc. ofthe22nd International Symposium on Reliable Distributed Systems,2003:319-328
    [131] L. M. Silva, J. G. Silva. The Performance of Coordinated and IndependentCheckpointing[C]. Proc. of the13th International Parallel Processing Symposium and10th Symposium on Parallel and Distributed Processing, San Juan, Puerto Rico,1999:280-284
    [132] D. Ghosh, R. Sharman, H. Raghav Rao, S. Upadhyaya. Self-healing Systems-Surveyand Synthesis[J]. Decision Support Systems,2007,42:2164-2185
    [133]李凯原,杨孝宗.提高用任务重复的检查点方案的性能[J].电子学报,2000,28(5):28,33-35
    [134] M, Ono, H. Higaki. Hybrid Checkpoint Protocol for Cell-Dependent InfrastructuredNetworks[C]. Proc. of the18th International Parallel and Distributed ProcessingSymposium,2004(2):1006-1011
    [135] Y. M. Wang. Consistent Global Checkpoints That Contain a Given Set of LocalCheckpoints[J]. IEEE Trans. on Computers,1997,46(4):456-468
    [136] E. N. Elnozahy, J. S. Plank. Checkpointing for Peta-scale Systems: a Look into theFuture of Practical Rollback-recovery[J]. IEEE Transactions on Dependable andSecure Computing,2004,1(2):97-108
    [137] M. Paun, N. Naksinehaboon, R. Nassar, al et. Incremental Checkpoint Schemes forWeibull Failure Distribution[J]. International Journal of Foundations of ComputerScience,2010,21(3):329-344

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700