分布式系统后向恢复容错技术研究

英文题名：Study on Backward Recovery of Fault Tolerant Technology in Distributed Systems
作者：刘翠英
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：分布式系统 ; 容错 ; 检查点开销 ; 非阻塞 ; 回卷恢复
英文关键词：distributed systems ; fault-tolerance ; checkpoint overhead ; non-blocking ; rollback recovery
学位年度：2011
导师：高胜法
学科代码：081201
学位授予单位：山东大学
论文提交日期：2011-04-20

摘要

近年来,对高可靠性和高可用性的分布式计算系统的应用需求一直在稳定地增长,比如全球个人以及军用通信系统、航空控制系统、网络管理平台、金融系统等。随着分布式计算系统中应用范围的扩展以及节点数量的增加,网络异构问题也日益突出,基于分布式系统设计的软件系统也越来越庞大、复杂,系统中出现故障的概率越来越高,如果不采取容错措施,一旦分布式应用被故障中断,就要重新启动系统、重新执行应用,那么所要执行的任务可能需要很长时间才能完成,甚至根本完成不了。因此,研究分布式系统应用中的容错技术具有重大的理论指导意义和实际应用价值。后向恢复技术是当今容错技术研究领域的热点,包括以下几个研究方向：检查点算法(包括提高检查点设置的效率、降低检查点的开销、有效地控制回卷的距离等)：容错回卷回复的系统模型；算法的性能评估和优化策略；分布式计算系统的故障特征和检测；捕获和恢复进程状态等。
     本课题的提出来源于山东省自然科学基金项目“基于后向恢复的异构分布式系统容错技术的研究与实现”。本文介绍了分布式系统容错技术的研究现状、分布式系统中的常见故障以及容错技术涉及的相关概念、定义；指出了分布式容错系统存在的必须解决的各种问题,如：孤儿消息、传输中消息、检查点开销,多米诺效应等问题；介绍了如何消除非全局一致的检查点状态的条件和定理；分析了分布式系统容错技术中各种检查点技术和各种消息日志技术的原理、性能和优缺点；分析了影响检查点算法性能的瓶颈因素,研究了分布式系统容错检查点算法设置的原则,比如减少检查点设置和回卷回复时进程的阻塞,提高检查点设置效率,减少控制消息的数量等。本文所做的主要工作有以下几个方面：
     1)分析研究了有限状态机扩展模型及其算法,并对该模型进行了改进,使得该模型的功能更强大,适应范围更广范。
     2)提出了一种高效的异步存储非阻塞的协调检查点算法ASNB,从三个方面考虑降低检查点设置时的开销：允许多个进程并发的在进程状态信息量较小的时候设置检查点；在稳固存储器空闲的时候异步存储检查点；设置检查点的过程中不需要阻塞进程的基本执行。
     3)给出ASNB算法的改进算法,使得进程在设置检查点时只卷入有依赖关系的最少的进程设置强制检查点,非常适用于进程对计算损失敏感度有较大差异的系统,使不同的进程可以采用不同的间隔设置检查点,对于每个进程设置检查点频率差别较大的系统,大大减少了其设置检查点时的开销。
In recent years, high reliability and high availability applications demand of distributed computing systems has been steadily growing, such as the global personal and military communication systems, air traffic control systems, network management platform, and financial system. With the expansion of the scope of the application and the increase of the number of nodes in distributed computing system, the issues of network heterogeneous are increasingly prominent; the software systems are increasingly large and complex in the design of distributed computing system, failure probability of system are getting more and more high; If there is no fault tolerance, the calculation is interrupted, the previous calculation done will be lost, calculations need to begin from scratch. This catastrophic phenomenon is intolerable. Without fault tolerance in the distributed computing system, computing task may take a long time to complete, not even complete. Therefore, the research of fault-tolerant technology in distributed application has great theoretical significance and practical value. Backward recovery technology is a hot research field in current research of fault-tolerant technology, including the following research directions:checkpointing algorithm( including the improvement of the efficiency of the taking the checkpoints, the reduce of the overhead of the checkpointing, and effectively control of the volume of recovery, etc.); the fault-tolerant system model of fault-tolerant rollback recovery; performance evaluation and optimization strategies of the algorithm; fault characteristics and detection of the distributed computing system; and capture and recovery of the process status.
     The proposed project comes from the Natural Science Foundation of Shandong Province, "The research and implementation of fault-tolerant technology based backward recovery in heterogeneous distributed systems". This paper describes the current research situations of fault-tolerant in distributed systems, common faults in distributed system and relevant concepts and definitions in fault-tolerant technology; the various problems to be resolved in distributed fault-tolerant system, such as: orphan message, in-transmit message, checkpoint overhead, and the problem of domino effect; introducing the conditions and theorems of how to eliminate the non-global consistent checkpoint state; analysising the principles, performance and advantages and disadvantages of various checkpointing technology and messages logging in distributed fault-tolerant system; analysising the bottlenecks of impacting the performance of checkpointing algorithm, and studying the principles of the taking of the distributed fault-tolerant checkpointing algorithm, such as reducing the number of checkpoints, improving the efficiency of taking checkpoints, and reducing the number of the control messages. The main work of this paper are the following:
     First, this paper analyzes the limitations of Extended Finite State Machine and Fault Tolerant Mechanism. The model was improved.
     Second, this paper proposes an efficient non-blocking coordinated checkpoint algorithm. On the algorithm, more processes can concurrently take consistent global checkpoints. The algorithm reduces the overhead by saving the state asynchronously and taking checkpoint when the amount of state information to be saved is small. The algorithm greatly lowers the overhead of checkpoint and improves system's performance.
     Third, an algorithm improved upon ASNB is proposed which has better adapt to the different sensitivity process on the faulty, so different processes can be used to set different intervals checkpoint.

引文

[1]Elnozahy E N, Alvisi L, Wang Y M, Johnson D B. A survey of rollback-recovery protocols in message-passing systems[J].ACM Computing Surveys,2002,34 (3): 375～408.
    [2]L. Lamport. Using time instead of timeout for fault-tolerant distributed systems.' In ACM Transactions on Programming Languages,6(2):254—280, Apr.1984.
    [3]Schlichting R D, Schneider F B. Fail-stop processors:An approach to designing fault-tolerant computing systems[J].ACM Transactions on Computer Systems. 1983,1 (3),222-238.
    [4]Yi-Min Wang,Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints. IEEE Transactions on Computers Volume 46, Issue 4 (April 1997) Pages:456-468
    [5]Qiangfeng Jiang,Yi Luo,D.Manivannan, An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems. Journal of Parallel and Distributed Computing Volume 68,2008. Issue 12 Pages:1575-1589
    [6]Lyu M R. Software Fault Tolerance. USA:John Wiley & Sons Ltd. 1995, ISBN:0471950688,1-337.
    [7]Shengfa Gao,Xin Li,Ruihua Zhang, The Extended Finite State Machine and Fault Tolerant Mechanism in Distributed Systems,2009 Seventh ACIS International Conference on Software Engineering Research, Management and Applications, Haikou, China,December 02-December 04
    [8]D.Manivannan,Robert H.B.Netzer,Mukesh Singhal, Finding Consistent Global Checkpoints in a Distributed Computation. IEEE Transactions on Parallel and Distributed Systems, Volume 8, Issue 6 (June 1997), Pages:623-627
    [9]罗元盛,闵应骅,张大方,基于索引的准同步检查点重新计时策略.计算机工程与科学,2005,Vol.28 No.10
    [10]张宇；洪炳熔；基于检测点设置依赖图和属性表的卷回恢复算法.计算机研究与发展,Journal of Computer Research and Development,2001年02期
    [11]H. Zhong and J. Nieh. CRAK:Linux checkpoint/restart as a kernel module. Technical Report CUCS-014·01, Department of Computer Science, Columbia University, 2001.
    [12]J. Leon, A.L. Ficher, et al. "Fail-safe PVM:A portable package for distributed programming with transparent recovery." Technical Report CMU-CS-93-124, School of Computer Science, Camegie Mellon University,Feb.1993.
    [13]汪东升,裴丹,张悠慧,ChaRM:并行程序运行回卷恢复与进程迁移系统,软件学报,1998.
    [14]V. Strumpen, B. Ramkumar. Portable Checkpointing for Heterogeneous Architectures. Fault-Tolerant Parallel and Distributed Systems,1998,4:73-92
    [15]Chung P E, Lee W J, Huang Y, et al. Winckp:A Transparent Checkpointing and Rollback Recovery Tool for Windows NT Applications. In 29th Annual International Symposium on Fault-tolerant Computing, Wisconsin,1999, pp. 220-223.
    [16]T. Tannenbaum and M. Litzkow, The Condor Distributed Processing System,Dr. Dobb's Journal, (2):pp40. -48, feb.1995
    [17]Plank J S, Beck M, Kingsley G, et al. Libckpt: Transparent Checkpointing under Unix. In USENIX 1995 Technical Conference Proceedings, USENIX Press, New Orleans,1995, pp.213-223.
    [18]G-M. Chiu and C-R. Young. "Efficient rollback-recovery technique in distributed computing systems." In IEEE Transactions, vol.7, no.6, Jun.1996.
    [19]O.P. Damani and V. K. Garg. "How to recover efficiently and asynchronously when optimism fails." In Proceedings of the 16th International Conference on Distributed Computing, pp.108—115, May 1996.
    [20]A. Duda. "The effects of checkpointing on program execution time." In Information Processing Letters, no.16, pp.221—229,1983.
    [21]Cao G and Singhal M. On coordinated checkpointing in distributed systems. IEEE Transactions,1998,9 (12):1213-1225.
    [22]Ni W, Vrbsky S V, Ray S. Low-cost Coordinated Nonblocking Checkpointing in Mobile Computing Systems. In 8th IEEE International Symposium on Computers and Communications, Turkey,2003, pp.1427-1434
    [23]O. Babaoglu and K. Marzullo. "Consistent global states of distributed systems: Fundamental concepts and mechanisms."Distributed Systems, Ed. S. Mullender, Addison-Wesley, pp.55—96,1993.
    [24]G. Deconinck, J. Vounckx, R. Lauwereins, and J. Peperstraete. "Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback." International Journal of Modeling and Simulation, 18(1):66—71,1998.
    [25]N. S. Bowen and D. K. Pradhan. "Processor- and memory-based checkpoint and rollback recovery." In IEEE Computer,26(2):22—32, Feb.1993.
    [26]G. Barigazzi and L. Strigini. "Application-transparent setting of recovery points." In Proceedings of the Thirteenth International Symposium on Fault-Tolerant Computing Systems, FTCS-13, pp.48—55,1983.
    [27]Cao G, Singhal M. Checkpointing with mutable checkpoints. Theoretical Computer Science,2003,290:1127-1148
    [28]L. Alvisi, S. Rao, and H.M. Vin. "Low-overhead protocols for fault-tolerant file sharing." In Proceedings of the IEEE 18th International Conference on Distributed Computing Systems, pp.452—461, May 1998.
    [29]B. Bieker, G. Deoninck, E. Maehle and J. Vounckx. "Reconfiguration and checkpointing in massively parallel systems. "In Proceedings of the 1st European Dependant Computing Conference, EDCC-1, pp.353—370, Oct.1994.
    [30]O.P. Damani and V. K. Garg. "How to recover efficiently and asynchronously when optimism fails." In Proceedings of the 16th International Conference on Distributed Computing, pp.108—115, May 1996.
    [31]A. Sistla and J. Welch. "Efficient distributed recovery using message logging." In Proceedings of the 8th Annual ACMSymposium on Principles of Distributed Computing (PODC), pp.223—238, Aug.1989.
    [32]R. E. Strom, S.A. Yemini and D. F. Bacon. "A recoverable object store." In Proceedings of the Hawaii Internationa Conference on System Sciences, pp. H-215—Ⅱ221, Jan.1988.
    [33]A. Acharya and B. R. Badrinath. "Recording distributed snapshots based on causal order of message delivery." In Information Processing Letters, vol.44, no. 6, Dec.1992.
    [34]D.F. Bacon. "File system measurements and their application to the design of efficient operation logging algorithms." In Proceedings of the 10th Symposium on Reliable Distributed Systems, pp.21—30, Oct.1991.
    [35]Lamport L. Time, Clocks and the Ordering of Events in a Distributed System. Communications of the ACM,1978,21 (7):558-565.
    [36]Abdel-Shafi H, Speight E, Bennett J K. Efficient User-level Thread Migration and Checkpointing on Windows NT Cluster. In 3nd USENIX Windows NT Symposium, USENIX Press, Seattle,1999, pp.1-10
    [37]Holzmann G. J.. The model checker SPIN. IEEE Transactions on Software Engineering,1997,23 (5):279-295
    [38]Douglas F, Outerhout J. Process Migration in the Sprite Operating System. In 7th international conference on Distributed Computing Systems,1987, pp.18-25.
    [39]Dieter W R, James E L, Jr. A User-level Checkpointing Library for POSIX hreads Programs. In 29th Annual International Symposium on Fault-tolerant Computing Systems (FTCS'99), Madison,1999,pp.224-227.
    [40]李凯原,杨孝宗,减少检查点开销的一种方法,计算机工程与应用,2000.2.
    [41]Nasika R, Dasgupta P. Transparent Migration of Distributed Communicating Processes. In 13th ISCA international conference on parallel and distributed computing systems (PDCS-2000), August 2000.
    [42]Ziv A, Bruce J. Performance Optimization of Checkpointing Schemes with Task Duplication.IEEE Transactions on Computers,1997,46(12) 1381-1386
    [43]L. Alvisi and K. Marzullo. "Message logging:Pessimistic, optimistic and causal.' In Proceedings of the IEEE International Conference on Distributed Computing Systems, May 1995.
    [44]S. Rao, L. Alvisi, H.M. Vin, "The Cost of Recovery in Message Logging Protocols," srds, pp.10, The 17th IEEE Symposium on Reliable Distributed Systems,1998.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700