面向事务存储系统的容错技术研究

英文题名：Research on Fault Tolerance for Transactional Memory System
作者：宋伟
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：容错技术 ; 事务存储 ; 故障检测 ; 故障恢复 ; 故障屏蔽
英文关键词：Fault Tolerance ; Transactional Memory ; Fault Detection ; Fault Recovery ; Fault Masking
学位年度：2011
导师：杨学军
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2011-09-01

摘要

随着多核处理器的发展,事务存储作为一种有潜力的并发控制机制受到了越来越多的关注。另一方面,随着大规模集成电路的发展进入深亚微米级甚至纳米级,处理器更容易受电磁辐射、宇宙射线以及其它干扰源的影响,这使得处理器的可靠性问题变得日益突出。因此事务存储机制下的容错问题也将逐渐成为一个值得关注的问题。
     本文针对事务存储系统下的容错问题展开研究,以事务存储系统中的错误传播行为为理论基础,围绕故障检测、故障恢复和故障屏蔽等几个关键问题提出了理论方法、技术方案和实现框架。本文的主要贡献如下:
     1.以基于程序语句序列的语句间的错误传播行为为始,层层递进的分析了错误在事务存储系统中的传播行为。通过对事务自身的属性和特点的分析,针对容错位置和容错对象集合这两个容错技术主要关心的信息,给出了事务存储系统中两类天然的容错位置及对应的容错对象集合,并证明了其所具有的不同的容错能力,从理论上揭示了事务存储系统天然的容错特性。
     2.提出了基于事务冗余的错误检测方法——EDRT错误检测方法。该方法为事务创建冗余副本,并同时执行事务及其副本,通过在提交前比较两事务的写集合实现了低错误检测开销的基于冗余事务的错误检测方法。此外,我们根据事务存储系统所采用的数据版本管理机制的不同特点,分别从错误检测数据比较集的获取和比较方法以及冲突检测机制两方面提出了将EDRT错误检测方法应用于基于Eager数据版本管理机制和基于Lazy数据版本管理机制的事务存储系统的系统约束和设计指导方法。通过一组实验我们验证了相比于传统的双模冗余错误检测方法,EDRT错误检测方法可以在较低的错误检测开销下获取较好的错误检测能力。
     3.提出了基于事务回退的故障恢复方法——FRTR故障恢复方法。该方法利用事务存储系统的数据版本管理机制作为故障恢复的“检查点”,通过单故障事务的回退来完成故障恢复的过程。通过对支持FRTR故障恢复方法的容错事务存储系统的隔离性的讨论,我们证明了基于单事务回退的FRTR故障恢复方法对于事务存储系统的故障恢复的充分性。通过一组实验我们验证了FRTR故障恢复方法的低故障恢复开销。此外,我们将并行复算的思想引入FRTR故障恢复方法,进一步降低了FRTR故障恢复方法的故障恢复开销,并针对OpenTM程序给出了基于事务存储系统的并行复算的编程指导。通过实验我们也验证了对于较大粒度事务的事务存储系统,该优化方法对FRTR故障恢复方法的性能优化的有效性。
     4.提出了基于三模冗余的容错方法——TriTM容错方法。该方法将三模冗余的思想引入事务存储系统,以事务的写集合作为数据比较集合,实现了一种低容错开销的故障屏蔽方法。我们利用TriTM容错方法的自纠错能力,提出了基于比较点优化设置的TriTM容错方法的性能优化方法Opti_TriTM。此外,我们根据基于Closed嵌套的事务存储系统的特点,提出了基于Closed嵌套事务的TriTM容错方法的实现方法。通过一组实验我们验证了相比于传统的三模冗余容错方法,TriTM容错方法具有较低的容错开销,同时我们也验证了Opti_TriTM对容错性能优化的有效性。
With the development of multi-core processors, transactional memory has attracted more and more attention as a promising concurrent control mechanism. On the other hand, with the development of large scale integrated circuit entering into deep submicron or even nanometer level, the processors become more and more susceptible to electromagnetic radiation, cosmic ray and other interfering resources. This makes the reliability of the processors become more outstanding, so as a result, the fault tolerance in transactional memory system becomes a concerning issue.
     In this paper, we study the issues on the fault tolerance in transactional memory system. Based on the theoretical foundation of error propagation behavior in transactional memory system, we propose the theoretical methods, technical solutions and implementation frameworks around the issues of fault detection, fault recovery and fault masking. This paper has the following contributions:
     1. Taking the error propagation behavior between statements sequence as the beginning, we analyze the error propagation behavior in transactional memory system progressively. We provide two sorts of fault tolerant positions and the corresponding fault tolerant objects, and prove the different fault tolerant abilities they have, and reveal the fault tolerant characteristics of transactional memory.
     2. We propose an error detection method based on redundant transaction– EDRT. This method creates a redundant copy for every transaction, and executes both the transaction and its copy, and achieves the error detection by comparing the write sets of the two transactions before the committing operation. In addition, we propose the system restraints and the designing guide for how to apply the EDRT to the transactional memory systems based on both the eager and lazy data-versioning mechanisms from the aspects of both the acquisition and comparison method of error detection data sets and the conflict detection mechanism. We prove that the EDRT has good error detection ability with low cost through a set of experiments.
     3. We propose a fault recovery method based on the transaction rollback– FRTR. This method takes the data-versioning mechanism as the checkpoint, and accomplishes the fault recovery by rolling back the single fault transaction. We prove the sufficiency for fault recovery in transactional memory system through discussing the isolation of the transactional memory system that supports the FRTR. We also prove the low cost of FRTR through a set of experiments. In addition, we introduce the idea of parallel recomputing into the FRTR to reduce the cost of FRTR, and we provide the programming guide of the parallel recomputing for OpenTM. We also prove the availability of this optimization method through a set of experiments.
     4. We propose a fault tolerant method based on triple redundancy– TriTM. This method introduces the idea of triple redundancy into the transactional memory system, taking the write sets of the transactions as the data comparison set, and implements a low cost fault masking method. By utilizing the error correction ability of TriTM, we propose an optimization method based on the optimization of the set of the comparison point in TriTM. In addition, we implement the TriTM in the closed nesting transactional memory system. And we also prove the low cost of TriTM and the availability of Opti_TriTM through a set of experiments.

引文

[1] Kunle Olukotun and Lance Hammond. The future of microprocessors [J]. Queue, 2005, 3(7):26–29.
    [2]易会战.低功耗技术研究—体系结构和编译优化[D].湖南:国防科学技术大学研究生院,2008:1-2.
    [3] TOP500 Supercomputing Site [EB/OL]. http://www.top500.org, June 2010.
    [4] Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques [M]. San Francisco, CA: Morgan Kaufmann Publishers, 1992.
    [5] David B. Lomet. Process structuring, synchronization, and recovery using atomic actions [C]. In ACM Conference on Language Design for Reliable Software. Raleigh, NC: ACM Press, 1977:128-137.
    [6] M.P. Herlihy and J.E.B. Moss. Transactional memory: Architectural support for lock-free data structures [R]. Cambridge MA: Digital Cambridge Research Lab, 1992.
    [7] M.P. Herlihy and J.E.B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures [C]. Proc. 20th Ann. Int’l Symp. Computer Architecture (ISCA 93). New York, USA: ACM Press, 1993:289-300.
    [8] Tim Harris, Adrián Cristal, Osman S. Unsal, Eduard Ayguadé, Fabrizio Gagliardi, Burton Smith, Mateo Valero. Transactional Memory: An Overview [J]. IEEE Micro Special Issue: Hot Tutorials, 2007, 27(3):8-29.
    [9] Larus, J. and Kozyrakis, C. Transactional Memory [J]. Commun. ACM, 2008, 51(7):80-88.
    [10] Intel@32nm Logic Technology [EB/OL]. http://www.intel.com/technology/ architecture-silicon/32nm/index.htm.
    [11] White paper: Introduction to Intel’s 32nm Process Technology [EB/OL]. http:// download.intel.com/pressroom/kits/32nm/westmere/Intel 32nm overview.pdf.
    [12] AMD Phenom Processors [EB/OL]. http://www.amd.com/us/products/desktop/ processors/phenom/Pages/AMD-phenom-processor-X4-X3-at-home.aspx.
    [13]王攀峰.应用级checkpointing技术的研究与实现[D].湖南:国防科学技术大学研究生院, 2008:3.
    [14] H.H.K. Tang. Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-induced soft errors from a physicist's perspective [J]. IBM Journal of Research and Development. 1996, 40(1):91-108.
    [15] C. Hsu, W. Feng. A Power-aware Run-Time System for High-Performance Computing [C]. Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing. SC, 2005:1.
    [16] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic Soft Error Reliability on the Cheap [C]. In Proceedings of the International Conference on Architectural Support for Programming Language and Operating Systems (ASPLOS 10). New York, USA: ACM Press, 2010:385-396.
    [17] S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading [J]. SIGARCH Computer Architecture News, 2000, 28(2):25-36.
    [18] T. Harris, J.R. Larus, and R. Rajwar. Transactional Memory [M]. Morgan & Claypool Publishers, 2010.
    [19] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, and Sean Lie. Unbounded Transactional Memory [C]. 11th International Symposium on High-Performance Computer Architecture (HPCA 05). San Francisco: IEEE CS Press, 2005:316-327.
    [20] K.E. Moore, J. Bobba, M.J. Moravan, M.D. Hill, and D.A. Wood. LogTM: Log-Based Transactional Memory [C]. In Proceedings of the 12th IEEE Symposium on High-Performance Computer Architecture (HPCA 06). Austin: IEEE CS Press, 2006.
    [21] M.J. Moravan, J. Bobba, K.E. Moore, L. Yen, M.D. Hill, B. Liblit, M.M. Swift, and D.A. Wood. Supporting Nested Transactional Memory in LogTM [C]. Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 06). San Jose: ACM Press, 2006:359-370.
    [22] K.E. Moore. Log-Based Transactional Memory [D]. Madison: University of Wisconsin: 2007.
    [23] R. Rajwar, M. Herlihy, and K. Lai. Virtualizing Transactional Memory [C]. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA 05). New York: ACM Press, 2005:494-505.
    [24] L. Hammond, V. Wong, M. Chen, B.D. Carlstrom, J.D. Davis, B.Hertzberg, M.K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional Memory Coherence and Consistency [C]. Proceedings 31st Annual International Symposium Computer Architecture (ISCA 04). Washington DC: IEEE CS Press, 2004:102-113.
    [25] L. Hammond, B.D. Carlstrom, V. Wong, M. Chen, C. Kozyrakis, K. Olukotun. Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software [J]. Micro, 2004, 24(6):92-103.
    [26] M. Herlihy et al. Software Transactional Memory for Dynamic-Sized Data Structures [C]. Proceedings 22nd Annual Symposium Principles of Distributed Computing (PODC 03), Boston: ACM Press, 2003:92-101.
    [27] K. Fraser. Practical Lock-Freedom [R]. Cambridge, UK: Cambridge University Computer Laboratory, 2004.
    [28] K. Fraser and T. Harris. Concurrent Programming without Locks [J]. ACM Transactions on Computer Systems, 2004, 25(2):5.
    [29] N. Shavit and D. Touitou. Software Transactional Memory [C]. Proceedings 14th Annual ACM Symposium Principles of Distributed Computing (PODC 95), Ottowa: ACM Press, 1995:204-213.
    [30] V.J. Marathe, W.N. Scherer III, and M.L. Scott. Adaptive Software Transactional Memory [C]. Proceedings 19th International Symposium Distributed Computing (DISC 05), LNCS 3724, Springer, 2005:354-368.
    [31] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nussbaum. Hybrid transactional memory [C]. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 06), San Jose: ACM Press, 2006:336–346.
    [32] S. Kumar, M. Chu, C.J. Hughes, P. Kundu, and A. Nguyen. Hybrid transactional memory [C]. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 06). New York: ACM Press, 2006:209-220.
    [33] B. Saha, A.R. Adl-Tabatabai, and Q. Jacobson. Architectural Support for Software Transactional Memory [C]. In Proceedings of the 39th International Symposium on Microarchitecture, Orlando: IEEE Computer Society, 2006:185-196.
    [34] E. Dubrova. Fault Tolerant Design: An Introduction [M]. Kluwer Academic Publishers, 2006 (Draft).
    [35]胡谋.计算机容错技术[M].北京:中国铁道出版社,1995.
    [36] I. Koren, C.M. Krishna. Fault Tolerant Systems [M]. Morgan Kaufmann Publishers, 2007.
    [37] D.P. Siewiorek, R.S. Swarz. Reliable Computer Systems: Design and Evaluation [M]. Digital Press, 1992.
    [38] R.D. Schlichting, F.B. Schneider. Fail-Stop Processors: An Approach To Designing Fault-Tolerant Computing Systems [J]. ACM Transactions on Computer System, 1983, (3):222-238.
    [39] M. Treaster. A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems [R]. Tech. Rep. cs.DC/0501002, ACM Computing Research Repository, 2005.
    [40] T. Nanya, H.A. Goosen. The Byzantine Hardware Fault Model [J]. Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1989, (11).
    [41] R. Bazzi. Synchronous Byzantine Quorum Systems [C]. Proceedings of the 16th Annual ACM Symposium on Principles of Distributed Computing. ACM Press, 1997.
    [42] G. Bracha, S. Toueg. Asynchronous Consensus And Broadcast Protocols [J]. Journal of the Association for Computing Machinery, 1995.
    [43] M. Castro, B. Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery [J]. ACM Transactions on Computer Systems, 2002, (4):398-461.
    [44] D. Malkhi, M. Reiter. Byzantine Quorum Systems [J]. Distributed Computing, 1998, 11:203-213.
    [45] M. Reiter, D. Malkhi, A. Wool. The Load And Availability of Byzantine Quorum Systems [J]. SIAM Journal on Computing, 2000.
    [46] M. Castro, R. Rodrigues, B. Liskov. BASE: Using Abstraction To Improve Fault Tolerance [C]. Proceedings of the 18th ACM Symposium on Operating Systems Principles. ACM Press, 2001.
    [47] F. Schneider. Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial [J]. Computing Surveys, 1990, (3):299-319.
    [48] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults [J]. J. ACM, 1980, 27(2):228–234.
    [49] L. Lamport, R. Shostak, and M. Pease. The Byzantine general problem [J]. ACM Trans. Program. Lang. Syst., 1982, 4(3):382-401.
    [50] R. Canetti and T. Rabin. Optimal asynchronous Byzantine agreement [R]. Technique Report#92-15, Computer Science Department, Hebrew University, 1992.
    [51] M. Reiter. A secure group membership protocol [J]. IEEE Trans. Softw. Eng., 1996, 22(1):31-42.
    [52] D. Malkhi and M. Reiter. Unreliable intrusion detection in distributed computations [C]. In Proceedings of the Ninth Computer Security Foundations Workshop. Ireland: IEEE Computer Society Press, 1996:9-17.
    [53] J. Garay and Y. Moses. Fully polynomial Byzantine agreement for n>3t processors in t+1 rounds [J]. SIAM J. Comput., 1998, 27(1):247-290.
    [54] K. Kihlstrom, L. Moser, and P. Melliar-Smith. The SecureRing protocols for securing group communication [C]. In Proceedings of the Hawaii International Conference on System Sciences. Hawaii, 1998.
    [55] L. Lamport. Using time instead of timeout for fault-tolerant distributed systems [J]. ACM Trans. Program. Lang. and Syst., 1984, 6(2):254-280.
    [56] Leslie Lamport, Robert Shostak and Marshall Pease. The Byzantine Generals Problem [J]. ACM Transactions on Programming Languages and Systems, 1982, 4(3).
    [57] J. Yin, J. P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Byzantinefault-tolerant confidentiality [C]. In Proceedings of the International Workshop on Future Directions in Distributed Computing. 2002:12-16.
    [58] J.V. Neumann. Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components [J]. Automata Studies, 1956:43-98.
    [59] E.N. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems [J]. ACM Computing Surveys. 2002, (3):375-408.
    [60] E.N. Elnozahy and James S. Plank. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery [J]. IEEE Transactions on Dependable and Secure Computing, 2004, 1(2):97-108.
    [61] Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia, Zhiyuan Wang, and Guang Suo. The Fault Tolerant Parallel Algorithm:the Parallel Recomputing Based Faulure Recovery [C]. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques, IEEE Press, 2007:199-210.
    [62] J.F. Martinez, C. LaFrieda, E. Ipek, R. Manohar. Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor [C]. Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 2007:317-326.
    [63] N. Oh, P.P. Shirvani, and E.J. McCluskey. Error Detection by Duplicated Instructions in Super-Scalar Processors [J]. IEEE Transactions on Reliability. 2002, (1):63-75.
    [64] N. Oh, S. Mitra, and E.J. McCluskey. ED4I: Error Detection by Diverse Data and Duplicated Instructions [J]. IEEE Transactions on Computers. 2002, (2):180-199.
    [65] T.J. Slegel, R.M. Averill III, M.A. Check, B.C. Giamei, B.W. Krumm, C.A. Krygowski, W.H. Li, J.S. Liptay, J.D. MacDougall, T.J. McPherson, J.A. Navarro, E.M. Schwarz, K. Shum, and C.F. Webb. IBM’s S/390 G5 microprocessor design [J]. IEEE Micro, 1999, 19(2):12-23.
    [66] A. Wood, R. Jardine, and W. Bartlett. Data integrity in HP NonStop servers [C]. In Proceedings of the 2006 Workshop on System Effects of Logic Soft Errors, Urbana-Champaign, 2006.
    [67] S.S. Mukherjee, M. Kontz, and S.K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives [C]. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA 02). Washington, DC: International Symposium on Computer Architecture, 2002:99-110.
    [68] Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseph Blomstedt, and Danieal A. Connors. Using process-level redundancy to exploit multiple cores fortransient fault tolerance [C]. In International Conference on Dependable Systems and Networks, 2007.S.G. Miremadi, A. Rajabzadeh, M. Mohandespour. Error Detection Enhancement in COTS Superscalar Processors with Performance Monitoring Features [J]. Journal of Electronic Testing: Theory and Applications. 2004, 20(5):553-567.
    [69] J.K. John, S. Wang, J.S. Hu, G. M. Link, S. G. Ziavras. Resource-driven Optimizations for Transient-Fault Detecting Superscalar Microarchitectures [C]. Proceedings of the 10th Asia-Pacific Computer Systems Architecture Conference. 2005, 24-26.
    [70] G. Cui, H. Yang, X. Yang. TRSTR: A Fault-Tolerant Microprocessor Architecture Based on SMT [J]. Wuhan University Journal of Natural Sciences. 2005, 10:51-55.
    [71] S. Hamdioui, D. Borodin, B.H. Juurlink, S. Vassiliadis. Instruction-Level Fault Tolerance Con.gurability [J]. Journal of Signal Processing Systems. 2009, 57(1):89-105.
    [72] R. Farivar, M. Fazeli, S. G. Miremadi. Error Detection Enhancement in PowerPC Architecture-based Embedded Processors [J]. Journal of Electronic Testing: Theory and Applications. 2008, 24(13):21-33.
    [73] N. Nakka, A. Choudhary. Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems [J]. High Performance Computing Systems and Applications. 2010, 5976:304-322.
    [74] J.H. Patel and L.Y. Fung. Concurrent error detection in ALU’s by recomputing with shifted operands [J]. IEEE Transactions on Computers, 1982, 31(7):589–595.
    [75] G.A. Reis, J. Chang, D.I. August, and S.S. Mukherjee. Configurable transient fault detection via dynamic binary translation [C]. In Proceedings of the 2nd Workshop on Architectural Reliability, Orlando, Florida, 2006.
    [76] D. J. Lu. Watchdog Processors and VLSI [C]. Proceedings of National Electronics Conference. 1980, 240-245.
    [77] Mahmood, E. McCluskey. Watchdog Processors: Error Coverage and Overhead [C]. Proceedings of the 15th Annual International Symposium on Fault-Tolerant Computing. 1985, 214-219.
    [78] N. R. Saxena, E. J. McCluskey. Control-Flow Checking Using Watchdog Assists and Extended-Precision Checksums [J]. IEEE Transactions on Computers. 1990, (4):554-559.
    [79] J. Ohlsson, M. Rimen. Implicit Signature Checking [C]. Proceedings of the International Symposium on Fault-Tolerant Computing. 1995, 218-227.
    [80] G. D. Natale, P. Prinetto Benso, S. D. Carlo, L. Tagliaferri. Control-Flow Checking via Regular Expressions [C]. Proceedings of the 10th Asian TestSymposium. 2001, 299-303.
    [81] P. P. Shirvani, N. Oh, E. J. McCluskey. Control-Flow Checking By Software Signatures [J]. IEEE Transactions on Reliability. 2002, (1):111-122.
    [82] M. Rebaudengo Goloubeva, M. S. Reorda. Soft Error Detection Using Control Flow Assertions [C]. Proceedings of the 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2003, 581-588.
    [83] N. Vachharajani, R. Rangan, G. A. Reis, J. Chang, D. I. August. SWIFT: Software Implemented Fault Tolerance [C]. Proceedings of International Symposium on Code Generation and Optimization. 2005, 243-254.
    [84] Y. Wu, E. Borin, C. Wang, G. Araujo. Software-based Transparent And Comprehensive Control-Flow Error Detection [C]. Proceedings of International Symposium on Code Generation and Optimization. 2006, 333-345.
    [85] X. Li, J. Gaudiot. A Compiler-Assisted on-Chip Assigned-Signature Control Flow Checking [J]. Advances in Computer Systems Architecture. 2004, 3189:554-567.
    [86] F. Rodriguez, J. Serrano. Control Flow Error Checking with ISIS [J]. Embedded Software and Systems. 2005, 3820:659-670.
    [87] D. Ziener, J. Teich. Concepts for Autonomous Control Flow Checking for Embedded CPUs [C]. Proceedings of the 5th international conference on Autonomic and Trusted Computing. Berlin, Heidelberg: Springer-Verlag, 2008, 234-248.
    [88] M. Albert, J. Daniel. Error Detection Using Dynamic Data.ow Veri.cation [C]. Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 2007, 104-118.
    [89] R. Koo, S. Toueg. Checkpointing and rollback recovery for distributed systems [J]. IEEE Trans. Software Eng. 1987, SE-13:23-31.
    [90] S. Venkatesan. Message optimal incremental snapshots [C]. In Proc. IEEE 9th Int. Conf. Distributed Comput. Syst., 1989:53-60.
    [91] Albert Y. H. Zomaya. Parallel and distributed computing handbook [J]. Mcgraw-Hill Computer Engineering Series, 1996.
    [92] K. Li, J.F. Naughton, J.S. Plank. Checkpointing multicomputer applications [C]. In Proc. IEEE Conf. on Reliable Distributed Syst., 1991:2-11.
    [93] J.L. Kim, T. Park. An efficient protocol for checkpointing recovery in distributed systems [J]. IEEE Trans. Parallel Distributed Syst., 1993, 4:955-960.
    [94] B. Bhargava and S.R. Lian. Independent checkpointing and concurrent rollback for recovery-an optimistic approach [C]. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 1988:3–12.
    [95] B. Bhargava, S.R. Lian, P.J. Leu. Experimental evaluation of concurrentcheckpointing and rollback-recovery algorithms [C]. In Proceedings of the Sixth International Conference on Data Engineering, 1990:182-189.
    [96] Y.M. Wang. Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems [D]. University of Illinois, Department of Computer Science, 1993.
    [97] Y.M. Wang, P.Y. Chung, W.K. Fuchs. Tight upper bound on useful distributed system checkpoints [R]. Technical Report, University of Illinois, 1995.
    [98] J.S. Plank. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques [C]. In Proceedings of the Symposium on Reliable Distributed Systems, 1996:76–85.
    [99] J.S. Plank, Y.B. Kim, and Jack J. Dongarra. Fault-tolerant matrix operations for networks of workstations using diskless checkpointing [J]. Journal of Parallel and Distributed Computing, 1997, 43(2):125–138.
    [100] J.S. Plank, K. Li, and M.A. Puening. Diskless checkpointing [J]. IEEE Trans. Parallel Distrib. Syst., 1998, 9(10):972-986.
    [101] A. Beguelin, E. Seligman, P. Stephan. Application level fault tolerance in heterogeneous networks of workstations [J]. Journal of Parallel and Distributed Computing, 1997, 43(2):147-155.
    [102] G. Bronevetsky, M. Daniel, K. Pingali, and S. Paul. Automated application-level checkpointing of mpi programs [C]. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 03), San Diego, CA, 2003:84–94.
    [103] John Paul Walters and Vipin Chaudhary. Application-Level Checkpointing Techniques for Parallel Programs [J]. Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006, 221-234.
    [104] S. Vadhiyar and J. Dongarra. Srs - a framework for developing malleable and migratable parallel software [J]. Parallel Processing Letters, 2003, 13(2):291–312.
    [105] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca and J. Dongarra. Fault Tolerant High Performance Computing by a Coding Approach [C]. In Proc. the Seventeenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005:213-223.
    [106] Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali and Paul Stodghill. Recent Advances in Checkpoint/Recovery Systems [C]. Next Generation Systems Program Workshop at IPDPS 2006, 2006.
    [107] G. Bronevetsky, M. Schulz, P. Szwed, S. Zaman, and K. Pingali. Application-level checkpointing for shared memory programs [C]. In Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 04), 2004.
    [108] Daniel Marques, Greg Bronevetsky, Rohit Fernandes, Keshav Pingali, Paul Stodghill. Optimizing Checkpoint Sizes in the C3 System [C]. The Next Generation Software Workshop at IPDPS 2005, 2005.
    [109] Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. C3: A System for Automating Application-level Checkpointing of MPI Programs [C]. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 03), 2003.
    [110] Greg Bronevetsky. Portable Checkpointing for Parallel Applications [D]. Cornell University, 2007.
    [111] Daniel J. Marques. Automatic Application-Level Checkpointing for High Performance Computing Systems [D]. Cornell University, 2006.
    [112] Greg Bronevetsky, Keshav Pingali, Paul Stodghill. Application-level Checkpointing for OpenMP Programs [C]. International Conference on Supercomputing (ICS 06), 2006.
    [113] Greg Bronevetsky and Bronis de Supinski. Formal Specification of the OpenMP Memory Model [C]. International Workshop on OpenMP (IWOMP 06), 2006.
    [114] Greg Bronevetsky, Martin Schulz, Peter Szwed, Shafat Zaman and Keshav Pingali. Checkpointing Shared Memory Programs at the Application-level [C]. European Workshop on OpenMP (EWOMP 04), 2004.
    [115] Greg Bronevetsky, Daniel Marques, Keshav Pingali and Radu Rugina. Compiler-Enhanced Incremental Checkpointing [C]. International Workshop on Languages and Compilers for Parallel Computing (LCPC 07), 2007.
    [116] Xuejun Yang, Panfeng Wang, Hongyi Fu, Yunfei Du, Zhiyuan Wang, and Jia Jia. Compiler-Assisted Application-Level Checkpointing for MPI Programs [C]. The 28th International Conference on Distributed Computing Systems (ICDCS 08), Beijing, China, 2008.
    [117] S. Chakravorty and L.V. Kale. A Fault Tolerance Protocol with Fast Fault Recovery [C]. In 21st IEEE International Parallel & Distributed Processing Symposium, California, USA, 2007:120-128.
    [118] M.L. Powell and D.L. Presotto. Publishing: A reliable broadcast communication mechanism [C]. In Proceedings of the Ninth Symposium on Operating System Principles. ACM SIGOPS, 1983:100-109.
    [119] L. Alvisi and K. Marzullo. Message Logging: Pessimistic, Optimistic, Causal, and Optimal [J]. IEEE Transactions on Software Engineering, 1998, 24(2):149-159.
    [120] A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance [C]. In Proceedings of the Symposium on Operating Systems Principles. ACM SIGOPS, 1983:90–99.
    [121] D.B. Johnson and W. Zwaenepoel. Sender-based message logging [C]. The Seventeenth Annual International Symposium on Fault-Tolerant Computing. 1987:14–19.
    [122] A. Borg, W. Blau, W. Graetsch, F. Hermann and W. Oberle. Fault tolerance under UNIX [J]. ACM Transactions on Computing Systems, 1989, 7(1):1-24.
    [123] T.T.-Y. Juang and S. Venkatesan. Crash recovery with little overhead [C]. In Proceedings of the 11th International Conference on Distributed Computing Systems. 1991:454–461.
    [124] O.P. Damani and V. K. Garg. How to Recover Efficiently and Asynchronously when Optimism Fails [C]. In Proceedings of the 16th International Conference on Distributed Computing Systems. 1996:108–115.
    [125] R.B. Strom and S. Yemeni. Optimistic recovery in distributed systems [J]. ACM Transactions on Computer Systems, 1985, 3(3):204–226.
    [126] D.B. Johnson. Distributed System Fault Tolerance Using Message Logging and Checkpointing [D]. Rice University, Department of Computer Science, 1989.
    [127] Y. Huang and Y. M. Wang. Why optimistic message logging has not been used in telecommunication systems [C]. The Twenty Fifth Annual International Symposiums on Fault-Tolerant Computing. 1995:459–463.
    [128] S.W. Smith and D.B. Johnson. Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback [C]. In Proceedings, the Fifteenth Symposium on Reliable Distributed Systems. 1996:66–75.
    [129] A. Sistla and J. Welch. Efficient distributed recovery using message logging [C]. In Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (PODC 89). 1989:223–238.
    [130] S. Kalaiselvi and V. Rajaraman. A survey of checkpointing algorithms for parallel and distributed computers. SADHANA, 2000, 25(5):489-510.
    [131] E.N. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output [J]. IEEE Transactions on Computers, 1992, 41(5).
    [132] E.N. Elnozahy. Manetho: Fault Tolerance in Distributed Systems using Rollback-Recovery and Process Replication [D]. Rice University, Department of Computer Science, 1993.
    [133] A. Bouteiller, B. Collin, T. Herault, P. Lemarinier, F. Cappello. Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI [C]. In proceedings of 19th IEEE/ACM International Parallel and Distributed Processing Symposium. Denver USA, 2005.
    [134] L. Alvisi. Understanding the Message Logging Paradigm for Masking Process Crashes [D]. Cornell University, Department of Computer Science, 1996.
    [135] D.P. Siewiorek. Reliability Modeling of Compensating Module Failures inMajority Voting Redundancy [J]. IEEE Transactions on Computers, 1975, C-24:525–533.
    [136] I. Koren and E. Shalev. Reliability Analysis of Hybrid Redundancy Systems [C]. IEE Proceedings on Computer and Digital Techniques, 1984, 131:31–36.
    [137] L. Mancini and M. Koutny. Formal Specification of N-modular Redundancy [C]. In: Proceedings of the 1986 ACM fourteenth annual conference on Computer Science. New York, 1986:199-204.
    [138] J.V. Neumann. Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components [R]. Automata Studies, Princeton: Princeton University Press, 1956:43-98.
    [139] M. Barborak, M. Malek, and A. Dahbura. The Consensus Problem in Fault-Tolerant Computing [J]. ACM Computing Surveys, 1993, 25:171–220..
    [140] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Algorithm [J]. ACM Transactions on Programming Languages and Systems, 1982, 4:382–401.
    [141] N.A. Lynch, M.J. Fischer, and R.J. Fowler. A Simple and Efficient Byzantine Generals Algorithm [C]. Second Symposium on Reliability in Distributed Software and Database Systems. 1982:46–52.
    [142] M.L. Scott. Sequential specification of transactional memory semantics [C]. In Proceedings of the First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing (TRANSACT 06). 2006.
    [143] Rachid Guerraoui, Michal Kapalka. On the correctness of transactional memory [C] Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP 08). Salt Lake City, UT, USA, 2008.
    [144] Maurice P. Herlihy, Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects [J]. ACM Transactions on Programming Languages and Systems (TOPLAS), 1990, 12(3):463-492.
    [145] G. Yalcin, O. Unsal, I. Hur, A. Cristal, and M. Valero. FaulTM: Fault-Tolerance Using Hardware Transactional Memory [C]. Workshop on Parallel Execution of Sequential Programs on Multi-core Architecture (Pespma 10). Saint Malo, France, 2010:34-43.
    [146] C. Fetzer, P. Felber. Transactional Memory for Dependable Embedded Systems [C]. 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN 11). Hong Kong, 2011:223-227.
    [147] Splash-2 Benchmarks Site [EB/OL]. http://www-flash.stanford.edu/apps/ SPLASH/.
    [148] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization andMethodological Considerations [C]. the 22nd Annual International Symposium on Computer Architecture, 1995:24-36.
    [149] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood. Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset [J]. ACM SIGARCH Computer Architecture News, 2005, 33(4):92-99.
    [150] Gems Site [EB/OL]. http://www.cs.wisc.edu/gems/.
    [151] P.S. Magnusson et al. Simics: A Full System Simulation Platform [J]. IEEE Computer, 2002, 35(2):50–58.
    [152] Simics Site [EB/OL]. https://www.simics.net/pub/.
    [153] Haris Volos, Adam Welc, Ali-Reza Adl-Tabatabai, Tatiana Shpeisman, Xinmin Tian, Ravi Narayanaswamy, NePalTM: design and implementation of nested parallelism for transactional memory systems [C]. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 09). Raleigh, NC, USA: ACM Press, 2009:291-292.
    [154] P. Felber, C. Fetzer, R. Guerraoui, and T. Harris. Transactions are back--but are they the same? [J]. ACM SIGACT News, 2008, 39(1):47-58.
    [155]杜云飞.容错并行算法的研究与分析[D].湖南:国防科学技术大学研究生院,2008.
    [156] Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, and Kunle Olukotun. The OpenTM transactional application programming interface [C]. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 07), Washington, DC, USA: IEEE Computer Society, 2007:376-387.
    [157] The OpenMP Application Program Interface Specification, version 2.5 [EB/OL]. http://www.openmp.org.
    [158] J.E.B. Moss. Open Nested Transactions: Semantics and Support [C]. In Workshop on Memory Performance Issues, Feb. 2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700