高可用MPI并行编程环境及并行程序开发方法的研究与实现

英文题名：Research and Implementation of High-availability MPI Parallel Programming Environment and Parallel Programming Methods
作者：谢旻
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：高可用 ; 用户级通信 ; 消息传递接口MPI ; 基于检查点的回卷恢复 ; 容错并行算法
英文关键词：High Availability ; User-Level Communication ; Message Passing Interface (MPI) ; Checkpointing/Restart ; Fault-Tolerant Parallel Algorithm
学位年度：2007
导师：杨学军
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2007-10-01

摘要

科学技术的发展进步使得越来越多的学科领域开始采用科学计算、数值模拟的手段来解决科学研究和工程实践中遇到的各种问题,这些应用问题往往具有大计算量,大数据存储量,以及大数据交换量的需求,大规模并行计算机系统是当前满足这些高性能计算需求的主流计算机系统结构实现方式。
     随着并行计算机系统规模的扩展,随之而来的是并行应用的可扩展性难题和系统可靠性的降低,一些超大规模并行计算系统的平均故障间隔时间甚至只有几小时,在这种情况下,如果不能提供高性能、具有容错能力的并行软件开发和运行环境,那么很多大规模的并行应用将无法高效率地运行,并最终成功完成,这将严重影响系统和应用的可用性。
     消息传递是开发并行应用的主要编程模型,MPI是消息传递编程接口的事实标准,具有并行算法实现灵活、性能高和可移植性好等特点。本论文紧密围绕提高大规模并行计算机系统和应用的可用性这一中心目标,对实现高可用MPI并行编程环境的相关问题展开研究,包括性能、可扩展性和容错能力。另外,考虑到未来并行计算系统的规模还将进一步扩大,为了更有效的进行容错处理,论文还从MPI并行程序开发的角度,研究探讨了高效的容错并行算法设计方法。论文的主要研究成果和创新包括:
     1)提出了一个面向大规模并行计算机系统的新型通信硬件接口CNI的结构设计,并在其上实现了Communication Express(CMEX)通信软件接口,该软件接口能够提供保护的、并发的、完全用户级的通信操作,支持进程间的零拷贝数据传输,并且采用无连接的语义实现了报文传输和RDMA两种通信机制,具有很好的数据传输性能,对实现具有良好可扩展性的软件系统提供了重要支持。我们也提出了采用静态程序分析和模型检验技术对CMEX通信软件接口进行验证的基本方法,以保证通信软件接口本身的正确性和可靠性。
     2)基于CMEX通信软件接口和MPICH2系统,本文研究了通过RDMA通信机制实现高性能、可扩展MPI并行编程环境MPICH2-CMEX的技术途径。面向性能需求,设计和实现了基于RDMA读和写操作的高效消息数据传输方法;面向可扩展性需求,一是提出了动态反馈信用流控算法,允许频繁通信的任务间通信资源的动态扩展,从而更有效地配置通信资源;二是结合并行应用的近邻通信模式,实现了组合通道数据传输方法,在并行应用的规模扩展时,能够在保证计算性能的同时,控制MPI系统内部的通信和内存资源消耗。我们在大规模并行计算机系统中进行了实际MPI并行应用的测试,取得了很好的计算加速比。
     3)为提高MPI并行应用的容错能力,在MPICH2-CMEX并行编程环境中设计实现了一个完全用户透明的系统级并行检查点,采用阻塞式协同检查点协议。针对并行检查点操作过程的主要开销来源,一是提出了结合并行应用的近邻通信模式特性,利用虚连接技术优化的低延迟协同协议实现技术,二是设计了利用全局并行文件系统的检查点系统结构,通过全局共享目录和并行I/O操作,简化检查点映像文件的管理,减少检查点映像数据的存储开销。通过一些并行应用的测试表明,该并行检查点系统具有较低的检查点运行时间开销,协同协议具有良好的可扩展性,检查点映像存储过程时间较短,为并行应用的长期可靠运行提供了有效的支持。
     4)面向未来超大规模并行计算机系统的并行应用容错需求,提出了一种面向MPI并行程序的新型容错并行算法(FTPA)的设计方法,该方法的核心思想在于它是通过并行应用中无故障任务并行复算故障任务的工作来应对系统中出现的故障。本文讨论了FTPA算法的设计思路和算法实现中的关键问题,提出了指导FTPA算法设计的进程间定值—引用分析方法和相关原则,并通过具体的并行应用实例说明了在不同并行应用中FTPA算法的实现细节。在大规模并行计算系统中的实际测试表明,FTPA算法的运行时间开销较低,具有较好的可扩展性。FTPA算法和检查点系统相结合,将是一种解决大规模并行应用容错的有效技术途径。
With the progress of science and technology, scientific computing and numerical simulation are adopted in more and more disciplines for problem solving. These scientific problems often require much more computation, storage, and communication, therefore large-scale parallel computing systems have become the mainstream architecture of current high performance computing system.
     With the expansion of parallel computing system scale, there come the problems of scalability, and the reduction of system reliability where the mean time to failures in some very large parallel computing systems might even be several hours. Under such conditions, many large scale parallel applications cannot run efficiently and complete successfully without a high performance and fault tolerance parallel software development and runtime environment.
     Message passing is the mainstream programming model for developing parallel applications. With such features as flexible support for parallel algorithm implementation, high performance and good portability, MPI is now the de facto standard of message passing API. Focusing on how to improve the availability of large parallel computing systems and applications, this thesis studies some key problems in implementing high availability MPI parallel programming environment, including performance, scalability and fault tolerance. Besides, oriented to more effectively fault tolerance on future very large scale parallel computing systems, we also do some research on the designing methods and rules of efficient fault tolerant parallel algorithms in MPI programs. The main contributions of this thesis can be summarized as follows:
     1) Oriented to large scale parallel computing systems, the architecture of a new communication hardware interface, CNI, is proposed. Based on CNI, we implemented the Communication Express (CMEX) software interface. CMEX provides protected, concurrent and completely user-level communication operations, supports zero-copy data transfer between processes, and has good scalability with the connectionless packet transfer and RDMA communication mechanism. We also put forward some basic methods to validate CMEX communication software interface by means of static program analysis and model checking, for ensuring the correctness and reliability of CMEX implementation.
     2) Based on CMEX software interface and MPICH2 system, we studied the technical approaches to implement a high performance and scalable MPI parallel programming environment , MPICH2-CMEX, using RDMA communication mechanism. For improving the performance, we designed and implemented the efficient message data transfer using RDMA read and write operations. For improving the scalability, first, we proposed a dynamic feedback credit flow control algorithm. Using this algorithm, communication resources can be utilized more effectively, because resources are enlarged dynamically between tasks which have frequently message passing. Second, we proposed the methods of hybrid channel data transfer utilizing the nearest-neighbor exchange mode of many parallel applications. When the scale of parallel applications is enlarged, we can control the internal usage of communication and memory resources in MPI system, while guarantying the runtime performance of applications. Using MPICH2-CMEX, we got good speedups in many parallel application tests.
     3) For the fault tolerance of MPI parallel applications, we designed and implemented a user transparent system level parallel checkpointing system in MPICH2-CMEX, blocking mode coordinated checkpointing protocol is used in it. Coordinated protocol and checkpoint image storage are two major parts affecting the overhead of parallel checkpointing system. Our system utilizes the feature of near-neighbor exchange in many parallel applications and uses virtual connection technology to reduce the number of internal messages exchanged in the coordination stage, hence reduce the latency of protocol processing. Global parallel filesystem is used for storing checkpoint images, as it simplifies the management of image files and implements parallel I/O in the image storage stage. Through the experiments of some parallel applications, it is showed that this checkpointing system has small runtime overhead and is scalable, it provides good support for fault-tolerant running of many long-time parallel applications.
     4) Oriented to the fault tolerance of parallel applications in future very large scale parallel computing systems, we proposed a methodology for designing new fault-tolerant parallel algorithm (FTPA) in MPI parallel programs. FTPA implements fault tolerance by parallel re-computing the work of fault task. Core idea and some key problems in the implementation of FTPA are discussed, and we proposed the methods and related rules for the analysis of use-definition chains between processes. We also introduced some implementation specifics of FTPA in two parallel programs. Through some experiments in a parallel computing system, it shows that FTPA has low runtime overhead and is scalable. It will be an effective technical approach for fault tolerance of parallel applications when FTPA is used in combination with checkpointing system.

引文

[1] Top 500 list. http://www.top500.org

    [2] PetaFLOPS Enabling Technologies and Applications.http://www.hq.nasa.gov/hpcc/petaflops

    [3] Advanced Simulation and Computing. https://asc.llnl.gov/
    [4] J. Dongarra. Present and Future Supercomputer Architectures. 2nd International Symposium on Parallel and Distributed Processing and Applications, Hong Kong, 2004
    [5] K. Koch. How does ASCI actually complete multi-month 1000-processor milestone simulations? (talk). In Conference on High Speed Computing, 2002
    [6] Charn-da Lu. Scalable Diskless Checkpointing for Large Parallel Systems. PhD Thesis, University of Illinois, Urbana-Champaign, IL, March 2004
    [7] D. Reed. High-end computing: The challenge of scale. Director's Colloquium,Los Alamos National Laboratory, 2004
    [8] Myrinet. http://www.myri.com
    [9] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic,W. Su. Myrinet: A Gigabit-per-second Local Area Network, IEEE Micro, Vol 15(1), 1995,29-36
    [10] F. Petrini, W.C. Feng, A. Hoisie, S. Coll, E. Frachtenberg. The Quadrics Network: High-Performance Clustering Technology. IEEE Micro, Vol. 22, No. 1,Jan/Feb, 2002

    [11] J. Beecroft et al. QsNetII: Defining High-Performance Network Design. IEEE Micro, Vol. 25(4), July/Aug, 2005

    [12] InfiniBand Architecture Specification, Release 1.1. InfiniBand Trade Association,2002

    [13] Mellanox Technologies.http://www.mellanox.com/
    [14] T. von Eicken, A. Basu, V. Buch, W. Vogels. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating System Principles, Dec. 1995
    [15] M. Welsh, A. Basu, X. Huang, T. von Eicken. Memory Management for User-Level Network Interfaces. IEEE Micro, March/April 1998, 77-82
    [16] T. von Eicken, D.E. Culler. S.C. Goldstein, K.E. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th International Symposium on Computer Architecture, May, 1992
    [17] B. N. Chun, A. M. Mainwaring, D. E. Culler. Virtual Network Transport Protocols for Myrinet. IEEE Micro, January/February, 1998, 53-63
    [18] P. Geoffray, L. Prylli, B. Tourancheau. BIP-SMP: High performance message passing over a cluster of commodity SMPs.In Supercomputing(SC'99).Portland,OR,Nov.1999
    [19]A.Gallatin,J.Chase,K.Yocum.Trapeze/IP:TCP/IP at Near-Gigabit Speeds.Proceedings of 1999 USENIX Technical Conference,1999
    [20]Ma Jie,He Jin,Meng Dan,Li Guojie.BCL-3:A High Performance Basic Communication Protocol for Commodity Superserver DAWNING-3000.Journal of Computer Science and Technology,2001
    [21]马捷。基于SMP结点的集群通信系统关键技术的研究。博士学位论文,中国科学院计算技术研究所,2001
    [22]H.Tezuka,A.Hori,Y.Ishikawa.PM:A high-performance communication library for multi-user parallel environments.Technical Report TR-96-015,Tsukuba Research Center,Real World Computing Partnership,Nov.1996
    [23]S.Pakin,V.Karamcheti,A.Chien.Fast Messages:Efficient,portable communication for workstation clusters and MPPs.IEEE Concurrency,Vol.5,No.2,Apr/June,1997
    [24]Myrinet Express(MX):A High Performance,Low-Level Message-Passing Interface for Myrinet,Version 1.1,Myricom Inc.http://www.myri.com,Jan.2006
    [25]The Cray XD 1 High Performance Computer-Closing the gap between peak and archievable performance in high performance computing.White Paper,Cray Inc.2004
    [26]HyperTransport Consortium.http://www.hypertransport.org/
    [27]B.Holden.Latency Comparison Between HyperTransport and PCI-Express in Communications Systems.HyperTransport Consortium White Paper,Nov.2006
    [28]The Future of High-Performance Computing:Direct Low-Latency Peripheral-to-CPU Connections.HyperTransport Consortium White Paper,Sept.2006
    [29]The MPICH and MPICH2 homepage,http://www-unix.mcs.anl.gov/mpi
    [30]MVAPICH and MVAPICH2 Project.http://mvapich.cse.ohio-state.edu/
    [31]J.Liu,W.Jiang,P.Wyckoff,D.K.Panda,D.Ashton,D.Buntinas,W.Gropp,B Toonen.Design and Implementation of MPICH2 over InfiniBand with RDMA Support.Int'l Parallel and Distributed Processing Symposium(IPDPS 04),Apr.2004
    [32]J.Liu,J.Wu,D.K.Panda.High Performance RDMA-Based MPI Implementation over InfiniBand.International Journal of Parallel Programming,Jun.2004
    [33]R.L.Graham,S.Choi,D.J.Daniel,N.N.Desai,R.G.Minnich,C.E.Rasmussen,L.D.Risinger,M.W.Sukalski.A Network-Failure-Tolerant Message-Passing System for Terascale Clusters. International Journal of Parallel Programming,Vol. 31,No.4,Aug. 2003
    [34] G.E. Fagg, J.J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. Proceedings of EuroPVM-MPI, 2000
    [35] G.E. Fagg, E. Gabriel, G. Bosilca. T. Angskun, Z. Chen, J.Pjesivac-grbovic, K.London, J.J. Dongarra. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems. Proceedings of the 19th International Supercomputer Conference (ISC 2004), June, 2004
    [36] G.E. Fagg. E.Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, J.J.Dongarra. Process Fault-Tolerance: Semantics, Design and Application for High Performance Computing. International Journal of High Performance Computing Applications, Vol. 19, No. 4, 2005
    [37] S. Louca, N. Neophytou, A. Lachanas, P. Evripidou. MPI-FT: portable fault tolerance scheme for MPI. Parallel Processing Letters, Vol. 10, no. 4, 2000
    [38] E. Garbriel et al. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, 2004
    [39] Open MPI: Open Source High Performance Computing.http://www.open-mpi.org,
    [40] G. Burns, R. Daoud, J. Vaigl. LAM: An Open Cluster Environment for MPI.Proceedings of Supercomputing Symposium, 1994
    [41] E.N. Elnozahy, D.B. Johnson, Y.M. Wang, D.B. Johnson. A Survey of Rollback-Recovery Protocols in Message Passing Systems. ACM computing survey, Vol. 34, No. 3, Sept. 2002
    [42] L. Alvisi, E. Elnozahy, S. Rao, S.A. Husian, A.D Mel. An analysis of communication induced checkpointing. In 29th Symposium on Fault-Tolerant Computing (FTCS'99),IEEE CS Press, 1999
    [43] E.N. Elnozahy, W. Zwaenepoel. Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit. IEEE Transactions on Computers, Vol. 41, No. 5, 1992
    [44] G. Stellner. CoCheck: Checkpoiting and process migration for MPI. In Proceedings of the International Parallel Processing Symposium, Apr. 1996
    [45] M. Litzkow, T. Tannenbaum, J. Basney, M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report CS-TR-199701346, University of Sisconsin, Madison, 1997
    [46] S. Sankaran, J. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, E.Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. in LACSI Symposium, Oct. 2003

    [47] J. Duell, P. Hargrove, E. Roman. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart. Berkeley Lab, Technical Report LBNL-54941,2002
    [48] Y. Zhang, D. Wang, W. Zheng. Checkpoint and Migration of parallel processes based on Message Passing Interface. The 3rd International Conference on High-Performance Clustered Computing, FL, 2002
    [49] H. Jung, D. Shin, H. Han, J.W. Kim, H.Y. Yeom, J. Lee. Design and Implementation of Multiple Fault-Tolerant MPI over Myrient (M3). In Proceedings of ACM/IEEE SuperComputing2005, Seattle, WA, Nov., 2005
    [50] Q. Gao, W. Yu, W. Huang, D.K. Panda. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. International Conference on Parallel Processing (ICPP'06), Columbus, OH, 2006
    [51] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, F.Cappello. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. In Proceedings of ACM/IEEE SuperComputing2006, Tempa, FL, Nov., 2006
    [52] M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali. P. Stodghill.Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs. In Proceedings of ACM/IEEE SuperComputing2004, Nov., 2004
    [53] G. Bronevetsky, D. Marques, K. Pingali, P. Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming (PPoPP), 2003
    [54] G. Bronevetsky. Portable Checkpointing for Parallel Applications. PhD Thesis,Cornell University, Jan., 2007
    [55] Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics. Mellanox White Paper, 2006
    [56] R. Brightwell, A. B. Maccabe. Scalability Limitations of VIA-Based Technologies in Supporting MPI. Proceedings of the Fourth MPI Developer's and User's Conference, Mar. 2000
    [57] S. Coll. E. Frachtenberg, F. Petrini, A. Hoisie, L. Gurvits. Using Multirail Networks in High-Performance Clusters. IEEE Cluster 2001
    [58] J. Liu. A. Vishnu. D.K. Panda. Building Multirail InfiniBand Clusters:MPI-Level Designs and Performance Evaluation. SuperComputing 2004 Conference (SC 04). Nov. 2004
    [59] A.M. Mainwaring, D.E. Culler. Design Challenges of Virtual Networks: Fast,General-Purpose Communication. Proceedings of the ACM SIGPLAN'99 Symposium on Principles and Practice of Parallel Programming. 1999
    [60] Tezuka. Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication. Proceedings of 12th International Parallel Processing Symposium. 1998
    [61] J. Corbet, G. Kroah-Hartman, A. Rubini. Linux Device Driver, 3rd Edition,O'Reilly, 2005
    [62] 庞征斌。高效MCC-NUMA系统的Cache一致性协议研究与实现。博士学位论文,国防科技大学, 2007

    [63] R. Brightwell, K. Underwood, R. Riesen. An Initial Analysis of the Impact of Overlap and Independent Progress for MPI. Proceedings of the 11th European PVM/MPI Users' Group Meeting, Sept. 2004
    [64] L. Dickman, G. Lindahl, D. Olson, J. Rubin, J. Broughton. PathScale InfmiPath:A first look. In Proceedings of the 13th Symposium on High Performance Interconnects (HOTI'05), Aug. 2005
    [65] R. Brightwell, D. Doerfler, K. Underwood. A Preliminary Analysis of the InfmiPath and XD1 Network Interfaces. 2006 Workshop on Communication Architecture for Clusters, Apr. 2006
    [66] M. Reilly, L. C. Stewart, J. Leonard, D. Gingold. SiCortex Technical Summary.SiCortex White Paper, http://www.sicortex.com, Dec, 2006
    [67] L. C. Stewart, D. Gingold. A New Generation of Cluster Interconnect. SiCortex White Paper, http://www.sicortex.com, Dec, 2006
    [68] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra. MPI - The Complete Reference: Volume 1, the MPI Core, MIT Press, 1998
    [69] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir,M. Snir, MIT Press, 1998
    [70] W. Gropp, E. Lusk, A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface, second edition, MIT Press, 1999
    [71] W. Gropp, E. Lusk, R. Thakur. Using MPI-2: Advanced Features of the Message-Passing Interface, MIT Press, 1999
    [72] G. Almasi, et al. Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM Journal of Research and Development, Vol.49, No. 2/3, 2005
    [73] H. Kamal, B. Penoff, A. Wagner. SCTP versus TCP for MPI. SuperComputing 2005 Conference (SC 05), Nov. 2005
    [74] D. Buntinas, G. Mercier, W. Gropp. Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-passing Communication Subsystem. 6th IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06),2006
    [75] D. Buntinas, G. Mercier, W. Gropp. Implementation and Evaluation of Shared-Memory Communication and Synchronization Operations in MPICH2 using Nemesis Communication Subsystem. Parallel Computing, Vol 33(9), 2007,634-644
    [76] P. Schwan. Lustre: Building a File System for 1,000-node Clusters. Ottawa Linux Symposium, 2003

    [77] Lustre Filesystem. http://www.clusterfs.com/
    [78] R. Butler, W. Gropp, E. Lusk. Components and interfaces of a process management system for parallel programs. Parallel Computing, Vol 27(11), 2001,1417-1429
    [79] M. Jette, M. Grondona. SLURM: Simple Linux Utility for Resource Management. Proceedings of ClusterWorld Conference and Expo, San Jose, CA,June, 2003
    [80] S.M. Balle, D. Palermo. Enhancing an Open Source Resource Manager with Multi-Core/Multi-threaded Support. Job Scheduling Strategies for Parallel Processing, 2007
    [81] A. Wagner, D. Buntias, D.K. Panda, R. Brightwell. Application-Bypass Reduction for Large-Scale Clusters. Proceedings of the IEEE International Conference on Cluster Computing (Cluster 2003), Dec, 2003
    [82] J. Liu. Designing High Performance and Scalable MPI over InfiniBand. PhD Thesis, Ohio State University, 2004
    [83] M. Banikazemi. R.K. Govindaraju, R. Blockmore, D.K. Panda,. MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems. IEEE Transactions on Parallel and Distributed Systems, 2001
    [84] J. Liu, D.K. Panda. Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand. Workshop on Communication Architecture for Clusters (CAC 04), Held in Conjunction with Int'l Parallel and Distributed Processing Symposium (IPDPS 04) Apr., 2004
    [85] J.S. Vetter, F. Mueller. Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. Int'l Parallel and Distributed Processing Symposium (IPDPS'02), Apr., 2002
    [86] W. Gropp, E. Lusk. A high-performance MPI implementation on a shared-memory vector supercomputer. Parallel Computing, Vol22(11), Jan.,1997
    [87] A. Supalov. Lock-Free Collective Operations. Proceedings of 10th European PVM/MPI User's Group Meeting, Venice, Italy, Sept./Oct, 2003
    [88] J. Liu, B. Chandrasekaran. J. Wu, W. Jiang, S, Kini, W. Yu, D. Buntinas, P.Wyckoff, D.K. Panda. Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics. SuperComputing 2003 Conference (SC 03),Nov. 2003

    [89] R. Brightwell. A comparison of three MPI implementations for Red Storm. 12th European PVM/MPI Users' Group Meeting, Sept., 2005
    [90] D. Bailey, et al. The NAS Parallel Benchmarks. RNR Technical Report RNR-94-007, Mar., 1994 http://www.nas.nasa.gov/News/Techreports/1994/PDF/RNR-94-007.pdf
    [91] T.B. Tabe, Q.F. Stout. The Use of the MPI Communication Library in the NAS Parallel Benchmarks. Technical Report CSE-TR-386-99, 1999 http://www.eecs.umich.edu/techreports/cse/1999/CSE-TR-386-99.pdf
    [92] A. Faraj, X. Yuan. Communication characteristics in the NAS parallel benchmarks. The 14th IASTED Int'l Conference of Parallel and Distributed Computing and Systems, Cambridge, MA, 2002
    [93] D. Doerfler, R. Brightwell. Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces. 13th European PVM/MPI User's Group Meeting, Born, Germany, Sept., 2006
    [94] R. Brightwell, T. Hudson, K. Pedretti, K.D. Underwood. Cray's SeaStar Interconnect: Balanced Bandwidth for Scalable Performance. IEEE Micro,May/June 2006
    [95] R. Brightwell, K. Pedretti, K.D. Underwood. Initial Performance Evaluation of the Cray SeaStar Interconnect. 13th IEEE Symposium on High-Performance Interconnects, Stanford, CA, Aug., 2005
    [96] S. Sur, M.J. Koop, L, Chai, D.K. Panda. Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms. Int'l Symposium on Hot Interconnects (HotI), Aug., 2007
    [97] J. Beecroft, D. Addison, F. Petrini. M. McLaren. QsNetII: An Interconnect for Supercomputing Applications. In Hot Chips 15, Stanford University, CA, Aug.,2003
    [98] A. Gara, et al. Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development, Vol. 49, No. 2/3, 2005
    [99] R. Thakur, W. Gropp. Open Issues in MPI Implementation. Proceedings of the 12th Asia-Pacific Computer Systems Architecture Conference (ACSAC 2007),Aug., 2007
    [100] W. Gropp, E. Lusk. Fault Tolerance in MPI Programs. Proceedings of the Cluster Computing and Grid Systems Conference, Dec, 2002
    [101] K.M. Chandy, L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, Vol. 3(1), 1985

    [102] 周恩强,卢宇彤,沈志宇,一个适合大规模集群并行计算的检查点系统,计算机研究与发展, Vol. 42,No. 6, 2005

    [103] R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, K. Davis. Transparent Incremental Checkpointing at Kernel Level: A Foundation for Fault Tolerance for Parallel Computers. ACM/IEEE SuperComputing 2005 Conference (SC 05), 2005
    [104] J.M. Squyres, B. Barrett, A. Lumsdaine. Request Progression interface (RPI) system services interface (SSI) modules for LAMM/MPI. Technical Report TR579, Indiana University, 2003
    [105] A. Geist, C. Engelmann. Super-Scalable Algorithms for Computing on 100,000 Processors. In International Conference on Computational Science (ICCS). 2005
    [106] F.T. Luk, H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel Distributed Computing. Vol. 5(2), Apr., 1988
    [107] K.H. Huang, J.A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers. 1984
    [108] J.S. Plank, Y. Kim, J.J. Dongarra. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing. Journal of Parallel and Distributed Computing, 1997
    [109] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, J.J.Dongarra. Fault Tolerant High Performance Computing by a Coding Approach.Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'05), Chicago, IL, June, 2005
    [110] R. Allen, K. Kennedy. Optimizing Compilers for Modern Architecture - A Dependence Based Approach. Morgan Kaufmann, 2001
    [111] F. Petrini. D. Kerbyson, S. Pakin. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In IEEE/ACM SC2003, Phoenix, AZ, Nov., 2003
    [112] R.E. Moore. Interval Analysis. Prentice Hall, 1966
    [113] P. Cousot, R. Cousot. Static determination of dynamic properties of programs. In ISOP'76, Paris, France, 1976
    [114] J.P. Katoen. Concepts, Algorithms, and Tools for Model Checking. Lecture Notes of the Course "Mechanised Validation of Parallel Systems", 1999
    [115] E.M. Clarke, O. Grumberg, S. Jha, Y. Lu, H. Veith. Counterexample-guided abstraction refinement. In Preceedings of CAV 1855, 2000
    [116] S. Graf, H. Saidi. Construction of abstract state graphs with PVS. CAV 97:Computer-aided Verification, 1997
    [117] S. Chaki, E. Clarke, A. Groce. Modular Verification of Software Components in C. ACM-SIGSOFT Distinguished Paper in the 25th International Conference on Software Engineering (ICSE), 2003

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700