面向多核的系统级MPI通信优化关键技术研究

英文题名：Research on the System-Level Optimizing Key Techniques for MPI Communication on Multicore Systems
作者：刘志强
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：MPI通信加速器 ; 多核处理器 ; MPI通信优化 ; 线程MPI ; 分级集合通信算法 ; 共享内存消息传递接口 ; 竞争式流水化方法
英文关键词：MPIActor ; Multicore Processor ; MPI Communication Optimizing ; Threaded-MPI ; Hierachical Collective Communicaiton Algorithm ; Competitive and Pipelined Method ; Shared-Memory Message Passing Interface
学位年度：2011
导师：宋君强
学科代码：081203
学位授予单位：国防科学技术大学
论文提交日期：2011-06-01

摘要

消息传递接口(Message Passing Interface,简称MPI)自20世纪90年代以来一直是高性能计算(High Performance Computing,简称HPC)领域并行程序开发的事实标准。在基于MPI编写的并行程序中,MPI通信性能通常对程序整体性能起着关键作用,优化MPI通信具有重要意义。
     近年来,在多核技术高速发展的背景下,MPI通信亟待针对多核系统特点进行优化。然而,现有优化工作主要停留在基于进程MPI的通信技术,普遍存在处理开销大、访存需求高等不足,限制了通信性能进一步提高。本文针对多核系统诸多特点和现有优化方法不足,从基于线程MPI的通信技术方向入手,系统研究了多核系统MPI通信优化的关键技术,探索了共享内存系统上更高效的消息传递通信接口。取得的主要成果如下:
     1、面向多核系统,提出了一种高效线程MPI支撑软件技术——MPI通信加速器(MPI Communication Accelerator,简称MPIActor)。MPIActor通过自身专门设计的接口聚合技术在传统进程MPI支撑环境的基础上建立线程MPI支撑环境。相比传统MPI支撑软件的开发方法,采用MPIActor技术构建线程MPI支撑软件的开发工作量小,且MPIActor应用更灵活,能横向支持符合MPI-2标准的传统进程MPI支撑软件。实验采用双路Nehalem-EP处理器系统上的OSU_LATENCY基准程序进行测试,结果表明传输8K至2M字节长度消息时,加入MPIActor的MVAPICH2 1.4在处理器内通信性能提升了37%以上,最高可达114%;处理器间通信性能提升30%以上,最高可达144%;而对加入MPIActor的Open MPI 1.5测试结果也表明,处理器内通信性能能提升48%以上,最高可达106%,处理器间则能提高46%以上,最高可达98%。
     2、针对多核系统上的集合通信优化,基于MPIActor提出了一套新的分级集合通信算法框架(MPIActor Hierachical Collective Algorithm Framework,简称MAHCAF)和一组高效的基于线程MPI的节点内集合通信算法。MAHCAF采用模板方法设计分级集合通信算法,将节点内和节点间集合通信过程作为模板的可扩展步骤,并将它们通过流水化并行方法组织,能够充分发挥子集合通信过程间的并发性。基于线程MPI设计的节点内集合通信算法能够充分利用共享内存系统的优势实现通信过程,相比传统基于进程MPI的集合通信算法处理代价小,访存需求低。Nehalem集群系统上的IMB实验表明:与MVPAICH2 1.6相比,采用节点内集合通信通用算法的MAHCAF能够对广播、多对多广播、归约和全归约在绝大多数条件下带来显著的性能提升;不仅如此,将专门针对Nehalem体系结构设计的多级分段归约算法(HSRA)加入MAHCAF后,归约和全归约通信的性能还能够被进一步提高。
     3、针对非平衡进程到达影响广播通信性能的问题,基于MPIActor的特有结构提出了一种竞争式流水化优化(Competitive and Pipelined,简称CP)方法以提高非平衡进程到达模式下的广播通信性能。该方法利用多核/多处理器系统节点内运行多个进程的优势,将节点内最早到达的进程作为执行节点间通信的引导进程,能在最早时间启动节点间集合通信过程,减少广播通信平均等待时间。微性能测试实验表明,采用CP方法优化的广播性能显著优于传统算法,而两个实际应用实例的性能测试也表明CP方法能够显著改善广播性能。
     4、面向多核/多处理器系统上的节点内MPI通信优化,在MPIActor基础上提出了一套高效的共享内存消息传递接口(Shared-Memory Message Passing Interface,简称SMPI)。相比传统MPI,该接口能支持运行在同一节点上的MPI进程通过传递消息地址直接读取进程间发送的消息数据,而不是复制消息数据到当前进程,因此极大减少了访存开销。实验表明,在8个节点上用64个MPI进程进行4000阶矩阵乘,利用该接口设计的cannon矩阵乘算法较利用MPI设计的算法加速比达到了约1.14。
Over the last decades of the 20th century, MPI (Message Passing Interface) have become a de facto standard of programming model in High Performance Computing (HPC) domain. The performance of MPI communications usually play a key role on the global performance of MPI-based programs. Thus Optimizing MPI communication is becoming extremely important.
     Recently, with the rapid development of multicore technologies, The optimization of MPI communication in multicore systems is strongly expected to be improved by combining their new characteristics with multicores. However, the existing optimizing techniques still remains at the technologies of process-MPI based communication which often incur performance issues such as large overhead, high memory visit, thus the cur-rent methods still have some limitations on further improving the communication per-formances. Towards addressing these issues of optimizing the performance of MPI communication on multicore systems, we concentrate on studying key optimization strategies from the views of threaded-MPI based communication techniques in this pa-per. As a result of our invesitigation, the following contributions have been achieved:
     (1) An effective threaded-MPI software technology, called MPI communication ac-celerator (MPIActor), is proposed for multicore systems. Compared with the traditional MPI implementation development methods to develop threaded-MPI, MPIActor with smaller development workload, more flexible usage performs better. Not only that, MPIActor can support all traditional process-based MPI softwares satisfying MPI-2 standards, and can inherit the performance advantages of inter-node communication supported by traditional MPI implementations. The experimental results of OSU_LATANCY benchmark on dual-way Nehalem-EP processor system show that the performance of intra-socket communication can be increased by 37% to 114% and the performance of inter-socket communication can be improved by 30% to 144% for MVAPICH2 1.4 supported by MPIActor in comparison with the pure MVAPICH2 1.4 when transferring 8KB to 2MB messages. At the same time, the experiments also show that the performance of intra-socket communication is increased by 48% to 106% and the performance of inter-socket communication is increased by 46% to 98% for Open MPI 1.5 supported by MPIActor compared to the pure Open MPI 1.5.
     (2) A novel hierarchical collective communication algorithm framework (MAH-CAF) and a group of effective threaded-MPI based intra-node collective communication algorithms are proposed. MAHCAF constructs hierarchical collective communication algorithms by using template design pattern. The intra-node and inter-node collective communication sub-processes (IntraCP and InterCP) of MAHCAF play extensible roles. IntraCP can not only be implemented by general independent algorithms regardless of multicore architecture but also be designed by the specific algorithms considering the characteristics of multicore architecture in multicore architecture drivers. The experi-ment results of Intel MPI benchmark show that MAHCAF integrated with general in-tra-node collective algorithms can remarkably improve the performance of MPI_Bcast, MPI_Allgather, MPI_Reduce and MPI_Allreduce compared with MVAPICH2 1.6,. In addition, the intra-node reduce algorithm for Nehalem architecture, called hierarchical segment reduce algorithm (HSRA), can greatly improve the performance of MPI_Reduce and MPI_Allreduce.
     (3) Towards reducing negative performance impact on MPI broadcast incurred by unbalanced processes arrival (UPA) patterns , a novel Competitive and Pipelined (CP) method based on MPIActor is proposed. CP method can regard the first arriving process as a leading process to execute inter-node collective communication by making better use of the advantages of running multiple processes in the intra-node of multicore sys-tem. By doing so, it can start the inter-node collective communication process as soon as possible and reduce the waiting cost. The experiment results of a micro benchmark show that the performance of broadcast algorithms enhanced by CP-method signifi-cantly outperforms the performance brought by other traditional algorithms. In addi-tion, the extensive experiments through running two real world applications also prove that CP method can greatly improve the performance of broadcast in real scenarios.
     (4) An efficient and effective Shared-memory Message Passing Interface (SMPI) on threaded-MPI is proposed for optimizing intra-node communication on multicore systems. Unlike copying the message from the source process to the destination process, SMPI-supported MPI processes can communicate with each other in manner of directly accessing the buffer of posted message on the same node. In particular, SMPI can be efficiently implemented by utilizing the existing pattern of MPIAcotor. The result of 4000 order square matrix multiplication computed by 64 processes on 8 nodes shows the performance of SMPI based cannon matrix multiplication algorithm can achieve a speedup of about 1.14 in contrast to MPI based algorithm.

引文

[1]莫则尧.科学计算应用程序探讨[J].物理, 2009, 38: 552-558
    [2]赵毅.浅析高性能计算应用的需求与发展[J].计算机研究与发展, 2007, 44: 1640-1646
    [3] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, et al.The ILLIAC IV Computer[J]. IEEE Trans. Comput., 1968, 17: 746-757
    [4] Top 500 Websit[EB/OL]. http://www.top500.org, 2011-02-01
    [5] W. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious[R]. Charlottesville, VA,USA: U. o. Virginia, 1994:
    [6] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach[M]. 4th. Morgan Kaufmann Publishers Inc., 2007:
    [7] A. Cristal, J. Llosa, M. Valero and D. Ortega. Future ILP processors[J]. Int. J. High Perform. Comput. Netw., 2004, 2: 1-10
    [8] A. Wolfe. Intel Clears Up Post-Tejas Confusion[EB/OL]. http://www.varbusiness.com/sections/news/breakingnews.jhtml?articleId=18842588, 2004-5-17/2009-11-23
    [9] S. Borkar. Design Challenges of Technology Scaling[J]. IEEE Micro, 1999, 19: 23-29
    [10] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, et al. The landscape of parallel computing research: A view from berkeley[R]. 2006:
    [11] L. Hammond, B. A. Nayfeh and K. Olukotun. A Single-Chip Multiprocessor[J]. Computer, 1997, 30: 79-85
    [12] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson and K. Chang.The case for a single-chip multiprocessor[J]. SIGOPS Oper. Syst. Rev., 1996, 30: 2-11
    [13] What is multicore processor?[EB/OL]. http://searchdatacenter.techtarget.com/definition/multi-core-processor, 2007-03-27/2011-2-13
    [14] Dual-Core Xeon[EB/OL]. http://en.wikipedia.org/wiki/Xeon#Dual-Core_Xeon, 2011-02-23/2011-02-28
    [15] Multi-core Opterons[EB/OL]. http://en.wikipedia.org/wiki/Opteron#Multi-core_Opterons, 2011-02-25/2011-02-28
    [16] N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, et al. An overview of the BlueGene/L Supercomputer[C]. Baltimore, Maryland: Proceedings ofthe 2002 ACM/IEEE conference on Supercomputing, 2002, 1-22
    [17] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, et al. Aview of the parallel computing landscape[J]. Commun. ACM, 2009, 52: 56-67
    [18] J. M. McQuillan and D. C. Walden. Some considerations for a high performance message-based interprocess communication system[J]. SIGOPS Oper. Syst. Rev., 1975, 9: 77-86
    [19] MPI Overview and Goals[EB/OL]. http://www.mpi-forum.org/docs/mpi-11-html/node2.html#Node2, 1997-08-06/2010-12-02
    [20] Prof. Jack Dongarra's homepage[EB/OL]. http://www.netlib.org/utk/people/JackDongarra/, 2011-01-02/2011-02-12
    [21] MPI Forum Website[EB/OL]. http://www.mpi-forum.org, 2011-02-01
    [22] MPIF. MPI 1.1 Standard[EB/OL]. http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html, 1995-06-12/2010-12-02
    [23] PVP homepage[EB/OL]. http://www.csm.ornl.gov/pvm/, 2010-12-09/2011-02-03
    [24] MPI .vs. PVM[EB/OL]. http://www.lam-mpi.org/mpi/mpi_top10.php, 2006-03-08/2011-02-12
    [25] OpenMP Websit[EB/OL]. http://openmp.org/wp/about-openmp/, 2011-01-13/2011-02-11
    [26] G. Krawezik. Performance comparison of MPI and three openMP programming styles on shared memory multiprocessors[C]. San Diego, California, USA: Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures, 2003, 118-127
    [27] G. Krawezik, G. Alléon and F. Cappello. SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks[C]. Proceedings of the4th International Symposium on High Performance Computing, 2002, 425-436
    [28] Y. He and C. H. Q. Ding. An evaluation of MPI and OpenMP paradigms for multi-dimensional data remapping[C]. Toronto, Canada: Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming, 2003, 195-210
    [29] K. Leung, Z. Huang, Q. Huang and P. Werstein. Data race: tame thebeast[J]. J. Supercomput., 2010, 51: 258-278
    [30] L. Smith and M. Bull. Development of mixed mode MPI / OpenMPapplications[J]. Sci. Program., 2001, 9: 83-98
    [31] L. Giraud. Combining shared and distributed memory programming models on clusters of symmetric multiprocessors: Some basic promising experiments[J]. International Journal of High Performance Computing Applications, 2002, 16: 425-430
    [32] M. D. Jones and R. Yao. Parallel programming for OSEM reconstruction with MPI, OpenMP, and hybrid MPI-OpenMP[C].Nuclear Science Symposium Conference Record, 2004 IEEE, 2004, 3036-3042
    [33] S. W. Bova, C. P. Breshears, H. Gabb, B. Kuhn, B. Magro, et al. Parallel programming with message passing and directives[J]. Computing in Science and Engineering, 2001, 3: 22-22
    [34] N. Drosinos and N. Koziris. Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters[C].Santa Fe, NM, United states:Proceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM), April 26, 2004 - April 30, 2004, 2004, 193-202
    [35] G. Mahinthakumar and F. Saied. A hybrid MPI-OpenMP implementation of an implicit finite-element code on parallel architectures[J]. International Journal of High Performance Computing Applications, 2002, 16: 370-393
    [36] I. Bush, C. Noble and R. Allan. Mixed OpenMP and MPI for parallel Fortran applications[C].Edinburgh:European Workshop on OpenMP, 2000,
    [37] F. Cappello and D. Etiemble. MPI versus MPI+OpenMP on the IBMSP for the NAS Benchmarks[C].Supercomputing, ACM/IEEE 2000 Conference, 2000, 12-12
    [38] A. Majumdar. Parallel performance study of Monte Carlo photon transport code on shared-, distributed-, and distributed-shared-memory architectures[J].Proceedings of the International Parallel Processing Symposium, IPPS, 2000, 93-99
    [39] C. H. Tai, Y. Zhao and K. M. Liew. Parallel-multigrid computation of unsteady incompressible viscous flows using a matrix-free implicit method andhigh-resolution characteristics-based scheme[J]. Computer Methods in Applied Mechanics and Engineering, 2005, 194: 3949-3983
    [40] S. Benkner and V. Sipkova. Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs[C].International Symposium on High Performance Computing 2002, 2003, 3-19
    [41] L. Oliker, X. Li, P. Husbands and R. Biswas. Effects of ordering strategies and programming paradigms on sparse matrix computations[J]. SIAM Review, 2002, 44: 373-393
    [42] L. Adhianto and B. Chapman. Performance modeling of communication and computation in hybrid MPI and OpenMP applications[J]. Simulation Modelling Practice and Theory, 2007, 15: 481-491
    [43] C. Terboven, D. a. Mey, D. Schmidl, H. Jin and T. Reichstein. Dataand thread affinity in openmp programs[C]. Ischia, Italy: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?, 2008, 377-384
    [44] M. M. Michael, A. K. Nanda, B.-H. Lim and M. L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors[J]. SIGARCHComput. Archit. News, 1997, 25: 219-228
    [45] S. Saini, A. Naraikin, R. Biswas, D. Barkai and T. Sandstrom. Earlyperformance evaluation of a "Nehalem" cluster using scientific and engineering applications[C]. Portland, Oregon: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, 1-12
    [46] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak and B. Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor[J].IEEE Micro, 2010, 30: 16-29
    [47] F. Broquedis, F. Diakhaté, S. Thibault, O. Aumage, R. Namyst, et al.Scheduling dynamic OpenMP applications over multicore architectures[C]. West Lafayette, IN, USA: Proceedings of the 4th international conference on OpenMP in a new era of parallelism, 2008, 170-180
    [48] N. Robertson and A. Rendell. OpenMP and NUMA architectures I: Investigating memory placement on the SGI origin 3000[C]. Melbourne, Australia:Proceedings of the 2003 international conference on Computational science, 2003, 648-656
    [49] M. Tsuji and M. Sato. Performance Evaluation of OpenMP and MPIHybrid Programs on a Large Scale Multi-core Multi-socket Cluster, T2K Open Supercomputer[C]. Proceedings of the 2009 International Conference on Parallel Processing Workshops, 2009, 206-213
    [50] X. Dong, Y. Xie, N. Muralimanohar and N. P. Jouppi. Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support[C]. Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010, 1-11
    [51] H. Tang and T. Yang. Optimizing threaded MPI execution on SMP clusters[C]. Sorrento, Italy: Proceedings of the 15th international conference on Supercomputing, 2001, 381-392
    [52] J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, et al. TheInternational Exascale Software Project roadmap[J]. Int. J. High Perform. Comput.Appl., 2011, 25: 3-60
    [53] P. Balaji, D. Buntinas, D. Goodell, W. Gropp, S. Kumar, et al. MPIon a Million Processors[C]. Espoo, Finland: Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machineand Message Passing Interface, 2009, 20-30
    [54] J. Liu, J. Wu and D. K. Panda. High performance RDMA-based MPI implementation over InfiniBand[C].2004, 167-198
    [55] G. Naijie. Efficient indirect all-to-all personalized communication on rings and 2-D tori[J]. J. Comput. Sci. Technol., 2001, 16: 480-483
    [56] Y. Yang and J. Wang. Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori[J]. IEEE Trans. Parallel Distrib. Syst., 2002, 13: 128-141
    [57] X. Zhang and C. Qiao. On scheduling all-to-all personalized connections and cost-effective designs in WDM rings[J]. IEEE/ACM Trans. Netw., 1999,7: 435-445
    [58] G. E. Moore. Cramming more components onto integrated circuits[J].Electronics, 1965, 38: (8)
    [59] ITRS. International Technology Roadmap for Semiconductors[R]. ITRS, 2007:
    [60] I. 2008. IDF: Intel says Moore's Law holds until 2029[EB/OL]. http://www.h-online.com/newsticker/news/item/IDF-Intel-says-Moore-s-Law-holds-until-2029-734779.html, 2008-04-04/2011-02-12
    [61] L. Chai. High Performance and Scalable MPI Intra-node Communication Middleware for Multicore Clusters[D]. The Ohio State University, 2009:
    [62] IBTA. InfiniBand? Roadmap[EB/OL]. http://www.infinibandta.org/content/pages.php?pg=technology_overview, 2010-10-11/2011-02-12
    [63] L. Chai, P. Lai, H.-W. Jin and D. K. Panda. Designing an efficient kernel-level and user-level hybrid approach for MPI intra-node communication onmulti-core systems[C].Portland, OR, United states:37th International Conference on Parallel Processing, ICPP 2008, September 9, 2008 - September 12, 2008, 2008, 222-229
    [64] R. L. Graham and G. Shipman. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives[C]. Dublin, Ireland: Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008, 130-140
    [65] K. Kandalla, H. Subramoni, G. Santhanaraman, M. Koop and D. K. Panda. Designing multi-leader-based Allgather algorithms for multi-core clusters[C].Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, 2009, 1-8
    [66] A. R. Mamidala, R. Kumar, D. De and D. K. Panda. MPI Collectives on Modern Multicore Clusters: Performance Optimizations and CommunicationCharacteristics[C].Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on, 2008, 130-137
    [67] K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda. Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather[C]. 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, 1-8
    [68] Z. Liu, J. Song, K. Ren, F. Xu and X. Qu. A systemic strategy for tuning intra-node collective communication on multicore systems[C].Shanghai, China:4th International Conference on Frontier of Computer Science and Technology,FCST 2009, December 17, 2009 - December 19, 2009, 2009, 14-21
    [69] A. R. Mamidala, A. Vishnu and D. K. Panda. Efficient shared memory and RDMA based design for MPI_Allgather over infiniband[C].Bonn, Germany:13th European PVM/MPI User's Group Meeting, September 17, 2006 - September 20, 2006, 2006, 66-75
    [70] R. Nishtala and K. A. Yelick. Optimizing collective communication on multicores[C]. Berkeley, California: Proceedings of the First USENIX conference on Hot topics in parallelism, 2009, 18-18
    [71] Y. Qian and A. Afsahi. Process arrival pattern and shared memory aware alltoall on infiniband[C].Espoo, Finland:16th European Parallel Virtual Machine and Message Passing Interface Users' Group Meeting, EuroPVM/MPI, September 7, 2009 - September 10, 2009, 2009, 250-260
    [72] MPICH2 Websit[EB/OL]. http://www.mcs.anl.gov/mpi/mpich2, 2011-02-01
    [73] OpenMPI Websit[EB/OL]. http://www.open-mpi.org, 2011-02-10
    [74] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, et al. Myrinet. A gigabit-per-second local area network[J]. IEEE Micro, 1995, 15: 29-36
    [75] F. Petrini, W.-C. Feng, A. Hoisie, S. Coll and E. Frachtenberg. The Quadrics network: High-performance clustering technology[J]. IEEE Micro, 2002, 22: 46-57
    [76] M. Technologies. Mellanox InfiniBand InfiniHost MT23108 Adapters[EB/OL]. http://www.mellanox.com, 2002-07-13/2010-10-01
    [77] D. Buntinas, G. Mercier and W. Gropp. Data transfers between processes in an SMP system: Performance study and application to MPI[C].Columbus,OH, United states:ICPP 2006: 2006 International Conference on Parallel Processing, August 14, 2006 - August 18, 2006, 2006, 487-494
    [78] D. Buntinas, B. Goglin, D. Goodell, G. Mercier, St\, et al. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis[C]. Proceedings of the 2009 International Conference on Parallel Processing, 2009, 462-469
    [79] D. Buntinas, G. Mercier and W. Gropp. Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem[J]. Parallel Comput., 2007, 33: 634-644
    [80] D. Buntinas, G. Mercier and W. Gropp. Design and evaluation of nemesis, a scalable, low-latency, message-passing communication subsystem[C].Singapore:6th IEEE International Symposium on Cluster Computing and the Grid, 2006. CCGRID 06, May 16, 2006 - May 19, 2006, 2006, 521-530
    [81] L. Chai, A. Hartono and D. K. Panda. Designing high performance and scalable MPI intra-node communication support for clusters[C].Barcelona, Spain:2006 IEEE International Conference on Cluster Computing, Cluster 2006, September 25, 2006 - September 28, 2006, 2006, IEEE Institute of Electrical and Electronics Engineers; Inst. Electrical and Electronics Engineers Computer Science;Task Committee on Scalable computing; Ministerio de Educacion y Ciencia; Generalitat de Catalunya; et. al
    [82] H.-W. Jin and D. K. Panda. LiMIC: Support for High-Performance MPI Intra-node Communication on Linux Cluster[C]. Proceedings of the 2005 International Conference on Parallel Processing, 2005, 184-191
    [83] H.-W. Jin, S. Sur, L. Chai and D. K. Panda. Lightweight Kernel-level primitives for high-performance MPI intra-node communication over multi-core systems[C].Austin, TX, United states:2007 IEEE International Conference on Cluster Computing, CLUSTER 2007, September 19, 2007 - September 20, 2007, 2007, 446-451
    [84] S. Moreaud, B. Goglin, R. Namyst and D. Goodell. Optimizing MPIcommunication within large multicore nodes with kernel assistance[C].Atlanta, GA, United states:2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, IPDPSW 2010, April 19, 2010 - April 23, 2010, 2010,
    [85] MVAPICH2 Websit[EB/OL]. http://mvapich.cse.ohio-state.edu, 2011-01-23
    [86] Intel. Intel? I/O Acceleration Technology[EB/OL]. http://www.intel.com/network/connectivity/vtc_ioat.htm, 2007-12-01/2011-02-02
    [87] I. S. P. Group. Intel? QuickData Technology Software Guide for Linux[EB/OL]. http://www.intel.com/technology/quickdata/whitepapers/sw_guide_linux.pdf, 2008-05/2011-02-02
    [88] K. Vaidyanathan and D. K. Panda. Benefits of I/O Acceleration Technology (I/OAT) in clusters[C].San Jose, CA, United states:ISPASS 2007: IEEE International Symposium on Performance Analysis of Systems and Software, 2007, 220-229
    [89] S. L. Johnsson and C.-T. Ho. Optimum Broadcasting and Personalized Communication in Hypercubes[J]. IEEE Trans. Comput., 1989, 38: 1249-1268
    [90] P. K. McKinley, Y.-j. Tsai and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers[J]. Computer, 1995, 28: 39-50
    [91] S. Chittor and R. J. Enbody. Hypercubes Vs. 2D Meshes[C]. Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing, 1990, 313-318
    [92] J. Watts and R. Van De Geijn. Pipelined broadcast for multidimensional meshes[J]. Parallel processing letters, 1995, 5: 281-292
    [93] S. Horiguchi and T. Ooki. Hierarchical 3D-Torus Interconnection Network[C]. Proceedings of the 2000 International Symposium on Parallel Architectures, Algorithms and Networks, 2000, 50
    [94] Y. Kawakura and S. Oyanagi. Improving the Performance of Global Communication on a 3D Torus Network[C]. Proceedings of the 1994 International Conference on Parallel Processing - Volume 03, 1994, 193-196
    [95] Y. Huang and P. K. McKinley. An adaptive global reduction algorithm for wormhole-routed 2D meshes[J]. Parallel Comput., 1997, 23: 1909-1936
    [96] W. Mao, J. Chen and W. I. Watson. One-to-all personalized communication in torus networks[C]. Innsbruck, Austria: Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks, 2007, 291-296
    [97] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, et al. Blue Gene/L torus interconnection network[J]. IBM J. Res. Dev., 2005, 49: 265-276
    [98] Cray websit[EB/OL]. http://www.cray.com, 2011-01-01/2011-03-03
    [99] A. Faraj, X. Yuan and P. Patarasuk. A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters[J]. IEEE Trans. Parallel Distrib. Syst., 2007, 18: 264-276
    [100] P. Patarasuk, A. Faraj and X. Yuan. Pipelined broadcast on ethernet switched clusters[C].445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331,United States:20th International Parallel and Distributed Processing Symposium, IPDPS 2006, 2006,
    [101] T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat and R. A. F. Bhoedjang. MagPIe: MPI's collective communication operations for clustered wide area systems[C]. Atlanta, Georgia, United States: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, 1999, 131-140
    [102] Mellanox. Introduction to InfiniBand[EB/OL]. www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf 2003/2010-12-22
    [103] J. Zhou, X.-Y. Lin and Y.-C. Chung. Hardware supported multicast in fat-tree-based InfiniBand networks[J]. J. Supercomput., 2007, 40: 333-352
    [104] S. C. Liew, M.-H. Ng and C. W. Chan. Blocking and nonblocking multirate Clos switching networks[J]. IEEE/ACM Trans. Netw., 1998, 6: 307-318
    [105] B. Jia. Process Cooperation in Multiple Message Broadcast[M]. ed.F.Cappello, T. Herault and J. Dongarra. Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer Berlin / Heidelberg, 2007: 27-35
    [106] Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance[C]. Proceedings of the 14th International Symposium on Parallel and Distributed Processing, 2000, 377
    [107] Ananth Grama, Aushul Gupta, George Karypis and V. Kumar. Introduction to Parallel Computing[M]. Second Edition. Boston, MA, USA.: Addison-Wesley Longman Publishing Co., Inc., 2003: 856
    [108] J. Bruck, C.-T. Ho, E. Upfal, S. Kipnis and D. Weathersby. EfficientAlgorithms for All-to-All Communications in Multiport Message-Passing Systems[J]. IEEE Trans. Parallel Distrib. Syst., 1997, 8: 1143-1156
    [109] R. Thakur, R. Rabenseifner and W. Gropp. Optimization of collectivecommunication operations in MPICH[J]. International Journal of High Performance Computing Applications, 2005, 19: 49-66
    [110] R. W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2[J]. Parallel Computing, 1994, 20: 389-398
    [111] D. Hensgen, R. Finkel and U. Manber. Two algorithms for barrier synchronization[J]. Int. J. Parallel Program., 1988, 17: 1-17
    [112] M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G. Payne, et al. Interprocessor collective communication library (InterCom)[C].Knoxville, TN:Proceedings of Scalable High Performance Computing Conference, 1994, 357-364
    [113] R. Rabenseifner. New optimized MPI reduce algorithm[EB/OL]. https://fs.hlrs.de/projects/par/mpi//myreduce.html, 1997-11-01/2011-02-11
    [114] A. Faraj, X. Yuan and D. Lowenthal. STAR-MPI: self tuned adaptiveroutines for MPI collective operations[C]. Cairns, Queensland, Australia: Proceedings of the 20th annual international conference on Supercomputing, 2006, 199-208
    [115] R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebrasoftware[C]. San Jose, CA: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), 1998, 1-27
    [116] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, et al. Open MPI: Goals, concept, and design of a next generation MPI implementation[J]. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2004, 353-377
    [117] J. M. Squyres and A. Lumsdaine. The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms*[J]. Component Models andSystems for Grid Applications, 2005, 167-185
    [118] S. Sistare, R. vandeVaart and E. Loh. Optimization of MPI collectives on clusters of large-scale SMP's[C]. Portland, Oregon, United States: Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), 1999, 23
    [119] B. Massimo. MPI Collective Communication Operations on Large Shared Memory Systems[C].2001, 159-159
    [120] P. Husbands and J. C. Hoe. MPI-StarT: delivering network performance to numerical applications[C]. San Jose, CA: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), 1998, 1-15
    [121] A. Mamidala, A. Vishnu and D. Panda. Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand[M]. ed.B. Mohr, J. Tr?ff, J. Worringen and J. Dongarra. Recent Advances in Parallel Virtual Machineand Message Passing Interface, Springer Berlin / Heidelberg, 2006: 66-75
    [122] T. Ma, G. Bosilca, A. Bouteiller and J. J. Dongarra. Locality and topology aware intra-node communication among multicore CPUs[C]. Stuttgart, Germany: Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface, 2010, 265-274
    [123] B. Tu, M. Zou, J. Zhan, X. Zhao and J. Fan. Multi-core aware optimization for MPI collectives[C].Tsukuba, Japan:2008 IEEE International Conference on Cluster Computing, CCGRID 2008, September 29, 2008 - October 1, 2008,2008, 322-325
    [124] H. Zhu, D. Goodell, W. Gropp and R. Thakur. Hierarchical Collectives in MPICH2[C]. Espoo, Finland: Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2009, 325-326
    [125] H. Tang, K. Shen and T. Yang. Program transformation and runtime support for threaded MPI execution on shared-memory machines[J]. ACM Trans. Program. Lang. Syst., 2000, 22: 673-700
    [126] E. Demaine. A threads-only MPI implementation for the developmentof parallel programs[C]. Winnipeg, Manitoba, Canada: Proceedings of the 11th International Symposium on High Performance Computing Systems, 1997, 153-163
    [127] S. Prakash and R. Bagrodia. MPI-SIM: using parallel simulation to evaluate MPI programs[C]. Winter Simulation Conference (WSC'98), 1998, 467-474
    [128]西班牙Ingenio 2010政府计划官方网站[EB/OL]. http://www.ingenio2010.es/?menu1=&menu2=&menu3=&dir=&id=En, 2010-12-01/2011-01-23
    [129] J. C. Diaz Martin, J. A. Rico Gallego, J. M. Alvarez Llorente and J.F. Perogil Duque. An MPI-1 compliant thread-based implementation[C].Espoo, Finland:16th European Parallel Virtual Machine and Message Passing Interface Users' Group Meeting, EuroPVM/MPI, September 7, 2009 - September 10, 2009, 2009, 327-328
    [130] MPICH2 collaborators[EB/OL]. http://www.mcs.anl.gov/research/projects/mpich2/about/index.php?s=collab, 2010-12-12/2011-02-23
    [131] A. Yoo, M. Jette and M. Grondona. SLURM: Simple Linux Utility for Resource Management[M]. ed.D. Feitelson, L. Rudolph and U. Schwiegelshohn. Job Scheduling Strategies for Parallel Processing, Springer Berlin / Heidelberg,2003: 44-60
    [132] A. Faraj, P. Patarasuk and X. Yuan. A study of process arrival patterns for MPI collective operations[C]. Seattle, Washington: Proceedings of the 21stannual international conference on Supercomputing, 2007, 168-179
    [133] MPICH-MX and MPICH2-MX Software home page[EB/OL]. http://www.myri.com/scs/download-mpichmx.html, 2010-01-07/2010-12-11
    [134] Y. Sade, M. Sagiv and R. Shaham. Optimizing C Multithreaded Memory Management Using Thread-Local Storage[M]. ed.R. Bodik. Compiler Construction, Springer Berlin / Heidelberg, 2005: 139-139
    [135] R. Kesavan, K. Bondalapati and D. K. Panda. Multicast on irregularswitch-based networks with wormhole routing[C].High-Performance Computer Architecture, 1997., Third International Symposium on, 1997, 48-57
    [136] H. Ritzdorf and J. L. Triff. Collective operations in NEC's high-performance MPI libraries[C]. Rhodes Island, Greece: Proceedings of the 20th international conference on Parallel and distributed processing, 2006, 100-100
    [137] P. Patarasuk and X. Yuan. Efficient MPI_Bcast across different process arrival patterns[C].Miami, FL, United states:IPDPS 2008 - 22nd IEEE International Parallel and Distributed Processing Symposium, April 14, 2008 - April 18, 2008, 2008, IEEE Computer Society Technical Committee on Parallel Processing
    [138] A. Faraj, S. Kumar, B. Smith, A. Mamidala and J. Gunnels. MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations[C]. Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects, 2009, 63-72
    [139] Y. Qian and A. Afsahi. Process Arrival Pattern Aware Alltoall and Allgather on InfiniBand Clusters[J]. 2010, 1-21