基于SMP集群的MPI+OpenMP混合并行编程模型研究与应用

英文题名：Research and Application of MPI and OpenMP Hybrid Programming Model Based on SMP Clusters
作者：王惠春
论文级别：硕士
学科专业名称：计算数学
中文关键词：SMP集群 ; MPI+OpenMP ; 混合编程模型 ; 性能评测与建模 ; 并行编程
英文关键词：SMP Cluster ; MPI+OpenMP ; Hybrid programming model ; Performance evaluation and modeling ; Parallel programming
学位年度：2008
导师：曹学年
学科代码：070102
学位授予单位：湘潭大学
论文提交日期：2008-05-16

摘要

随着计算机集成技术的不断发展,越来越多的CPU可以同时访问同一内存空间,共享内存体系结构(SMP)逐渐在并行计算领域占据主导地位。另外,制造商热衷于将SMP集群化,来构建远胜于单一结构的并行系统。在Top500前十名超级计算机中,大部分是SMP集群(SMP Cluster)。然而,适合于SMP集群的并行编程模型研究相对滞后。为了找到一种适合于现有平台及应用的编程模型,往往需要做大量试验,不断的对比、分析和修正。针对这一突出问题,本文做了如下研究工作:
     (1)以MPI+OpenMP为代表详细描述了适用于SMP集群的混合并行编程模型。它贴近于SMP集群的体系结构且综合了消息传递和共享内存两种编程模型的优势,能够得到更好的性能表现。在此基础上,重点讨论了混合模型的实现机制、并行化粒度选择、循环选择、线程数控制、优化措施以及相对于纯MPI模型的优势。最后指出,在一定条件下混合并行编程模型是SMP集群的最优选择。
     (2)性能评测和建模是使得优化并行程序得以进行的关键步骤。混合并行编程模型,例如MPI+OpenMP,需要分析确定两种性能效率和特定硬件平台下的最佳处理器和线程数。为了研究这些问题,我们提出基于少量参数却能较好逼近运行时系统复杂性的性能评测模型。我们结合两种不同的技术,包括静态分析和并行开销度量基准。静态分析的作用在于通过OpenUH编译器获得应用标识,而并行开销度量基准通过Sphinx和Perfsuite搜集系统特征。最后,我们提出了一种用于确定通讯和计算效率的性能评测度量模型。
     (3)通过一个具体的应用案例详细描述了混合并行程序的设计过程和试验分析方法。通过对结果的分析展示了上述研究工作的有效性,结果是令人满意的。
     多核体系结构的出现直接导致了多核SMP集群的出现。这一新兴的复杂体系结构,无论是对于企业级服务器还是大规模科研应用,都已经成为性价比最高的首选解决方案。本文最后,对(多核)SMP集群的编程模型及优化进行了展望。
Shared memory architectures are gradually becoming more prominent in the HPC market, as advances in technology have allowed larger numbers of CPUs to have access to a single memory space. In addition, manufacturers are increasingly clustering these SMP systems together to go beyond the limits of a single system. Of the top 10 supercomputers listed in the Top500, most (if not all) are clusters of SMPs.
     As clustered SMPs become more prominent, it becomes more important for applications to be portable and efficient on these systems. For this pressing problem, the paper doing some research as follows:
     1. Detailedly described the hybrid MPI+OpenMP programming model suited for SMP cluster. It is close to the architecture of SMP cluster, combining the advantage of message passing and shared memory, could gain more efficient performance behavior. Based on this, the paper focus on the implement of the hybrid programming model, the choice of parallel grain size and loops, control of the thread number , optimize measures and its benefit relative to the MPI model. At last, it concludes that the hybrid programming model is the best choice of SMP cluster program under some conditions.
     2. Performance evaluation and modeling are crucial steps to enabling the optimization of parallel programs. Hybrid programming model, such as MPI+OpenMP, requires analysis to determine both performance efficiency and the most suitable numbers of processes and threads for their execution on a given platform. To study both of these problems, we propose the construction of a model that is based upon a small number of parameters, but is able to capture the complexity of the runtime system. We have combined two different techniques that includes static analysis, driven by the OpenUH compiler, to retrieve application signatures and a parallelization overhead measurement benchmark, realized by Sphinx and Perfsuite, to collect system profiles. Finally, we propose a performance evaluation measurement to identify communication and computation efficiency. We describe our underlying framework, the performance model, and show how our tool can be applied to a sample code.
     3. Detailedly described the process of design Hybrid parallel program and the method of experimental analysis by a idiographic example. The results reveal the reasonable of the research work.
     The appearance of Multi-Core architecture greatly speeds up the development of SMP Cluster. Multi-Core CPU could be treated as a simple SMP. Naturally instead of the Single-Core CPUs, embedded the Multi-Core CPUs into the SMPs will make the more complicated Multi-Core SMP Clusters. The rising complex architecture has already turned into the most cost-effective solution for in spite of the large scale scientific research applications and the server of for enterprise. This paper views the program model and optimization of the Multi-Core SMP Cluster.

引文

[1].陈国良.并行计算:结构.算法.编程(修订版)[M].北京:高等教育出版社, 2003.
    [2].陈国良,安虹,陈峻,郑启龙,单久龙.并行算法实践.北京:高等教育出版社, 2004.
    [3]. MPI, MPI: A Message-Passing Interface standard. Message Passing Interface Forum, June 1995. http://www.mpi-forum.org/.
    [4]. OpenMP, The OpenMP ARB. http://www.openmp.org/.
    [5]. High Performance FORTRAN (HPF). http://www.netlib.org/hpf/.
    [6]. PVM: Parallel Virtual Machine. http://www.csm.ornl.gov/pvm/.
    [7]. Message Passing Interface Forum. MPI: A message passing interface standard. International Journal of Super Computer Applications, 8(3/4): 159-416, 1994.
    [8]. Message Passing Interface Forum. MPI-2: A message passing interface standard. High Performance Computing Applications, 12 (1-2): 1-299, 1998.
    [9]. David R.Butenhof. Programming with POSIX Threads. Addison Wesley, 1997.
    [10]. Doug Lea, Doaglas Lea. Concurrent Programming in Java. Second Edition: Design Principles and Patterns. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
    [11]. R.Rabenseifner, Hybrid Parallel Programming on HPC Platforms, Proceedings of the Fifth European Workshop on OpenMP, EWOMP'03, Aachen, Germany, September 22-26, 2003.
    [12]. Valentina Piermarini, Antonio Laganà, Gabriel G. Balint-Kurti, Lorna Smith, Robert J. Allan, Parallelism and Granularity in Time Dependent Approaches to Reactive Scattering Calculations. The International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'2000), June 24-29, 2000, Las Vegas, Nevada, USA. CSREA Press 2000.
    [13]. W.Huang and D.K.Tafti, A Parallel Computing Framework for Dynamic Power Balancing in Adaptive Mesh Refinement Applications, Proceedings of Parallel Computational Fluid Dynamics 1999, Wiiliamsburg, VA, May 23-26, 1999.
    [14]. Quinn M J, Parallel Programming in C with MPI and OpenMP [M]. McGraw-Hill Companies, Inc, 2004.
    [15]. Steve W. Bova, Clay P. Breshears, Henry Gabb, Bob Kuhn, Bill Magro, Rudolf Eigenmann, Greg Gaertner, Stefano Salvini, Howard Scott, Parallel Programming with Message Passing and Directives, Computing in Science and Engineering 3 (5) (2001) 22–37.
    [16]. N. Drosinos, N. Koziris. Performance Comparison of pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters, in: Proceedings of the 18th International Parallel and Distributed Processing Symposium 2004 (IPDPS 2004), Santa Fe, New Mexico, April 2004.
    [17]. M.D. Jones, R. Yao, Parallel OSEM Reconstruction speed with MPI, OpenMP, and Hybrid MPI-OpenMP Programming Models, in: In IEEE Nuclear Science Symposium and Medical Imaging Conference Record, Rome, Italy, October 2004.
    [18]. F. Cappello, D. Etiemble, MPI versus MPI + OpenMP on IBM SP for the NAS benchmarks, in:In SC2000, Supercomputing 2000, November, Dallas, 2000.
    [19]. Piero Lanucara,Sergio Rovida, Conjugate-gradients algorithms: an MPIOpenMP implementation on, in: First European Workshop on OpenMP, 1999, pp.76–78.
    [20]. G.Mahinthakumar, F. Saied, A hybrid MPI-OpenMP implementation of an implicit finite-element code on parallel architectures, International Journal of High Performance Computing Applications 16 (4) (2002) 371–393.
    [21]. William D. Gropp, Ewing Lusk, Reproducible measurements of MPI performance characteristics, in: Jack Dongarra, Emilio Luque, Tom‘as Margalef (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, vol. 1697, Springer Verlag, 1999, pp. 11–18, 6th European PVM/MPI Users’Group Meeting, Barcelona, Spain, September 1999.
    [22]. Pallas GmbH. Pallas MPI benchmarks pmb. http://www.pallas.de/pages/pmbd.htm.
    [23]. Jeffrey Odom, Jeffrey K. Hollingsworth, Luiz DeRose, Kattamuri Ekanadham, Simone Sbaraglia, Using dynamic tracing sampling to measure long running programs, in: SC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, IEEE Computer Society, Washington, DC, USA, 2005, p. 59.
    [24]. P. Mucci, K. London, The Mpbench Report, 1998.
    [25]. M. Ku¨hnemann, T. Rauber, G. Runger, A source code analyzer for performance prediction, in: Proceedings of the IPDPS-Workshop on Massively Parallel Processing (CDROM), IEEE, 2004.
    [26]. Thilo Kielmann, Henri E. Bal, Kees Verstoep, Fast measurement of LogP parameters for message passing platforms, in: IPDPS Workshops, 2000, pp. 1176–1183.
    [27]. Edmond Chow, David Hysom. Assessing performance of hybrid MPI/OpenMP programs on SMP clusters. Technical Report UCRLJC-143957, Lawrence Livermore National Laboratory, May 2001.
    [28]. Michael E. Wolf, Dror E. Maydan, Ding-Kai Chen, Combining loop transformations considering caches and scheduling, in: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society, 1996, pp. 274–286.
    [29]. Andre Weinand, Eclipse– an open source platform for the next generation of development tools, in: NODE’02: Revised Papers from the International Conference NetObjectDays on Objects, Components, Architectures, Services, and Applications for a Networked World, Springer-Verlag, London, UK, 2003, p. 3.
    [30]. Ralf Reussner, Peter Sanders, Lutz Prechelt, and Matthias Muller. SkaMPI: A detailed, accurate MPI benchmark. In PVM/MPI, 1998, pp. 52–59.
    [31]. US.DoD.HPCMP.WES, Dual-level parallel analysis of Harbour Wave response using MPI and OpenMP, http://www.wes.hpc.mil/news/SC98/awardpres.pdf,1998.
    [32]. Phillip B. Gibbons, Yossi Matias, Vijaya Ramachandran, Can sharedmemory model serve as abridging model for parallel computation, in: SPAA’97: Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, ACM Press, New York, NY, USA, 1997, pp 72–83.
    [33]. David R. Helman, Joseph Jaacute, Prefix computations on symmetric multiprocessors, Journal ofParallel and Distributed Computing 61 (2) (2001) 265–278.
    [34]. Xipeng Shen, Yutao Zhong, Chen Ding, Locality phase prediction, in: ASPLOS-XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, ACM Press, New York, NY, USA, 2004, pp. 165–176.
    [35]. L. Giraud, Combining shared and distributed memory programming models on clusters of symmetric multiprocessors: some basic promising experiments. Working Note WN/PA/01/19, CERFACS, Toulouse, France, 2001.
    [36]. SPHINX. http://www.llnl.gov/casc/sphinx/sphinx.html.
    [37]. Siegfried Benkner, Viera Sipkov’a, Exploiting distributed-memory and shared-memory parallelism on clusters of SMPs with data parallel programs, International Journal of Parallel Programming 31 (1) (2003) 3–19.
    [38]. Gabriel Marin, John Mellor-Crummey, Cross-architecture performance predictions for scientific applications using parameterized models, in: SIGMETRICS2004/PERFORMANCE2004: Proceedings of the joint international conference on Measurement and modeling of computer systems, ACM Press, New York, NY, USA, 2004, pp. 2–13.
    [39]. Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, Chris Scheiman, Loggp: Incorporating long messages into the LogP model: one step closer towards a realistic model for parallel computation, in: SPAA’95: Proceedings of the seventh annual ACM symposium on Parallelal gorithms and architectures, ACM Press, New York, NY, USA, 1995, pp. 95–105.
    [40]. Amitava Majumdar, Parallel performance study of monte carlo photon transport code on shared-, distributed-, and distributedshared-memory architectures, in: IPDPS, 2000, p. 93.
    [41]. David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos,Ramesh Subramonian, Thorsten von Eicken, Logp: towards a realistic model of parallel computation, in: PPOPP’93: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, ACM Press, New York, NY, USA, 1993, pp. 1–12.
    [42]. Huang W. and Tafti D.K.A parallel computing framework for dynamic power balancing in adaptive mesh refinement applications. In Parallel CFD99, Wiiliamsburg, VA, May 1999.
    [43]. B. F. Plybon, An Introduction to Applied Numerical Analysis. PWS-KENT Publishing Company,Boston, 1992.
    [44].张林波,迟学斌,莫则尧,李若.并行计算导论.北京:清华大学出版社, 2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700