用户名: 密码: 验证码:
基于键涨落模型数值模拟的并行优化
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着大规模计算的需求不断增长,并行计算技术得到了不断发展,TOP500每年都要公布峰值速度前500强的世界高性能计算机排名。现在主流高性能计算机的体系结构发展趋势,是使用基于共享存储的刀片多核处理器搭建机群系统,山东大学高性能机群便是采用这种架构。在并行计算机发展日新月异的同时,并行计算的实际应用发展速度大大落后于硬件的发展速度,应用程序实测性能远低于计算峰值。因此,充分利用起并行计算机的计算特点,优化提升应用的并行性能,比如对高分子表面吸附的并行应用进行优化,使其在机群上实现并行计算,并且提升并行性能缩短计算周期,成为了并行计算中的一项研究课题。
     键涨落模型是高分子表面吸附数值模拟的经典运动模型,由于其计算量巨大使得单机PC进行计算模拟的时间不可接受,而MPI实现的并行蒙特卡洛抽样方法可以通过扩展PC(或计算核)的个数来计算不同的样本,最后将数据归约计算,当使用480核计算960样本时,将需要9年完成的串行计算加速至5天完成。这种粒度划分最小为每一个样本为一个计算任务,但当模拟的高分子链分子量比较大时,一个独立样本的计算时间也是相当长的。因此在MPI并行基础之上,可以通过区域分解进一步划分并行粒度,区域分解之后的循环迭代易于使用OpenMP提供的编译制导并行化。相比于MPI使用进程通信,OpenMP基于多线程技术,能更好的发挥刀片结点共享存储的优势。使用OpenMP直接并行化的应用程序,可以初步完成在刀片结点上的并行计算,但高效率的发挥并行性能,需要进一步测试、分析、调优,然后得到最合理的硬件资源使用方案。
     本文基于高分子表面吸附在高性能机群上的MPI并行编程框架,主要工作分为两部分:首先,研究OpenMP编程技术,实现应用热点模块的并行化;其次,研究OpenMP的优化技术,针对高分子表面吸附应用设计并行优化方案。本文对高分子表面吸附应用的OpenMP程序调优工作,均在四路八核刀片上完成,测试结果表明优化方案能有效地提高实际并行性能。本文采用的软件工程优化化方法,可以为将来机群应用在单个结点上的OpenMP调优提供方法、经验和借鉴。具体而言,本文的主要工作如下:
     1.给出MPI并行后的高分子表面吸附在高性能机群上的性能测试和分析,验证满足Gustafson定律;
     2.在计算长链并行性能提升达到瓶颈的条件下,本文基于长链分段的方法,使用多线程模拟各段内的键涨落运动,以此来替代长链运动的方法。使用OpenMP并行编程实现并行接口MC_Bond_Fluc,完成在四路八核刀片上的性能测试;
     3.设计了高分子表面吸附基于键涨落模型数值模拟的OpenMP优化方案,基于软件优化方法论,使用均衡负载、减少并行开销、合理使用内存、提高Cache命中等手段增量式优化,测试得到了性能最佳的OpenMP优化方式,给出高分子表面吸附的OpenMP并行最优方案。
With the growing demand of massive computing, parallel computing has been growing fast. Top500official website annually publishes the report card of five hundred of high performance computers judging by Rmax. Though the architecture of high performance computers has been developed variously over the past decades, cluster system, which consists of hundreds of nodes based on multi-core shared memory model, is going to be the trend of the development of high performance computers. HPC of Shandong University has this kind of architecture. With the rapid development of HPC, the development of application of Parallel Computing significantly lagged behind the pace of the hardware development. And it is common that application's real performance on HPC is much lower than the Mflops/Watt. As a result, taking full advantage of the computing characteristics of HPC and improving parallel application performance become a significant issue in the parallel computing field. The optimization of polymer adsorption on the surface based on parallel computing is the application of a topic in clusters achieves high performance computing.
     Bond-fluctuation model is the classical model of numerical simulations of polymer adsorption on the surface. The huge amount of computation makes it impossible for stand-along PC to complete simulation in time. But parallel Monte Carlo sampling method achieved by MPI can calculate different samples by extending the number of the processors (cores), then calculate reduce data. When using480cores to compute960samples, it will take5days to complete with parallel computing, while it will take9years in previous sequence computing. This kind of division mode can achieve the smallest granularity when just take one sample as a computing task, however, it will still take a long time to calculate an independent sample when the weight of polymer chain is much larger than before. So we can enlarge the degree of parallelism, by dividing the Monte Carlo Step, i.e. MCS, to a smaller step further. The domain decomposition method can be easily implemented with OpenMP, by adding compiler directives before the loop structure. Compared with MPI using processes as communication units, OpenMP is based on multithread technology, and better plays the advantages of shared memory model on each blade node. Applications with OpenMP programming can achieve simple parallel computing on each blade node. But, further testing, analysis and tuning the parallel efficiency as well as getting the right hardware resource usage scenarios is necessary.
     The parallel programming framework which is designed to simulation the polymer surface adsorption on cluster is based on parallel programming language MPI. The work of this article is based on the framework, and can mainly be divided into two parts. First, studying on programming with OpenMP to achieve the parallelization of the hotspot module of application.Second, studying on optimization methods of OpenMP technology and design a performance optimization scheme for the simulation programs of polymer adsorption on the surface, and then make it come true. All the tests with OpenMP programs have been executed on the blade node, which has four CPUs, each CPU has eight cores, each core supports Hyper-Threading Technology.Testing result proves that optimization scheme can effectively improve practical parallel performance. Software engineering optimization method used in this article could provide tuning method, experience and reference for optimizing the application on cluster's node in the future. Specifically, the detail tasks of this article are as follows:
     1. Show the performance testing and analysis of the MPI programs of polymer surface adsorption on cluster, and verify the speedup-result meets Gustafson's law.
     2. The MPI programs'performance reaches a bottleneck when simulating much larger polymers than before. This article is based on method of long-chain segments, uses multithreading to simulate bond-fluctuation motion in each paragraph instead of long-chain movement method. Programming with OpenMP and implements the parallel interface MC_Bond_Fluc, show the report of performance on the4-CPU-32-cores blade node.
     3. This article is based on software optimization methodology and designs a performance optimization scheme for the simulation programs of polymer adsorption on the surface. This optimization methodology includes balanced load, reduce parallel overhead, rational use of memory and improve the cache hits, then show the performance of these methods with new Versions. At last, the work achieves the best performance among these optimizations of OpenMP, offers the scheme which has the best performance of the OpenMP program of simulation polymer chain adsorption on the surface.
引文
[1]周毓麟,沈隆钧.高性能计算的应用及战略地位[J].中国科学院院刊,1999,(3):184-187
    [2]PAN Sha, LI Hua, XIA Zhi-xun. High-Performance Parallel Computation Application for Aerospace CFD Numerical Simulation [J]. Computer Engineering and Science.2012,34(8):191-198(in Chinese with English abstract)
    [3]黄铠,徐志伟.可扩展并行计算技术、结构与编程[M].北京:机械工业出版社.2000:82-94
    [4]Official home page-Top 500 supercomputers sites Available: http://www.top500.org/
    [5]Li Hong, Qian Chang-ji,Sun Li_zhen,Luo Meng-bo. Conformational properties of a polymer tethered to an interacting flat surface [J]. Polymer J.2010,42:383.
    [6]Li Hong, Qian Chang-Ji, Sun Li-Zhen, Luo Meng-Bo. Simulation of a flexible polymer tethered to a flat adsorbing surface [J]. Journal of Applied Polymer Science. 2012,124:282.
    [7]Quinn, M. J. Parallel programming in C with MPI and OpenMP [M]. McGraw-Hill. 2003:70-89,128-141,193-196,323-360
    [8]李璐,陈宝国.基于Linux的MPI并行环境的配置[J].计算机与数字工程.2007,35(11):47-48
    [9]Hong Li, Bin Gong, Zhi-gang Sun. Parallel computing based on numerical simulation of self-avoiding walk [R]. The 3nd Chinese Conference on Cloud Computing (CCCC2012),2012. (in Chinese with English abstract)
    [10]Nedelcu S, Werner M, Lang M, Sommer J-U. GPU implementation of the bond fluctuation model [J]. Journal of Computational Physics.2012,231:2811-2823.
    [11]Uhlherr A. Parallel Monte Carlo simulations by asynchronous domain decomposition [J]. Comput. Phys. Comm.2003,155 (1):31-41
    [12]Thomas Rauber, Gudula Runger. Parallel Programming for Multi-core and Cluster Systems [M].New York:Springer-Verlag Heidelberg.2010
    [13]Gross J, Janice W, Bachmann M. Massively parallelized replica-exchange simulations of polymers on GPUs [J]. Comput. Phys. Comm.2011,182: 1638-1644.
    [14]Wen-qin Lu, Meng-bo Luo. Monte Carlo study on the critical adsorption point of bond-fluctuated polymer chains tethered on adsorbing surfaces [J].Chinese Journal of Polymer Science.2009,27(l):109-114
    [15]孙蕾,彭昌军,刘洪来,胡英Monte Carlo模拟随机共聚高分子在固液界面的吸附行为[J].化工学报.2006,57(5):1048-1054
    [16]Chen Ying-cai, Luo Meng-bo. Dynamic Monte Carlo study on the probability distribution functions of tail-like polymer chain [J]. Zhejiang Univ SCI.2005, 6B(11):1130-1134
    [17]张林波等.并行计算导论[M].北京:清华大学出版社.2006:44-49
    [18]Official home page-Message Passing Interface (MPI) Available: http://www.mpi-forum.org/
    [19]Official home page-MPI system (MPICH) Available: http://www.mcs.anl.gov/mpi/mpich/
    [20]Official home page-MPI system (LAM-MPI) Available:http://www.lam-mpi.org/
    [21]Official home page-Open MP Available:http://www.openmp.org/
    [22]OpenMP Architecture Review Board (2011) OpenMP C and OpenMP C++ Application Program Interface, Version 3.1. Available download link http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf.
    [23]OpenMP Architecture Review Board (2011) OpenMP Fortran Application Program Interface, Version 3.1. Available download link: http://www.openmp.org/presentations/miguel/F95_OpenMPvl_v2.pdf
    [24]罗秋明等OpenMP编译原理[M].北京:清华大学出版社.2012:3-19,53-60.
    [25]Sudhakar Sah, Vinay G. Vaidya. A Review of Parallelization Tools and Introduction to Easypar [J]. International Journal of Computer Applications.2012,56(12): 17-29
    [26]Official home page-Understand for Fortran (UFF) Available: http://www.scitools.com/
    [27]Official home page-IPM Available:http://ipm-hpc.sourceforge.net/
    [28]Intel Corporation, Intel Parallel Amplifier. http://software.intel.com/en-us/articles/intel-parallel-amplifier/
    [29]Kazuhiro Kusano, Shigehisa Satoh, Mitsuhisa Sato. Performance Evaluation of the Omni OpenMP Compiler [R]. Computer Science.2000,1940:403-414
    [30]张平等.共享内存结构OpenMP并行程序的自动生成[J].计算机科学.2011,31(12):189-191
    [31]Xinmin Tian, Milind Girkar. Practical Compiler Techniques on Efficient Multithreaded Code Generation for OpenMP Programs [J]. The Computer Journal. 2005,48 (5):588-601.
    [32]Ayon Basumallik, Rudolf Eigenmann. Toward Automatic Translation of OpenMP to MPI [R]. ICS05 Proceedings of the 19th Annual International Conference on Supercomputing.2005:189-198
    [33]Intel OpenMP线程绑定内核接口.http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011 Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm
    [34]付雄.利用程序分析和优化提高Cache性能[D].中国科学技术大学.2007:11-16
    [35]Chris Holt, Jaswinder Pal Singh, John Hennessy. Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines [R]. ISCA96 Proceedings of the 23rd Anuual International Symposium on Computer Architecture.1996:134-145
    [36]Dongming Jiang. Scaling Application Performance on a Cache-Coherent Multiprocessor [J]. Computer Architecture,1999:305-316
    [37]苗乾坤.面向共享存储系统的计算模型及性能优化[D].中国科学技术大学. 2010:8一11
    [38]Barbara Chapman, Gabriele Jost, Ruud van der Pas. Using OpenMP:Portable Shared Memory Parallel Programming [M]. MIT press,2008.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700