GRAPES有限区域切线/伴随模式高效并行算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
四维变分同化技术作为数值天气预报的关键技术之一,可将不同地区、不同性质的观测资料随时间的变化信息融入到初始场,从而提高系统的预报质量,因而当前在国际上被认为是最有效的资料同化方案。但其计算过程非常复杂,程序占用内存量巨大,系统的运行时间较长。我国自主研发的新一代数值天气预报系统GRAPES(Global/Regional Assimilation and Prediction System)的四维变分同化系统(GRAPES-4DVAR)也有计算量大,占用内存多,运行时间长的特征。如何针对GRAPES有限区域模式在算法或代码上进行改进,提高其运行效率和并行可扩展性,是本文研究的关键与重点。文章主要从优化程序代码、改进伴随算法、开展混合并行等方面来提高程序的运行效率和可扩展性,研究并实现减少程序运行时间的有效方法。主要内容概述如下:
     (1)对GRAPES有限区域模式的代码进行调整优化。研究提高内存系统资源利用率和处理器运算部件运行效率的方法,消除代码中对性能有着显著影响的瓶颈因素。通过有效的代码实现,非线性模式的运行效率提高约25%。
     (2)提出了一种新的伴随模式计算方法—极限断点存储技术。用增加约30%的内存代价换取了程序运行性能100%的提升。
     (3)提出了一种可实现数据块先进先出与先进后出关系的内存数据管理技术,并实现了该结构-嵌套多链栈。
     (4)针对GRAPES伴随模式并行读写外部存储器可扩展性受限的问题,提出一种增强性能的改进方案。用有限的内存空间来实现大量中间数据的管理方法,替换了影响性能的外部存储器读写过程,实现了当扩展处理器规模超过128时,可减少70%程序墙钟时间。
     (5)实现GRAPES的混合并行计算。立足当前流行的集群系统结构,实现了在节点内使用OPENMP线程级并行,节点间使用MPI进程级并行的混合并行来替代纯MPI并行的GRAPES计算方法。得出了当纯MPI并行效率下降到90%以下时,使用混合并行方式,可提高5%到10%左右的结论。
Four-dimensional variational assimilation as one of the key technologies of numerical weather prediction’s can take the information related in time for observed data into account to improve the quality of init data which determine the effect of forecast. It can be assimilated the different times, different regions, different types of observational data be considered the most effective scheme international in data assimilation currently. But its calculation is very complicated and needs more computations and more time to compute. The four-dimensional variational assimilation system of GRAPES ( Global/Regional Assimilation and Prediction System ) called GRAPES-4DVAR for short which is a new generation of numerical weather prediction system be developed by Chinese independently have the similar feature with a large amount of computations, needing more memory and longer time when running. How to reduce the elapsed time by improving the code efficiency, changing the algorithm, enhancing the parallel scalability is the key and focus of this article. This article mainly focus on how to obtain the performance from optimized code for improving efficiency, how to analysis the impact on program performance by using a different way through the quantitative method, and how to use a mixed parallel mode for increase scalability of parallel computing. The main work is summarized as follows:
     (1) Adjusted and optimized the GRAPES regional mode code. Focus on the research of enhancing the performance of memory system and the basic components of the processor. Analyzed what the reasons caused pipeline stalled and remove the bottleneck in code which has a significant impact on the performance when running. Through these, nonlinear mode obtained a benefit 25% improved by adjusting and optimizing code.
     (2) Put forward a limit solution between the Checkpointing strategy and Store-All strategy. Trade an increase of about 30% of the memory cost for 100% performance increased.
     (3) Put forward a technique that can manage the data blocks in memory supporting both First In First Out and First In Last Out. Nested Multi-Chained Stack be implement satisfy the need of the improved adjoint algorithm excellent.
     (4) Improved the Input and Output problem of parallel performance. By comparing the gap of maximum iteration the adjoint mode could running and actual demanding, determined which method can obtain the most performance and satisfy the actual need under stationary computation scale and stationary number of processors. Also given the result that using limited memory space replace the reading/writing external storage when the number of processors more than 128, the wall clock time decline up to 70%.
     (5) Implement the mixed-mode of parallel computation. For the popular structure of modern cluster system, by using thread-level parallelism through OPENMP method in the node and using the message passing through MPI method internal nodes will display an excellent parallel performance and scalability. Conclude the result that the parallel efficiency of mixed parallel mode can be increased 5% to 10% than of the pure MPI mode when dropped below 90%. Last analyzed the advantages and disadvantages of data division statically for threads.
引文
[1] L. F. Richardson. Weather Prediction by Numerical Processes[M]. Cambridge University Press. 1922.
    [2] P. Lynch. Richardson's marvelous forecast[J]. Amer. Met. Soc. 1999:61~73.
    [3]薛纪善,陈德辉,等.数值预报系统GRAPES的科学设计与应用[M].北京:科学出版社,2008:334~335.
    [4]官元红,周广庆,陆维松,陈建萍.资料同化方法的理论发展与应用综述[J].气象与减灾研究.2007,30(4):1~8.
    [5] Panifsky H. Objective weather map analysis[J]. Journal of Meteorological. 1949(6):386~392.
    [6] Parrish D.F., Derber J.C. The national meteorological center's spectral statistical- interpolation analysis system. Monthly Weather Review[J]. 1992(120):1747~1763.
    [7] Barnes S. L. A technique for maximizing details in numerical weather map analysis[J]. Journal of Application Meteorology. 1964(3):396~409.
    [8]马吉溥.一个含有台风高度场的客观分析方法.南京大学学报(自然科学版) (1) [J]. 1975(2):93~103.
    [9]王跃山.客观分析和四维同化—站在新世纪的回望(II)客观分析的主要方法[J].气象科技.2001, 29(1):1~29.
    [10]王喜冬,许东峰,徐晓华.变分资料同化中不同的变分求解方法[J].海洋学研究.2007,25(3):103~111.
    [11] P. Courtier, Talagrand O. Variational assimilation of meteorological observations with the adjoint vorticity equations, PartⅡ: Numerical results [J]. Quart. J. Roy. Meteor. Soc., 1987, 113(4):1329~1347.
    [12] Derber J. A variational continuous assimilation technique[J]. Mon. Wea. Rev. 1989(117):2437~2446.
    [13] Daley R., Atmospheric data analysis[M]. Cambridge University Press. 1991
    [14] Zupanski M. Regional 4-dimensional variation data assimilation in a quasi- operational forecasting environment[J]. Mon. Wea. Rev. 1993(121):2396~2408.
    [15] Bouttier F., Rabier F. The operational implementation of 4D-Var[J]. ECMWF Newsletter. 1997(78):2~5.
    [16]薛纪善.新世纪初我国数值天气预报的科技创新研究[J].应用气象学报.2006,17(5):602~603.
    [17]赵军.数值天气预报资料同化关键技术及并行计算研究[D].长沙:国防科学技术大学研究生院.2007:4~5.
    [18]张林,朱宗申.GRAPES模式切线性垂直扩散方案的误差分析和改进[J].应用气象学报.2008,19(2):194~200.
    [19]张根生,黄小刚,费建芳.数值天气预报初场的变分同化及其伴随方法简介[J].气象科学.2004,24(2):240~245.
    [20] E.Shuttleworth. Revised methods for Adjoint Calculations[Z]. Serco Assurance, Winfrith, Dorchester, Dorset, United Kingdom.
    [21] Ronald M., Errico. What Is an Adjoint Model[J]. National Center for Atmospheric Research. 1997,78(11):2577~2591.
    [22]刘琼,孙安香.变分同化中的伴随方程[J].计算机工程与科学.2006,28(9):103~105.
    [23]王栋梁,沈桐立.中尺度模式MM5的四维变分资料同化系统[J].南京气象学院学报.2002,25(5):603~610.
    [24]伍湘君,金之雁,陈德辉,宋君强,杨学胜.新一代数值预报模式GRAPES的并行计算方案设计与实现[J].计算机研究与发展. 2007,44(3):510~515.
    [25]刘国平.中国新一代全球数值天气预报模式切性线伴随模式技术[D].长沙:国防科学技术大学研究生院.2008:10~15.
    [26]沈桐立,李华宏.伴随模式同化系统的设计及其应用研究[J].南京气象学院学报. 2002,22(8):418~420.
    [27] Laurent Hascoet, Valerie Pascual. TAPENADE 2.1 user’s guide[J]. Unitéde recherche INRIA Sophia Antipolis. 2004:9~11.
    [28]伍湘君,金之雁,黄丽萍,陈德辉.GRAPES模式软件框架与实现[J].应用气象学报. 2005,16(4):539~546.
    [29]丑纪范.四维同化的理论和新方法.In:廖洞贤,柳崇健.数值天气预报中的若干新技术.北京:气象出版社,1995:223~230.
    [30] Richard Gerber, Kevin B. Smith, Aart J.C Bik, Xinmin Tian. The Softare Optimization CookBook : High-Performance Recipes for IA-32 Platforms(Second Edition)[M]. Bei Jing:Publishing house of electronics industry. 2007.
    [31] John Bentley.编程珠玑(第2版)[M].北京:人民邮电出版社. 2008.
    [32] Kris Kaspersky.代码优化:有效使用内存[M].北京:电子工业出版社. 2004.
    [33] Intel Corporation. Intel 64 and IA-32 Architectures Manual[EB/OL]. 2006. http://www.intel.com.
    [34] International Business Machines Corporation. XL Fortran for AIX User’s Guide (Version 8.1) [EB/OL]. 2002. http://www.ibm.com.
    [35] John L. Hennessy, David A Patternon. Computer Architecture A Quantitative Approach ( Fourth Edition )[M]. Bei Jing:China Machine Press. 2007.
    [36]郑纬民,汤志忠.计算机系统结构[M].北京:清华大学出版社.2001.
    [37] Alfed V.Aho, Ravi Sethi, Jeffrey D.Ullman.编译原理.北京:机械工业出版社. 2003.
    [38]沈志宇,等.并行编译方法[M].北京:国防工业出版社. 2007.
    [39] Randy Allen, Ken Kennedy.现代体系结构的优化编译器[M].北京:机械工业出版社. 2004.
    [40]陈峰峰,王光辉,沈学顺,陈德辉,胡江林.Cascade插值方法在GRAPES模式中的应用[J].应用气象学报.2009,20(2):164~169.
    [41] Andrew S.Tamenbaum.现代操作系统(第二版) [M].北京:机械工业出版社. 2005.
    [42]程强,张林波,王斌.模式伴随化的基本规则及其代价分析[J].中国科学E辑-信息科学.2004, 34(6):601~618.
    [43] Griewank A., An implementation of checkpointing for the reverse or adjoint model of differentiation[J]. ACM Trans Math Software. 1999, 26(1):1~19.
    [44] OpenMP ARB, OpenMP Application Programming Interface Version 3.0[EB/OL]. 2008. http://www.openmp.org.
    [45] Lorna Smith, and Mark Bull. Development of mixed mode MPI/OpenMP applications[J]. Scientific Programming 9. 2001:83~98.
    [46] J. Mark Bull , James Enright, Xu Guo, Chris Maynard, Fiona Reid. Performance Evaluation of Mixed-Mode OpenMP/MPI Implementations[J]. Int J Parallel Prog. 2010(38):396~417.
    [47] Bull, J.M., Enright, J., Ameer, N. Amicro benchmark suite formixed-mode OpenMP/MPI. Proceedings of Fifth International Workshop on Openmp (IWOMP’09)[C]. Dresden:Lecture Notes in Computer Science. 2009:118~131.
    [48] Rabenseifner, R., Hager, G., Jost, G. Hybrid MPI/OpenMP parallel programming on clusters of multicore SMP nodes[C]. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributes and Network-Based Processing (PDP 2009). 2009.
    [49] Reussner, R., Sanders, P., Traeff, J. L. SKaMPi. a comprehensive benchmark for public benchmarking of MPI[J]. Sci. Program. 2002,10(1):55~65.
    [50] Smith, L., Bull, M. Development of mixed mode MPI/OpenMP applications[J]. Sci. Program. 2001,9(2–3):83~98.
    [51] Edwards, R.G., Joo, B. The chroma software system for lattice QCD. In Proceedings of the 22nd International Symposium for Lattice Field Theory (Lattice2004) [C]. Nucl. PhysB140 (Proc. Suppl). 2005: 832~835.
    [52] Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar. Introduction to Parallel Computing(Second Edition) [M]. Bei jing:China Machine Press. 2003:143~144.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700