“神威·太湖之光”上Tend_lin并行优化
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Parallelization and optimization of Tend_lin on Sunway TaihuLight system
  • 作者:傅游 ; 王坦 ; 郭强 ; 高希然
  • 英文作者:FU You;WANG Tan;GUO Qiang;GAO Xiran;College of Computer Science and Engineering, Shandong University of Science and Technology;National Supercomputer Center in Jinan;State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Science;
  • 关键词:神威·太湖之光 ; Tend_lin ; 神威OpenACC ; 众核并行 ; 优化
  • 英文关键词:Sunway TaihuLight System;;Tend_lin;;Sunway OpenAcc;;many-core parallel;;optimization
  • 中文刊名:SDKY
  • 英文刊名:Journal of Shandong University of Science and Technology(Natural Science)
  • 机构:山东科技大学计算机科学与工程学院;山东省计算中心(国家超级计算济南中心);中国科学院计算技术研究所计算机体系结构国家重点实验室;
  • 出版日期:2019-04-03 10:17
  • 出版单位:山东科技大学学报(自然科学版)
  • 年:2019
  • 期:v.38;No.181
  • 语种:中文;
  • 页:SDKY201902011
  • 页数:10
  • CN:02
  • ISSN:37-1357/N
  • 分类号:95-104
摘要
大气环流模式是中科院地球系统模式中最为复杂的模式,在当前主流的众核异构平台上开展大气环流模式的众核并行化是高性能计算的热点研究问题。针对AGCM4.0热点程序动力框架的适应过程Tend_lin,利用神威OpenACC编程模型在"神威·太湖之光"高性能计算平台上实现并行化,并从循环分布、循环分块、数据传输的表达、函数调用的从核化等方面提升应用性能。详细讨论了不同场景下的数据传输表达,对比测试了不同分块尺寸对程序性能的影响。相比主核串行,两种测试规模下,Tend_lin应用的单核组多线程并行均获得6倍以上的加速;且随着应用分辨率的扩大,众核处理器的性能得到更好发挥,在C规模下,多进程获得了69倍的全应用加速。
        Atmospheric general circulation model(AGCM) is the most complex model of the Chinese Academy of Sciences' Earth System Model(CAS-ESM) and the many-core parallelization of AGCM on the leading many-core heterogeneous high performance computing(HPC) platform is one of the hotspots in HPC area. In this paper, Tend_lin, the adaptive process of AGCM 4.0 hotspot program, was parallelized on Sunway platform by using OpenACC programming model. Its performance was improved from the aspects of loop distribution, loop tiling, expression of data transfer, and function call. The data transmission expressions under different scenarios were discussed in detail and the effects of different block sizes on program performance were tested. Compared with the master-core serial application, the many-core parallel application of Tend_lin was accelerated more than 6 times in the single core group. With the increase of application resolution, the performance of the many-core processor got better performance. In the C scale, the acceleration ratios of the multi-process application was up to 69.
引文
[1]WANG B,WAN H,JI Z,et al.Design of a new dynamical core for global atmospheric models based on some efficient numerical methods[J].Science in China Series A:Mathematics,2004,47(1):4-21.
    [2]徐金秀,李中华,孙俊,等.基于国产十亿亿次超算系统的近连续过渡流区 N-S/DSMC 耦合算法并行优化研究[C/CD]//青岛:全国高性能计算学术年会论文集,2018.
    [3]李亿渊,王欣亮,许平,等.稀疏矩阵向量乘法在申威众核架构上的性能优化[C/CD]//青岛:全国高性能计算学术年会论文集,2018.
    [4]YANGG W.A highly efficient GPU-CPU hybrid parallel implementation of sparse LU factorization [J].Chinese Journal of Electronics,2012,21(1):7-12.
    [5]FU H,LIAO J,YANG J,et al.The Sunway Taihu Light supercomputer:System and applications[J].Science China Information Sciences,2016,59(7):072001.
    [6]ZHANG J,LUO J,DONG F.Scheduling of scientific workflow innon-dedicated heterogeneous multicluster platform[J].The Journal of Systems & Software,2013,86(7):1806-1818.
    [7]刘鑫,郭恒,孙茹君,等.“神威·太湖之光”计算机系统大规模应用特征分析与E级可扩展性研究[J].计算机学报,2018,41(10):2209-2220.LIU Xin,GUO Heng,SUN Rujun,et al.The characteristic analysis and exascale scalability research of large scale parallel applications on Sunway Taihulight supercomputer[J].Chinese Journal of Computers,2018,41(10):2209-2220.
    [8]陈德训,刘鑫.神威·太湖之光并行程序设计与优化[M].北京:国家并行计算机工程技术研究中心.2017.
    [9]XU Z,LIN J,MATSUOKA S.Benchmarking SW26010 many-core processor[C]//Parallel and Distributed Processing Symposium Workshops (IPDPSW),2017 IEEE International.IEEE,2017:743-752.
    [10]刘侃,王欣亮,许平,等.申威众核处理器上的三对角并行求解器[C/CD]//青岛:全国高性能计算学术年会论文集,2018.
    [11]FARBER R.Parallel programming with OpenACC[M].Oxford:Newnes,2016.
    [12]张贺.大气环流模式IAP AGCM4.0的设计及其数值模拟[D].北京:中国科学院研究生院(大气物理研究所),2009.
    [13]张贺,林朝晖,曾庆存.IAP AGCM-4 动力框架的积分方案及模式检验[J].大气科学,2009,33(6):1267-1285.ZHANG He,LIN Zhaohui,ZENG Qingcun.The computational scheme and the test for dynamical framework of IAP AGCM-4[J].Chinese Journal of Atmospheric Sciences,2009,33(6):1267-1285.
    [14]韩林,徐金龙,李颖颖,等.面向部分向量化的循环分布及聚合优化[J].计算机科学,2017(2):70-74.HAN Lin,XU Jinlong,LI Yingying,et al.Method of loop distribution and aggregation for partial vectorization[J].Computer Science,2017(2):70-74.
    [15]李雁冰,赵荣彩,赵博,等.面向异构多核处理器的的循环分块[J].计算机工程与设计,2015,36(1):168-173.LI Yanbing,ZHAO Rongcai,ZHAO Bo,et al.Loop tiling for heterogeneous multi-core processor[J].Computer Engineering and Design,2015,36(1):168-173.
    [16]ZHANG J,LUO J,DONG F.Scheduling of scientific workflow in non-dedicated heterogeneous multicluster platform[J] Journal of Systems and Software,2013,86(7):1806-1818.
    [17]BONATI C,COSCETTI S,D’ELIA M,et al.Design and optimization of a portable LQCD Monte Carlo code using OpenACC[J/OL].International Journal of Modern Physics C,2017,28(5):1750063.
    [18]王一超,林新华,蔡林金,等.太湖之光上利用 OpenACC 移植和优化 GTC-P[J].计算机研究与发展,2018,55(4):875-884.WANG Yichao,LIN Xinhua,CAI Linjin,et al.Porting and optimizing GTC-P on TaihuLight supercomputer with OpenACC[J].Journal of Computer Research and Development,2018,55(4):875-884.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700