Optimizing and Scaling HPCG on Tianhe-2: Early Experience
详细信息    查看全文
  • 作者:Xianyi Zhang (24) (26)
    Chao Yang (24) (25)
    Fangfang Liu (24)
    Yiqun Liu (24) (26)
    Yutong Lu (27)
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2014
  • 出版时间:2014
  • 年:2014
  • 卷:8630
  • 期:1
  • 页码:28-41
  • 全文大小:587 KB
  • 参考文献:1. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proc. ACM/IEEE Conference on Supercomputing (SC 2008), pp. 4:1-:12. IEEE Press (2008)
    2. Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013-4744, Sandia National Laboratories (2013)
    3. Dongarra, J., Luszczek, P.: HPCG technical specification. Sandia Report SAND2013-8752, Sandia National Laboratories (2013)
    4. García, C., Lario, R., Prieto, M., Pi?uel, L., Tirado, F.: Vectorization of multigrid codes using SIMD ISA extensions. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2003), p. 8. IEEE (2003)
    5. Ghysels, P., Vanroose, W.: Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. Parallel Computing (2013) (in press)
    6. Kumahata, K., Minami, K., Maruyama, N.: HPCG on the K computer. In: ASCR HPCG Workshop (2014)
    7. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol.?5952, pp. 111-25. Springer, Heidelberg (2010) CrossRef
    8. Park, J., Smelyanskiy, M.: Optimizing Gauss–Seidel smoother in HPCG. In: ASCR HPCG Workshop (2014)
    9. Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels 16(1), 521 (2005)
    10. Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proc. Computer Software and Applications Conference (COMPSAC 2009), vol.?1, pp. 579-86. IEEE (2009)
    11. Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Van Straalen, B., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi-and manycore processors. In: Proc. Int’l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 96:1-6:11. IEEE Computer Society Press, Los Alamitos (2012)
    12. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Computing?35(3), 178-94 (2009) CrossRef
    13. Wonnacott, D.: Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2000), pp. 171-80. IEEE (2000)
  • 作者单位:Xianyi Zhang (24) (26)
    Chao Yang (24) (25)
    Fangfang Liu (24)
    Yiqun Liu (24) (26)
    Yutong Lu (27)

    24. Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
    26. University of Chinese Academy of Sciences, Beijing, 100049, China
    25. State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing, 100190, China
    27. National University of Defense Technology, Changsha, Hunan, 410073, China
  • ISSN:1611-3349
文摘
In this paper, a first attempt has been made on optimizing and scaling HPCG on the world’s largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss–Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700