Optimizing and Scaling HPCG on Tianhe-2: Early Experience

详细信息查看全文

作者：Xianyi Zhang (24) (26)
Chao Yang (24) (25)
Fangfang Liu (24)
Yiqun Liu (24) (26)
Yutong Lu (27)
刊名：Lecture Notes in Computer Science
出版年：2014
出版时间：2014
年：2014
卷：8630
期：1
页码：28-41
全文大小：587 KB
参考文献：1. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proc. ACM/IEEE Conference on Supercomputing (SC 2008), pp. 4:1-:12. IEEE Press (2008)
2. Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013-4744, Sandia National Laboratories (2013)
3. Dongarra, J., Luszczek, P.: HPCG technical specification. Sandia Report SAND2013-8752, Sandia National Laboratories (2013)
4. García, C., Lario, R., Prieto, M., Pi?uel, L., Tirado, F.: Vectorization of multigrid codes using SIMD ISA extensions. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2003), p. 8. IEEE (2003)
5. Ghysels, P., Vanroose, W.: Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. Parallel Computing (2013) (in press)
6. Kumahata, K., Minami, K., Maruyama, N.: HPCG on the K computer. In: ASCR HPCG Workshop (2014)
7. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol.?5952, pp. 111-25. Springer, Heidelberg (2010) CrossRef
8. Park, J., Smelyanskiy, M.: Optimizing Gauss–Seidel smoother in HPCG. In: ASCR HPCG Workshop (2014)
9. Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels 16(1), 521 (2005)
10. Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proc. Computer Software and Applications Conference (COMPSAC 2009), vol.?1, pp. 579-86. IEEE (2009)
11. Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Van Straalen, B., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi-and manycore processors. In: Proc. Int’l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 96:1-6:11. IEEE Computer Society Press, Los Alamitos (2012)
12. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Computing?35(3), 178-94 (2009) CrossRef
13. Wonnacott, D.: Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2000), pp. 171-80. IEEE (2000)
作者单位：Xianyi Zhang (24) (26)
Chao Yang (24) (25)
Fangfang Liu (24)
Yiqun Liu (24) (26)
Yutong Lu (27)

24. Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
26. University of Chinese Academy of Sciences, Beijing, 100049, China
25. State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing, 100190, China
27. National University of Defense Technology, Changsha, Hunan, 410073, China
ISSN：1611-3349

文摘

In this paper, a first attempt has been made on optimizing and scaling HPCG on the world’s largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss–Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700