Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters

详细信息查看全文

作者：David Clarke (19)
Aleksandar Ilic (20)
Alexey Lastovetsky (19)
Leonel Sousa (20)
关键词：parallel applications ; heterogeneous platforms ; GPU ; data partitioning algorithms ; functional performance models ; matrix multiplication
刊名：Lecture Notes in Computer Science
出版年：2012
出版时间：2012
年：2012
卷：7484
期：1
页码：502-513
全文大小：370KB
参考文献：1. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU : A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol.聽5704, pp. 863鈥?74. Springer, Heidelberg (2009) CrossRef
2. Beaumont, O., Boudet, V., Rastello, F., Robert, Y.: Matrix Multiplication on Heterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst.聽12(10), 1033鈥?051 (2001) CrossRef
3. Blumofe, R., Leiserson, C.: Scheduling multithreaded computations by work stealing. JACM聽46(5), 720鈥?48 (1999) CrossRef
4. Choi, J.: A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurrency: Practice and Experience聽10(8), 655鈥?70 (1998) CrossRef
5. Clarke, D., Lastovetsky, A., Rychkov, V.: Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models. In: Alexander, M., D鈥橝mbra, P., Belloum, A., Bosilca, G., Cannataro, M., Danelutto, M., Di Martino, B., Gerndt, M., Jeannot, E., Namyst, R., Roman, J., Scott, S.L., Traff, J.L., Vall茅e, G., Weidendorfer, J. (eds.) Euro-Par 2011, Part I. LNCS, vol.聽7155, pp. 450鈥?59. Springer, Heidelberg (2012) CrossRef
6. Dongarra, J., Faverge, M., Herault, T., Langou, J., Robert, Y.: Hierarchical qr factorization algorithms for multi-core cluster systems. Arxiv preprint arXiv:1110.1553 (2011)
7. Drozdowski, M., Lawenda, M.: On Optimum Multi-installment Divisible Load Processing in Heterogeneous Distributed Systems. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol.聽3648, pp. 231鈥?40. Springer, Heidelberg (2005) CrossRef
8. Galindo, I., Almeida, F., Bad铆a-Contelles, J.M.: Dynamic Load Balancing on Dedicated Heterogeneous Systems. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) EuroPVM/MPI 2008. LNCS, vol.聽5205, pp. 64鈥?4. Springer, Heidelberg (2008) CrossRef
9. Horton, M., Tomov, S., Dongarra, J.: A class of hybrid lapack algorithms for multicore and gpu architectures. In: SAAHPC, pp. 150鈥?58 (2011)
10. Hummel, S., Schmidt, J., Uma, R.N., Wein, J.: Load-sharing in heterogeneous systems via weighted factoring. In: SPAA 1996, pp. 318鈥?28. ACM (1996)
11. Ilic, A., Sousa, L.: Collaborative execution environment for heterogeneous parallel systems. In: IPDPS Workshops and Phd Forum (IPDPSW), pp. 1鈥? (2010)
12. Ilic, A., Sousa, L.: On realistic divisible load scheduling in highly heterogeneous distributed systems. In: PDP 2012, Garching, Germany (2012)
13. Jacobsen, D.A., Thibault, J.C., Senocak, I.: An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters. In: AIAA Aerospace Sciences Meeting Proceedings (2010)
14. Kalinov, A., Lastovetsky, A.: Heterogeneous Distribution of Computations while Solving Linear Algebra Problems on Networks of Heterogeneous Computers. In: Sloot, P.M.A., Hoekstra, A.G., Bubak, M., Hertzberger, B. (eds.) HPCN-Europe 1999. LNCS, vol.聽1593, pp. 191鈥?00. Springer, Heidelberg (1999)
15. Kindratenko, V.V., et al.: GPU clusters for high-performance computing. In: CLUSTER, pp. 1鈥? (2009)
16. Lastovetsky, A., Reddy, R.: Data Partitioning with a Functional Performance Model of Heterogeneous Processors. Int. J. High Perform. Comput. Appl.聽21(1), 76鈥?0 (2007) CrossRef
17. Lastovetsky, A., Reddy, R., Rychkov, V., Clarke, D.: Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms. Arxiv preprint arXiv:1109.3074 (2011)
18. Legrand, A., Renard, H., Robert, Y., Vivien, F.: Mapping and load-balancing iterative computations. IEEE Transactions on Parallel and Distributed Systems聽15(6), 546鈥?58 (2004) CrossRef
19. Mart铆nez, J., Garz贸n, E., Plaza, A., Garc铆a, I.: Automatic tuning of iterative computation on heterogeneous multiprocessors with ADITHE. J. Supercomput. (2009)
20. Quintin, J.-N., Wagner, F.: Hierarchical Work-Stealing. In: D鈥橝mbra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part I. LNCS, vol.聽6271, pp. 217鈥?29. Springer, Heidelberg (2010) CrossRef
21. Veeravalli, B., Ghose, D., Robertazzi, T.G.: Divisible load theory: A new paradigm for load scheduling in distributed systems. Cluster Computing聽6, 7鈥?7 (2003) CrossRef
作者单位：David Clarke (19)
Aleksandar Ilic (20)
Alexey Lastovetsky (19)
Leonel Sousa (20)

19. School of Computer Science and Informatics, University College Dublin, Belfield, Dublin, 4, Ireland
20. INESC-ID, IST/Technical University of Lisbon, Rua Alves Redol, 9, 1000-029, Lisbon, Portugal

文摘

Hierarchical level of heterogeneity exists in many modern high performance clusters in the form of heterogeneity between computing nodes, and within a node with the addition of specialized accelerators, such as GPUs. To achieve high performance of scientific applications on these platforms it is necessary to perform load balancing. In this paper we present a hierarchical matrix partitioning algorithm based on realistic performance models at each level of hierarchy. To minimise the total execution time of the application it iteratively partitions a matrix between nodes and partitions these sub-matrices between the devices in a node. This is a self-adaptive algorithm that dynamically builds the performance models at run-time and it employs an algorithm to minimise the total volume of communication. This algorithm allows scientific applications to perform load balanced matrix operations with nested parallelism on hierarchical heterogeneous platforms. To show the effectiveness of the algorithm we applied it to a fundamental operation in scientific parallel computing, matrix multiplication. Large scale experiments on a heterogeneous multi-cluster site incorporating multicore CPUs and GPU nodes show that the presented algorithm outperforms current state of the art approaches and successfully load balance very large problems.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700