Improved MPI collectives for MPI processes in shared address spaces
详细信息    查看全文
  • 作者:Shigang Li (1)
    Torsten Hoefler (2)
    Chungjin Hu (1)
    Marc Snir (3)
  • 关键词:MPI ; Multithreading ; MPI_Allreduce ; Collective communication ; NUMA
  • 刊名:Cluster Computing
  • 出版年:2014
  • 出版时间:December 2014
  • 年:2014
  • 卷:17
  • 期:4
  • 页码:1139-1155
  • 全文大小:2,570 KB
  • 参考文献:1. AMD: Software optimization guide for AMD family 15h processors (2012).
    2. Aulwes, R., Daniel, D., Desai, N., Graham, R., Risinger, L., Taylor, M., Woodall, T., Sukalski, M.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 15 (2004).
    3. Blagojevi?, F., Hargrove, P., Iancu, C., Yelick, K.: Hybrid PGAS runtime support for multicore nodes. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model PGAS -0, pp. 3:1-:10. ACM (2010).
    4. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing PDP -0, pp. 180-86. IEEE Computer Society (2010).
    5. Feind, K., McMahon, K.: An ultrahigh performance MPI implementation on SGI ccNUMA Altix systems. Comput. Methods Sci. Technol., 67-0 (2006).
    6. Friedley, A., Bronevetsky, G., Lumsdaine, A., Hoefler, T.: Hybrid MPI: efficient message passing for multi-core systems. In: Proceedings of the SC13 IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (2013).
    7. Friedley, A., Hoefler, T., Bronevetsky, G., Lumsdaine, A., Ma, C.C.: Ownership passing: efficient distributed memory programming on multi-core systems. In: Proceedings of the 18th ACM symposium on Principles and Practice of Parallel Programming PPoPP-3 (Accepted) (2013).
    8. Graham, R.L., Shipman, G.: MPI support for multi-core architectures: optimized shared memory collectives. In: Proceedings of the 15th European PVM/MPI Users-Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 130-40. Springer, Berlin (2008).
    9. Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17(1), 1-7 (1988) CrossRef
    10. Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBand. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium IPDPS (2006).
    11. Intel: Intel 64 and IA-32 Architectures optimization reference manual (2012).
    12. Kamal, H., Wagner, A.: Fg-mpi: fine-grain mpi for multicore and clusters. In: Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1- (2010).
    13. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe: MPI’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP -9, pp. 131-40. ACM, New York (1999)
    14. Li, S., Hoefler, T., Snir, M.: Numa-aware shared-memory collective communication for mpi. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing HPDC -3, pp. 85-6. ACM, New York (2013)
    15. Mamidala, A., Kumar, R., De, D., Panda, D.: MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid CCGRID -8, pp. 130-37 (2008).
    16. Mellor-Crummey, J.M., Scott, M.L.: Synchronization without contention. SIGPLAN Notice 26(4), 269-78 (1991) CrossRef
    17. Molka, D., Hackenberg, D., Schone, R., Muller, M.S.: Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In: Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques PACT -9, pp. 261-70. IEEE Computer Society, Washington (2009)
    18. MPI Forum: MPI: A Message-passing interface standard. version 2.2 (2009).
    19. Negara, S., Zheng, G., Pan, K.C., Negara, N., Johnson, R.E., Kalé, L.V., Ricker, P.M.: Automatic MPI to AMPI program transformation using photran. In: Proceedings of the Conference on Parallel Processing Euro-Par, pp. 531-39. Springer, Berlin (2011)
    20. Board, OpenMP Architecture Review: Application program interface version 3, 1 (2011)
    21. Pérache, M., Carribault, P., Jourdren, H.: MPC-MPI: an MPI implementation reducing the overall memory consumption. In: Proceedings of the 16th European PVM/MPI Users-Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 94-03. Springer, Berlin (2009).
    22. Rabenseifner, R.: Optimization of col
  • 作者单位:Shigang Li (1)
    Torsten Hoefler (2)
    Chungjin Hu (1)
    Marc Snir (3)

    1. School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
    2. Department of Computer Science, ETH Zurich, Zurich, Switzerland
    3. Department of Computer Science, University of Illinois at Urbana-Champaign and Argonne National Laboratory, Champaign, IL, USA
  • ISSN:1573-7543
文摘
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations—Open MPI, MPICH2, and MVAPICH2—that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700