CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters

详细信息查看全文

作者：Khaled Hamidouche ; ^{hamidouche.2@osu.edu" class="auth_mail" title="E-mail the corresponding author} ; ^{hamidouc@cse.ohio-state.edu" class="auth_mail" title="E-mail the corresponding author} ; Akshay Venkatesh ^{venkatesh.19@osu.edu" class="auth_mail" title="E-mail the corresponding author} ; Ammar Ahmad Awan ^{awan.10@osu.edu" class="auth_mail" title="E-mail the corresponding author} ; Hari Subramoni ^{subramoni.1@osu.edu" class="auth_mail" title="E-mail the corresponding author} ; Ching-Hsiang Chu ^{chu.368@osu.edu" class="auth_mail" title="E-mail the corresponding author} ; Dhabaleswar K. Panda ^{panda.2@osu.edu" class="auth_mail" title="E-mail the corresponding author}
关键词：CUDA-aware ; PGAS ; OpenSHMEM ; GPUDirect RDMA ; CUDA
刊名：Parallel Computing
出版年：2016
出版时间：October 2016
年：2016
卷：58
期：Complete
页码：27-36
全文大小：2026 K

文摘

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs. It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs for OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to potential performance improvements being untapped. In this paper, we introduce “CUDA-Aware” concepts for OpenSHMEM that enable operations to be directly performed from/on buffers residing in GPU’s memory. We propose novel and efficient designs that ensure “truly one-sided” communication for different intra-/inter-node configurations while working around the hardware limitations. We achieve 2.5 × and 7 × improvement in point-point communication for intra-node and inter-node configurations, respectively. Our proposed framework achieves 2.2pan id="mmlsi1" class="mathmlsrc">pan class="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0167819116300345&_mathId=si1.gif&_user=111111111&_pii=S0167819116300345&_rdoc=1&_issn=01678191&md5=d21b48896117a2bf9195d83812a8ccc5" title="Click to view the MathML source">μpan>pan class="mathContainer hidden">pan class="mathCode">

μ

pan>pan>pan>s for an intra-node 8-byte put operation from CPU to local GPU and 3.13pan id="mmlsi1" class="mathmlsrc">pan class="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0167819116300345&_mathId=si1.gif&_user=111111111&_pii=S0167819116300345&_rdoc=1&_issn=01678191&md5=d21b48896117a2bf9195d83812a8ccc5" title="Click to view the MathML source">μpan>pan class="mathContainer hidden">pan class="mathCode">

μ

pan>pan>pan>s for an inter-node 8-byte put operation from GPU to remote GPU. The proposed designs lead to 19% reduction in the execution time of Stencil2D application kernel from the SHOC benchmark suite on Wilkes system which is composed of 64 dual-GPU nodes. Similarly, the evolution time of GPULBM application is reduced by 45% on 64 GPUs. On 8 GPUs per node CS-Storm-based system, we show 50% and 23% improvement on 32 and 64 GPUs, respectively.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700