CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

详细信息查看全文

作者：Yi Yang (1)
Chao Li (2)
Huiyang Zhou (2)

1. Department of Computing Systems Architecture ; NEC Laboratories America ; Princeton ; NJ ; 08540 ; U.S.A.
2. Department of Electrical and Computer Engineering ; North Carolina State University ; Raleigh ; NC ; 27606 ; U.S.A
关键词：GPGPU ; nested parallelism ; compiler ; local memory
刊名：Journal of Computer Science and Technology
出版年：2015
出版时间：January 2015
年：2015
卷：30
期：1
页码：3-19
全文大小：1,172 KB
参考文献：1. Chen L, Agrawal G. Optimizing MapReduce for GPUs with effective shared memory usage. In / Proc. the 21st International Symposium on High-Performance Parallel and Distributed Computing, June 2012, pp.199-210.
2. He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In / Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008, pp.260-269.
3. Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In / Proc. IEEE Int. Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079.
4. Wang J, Yalamanchili S. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In / Proc. the 2014 IEEE International Symposium on Work-load Characterization, Oct. 2014.
5. Che S, Boyer M, Meng J / et al. Rodinia: A benchmark suite for heterogeneous computing. In / Proc. the 2009 IEEE International Symposium on Workload Characterization, Oct. 2009, pp.44-54.
6. Bakhoda A, Yuan G, Fung W W L / et al. Analyzing CUDA workloads using a detailed GPU simulator. In / Proc. Int. Symp. Performance Analysis of Systems and Software, April 2009, pp.163-174.
7. Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In / Proc. the 2009 Euro-Par Parallel Processing Workshops, Aug. 2009, pp.46-55.
8. Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In / Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.86-97.
9. Boyer M, Tarjan D, Acton S T, Skadron K. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In / Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.
10. Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In / Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.
11. Lee S I, Johnson T, Eigenmann R. Cetus 鈥?An extensible compiler infrastructure for source-to-source transformation. In / Proc. the 16th Int. Workshop on Languages and Compilers for Parallel Computing, October 2003, pp.539-553.
12. Kayiran O, Jog A, Kandemir M T, Das C R. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In / Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques, Sept. 2013, pp.157-166.
13. Baskaran M M, Bondhugula U, Krishnamoorthy S / et al. A compiler framework for optimization of affine loop nests for GPGPUs. In / Proc. the 22nd Annual International Conference on Supercomputing, June 2008, pp.225-234.
14. Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In / Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp. 101-110,
15. Liu Y, Zhang E Z, Shen X. A cross-input adaptive framework for GPU programs optimization. In / Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.
16. Ueng S, Lathara M, Baghsorkhi S S, HwuWW. CUDA-lite: Reducing GPU programming complexity. In / Proc. the 21st Languages and Compilers for Parallel Computing, July 31-Aug. 2, 2008, pp.1-15.
17. DiMarco J, Taufer M. Performance impact of dynamic parallelism on different clustering algorithms. In / Proc. SPIE 8752, Modeling and Simulation for Defense Systems and Applications VIII, May 2013, p. 87520E.
18. Hong S, Kim H. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In / Proc. the 36th International Symposium on Computer Architecture, June 2009, pp. 152-163.
19. Phothilimthana P M, Ansel J, Ragan-Kelley J, Amarasinghe S. Portable performance on heterogeneous architectures. In / Proc. the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2013, pp.431-444.
20. Narasiman V, Shebanow M, Lee C / et al. Improving GPU performance via large warps and two-level warp scheduling. In / Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp. 308-317.
21. Rogers T G, O鈥機onnor M, Aamodt T. Cache-conscious wavefront scheduling. In / Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp. 72-83.
22. Steffen M, Zambreno J. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In / Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2010, pp.237-248.
23. Govindaraju N, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In / Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 2.
24. Hong S, Kim S K, Oguntebi T, Olukotun K. Accelerating CUDA graph algorithms at maximum warp. In / Proc. the 16th ACM Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp.267-276.
25. Jang B, Schaa D, Mistry P, Kaeli D. Exploiting memory access patterns to improve memory performance in dataparallel architectures. / IEEE Transactions on Parallel and Distributed Systems, 2011, 22(1): 105-118. CrossRef
26. Kim J, Kim H, Lee J, Lee J. Achieving a single compute device image in OpenCL for multiple GPUs. In / Proc. the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp. 277-288.
27. Ren B, Agrawal G, Larus J R, Mytkowicz T, Poutanen T, Schulte W. SIMD parallelization of applications that traverse irregular data structures. In / Proc. the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, Feb. 2013.
28. Ryoo S, Rodrigues C I, Stone S S / et al. Program optimization space pruning for a multithreaded GPU. In / Proc. the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, April 2008, pp.195-204.
29. Ryoo S, Rodrigues C I, Baghsorkhi S S / et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In / Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.
30. Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. 2009. https://developer.nvidia.com/cuda-toolkit-55-archive, Dec. 2014.
31. Sim J, Dasgupta A, Kim H / et al. A performance analysis framework for identifying performance benefits in GPGPU applications. In / Proc. the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2012, pp.11-22.
32. Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In / Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 31.
33. Wu B, Zhao Z, Zhang E / et al. Complexity analysis and algorithm design for reorganizing data to minimize noncoalesced GPU memory accesses. In / Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel programming, Feb. 2013, pp. 57-68.
34. Zhang Y, Cohen J, Owens J D. Fast tridiagonal solvers on the GPU. In / Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Jan. 2010, pp. 127-136.
35. Zhang E Z, Jiang Y, Guo Z / et al. On-the-fly elimination of dynamic irregularities for GPU computing. In / Proc. the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2011, pp. 369-380.
36. Liao C, Hernandez O, Chapman B / et al. OpenUH: An optimizing, portable OpenMP compiler. In / Proc. the 12th Workshop on Compilers for Parallel Computers, Jan. 2006, pp. 2317-2332.
37. Ayguad茅 E, Copty N, Duran A / et al. The design of OpenMP tasks. / IEEE Transactions on Parallel and Distributed Systems, 2009, 20(3): 404-418. CrossRef
38. B眉cker H M, Rasch A, Wolf A. A class of OpenMP applications involving nested parallelism. In / Proc. the 2004 ACM Symposium on Applied Computing, March 2004, pp.220-224.
39. Dagum L, Menon R. OpenMP: An industry standard API for shared-memory programming. / IEEE Computational Science & Engineering, 1998, 5(1): 46-55. CrossRef
40. Dimakopoulos V V, Hadjidoukas P E, Philos G C. A microbenchmark study of OpenMP overheads under nested parallelism. In / Proc. the 4th Int. Workshop on OpenMP in a New Era of Parallelism, May 2008, pp. 1-12.
41. Duran A, Gonz脿lez M, Corbal谩n J. Automatic thread distribution for nested parallelism in OpenMP. In / Proc. the 19th Annual International Conference on Supercomputing, June 2005, pp.121-130.
42. Hadjidoukas P E, Dimakopoulos V V. Nested parallelism in the OMPi OpenMP/C compiler. In / Proc. the 13th Int. Euro-Par Conf. Euro-Par Parallel Processing, Aug. 2007, pp. 662-671.
43. Tian X, Hoeflinger J P, Haab G / et al. A compiler for exploiting nested parallelism in OpenMP programs. / Parallel Computing, 2005, 31(10/11/12): 960-983.
44. Tanaka Y, Taura K, Sato M / et al. Performance evaluation of OpenMP applications with nested parallelism. In / Proc. the 5th Int. Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, May 2000, pp. 100-112.
刊物类别：Computer Science
刊物主题：Computer Science, general
Software Engineering
Theory of Computation
Data Structures, Cryptology and Information Theory
Artificial Intelligence and Robotics
Information Systems Applications and The Internet
Chinese Library of Science
出版者：Springer Boston
ISSN：1860-4749

文摘

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700