Multiple Target Task Sharing Support for the OpenMP Accelerator Model
详细信息    查看全文
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2016
  • 出版时间:2016
  • 年:2016
  • 卷:9903
  • 期:1
  • 页码:268-280
  • 全文大小:1,307 KB
  • 参考文献:1.Adinetz, A.V., Baumeister, P.F., Böttiger, H., Hater, T., Maurer, T., Pleiter, D., Schenck, W., Schifano, S.F.: Performance evaluation of scientific applications on POWER8. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 24–45. Springer, Heidelberg (2015)
    2.OpenMP ARB. OpenMP application program interface, v. 4.5 (2015)
    3.Bertolli, C., Antao, S.F., Eichenberger, A.E., O’Brien, K., Sura, Z., Jacob, A.C., Chen, T., Sallenave, O.: Coordinating GPU threads for OpenMP 4.0 in LLVM. In: Proceedings of the LLVM Compiler Infrastructure in HPC, LLVM-HPC 2014, Piscataway, NJ, USA, pp. 12–21. IEEE Press (2014)
    4.Khronos OpenCL Working Group. The OpenCL specification, version 2.0 (2014)
    5.The Portland Group. PGI accelerator compilers
    6.Lee, S., Vetter, J.S.: OpenARC: Open Accelerator Research Compiler for directive-based, efficient heterogeneous computing. In: The 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2014, Vancouver, BC, Canada, 23–27 June 2014, pp. 115–120 (2014)
    7.McCalpin, J.D.: Stream: sustainable memory bandwidth in high performance computers. Technical report, University of Virginia (2007)
    8.NVIDIA. CUDA C programming guide version 7.0. NVIDIA Corporation (2013)
    9.OpenACC-Standard.org. OpenACC application programming interface, v. 2.5 (2015)
    10.Ozen, G., Ayguadé, E., Labarta, J.: On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 215–229. Springer, Heidelberg (2014)
    11.Ozen, G., Ayguadé, E., Labarta, J.: Exploring dynamic parallelismin OpenMP. In: Proceedings of the Second Workshop on Accelerator Programming using Directives, WACCPD 2015, Austin, Texas, USA, 15 November 2015, pp. 5:1–5:8 (2015)
  • 作者单位:Guray Ozen (16) (17)
    Sergi Mateo (16) (17)
    Eduard Ayguadé (16) (17)
    Jesús Labarta (16) (17)
    James Beyer (18)

    16. Universitat Politècnica de Catalunya (UPC–BarcelonaTECH), Barcelona, Spain
    17. Barcelona Supercomputing Center (BSC-CNS), Barcelona, Spain
    18. Nvidia Corporation, Santa Clara, USA
  • 丛书名:OpenMP: Memory, Devices, and Tasks
  • ISBN:978-3-319-45550-1
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
  • 卷排序:9903
文摘
The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700