GPU-accelerated adjoint algorithmic differentiation
详细信息    查看全文
文摘
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5±4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.

Program summary

Program title: AD-GPU

Catalogue identifier: AEYX_v1_0

Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEYX_v1_0.html

Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland

Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html

No. of lines in distributed program, including test data, etc.: 16715

No. of bytes in distributed program, including test data, etc.: 143683

Distribution format: tar.gz

Programming language: C++ and CUDA.

Computer: Any computer with a compatible C++ compiler and a GPU with CUDA capability 3.0 or higher.

Operating system: Windows 7 or Linux.

RAM: 16 Gbyte

Classification: 4.9, 4.12, 6.1, 6.5.

External routines: CUDA 6.5, Intel MKL (optional) and routines from BLAS, LAPACK and CUBLAS

Nature of problem: Gradients are required for many optimization problems, e.g. classifier training or nonlinear image reconstruction. Often, the function, of which the gradient is required, can be implemented as a computer program. Then, algorithmic differentiation methods can be used to compute the gradient. Depending on the approach this may result in excessive requirements of computational resources, i.e. memory and arithmetic computations. GPUs provide massive computational resources but require special considerations to distribute the workload onto many light-weight threads.

Solution method: Adjoint algorithmic differentiation allows efficient computation of gradients of cost functions given as computer programs. The gradient can be theoretically computed using a similar amount of arithmetic operations as one function evaluation. Optimal usage of parallel processors and limited memory is a major challenge which can be mediated by the use of vectorization.

Restrictions: To use the GPU-accelerated adjoint algorithmic differentiation method, the cost function must be implemented using the provided AD-GPU intrinsics for matrix and vector operations. Unusual features:

GPU-acceleration.

Additional comments: The code uses some features of C++11, e.g. std::shared ptr. Alternatively, the boost library can be used.

Running time: The time to run the example program is a few minutes or up to a few hours to reproduce the performance measurements.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700