Krylov子空间方法的GPU加速算法研究

英文题名：Research on Accelerating Krylov Subspace Method with GPU
作者：秦晋
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：GPU ; CUDA ; Krylov子空间方法 ; 稀疏对角矩阵向量乘 ; 内积操作
英文关键词：GPU ; CUDA ; Krylov Subspace Method ; Sparse Matrix-Vector Multip-lication ; vector inner-product operation
学位年度：2010
导师：胡庆丰
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

近几年来,通用图形处理单元(GPGPU)在体系结构、编程模型等方面迅速发展,可编程性不断提高,同时GPU的单精度浮点峰值性能已经从几个Gflops增长到几个Tflops。GPU开始越来越多的运用于通用计算,并且越来越多地应用到科学计算程序的加速研究当中。在科学计算领域中,迭代法求解大型稀疏线性系统在计算流体力学、数值天气预报、石油地震数据处理、材料模拟与设计等实际应用中具有重要的意义,Krylov子空间方法是求解大规模稀疏线性系统的有效算法,具有存储容量需求小和快速收敛等优点。然而在CPU上利用该算法求解稀疏线性系统时往往会耗费大量的时间,如何加速其求解具有重要的理论意义和实际应用价值。利用GPU加速Krylov子空间方法求解大型稀疏线性系统可以降低计算周期,提高计算效率,有利于加速多种实际问题的解决。
     本文通过对GPU体系结构和CUDA编程模型的分析,研究了Krylov子空间方法在NVIDIA GPU上移植的技术难点,包括稀疏矩阵中零元素对存储空间的浪费、稀疏矩阵向量乘在GPU上的实现、内积操作在GPU上的实现、GPU上实现算法中公式时产生的计算核心开销对性能的降低、NVIDIA GPU不存在归约操作、在NVIDIA GPU上需要采用针对性的优化方法等问题。通过CPU和GPU协作的方式使用CUDA编程模型在NVIDIA GPU上对算法进行了加速,取得了如下的成果:
     1,面向CUDA编程模型的算法优化。详细分析NVIDIA GPU的体系结构特点,首先NVIDIA GPU的一个重要特点就是具有复杂的存储层次结构,包括registers、shared memory、local memory、global memory以及两种只读存储器texture memory和constant memory,各种存储器在容量、速度方面有较大的差别,优化算法需要充分利用GPU内的高速存储器,例如shared memory、texture memory甚至是registers。CUDA中的一个与硬件联系比较紧密的概念是warp,在NVIDIA GPU实际运行程序时,warp才是真正的执行单位,在我们的程序优化中,从优化访存入手,线程束能符合warp或者half-warp的对其访问要求。
     2,提出稀疏对角矩阵的CDIA压缩存储格式。在DIA存储格式的基础上设计了一种新型的压缩存储格式CDIA,在该格式下能够获得较高的数据传输效率以及满足CUDA对数据的对齐访问。为满足CUDA对存储器的合并访问要求,CDIA压缩存储格式中包含了对压缩矩阵做了相应的转置处理的操作,这大大提高了GPU的访存性能,从而在Krylov子空间方法的几个核心操作中起到了重要的作用。在稀疏对角矩阵向量乘中,使用CDIA格式比使用普通DIA格式性能提高了50.3%。
     3,稀疏对角矩阵向量乘在GPU上的加速实现。使用CDIA压缩存储格式对矩阵进行压缩存储,将计算任务进行细粒度的任务分配,在NVIDIA C1060平台上使用CUDA编程模型,数据试验结果表明矩阵向量乘的性能有明显的提高,通过对访存、数据传输等的优化,在测试数据中,最高获得了单精度39.6Gflop/s和双精度19.6Gflop/s的浮点计算性能,性能在Nathan Bell和Michael Garland的基础上分别提高了7.6%和17.4%。
     4,内积运算在GPU上的实现及优化。内积运算的实现以归约操作的实现为基础,内积运算采用先乘法再归约的方式,借助shared memory在GPU上有良好的性能表现,单精度情况下,与CPU相比,达到了高达36.2倍的加速比。
     5,提出了Krylov子空间方法在GPU上的优化策略。对算法的优化主要从程序结构、主机与设备的通信以及CUDA编程模型下的存储层次结构等几方面入手,使用NVIDIA GPU上的高速存储部件,主要是shared memory来提高访存效率,同时使用pinned memory来提高通信带宽。这些措施大大提高了算法的整体性能。在NVIDIA C1060平台使用CUDA 3.0的测试结果中,双精度和单精度条件算法性能分别达到11.3Gflop/s和16.7Gflop/s,相比Intel四核处理器Q6600使用三级优化编译,加速比分别达到9.2倍和13.57倍。
     测试结果表明,在不同矩阵规模下算法的性能有较大的差异,这是由GPU的体系结构特点决定的,从测试数据可以看出,GPU在加速Krylov子空间方法求解线性系统中具有较大的优势。
In recent years, the architecture, programming model and related aspects of general purpose graphics processing unit (GPGPU) have been developed rapidly, the programmability of GPU have been improving all the time, and the peak performance of GPU grows from several Gflops to several Tflops. The main application of GPU has shifted from image processing to gener-al-purpose computing, and more and more researchs are focusing on accelerating scientific problems with GPU. In scientific problems, utilizing iterative method to solve the large sparse linear systems is the key in many fields including the fluid dynamics, numerical weather predic-tion, oil and seismic date processing, materials simulation and design. The Krylov subspace me-thod is an efficient algorithm solving the large sparse system, which both has less storage re-quirement and stable convergence. However, solving these problems on CPU consumes lots of time which cannot meet the practical requirements. Accelerating iterative method that solving large sparse linear system with GPU can not only reduce the time but also improve the compu-ting efficiency, which can be generalized to speed up a lot of practical problems.
     After analyzing the architecture and CUDA programming model of GPU, we discussed the difficulties in adapting Krylov subspace method to NVIDIA GPU, including the wasting of the storage space introduced by sparse matrix; the implementation of sparse-matrix multiplication in GPU; the implementation of vector inner-product operation in GPU、the decreasing of the per-formance bringed by kernel overhead; the lack of reduction operation on NVIDIA GPU; the need of special optimization methods on NVIDIA GPU and so on. Through the cooperation of CPU and GPU, we proposed some novel methods to realize the algorithm in NVIDIA GPU on CUDA programming model. The main contributions in this paper are as follows:
     We proposed some arithmetic optimization method faced on CUDA program model, deeply studied the features of NVIDIA GPU, at first, we noticed that the NVIDIA GPU has a complex storage hierarchy, including registers, shared memory, local memory, global memory and other two kinds of read-only memory—texture memory and constant memory. These memorizers have great differences in speed and capacity, while optimize the algorithm we should make full use of those high speed memory, such as the shared memory, texture memory and even the registers. In CUDA program model, one of the concept that has keen relation of the hardware is warp, while run the program in NVIDIA GPU, the warp is the real unit of execute. In our optimization of Krylov subspace method, we try to fulfill the need of warp or half-warp.
     We describes a new compress format of sparse matrix based on DIA compress format (CDIA), the CIDA compress format of sparse matrix can obtain higher bandwidth and can fulfill the requirement of coalesced access, CDIA compress format include an operation of transpose of the matrix, by doing this we can get higher memory bandwidth, and improved the memory access performance of GPU, and the CIDA compress format did great effect in some important GPU kernel. While used in SPMV, it enhanced performance about 50.3% that with ordinary DIA compress format.
     We proposed the sparse matrix-vector multiplication in GPU, made full use of the CDIA compress format, in CUDA program model of NVIDIA C1060, and gave each thread fine-grained task distribution. By optimizing the memory access, data transportation and other aspect of GPU, In the data experiment, our best implementation achieves up to 39.6Gflop/s in single-precision and 19.6Gflop/s in double-precision, enhanced performance about 7.6% and 17.4% that of Bell and Garland’s.
     We implemented the vector inner-product operation in GPU, we know that the reduction operation is the base of the vector inner-product operation, by make full use of the shared mem-ory of GPU, the performance of vector inner-product operation increased 36.2 times in sin-gle-precision instance.
     We proposed several optimization measures for Krylov subspace method in NVIDIA GPU, there are several method for optimization: structure of the program, communication between the CPU and GPU, and use the high speed memory, especially the shared memory, by doing this, we can improve the bandwidth of memory access, we can also use pinned memory. In our data ex-periment, our best implementation achieves up to 11.3Gflop/s in double-precision and 16.7Gflop/s in single-precision, make the speed 9.2 and 13.57 times faster then in CPU.
     Different matrix dimension have different performance this was determined by the archi-tecture of GPU, our experiment indicated that the GPU have great power in accelerate linear system.

引文

[1] S. Barrachina, M. Castillo, F. Igual, et al. Solving dense linear systems on graphics processors.//Proceedings of Euro-Par 2008. Springer-Verlag Berlin Heidelberg, 2008:739~748.
    [2] Maribel Castillo, Ernie Chan, Francisco D. Igual, et al. Making parallel programming synonymous with programming for linear algebra libraries. The University of Texas at Austin, 2009:241~1512.
    [3] Nico Galoppo, Naga K. Govindaraju, Michael Henson. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware.//In Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005:3~3.
    [4] Fermi. www.nvidia.com/object/Fermi_architecture.html, 2009.10.1/2009.11.18.
    [5] Vasily Volkov, James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. University of California: EECS Department, 2008:1~11.
    [6] A Zonenberg. Distributed Hash Cracker: A Cross-Platform GPU-Accelerated Password Recovery System. Rensselaer Polytechnic Institute, 2009:1~6.
    [7] JP Walters, V Balu, V Chaudhary, et al. Accelerating Molecular Dynamics Simulations with GPUs. Journal of Computational Physics, 2007, 221(2), 2007:799~804.
    [8] T Scheuermann, J Hensley. Efficient histogram generation using scattering on GPUs.//In Proceedings of the 2007 symposium on Interactive 3D graphics and games. ACM, 2007:33~37.
    [9] X Liu, T Gu, X Hang, et al. A parallel version of QMRCGSTAB method for large linear systems in distributed parallel environments. Applied Mathematics and Computation, 2006, 172(2):744~752.
    [10]李晓梅,吴建平. Krylov子空间方法及其并行计算,计算机科学, 2005.
    [11] Brown P N.A Theoretical Comparison of the ARNOLDI and GM RES Algorithms.SIAM J Sci Star Comput,1991,12(1) 58-78.
    [12] Chan T F,et a1.A Quasi-minimal Residual Variant of the BiCGSTAB Algorithm for Nonsymmetric Systems.SIAM J Sci Comput,1994,15(2):338-347.
    [13] Freund R.W.A Transpose-free Quasi-minimal Residual Algorithm for Non Hermitian Linear Systems.SIAM J Sci Comput,1992,14(2)L:47O-482.
    [14] Freond R.W.A Nachfigal N M.QMR : A Quasi-minimai Residual Methods for Non-Hermitian Linear Systems.Numer Math.1991.315-339.
    [15] Kasenally E M .GMBACK:A Generahsed Minimum Backward Error Algorithm for Nonsymmetric Linear Systems.SIAM J Sc L Comput,l995,16(3):698-719.
    [16] Saad Y.Krylov Subspace Methods for Solving Large Unsymmetric Linear Systems.Mathematics of Computations,1981,37(155):105-126.
    [17] Saad Y.The Lanczos Biotthgonalization Algorithm and Other Oblique Projection Methods for Solving Large Unsymmetric Systems.SIAM J Numer Anal 1982,19(3):485-506.
    [18] Saad Y.Practical Use of Some Krylov Subspace Methods for Solving Indefinite and Nosymmetric Linear Systems.SIAM J Scl Stat Comput.1984.5(1):203-227.
    [19] Saunders M A.et al.Two Conjugate-gradient-type Methods for Unsymmetric Linear Equations.SIAM J Numer Anal,1988,25(4).
    [20] Sonneveld P.CGS,A Fast Lanczos-type Solver for Nonsymmetric Linear Systems.SIAMJ Sci Star Comput.1989:36-52.
    [21] Van Der Worst H A.Bi-CGSTABi:A Fast and Smoothly Convergence Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems . SIAM J Sci Stat Comput,1992,13(2):631-644.
    [22] Walker H F.Implementation of the GMRES Method Using Householder Transformations.SIAM J scl Star Comput,1988,9(1):152-163.
    [23] L.F.Eomero, E.L.Zapata.Data Distributions for Sparse Matrix Vector Multiplication.April 1995.
    [24] Sivan Toledo.Improving memory-system performance of sparse matrix-vector multiplication.November 25,1996.
    [25] Samuel Williams,Leonid Oliker.Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms.2007.
    [26] Glenn Lupton, Don Thulin.Accelerating HPC Using GPU's.High Performance Computing Division Hewlett-Packard Company.June 13, 2008.
    [27] Genna Cummins,Dr. Rob Adams.Scientific Computation Through a GPU.IEEE.2008.
    [28] Michael Garland.Sparse Matrix Computations on Manycore GPU's.NVIDIA Corporation.2008.
    [29] Rudiger Weiss.A theoretical overview of Krylov subspace methods.Rechenzentrum der Universitiit Karlsruhe.1995.1-10.
    [30] Sanjay Velamparambil,Sarah MacKinnon-Cormier.GPU Accelerated Krylov Subspace Methods for Computational Electromagnetics.Acceleware Corporation.2008.
    [31] Manfred Liebmann.Algebraic Multigrid Methods on Graphics Processing Units.Institute for Mathematics and Scientific Computing University of Graz.September 2008.
    [32] Matthias Christen,Olaf Schenk.General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform.June,2008.
    [33] Burkhard Zink. A General Relativistic Evolution Code on CUDA Architectures. Center for Computation & Technology.2008.
    [34] S Tomov, J Dongarra, M Baboulin. Towards dense linear algebra for hybrid GPU accelerated manycore systems. The University of Manchester Manchester, 2008:1-13.
    [35] Vasily Volkov, James W. Demmel. Benchmarking GPUs to tune dense linear algebra.//InSC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008:1-11.
    [36] L Buatois, G Caumon, B Lévy. Concurrent number cruncher: An efficient sparse linear solver on the GPU. In Proceedings of the High Performance Computation Conference 2007 (HPCC’07). Lecture Notes in Computer Science (LNCS), 2007:358-371.
    [37] Y Saad. SPARSKIT: A basic tool kit for sparse matrix computation, 1994.
    [38] N Bell, M Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, 2008.
    [39]董延星,王龙,迟学斌.二维扩散方程的GPU加速.计算机工程与科学, 2009, 31(11):121-123.
    [40] 2009年中国高性能计算机发展趋势及展望. 2009年全国高性能学术年会. 2009.
    [41]张舒,褚艳利,赵开勇,张钰勃, GPU高性能运算之CUDA[M],北京:中国水利水电出版社, 2009, 1-8.
    [42] D Kanter. NVIDIA's GT200: Inside a Parallel Processor. NVIDIA. 2008:1~26.
    [43] http://news.mydrivers.com/1/145/145916_2.htm, 2009.10.1.
    [44] NVIDIA Corperation. NVIDIA CUDA Programming Guide, version 3.0, 2010.
    [45] A.Basermann, B.Reichel, C.Schelthoff. Preconditioned CG methods for sparse matrices on massively parallel machines. Parallel Computing 23 (1997):381-398.
    [46] Roland W. Freund, Gene H. Golub, Noel M. Nachtigal. Iterative Solution of Linear Systems, Numerical analysis project manuscript, NA-91-05, November 1991.
    [47]葛振, GPU加速PQMRCGSTAB算法研究:学位论文.长沙:国防科学技术大学计算机学院, 2009.
    [48]刘杰,迟利华,胡庆丰,李晓梅,并行计算稀疏矩阵乘以向量的负载平衡算法[J],计算机工程与科学, 2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700