一种面向OpenCL架构的矩阵-向量乘并行算法与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

一种面向OpenCL架构的矩阵-向量乘并行算法与实现

详细信息查看全文 | 推荐本文 |

英文篇名：Matrix-vector Multiplication Parallel Algorithm and Implementation for OpenCL Architecture
作者：肖汉 ; 周清雷 ; 姚鹏姿
英文作者：XIAO Han;ZHOU Qing-lei;YAO Peng-zi;School of Information Science and Technology,Zhengzhou Normal University;School of Information Engineering,Zhengzhou University;
关键词：矩阵-向量乘 ; 图形处理器 ; 开放式计算语言 ; 并行算法
英文关键词：matrix-vector multiplication;;GPU;;OpenCL;;parallel algorithm
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：郑州师范学院信息科学与技术学院;郑州大学信息工程学院;
出版日期：2019-01-15
出版单位：小型微型计算机系统
年：2019
期：v.40
基金：国家自然科学基金项目(61572444,61250007)资助
语种：中文;
页：XXWX201901006
页数：5
CN：01
ISSN：21-1106/TP
分类号：28-32

摘要

矩阵-向量乘法算法的时间复杂度大,传统计算方法的实时性和跨平台性难以保证.本文提出一种基于开放式计算语言(Open Computing Language,OpenCL)的矩阵-向量乘并行算法,矩阵-向量乘法过程被分解成若干具有不同粒度的子任务.根据相应的并行度,每个工作组进行矩阵中的行块与列向量的乘积,每个工作项进行行块中行向量与列向量的乘积,并把计算任务分别分配到计算单元和处理单元进行处理.实验结果表明,与基于CPU的串行算法、基于OpenMP并行算法和基于统一计算设备架构(Compute Unified Device Architecture,CUDA)并行算法性能相比,矩阵-向量乘并行算法在OpenCL架构下NVIDIA图形处理器(Graphic Processing Unit,GPU)计算平台上分别获得了20. 86倍、6. 39倍和1. 49倍的加速比.验证了提出的并行优化方法的有效性和性能可移植性.
The time complexity of matrix-vector multiplication algorithm is large,and the real-time and cross-platform performance of traditional computing methods is difficult to guarantee. This paper presents a matrix-vector multiplication parallel algorithm based on Open Computing Language( OpenCL),and the matrix-vector multiplication process is decomposed into several subtasks with different granularity. According to the corresponding degree of parallelism,each work-group carries on the product of the rowblock in the matrix and the column vector,each work-item carries on the product of the rowvector in the rowblock and the column vector,and assigns the computation task separately to the compute unit and the processing element for processing. The experimental results showthat compared with the performance of the serial algorithm based on CPU,parallel algorithm based on OpenMP and parallel algorithm based on Compute Unified Device Architecture( CUDA),the matrix-vector multiplication parallel algorithm obtains 20. 86 times,6. 39 times and 1. 49 times speedup in the NVIDIA GPU computing platform under the OpenCL architecture respectively. The validity and performance portability of the proposed parallel optimization method are verified.

引文

[1] Lastra Miguel,Castro Diaz Manuel J,Urena Carlos,et al. Efficient multilayer shallow-water simulation system based on GPUs[J].Mathematics and Computers in Simulation,2018,148(6):48-65.
    [2] Fouad Mohamed M,Dansereau,Richard M. An optimized parallel order scheme of the deblocking filtering process for enhancing the performance of the HEVC standard using GPUs[J]. Multimedia Tools and Applications,2017,76(23):24609-24634.
    [3] Gutierreza Pablo D,Lastra Miguel,Bacardit Jaume,et al. GPUSME-kNN:scalable and memory efficient kNN and lazy learning using GPUs[J]. Information Sciences,2016,373(4):165-182.
    [4] Yang Meng-long,Liu Yi-guang,You Zhi-sheng. The euclidean embedding learning based on convolutional neural network for stereo matching[J]. Neurocomputing,2017,267(12):195-200.
    [5] Funasaka Shunji,Nakano Koji,Ito Yasuaki. Fully parallelized LZW decompression for CUDA-enabled GPUs[J]. The Institute of Electronics,Information and Communication Engineers(IEICE)Transactions on Information and Systems,2016,E99D(12):2986-2994.
    [6] Ratcliff Laura E,Degomme A,Flores-Livas Jose A,et al. Affordable and accurate large-scale hybrid-functional calculations on GPU-accelerated supercomputers[J]. Journal of Physics-Condensed Matter,2018,30(9):25-32.
    [7] Zhang Ji-lin,Wan Jian,Li Fang-fang,et al. Efficient sparse matrixvector multiplication using cache oblivious extension quadtree storage format[J]. Future Generation Computer Systems-the International Journal of Escience,2016,54(6):490-500.
    [8] Wu Gui-ming,Wang Miao,Xie Xiang-hui,et al. Sparse matrix blocking method for custom architecture[J]. Computer Science,2015,42(11):63-64,79.
    [9] Gao Zhen,Jing Qing-qing,Li Yu-meng,et al. An efficient fault-tolerance design for integer parallel matrix-vector multiplications[J].IEEE Transactions on Very Large Scale Integration(VLSI)Systems,2018,26(1):211-215.
    [10] Su Jin-zhu,Wu Gui-ming,Jia Xun. Design and implementation of large sparse matrix vector multiplication on FPGA over GF(2)[J]. Computer Engineering&Science,2016,38(8):1530-1535.
    [11] Xue Yong-jiang,Song Qing-zeng,Wang Rui-kun. Design and optimization of floating-point matrix-vector multiply for FPGAs[J].Microelectronics&Computer,2013,30(11):64-67.
    [12] Malovichko M,Khokhlov N,Yavich N,et al. Acoustic 3D modeling by the method of integral equations[J]. Computers&Geosciences,2018,111(8):223-234.
    [13] Hassan Somaia A,Mahmoud Mountasser MM,Hemeida A M,et al. Effective implementation of matrix-vector multiplication on Intel's AVX multicore processor[J]. Computer Languages Systems&Structures,2018,51(10):158-175.
    [14] Wu Chang-mao,Yang Chao,Yin Liang,et al. Numerical simulation of planetary fluid dynamics on CPU-MIC heterogeneous many-core systems[J]. Journal on Numerical Methods and Computer Applications,2017,38(3):197-214.
    [15] Zhang Ai-min,An Hong,Yao Wen-jun,et al. Efficient sparse matrix-vector multiplication on Intel Xeon Phi[J]. Journal of Chinese Computer Systems,2016,37(4):818-823.
    [16] Li Zheng,Feng Chun-sheng,Zhang Chen-song. An efficient SpMV for petroleum reservoir simulation on GPUs[J]. Journal on Numerical Methods and Computer Applications,2016,37(4):316-324.
    [17] Liang Yun,Tang Wai-teng,Zhao Rui-zhe,et al. Scale-free sparse matrix-vector multiplication on many-core architectures[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2017,36(12):2106-2119.
    [18] Thanh Tuan Dao,Lee Jaejin. An auto-tuner for OpenCL workgroup size on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems,2018,29(2):283-296.
    [19] Wang Ze-ke,Zhang Shu-hao,He Bing-sheng,et al. Melia:a mapreduce framework on OpenCL-based FPGAs[J]. IEEE Transactions on Parallel and Distributed Systems,2016,27(12):3547-3560.
    [20] Fang Jianbin,Varbanescu Ana Lucia,Liao Xiangke,et al. Evaluating vector data type usage in OpenCL kernels[J]. Concurrency and Computation-Practice&Experience,2015,27(17):4586-4602.
    [8]邬贵明,王淼,谢向辉,等.面向定制结构的稀疏矩阵分块方法[J].计算机科学,2015,42(11):63-64,79.
    [10]苏锦柱,邬贵明,贾迅.二元域大型稀疏矩阵向量乘的FPGA设计与实现[J].计算机工程与科学,2016,38(8):1530-1535.
    [11]薛永江,宋庆增,王瑞昆.浮点矩阵向量乘法的设计与优化[J].微电子学与计算机,2013,30(11):64-67.
    [14]吴长茂,杨超,尹亮,等.基于CPU-MIC异构众核环境的行星流体动力学数值模拟[J].数值计算与计算机应用,2017,38(3):197-214.
    [15]张爱民,安虹,姚文军,等.基于Intel Xeon Phi的稀疏矩阵向量乘性能优化[J].小型微型计算机系统,2016,37(4):818-823.
    [16]李政,冯春生,张晨松.一种针对GPU上的油藏数值模拟的高效SpMV[J].数值计算与计算机应用,2016,37(4):316-324.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700