双精度浮点矩阵乘协处理器研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

双精度浮点矩阵乘协处理器研究

详细信息查看全文 | 推荐本文 |

英文篇名：A Coprocessor for Double-Precision Floating-Point Matrix Multiplication
作者：贾迅 ; 邬贵明 ; 谢向辉 ; 吴东
英文作者：Jia Xun;Wu Guiming;Xie Xianghui;Wu Dong;State Key Laboratory of Mathematical Engineering and Advanced Computing;
关键词：矩阵乘 ; 协处理器 ; 加速 ; 浮点 ; 硬件定制
英文关键词：matrix multiplication;;coprocessor;;acceleration;;floating-point;;hardware customization
中文刊名：JFYZ
英文刊名：Journal of Computer Research and Development
机构：数学工程与先进计算国家重点实验室;
出版日期：2019-02-15
出版单位：计算机研究与发展
年：2019
期：v.56
基金：国家自然科学基金项目(91430214,61732018)~~
语种：中文;
页：JFYZ201902016
页数：11
CN：02
ISSN：11-1777/TP
分类号：186-196

摘要

矩阵乘运算在多个应用领域特别是数值计算领域被广泛使用,但双精度浮点矩阵乘在CPU,GPGPU,FPGA等现有计算平台上的性能和效率受限,其往往成为大规模数值计算应用的性能瓶颈.针对该问题,以线性阵列计算结构为基础,研究了双精度浮点矩阵乘的定制加速.首先,对线性阵列计算结构进行了双缓冲优化并设计了针对双缓冲的存储访问调度,以提高结构的计算效率.其次,提出了矩阵乘协处理器和加速计算系统的结构,构建了协处理器的性能模型并对其结构设计空间进行了探索.最后,验证了协处理器的功能正确性并在某主流工艺下评估了其硬件开销.实验结果表明,设计的双精度浮点矩阵乘协处理器可以达到3 TFLOPS的计算性能和99%的计算效率.与NVIDIA K40 GPGPU相比,协处理器执行双精度浮点矩阵乘的性能是K40的1.95倍,而面积开销仅为K40的21.05%.探索了定制加速结构设计在高性能计算中的应用,对现有计算系统的性能提升具有一定的参考价值.
Matrix multiplication has been widely used in various application fields, especially the field of numerical computation. However, double-precision floating-point matrix multiplication suffers from non-optimal performance or efficiency on contemporary computing platforms, including CPU, GPGPU and FPGA. To address this problem, acceleration of double-precision floating-point matrix multiplication with a customized coprocessor is proposed in this paper, which adopts linear array as the basic building block. Firstly, double-buffering technique and optimized memory scheduling are applied to the basic linear array for better computation efficiency. Then, architecture of the matrix multiplication coprocessor and coprocessor-based accelerated computing system are formulated. Furthermore, a performance model tailored for the coprocessor is developed and the design space of coprocessor is explored in detail. Finally, functional correctness of the coprocessor is verified and its hardware implementation cost under mainstream technology node is evaluated. Experimental results show that the proposed coprocessor can achieve the performance of 3 TFLOPS and the efficiency of 99%. Compared with NVIDIA K40 GPGPU for executing double-precision floating-point matrix multiplication, the coprocessor proposed in this paper achieves 1.95× performance with hardware overheads of only 21.05% in area. This work explores the application of customized acceleration in high-performance computing and has certain guidance for improving performance of existing computing systems.

引文

[1]Dongarra J J,Duff I S,Sorensen D C,et al.Numerical Linear Algebra on High-Performance Computers[M].Philadelphia,PA:SIAM,1998
    [2]Dongarra J J,Luszczek P,Petitet A.The LINPACKbenchmark:Past,present and future[J].Concurrency and Computation:Practice and Experience,2003,15(9):803-820
    [3]Wang Endong,Zhang Qing,Shen Bo,et al.HighPerformance Computing on the Intel/Xeon Phi TM[M].Berlin:Springer,2014
    [4]IBM Corporation.Engineering and scientific subroutine library(ESSL)and parallel ESSL[EB/OL].[2017-08-04].https:www-03.ibm.com/systems/power/software/essl/
    [5]Barrachina S,Castillo M,Igual F,et al.Evaluation and tuning of the level 3CUBLAS for graphics processors[C]Proc of the 22nd IEEE Int Parallel&Distributed Processing Symp.Piscataway,NJ:IEEE,2008:3103-3111
    [6]Heinecke A,Vaidyanathan K,Smelyanskiy M,et al.Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor[C]Proc of the 27th IEEE Int Parallel&Distributed Processing Symp.Piscataway,NJ:IEEE,2013:126-137
    [7]Oak Ridge National Laboratory.IBM Power8CPU overview[EB/OL].[2017-07-27].https:www.olcf.ornl.gov/wpcontent/uploads/2017/01/Summit/Dev_IBM-Power8-CPUs_Walkup.pdf
    [8]Wang Shen,Qi Fengbing,Gu Hongfeng,et al.Linpack parallel performance model and its prediction[J].Computer Engineering,2012,38(16):81-84(in Chinese)(王申,漆锋滨,谷洪峰,等.Linpack并行性能模型及其预测[J].计算机工程,2012,38(16):81-84)
    [9]Nowatzki T,Gangadhan V,Sankaralingam K,et al.Pushing the limits of accelerator efficiency while retaining programmability[C]Proc of the 22nd Int Symp on High Performance Computer Architecture.Piscataway,NJ:IEEE,2016:27-39
    [10]Singapura S G,Panangadan A,Prasanna V K.Performance modeling of matrix multiplication on 3D memory integrated FPGA[C]Proc of the 29th IEEE Int Parallel&Distributed Processing Symp Workshops.Piscataway,NJ:IEEE,2015:154-162
    [11]Kumar V B Y,Joshi S,Patkar S B,et al.FPGA based high performance double-precision matrix multiplication[J].International Journal of Parallel Programming,2010,38(3):322-338
    [12]Wu Guiming.Parallel algorithms and architectures for matrix computations on FPGA[D].Changsha:National University of Defense and Technology,2011(in Chinese)(邬贵明.FPGA矩阵计算并行算法与结构[D].长沙:国防科学技术大学,2011)
    [13]Wu Guiming,Dou Yong,Wang Miao.High performance and memory efficient implementation of matrix multiplication on FPGAs[C]Proc of the 9th IEEE Int Conf on Field Programmable Technology.Piscataway,NJ:IEEE,2010:134-137
    [14]Zhou Leitao,Tao Yaodong,Liu Sheng,et al.Research on systolic multiplication and technology based on FPGA[J].Journal of Computer Science and Engineering,2015,37(9):1632-1636(in Chinese)(周磊涛,陶耀东,刘生,等.基于FPGA的Systolic乘法技术研究[J].计算机工程与科学,2015,37(9):1632-1636)
    [15]Jovanovi,Milutinovi V.FPGA accelerator for floatingpoint matrix multiplication[J].IET Computers&Digital Techniques,2012,6(4):249-256
    [16]Lei Yuanwu,Chen Xiaowen,Peng Yuanxi.A high energy efficiency FFT accelerator on DSP chip[J].Journal of Computer Research and Development,2016,53(7):1438-1446(in Chinese)(雷元武,陈小文,彭元喜.DSP芯片中的高能效FFT加速器[J].计算机研究与发展,2016,53(7):1438-1446)
    [17]Qian Lei,Zhao Jinming,Peng Dajia,et al.Energy-efficient fingerprint matching based on reconfigurable micro server[J].Journal of Computer Research and Development,2016,53(7):1425-1437(in Chinese)(钱磊,赵锦明,彭达佳,等.基于可重构微服务器的高能效指纹比对方法[J].计算机研究与发展,2016,53(7):1425-1437)
    [18]Jouppi N P,Young C,Patil N,et al.In-datacenter performance analysis of a tensor processing unit[C]Proc of the 44th IEEE Int Symp on Computer Architecture.Piscataway,NJ:IEEE,2017:1-12
    [19]NVIDIA Corporation.Inside volta:The world’s most advanced data-center GPU[EB/OL].[2017-06-17].https:devblogs.nvidia.com/parallelforall/inside-volta/
    [20]Sze V,Chen Y H,Suleiman,A,et al.Hardware for machine learning:Challenges and opportunities[C]Proc of the 30th IEEE Custom Integrated Circuits Conf.Piscataway,NJ:IEEE,2017:299-306
    [21]Gupta S,Agrawal A,Gopalakrishnan K,et al.Deep learning with limited numerical precision[C]Proc of the32nd Int Conf on Machine Learning.New York:ACM,2015:1737-1746
    [22]JDJEC Organization.DDR3 SDRAM standard[EB/OL].[2017-06-18].https:www.jedec.org/standards-documents/docs/jesd-79-3d
    [23]Lee Y,Waterman A,Avizienis R,et al.A 45nm 1.3GHz16.7double-precision GFLOPS/W RISC-V processor with vector accelerators[C]Proc of the 40th IEEE European Solid-State Circuit Conf.Piscataway,NJ:IEEE,2014:199-202
    [24]Overton M L.Numerical Computing with IEEE Floating Point Arithmetic[M].Philadelphia,PA:SIAM,2001
    [25]Barrett R F,Chan T,D’Azevedo E F,et al.Complex version of high performance computing LINPACK benchmark(HPL)[J].Concurrency&Computation Practice&Experience,2010,22(5):573-587
    [26]Zhang Xianyi,Wang Qian,Zhang Yunquan.OpenBLAS:Ahigh-performance BLAS library on Loongson 3A CPU[J].Journal of Software,2011,22(2):208-216(in Chinese)(张先轶,王茜,张云泉.OpenBLAS:龙芯3A CPU的高性能BLAS库[J].软件学报,2011,22(2):208-216)
    [27]GitHub.OpenBLAS:An optimized BLAS library[EB/OL].[2017-08-04].https:github.com/xianyi/OpenBLAS

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700