一种可配置的CNN协加速器的FPGA实现方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

一种可配置的CNN协加速器的FPGA实现方法

详细信息查看全文 | 推荐本文 |

英文篇名：An FPGA Implementation Method for Configurable CNN Co-Accelerator
作者：蹇强 ; 张培勇 ; 王雪洁
英文作者：JIAN Qiang;ZHANG Pei-yong;WANG Xue-jie;College of Information Science and Electronic Engineering,Zhejiang University;Zhejiang University City College,Zhejiang University;
关键词：卷积神经网络 ; FPGA ; 嵌入式 ; 卷积计算 ; 并行算法
英文关键词：convolutional neural network;;FPGA;;embedded-system;;convolution;;parallel algorithm
中文刊名：DZXU
英文刊名：Acta Electronica Sinica
机构：浙江大学信息与电子工程学院;浙江大学城市学院;
出版日期：2019-07-15
出版单位：电子学报
年：2019
期：v.47;No.437
基金：面向14纳米及以下工艺的亚皮秒精度信号片上测量关键技术研究(No.61474098);; 面向10纳米及以下工艺集成电路晶圆快速缺陷检测(No.61674129)
语种：中文;
页：DZXU201907017
页数：7
CN：07
ISSN：11-2087/TN
分类号：135-141

摘要

针对卷积神经网络中卷积运算复杂度高而导致计算时间过长的问题,本文提出了一种八级流水线结构的可配置CNN协加速器FPGA实现方法.通过在卷积运算控制器中嵌入池化采样控制器的复用手段使计算模块获得更多资源,利用mirror-tree结构来提高并行度,并采用Map算法来提高计算密度,同时加快了计算速度.实验结果表明,当精度为32位定点数/浮点数时,该实现方法的计算性能达到22.74GOPS.对比MAPLE加速器,计算密度提高283.3%,计算速度提高了224.9%,对比MCA(Memory-Centric Accelerator)加速器,计算密度提高了14.47%,计算速度提高了33.76%,当精度为8-16位定点数时,计算性能达到58.3GOPS,对比LBA(Layer-Based Accelerator)计算密度提高了8.5%.
To solve the problem that the time consumption of convolutional neural network is too much,which is mostly caused by the high complexity of convolution operation,an FPGA implementation of a configurable CNN co-accelerator with eight-stage pipeline structure is proposed.By embedding the pooling controller in the convolution controller,the computational module is able to obtain more resources.Specially,a mirror-tree structure is designed to increase parallelism.Furthermore,to increase computational density and speed up calculation at the same time,the Map algorithm is implemented in this design.The experimental results show that the computing performance of this implementation reaches 22.74 GOPS on 32-bit fixed/float point.Compared with MAPLE accelerator,the computational density is increased by 283.3%,and the calculation speed is boosted by 224.9%.Compared with MCA(Memory-Centric Accelerator), the computational density is increased by 14.47%,and the calculation speed is boosted by 33.76%.With a precision range between 8-bit and 16-bit fixed point,the performance reaches 58.3 GOPS,and the computational density is increased by 8.5% compared with LBA(Layer-Based Accelerator).

引文

[1] CFarabet,C Poulet,J Y Han,Y LeCun.CNP:An FPGA-based processor for convolutional networks[A].2009 International Conference on Field Programmable Logic and Applications[C].Prague:IEEE,2009.32-37.
    [2] Ji S,Xu W,Yang M,Yu K.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
    [3] Larochelle H,Erhan D,Courville A,Bergstra J,Bengio Y.An empirical evaluation of deep architectures on problems with many factors of variation[A].Proceedings of the 24th International Conference on Machine Learning[C].New York:ACM,2007.473-480.
    [4] Sankaradas M,et al.A massively parallel coprocessor for convolutional neural networks[A].The 20th IEEE International Conference on Application-Specific Systems,Architectures and Processors[C].Boston:IEEE,2009.53-60.
    [5] Bengio Y,Courville A,Vincent P.Representation learning:a review and new perspectives[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828.
    [6] Yongmei Zhou,Jingfei Jiang.An FPGA-based accelerator implementation for deep convolutional neural networks[A].The 4th International Conference on Computer Science and Network Technology (ICCSNT)[C].China:IEEE,2015.829-832.
    [7] Chen Y,et al.DaDianNao:a machine-learning supercomputer[A].The 47th Annual IEEE/ACM International Symposium on Microarchitecture[C].Cambridge:IEEE,2014.609-622.
    [8] Roux S,Mamlet F,Carcia C.Embedded convolutional face finder[A].2006 IEEE International Conference on Multimedia and Expo[C].Canada:IEEE,2006.285-288.
    [9] Kamijo S,Matsushita Y,Ikeuchi K,et al.Traffic monitoring and accident detection at intersections [J].IEEE Transactions on Intelligent Transportation System,2000,1(2):108-118.
    [10] Li N,Takaki S,Tomiokay Y,Kitazawa H.A multistage dataflow implementation of a deep convolutional neural network based on FPGA for high-speed object recognition[A].2016 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI)[C].USA:IEEE,2016.165-168.
    [11] Cadambi S,Majumdar A,Becchi M,Chakradhar S,Graf H P.A programmable parallel accelerator for learning and classification[A].Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques[C].Austria:ACM,2010.273-284.
    [12] Chakradhar S,Sankaradas M,Jakkula V,Cadambi S.A dynamically configurable coprocessor for convolutional neural networks[J].ACM SIGARCH Computer Architecture News,2010,38(3):247-257.
    [13] Peemen M,Setio A A,Mesman B,Corporaal H.Memory-centric accelerator design for convolutional neural networks[A].IEEE 31st International Conference on Computer Design (ICCD) [C].USA:IEEE,2013.13-19.
    [14] Alex Krizhevsky,Ilya Sutskever,Geoffrey E Hinton.ImageNet classification with deep convolutional neural network[J].Communications of the ACM,2017,60(6):84-90.
    [15] Zhang C,Li P,Sun G,et al.Optimizing FPGA-based accelerator design for deep convolution neural networks[A].Proceedings of the 2015 ACM/SIGDA International Symposium on Field Programmable Gate Arrays[C].USA:ACM,2015.161-170.
    [16] Huang C,Ni S,Chen G.A layer-based structured design of CNN on FPGA[A].IEEE 12th International Conference on ASIC (ASICON)[C].Guiyang:IEEE,2017.1037-1040.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700