硬件加速神经网络综述

英文篇名：Survey on Accelerating Neural Network with Hardware
作者：陈桂林 ; 马胜 ; 郭阳
英文作者：Chen Guilin;Ma Sheng;Guo Yang;College of Computer ,National University of Defense Technology;
关键词：机器学习 ; 神经网络 ; 通用芯片 ; 专用加速芯片 ; 体系结构
英文关键词：machine learning;;neural network;;general-purpose processor;;special-purpose accelerator;;architecture
中文刊名：JFYZ
英文刊名：Journal of Computer Research and Development
机构：国防科技大学计算机学院;
出版日期：2019-01-29 13:16
出版单位：计算机研究与发展
年：2019
期：v.56
基金：国家自然科学基金项目(61672526);; 国防科技大学科研计划项目(ZK17-03-06)~~
语种：中文;
页：JFYZ201902002
页数：14
CN：02
ISSN：11-1777/TP
分类号：16-29

摘要

人工神经网络目前广泛应用于人工智能的应用当中,如语音助手、图像识别和自然语言处理等.随着神经网络愈加复杂,计算量也急剧上升,传统的通用芯片在处理复杂神经网络时受到了带宽和能耗的限制,人们开始改进通用芯片的结构以支持神经网络的有效处理.此外,研发专用加速芯片也成为另一条加速神经网络处理的途径.与通用芯片相比,它能耗更低,性能更高.通过介绍目前通用芯片和专用芯片对神经网络所作的支持,了解最新神经网络硬件加速平台设计的创新点和突破口.具体来说,主要概述了神经网络的发展,讨论各类通用芯片为支持神经网络所作的改进,其中包括支持低精度运算和增加一个加速神经网络处理的计算模块.然后从运算结构和存储结构的角度出发,归纳专用芯片在体系结构上所作的定制设计,另外根据神经网络中各类数据的重用总结了各个神经网络加速器所采用的数据流.最后通过对已有加速芯片的优缺点分析,给出了神经网络加速器未来的设计趋势和挑战.
Artificial neural networks are widely used in artificial intelligence applications such as voice assistant, image recognition and natural language processing. With the rise of complexity of the application, the computational complexity has also increased dramatically. The traditional general-purpose processor is limited by the memory bandwidth and energy consumption when dealing with the complex neural network. People began to improve the architecture of the general-purpose processors to support the efficient processing of the neural network. In addition, the development of special-purpose accelerators becomes another way to accelerate processing of neural network. Compared with the general-purpose processor, it has lower energy consumption and higher performance. The article aims to introduce the designs from current general-purpose processors and special-purpose accelerators for supporting the neural network. It also summarizes the latest design innovation and breakthrough of the neural network acceleration platforms. In particular, the article provides an overview of the neural network and discusses the improvements made by various general-purpose chips to support neural networks, which include supporting low-precision operations and adding a calculation module to speed up neural network processing. Then from the viewpoint of the computational structure and storage structure, the article summarizes the customized designs of special-purpose accelerators, and describes the dataflow used by the neural network chips based on the reuse of various types of the data in the neural network. Through analyzing the advantages and disadvantages of these solutions, the article puts forward the future design trend and challenge of the neural network accelerator.

引文

[1]Dean J.Large-scale deep learning for building intelligent computer systems[C]Proc of the 9th ACM Int Conf on Web Search and Data Mining.New York:ACM,2016:1-1
    [2]Krizhevsky A,Sutskever I,Hinton G E.ImageNet classification with deep convolutional neural networks[C]Proc of the 26th Annual Conf on Neural Information Processing Systems.Cambridge,MA:MIT Press,2012:1097-1105
    [3]Szegedy C,Liu Wei,Jia Yangqing,et al.Going deeper with convolutions[C]Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition.Los Alamitos:IEEEComputer Society,2015:1-9
    [4]He Kaiming,Zhang Xianyu,Ren Shaoqing,et al.Identity mappings in deep residual networks[C]Proc of the 2016European Conf on Computer Vision.Berlin:Springer,2016:630-645
    [5]Chen Yunji,Luo Tao,Liu Shaoli,et al.DaDianNao:Amachine-learning supercomputer[C]Proc of the 47th Annual IEEE/ACM Int Symp on Microarchitecture.Piscataway,NJ:IEEE,2014:609-622
    [6]Chen Yunji,Chen Tianshi,Du Zidong,et al.DianNao:Asmall-footprint high-throughput accelerator for ubiquitous machine-learning[C]Proc of the 19th Int Conf on Architectural Support for Programming Languages and Operating Systems.New York:ACM,2014:269-284
    [7]Liu Daofu,Chen Tianshi,Liu Shaoli,et al.PuDianNao:Apolyvalent machine learning accelerator[C]Proc of the20th Int Conf on Architectural Support for Programming Languages and Operating Systems.New York:ACM,2015:369-381
    [8]Du Zidong,Fasthuber R,Chen Tianshi,et al.ShiDianNao:Shifting vision processing closer to the sensor[C]Proc of the 42nd Annual Int Symp on Computer Architecture.New York:ACM,2015:92-104
    [9]Jouppi N P,Young C,Patil N,et al.In-Datacenter performance analysis of a tensor processing unit[C]Proc of the 44th Annual Int Symp on Computer Architecture.New York:ACM,2017:1-12
    [10]Google.Cloud TPUs:Google’s second-generation tensor processing unit is coming to cloud[EB/OL].[2017-10-30].https:ai.google/tools/cloud-tpus/
    [11]Venkataramani S,Dubey P,Raghunathan A,et al.ScaleDeep:A scalable compute architecture for learning and evaluating deep networks[C]Proc of the 44th Annual Int Symp on Computer Architecture.New York:ACM,2017:13-26
    [12]Chen Yu-Hsin,Emer J,Sze V.Eyeriss:A spatial architecture for energy-efficient dataflow for convolutional neural networks[C]Proc of the 43rd Annual Int Symp on Computer Architecture.New York:ACM,2016:367-379
    [13]Shafiee A,Nag A,Muralimanohar N,et al.ISAAC:Aconvolutional neural network accelerator with in-situ analog arithmetic in crossbars[C]Proc of the 43rd Annual Int Symp on Computer Architecture.New York:ACM,2016:14-26
    [14]Parashar A,Rhu M,Mukkara A,et al.SCNN:An accelerator for compressed-sparse convolutional neural networks[C]Proc of the 44th Annual Int Symp on Computer Architecture.New York:ACM,2017:27-40
    [15]Akopyan F,Sawada J,Cassidy A,et al.TrueNorth:Design and tool flow of a 65mW 1 million neuron programmable neurosynaptic chip[J].IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems,2015,34(10):1537-1557
    [16]Furber S B,Lester D R,Plana L A,et al.Overview of the spiNNaker system architecture[J].IEEE Transactions on Computers,2013,62(12):2454-2467
    [17]Furber S B,Galluppi F,Temple S,et al.The SpiNNaker project[J].Proceedings of the IEEE,2014,102(5):652-665
    [18]Jin Xin,Lujan M,Plana L A,et al.Modeling spiking neural networks on SpiNNaker[J].Computing in Science&Engineering,2010,12(5):91-97
    [19]McCulloch W S,Pitts W.A logical calculus of the ideas immanent in nervous activity[J].Bulletin of Mathematical Biophysics,1943,5(4):115-133
    [20]Rosenblatt F.The perceptron:A probabilistic model for information storage and organization in the brain[J].Psychological Review,1958,65(6):386-408
    [21]Hubel D H,Wiesel T N.Receptive fields,binocular interaction and functional architecture in the cat’s visual cortex[J].Journal of Physiology,1962,160(1):106-154
    [22]Lécun Y,Bottou L,Bengio Y,et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324
    [23]B9ttcher A,Silbermann B.Introduction to Large Truncated Toeplitz Matrices[M].Berlin:Springer,1999
    [24]Winograd S.Arithmetic Complexity of Computations[M].Philiadelphia PA:Society for Industrial&Applied Mathematics,1980
    [25]Vasilache N,Johnson J,Mathieu M,et al.Fast convolutional nets with fbfft:A GPU performance evaluation[OL].[2018-04-13].https:arxiv.org/abs/1412.7580
    [26]Krizhevsky A,Sutskever I,Hinton G E.ImageNet classification with deep convolutional neural networks[C]Proc of the 26th Annual Conf on Neural Information Processing Systems.Cambridge,MA:MIT Press 2012:1097-1105
    [27]Smith S W.The Scientist and Engineer’s Guide to Digital Signal Processing[M].Poway,CA:California Technical Publishing,1997
    [28]He Kaiming,Zhang Xianyu,Ren Shaoqing,et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2014,37(9):1904-1916
    [29]Dong L T,Mintram R.Genetic algorithm-neural network(GANN):A study of neural network activation functions and depth of genetic algorithm search applied to feature selection[J].International Journal of Machine Learning&Cybernetics,2010,1(1/2/3/4):75-87
    [30]Jia Y,Shelhamer E,Donahue J,et al.Caffe:Convolutional architecture for fast feature embedding[C]Proc of the 22nd ACM Int Conf on Multimedia.New York:ACM,2014:675-678
    [31]Abadi M,Agarwal A,Barham P,et al.TensorFlow:Largescale machine learning on heterogeneous distributed Systems[OL].[2018-04-13].https:arxiv.org/abs/1603.04467
    [32]Patrick K.Intel Xeon Phi Knights Mill for machine learning[EB/OL].(2017-08-21)[2017-10-18].https:www.servethehome.com/intel-knights-mill-for-machine-learning/
    [33]Anker G.Cuda-convnet2:A fast C++/Cuda implementation of convolutional(or more generally,feedforward)neural networks[EB/OL].[2017-10-20].https:code.google.com/p/cuda-convnet2/
    [34]Collobert R,Kavukcuoglu K,Farabet C.Torch7:A matlablike environment for machine learning[C/OL]Proc of NIPSWorkshop.2011[2018-04-13].http:citeseerx.ist.psu.edu/viewdoc/summary/doi=10.1.1.231.4195
    [35]Bastien F,Lamblin P,Pascanu R,et al.Theano:New features and speed improvements[J/OL].Computer Science,2012[2018-04-13].https:arxiv.org/abs/1211.5590
    [36]NVIDIA.Cudnn:GPU-accelerated library of primitives for deep neural networks[EB/OL].[2017-10-30].https:developer.nvidia.com/cuDNN
    [37]Vasilache N,Johnson J,Mathieu M,et al.Fast convolutional nets with fbfft:A GPU performance evaluation[OL].[2018-04-13].https:arxiv.org/abs/1412.7580
    [38]Cooper G.Facial expression analysis with deep learning&computer vision[EB/OL].[2018-04-13].https:www.synopsys.com/designware-ip/technical-bulletin/ev-facial-expressiondwtb-q117.html
    [39]CEVA.CEVA-XM6:Fifth-generation computer vision and deep learning embedded platform[EB/OL].[2017-10-30]].https:www.ceva-dsp.com/product/ceva-xm6/
    [40]Miya K.VeriSilicon’s Vivante VIP8000 neural network processor IP delivers over 3 Tera MACs Per second[EB/OL].[2017-10-30].http:www.verisilicon.com/newsdetail_499_VivanteVIP8000.html
    [41]Cadence.Tensilica Vision DSPs for imaging,computer vision,and neural networks[EB/OL].[2017-10-18].https:ip.cadence.com/vision
    [42]Yun S B,Kim Y J,Dong S S,et al.Hardware implementation of neural network with expansible and reconfigurable architecture[C]Proc of the 9th Int Conf on Neural Information Processing.Piscataway,NJ:IEEE,2002:970-975
    [43]Farabet C,Martini B,Corda B,et al.NeuFlow:A runtime reconfigurable dataflow processor for vision[C]Proc of the29th Computer Vision and Pattern Recognition Workshops.Piscataway,NJ:IEEE,2011:109-116
    [44]Zhang Chi,Prasanna V.Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system[C]Proc of the 25th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays.New York:ACM,2017:35-44
    [45]Gao Mingyu,Pu Jing,Yang Xuan,et al.TETRIS:Scalable and efficient neural network acceleration with 3D memory[C]Proc of the 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems.New York:ACM,2017:751-764
    [46]Li Zhen,Wang Yuqing,Zhi Tian,et al.A survey of neural network accelerators[J].Frontiers of Computer Science,2017,11(5):746-761
    [47]Kim D,Kung J,Chai S,et al.Neurocube:A programmable digital neuromorphic architecture with high-density 3Dmemory[C]Proc of the 43rd Annual Int Symp on Computer Architecture.New York:ACM,2016:380-392
    [48]Chi Ping,Li Shuangchen,Xu Cong,et al.PRIME:A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory[C]Proc of the 44th Annual Int Symp on Computer Architecture.New York:ACM,2016:27-39
    [49]Ji Yu,Zhang Youhui,Li Shuangchen,et al.NEUTRAMS:Neural network transformation and co-design under neuromorphic hardware constraints[C]Proc of the 49th IEEE/ACM Int Symp on Microarchitecture.Piscataway,NJ:IEEE,2016:No.21
    [50]Li Chuxi,Fan Xiaoya,Zhao Changhe,et al.A memristorbased processing-in-memory architecture for deep convolutional neural networks approximate computation[J].Journal of Computer Research and Development,2017,54(6):1367-1380(in Chinese)(李楚曦,樊晓桠,赵昌和,等.基于忆阻器的PIM结构实现深度卷积神经网络近似计算[J].计算机研究与发展,2017,54(6):1367-1380)
    [51]Liu Shaoli,Du Zidong,Tao Jinhua,et al.Cambricon:An instruction set architecture for neural networks[C]Proc of the 43rd Int Symp on Computer Architecture.Piscataway,NJ:IEEE,2016:393-405
    [52]Sankaradas M,Jakkula V,Cadambi S,et al.A massively parallel coprocessor for convolutional neural networks[C]Proc of the 20th IEEE Int Conf on Application-Specific Systems,Architectures and Processors.Piscataway,NJ:IEEE,2009:53-60
    [53]Sriram V,Cox D,Tsoi K H,et al.Towards an embedded biologically-inspired machine vision processor[C]Proc of the 9th Int Conf on Field-Programmable Technology.Piscataway,NJ:IEEE,2011:273--278
    [54]Chakradhar S,Sankaradas M,Jakkula V,et al.Adynamically configurable coprocessor for convolutional neural networks[C]Proc of the 38th Int Symp on Computer Architecture.New York:ACM,2010:247-257
    [55]Gupta S,Agrawal A,Gopalakrishnan K,et al.Deep learning with limited numerical precision[J/OL].Computer Science,2015[2018-04-13].https:arxiv.org/abs/1502.02551
    [56]Peemen M,Setio A A A,Mesman B,et al.Memory-centric accelerator design for convolutional neural networks[C]Proc of the 31st Int Conf on Computer Design.Piscataway,NJ:IEEE,2013:13-19
    [57]Zhang Chen,Li Peng,Sun Guangyu,et al.Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]Proc of the 23rd ACM/SIGDA Int Symp on Field-Programmable Gate Arrays.New York:ACM,2015:161-170
    [58]Albericio J,Judd P,Hetherington T,et al.Cnvlutin:Ineffectual-neuron-free deep neural network computing[C]Proc of the 43rd Int Symp on Computer Architecture.Piscataway,NJ:IEEE,2016:1-13
    [59]Zhang Shijin,Du Zidong,Zhang Lei,et al.Cambricon-X:An accelerator for sparse neural networks[C]Proc of the49th IEEE/ACM Int Symp on Microarchitecture.Piscataway,NJ:IEEE,2016:No.20
    [60]Han Song,Liu Xingyu,Mao Huizi,et al.EIE:Efficient inference engine on compressed deep neural network[C]Proc of the 43rd Int Symp on Computer Architecture.Piscataway,NJ:IEEE,2016:243-254
    [61]Yu Jiecao,Lukefahr A,Palframan D,et al.Scalpel:Customizing DNN pruning to the underlying hardware parallelism[C]Proc of the 44th Annual Int Symp on Computer Architecture.New York:ACM,2017:548-560

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700