基于3D忆阻器阵列的神经网络内存计算架构

英文篇名：3D Memristor Array Based Neural Network Processing in Memory Architecture
作者：毛海宇 ; 舒继武
英文作者：Mao Haiyu;Shu Jiwu;Department of Computer Science and Technology, Tsinghua University;
关键词：3D忆阻器阵列 ; 内存计算 ; 神经网络 ; 外围电路 ; 互联线路
英文关键词：3D memristor array;;processing in memory(PIM);;neural network;;peripheral circuit;;wire interconnection
中文刊名：JFYZ
英文刊名：Journal of Computer Research and Development
机构：清华大学计算机科学与技术系;
出版日期：2019-06-15
出版单位：计算机研究与发展
年：2019
期：v.56
基金：国家重点研发计划项目(2018YFB1003301);; 国家自然科学基金项目(61832011)~~
语种：中文;
页：JFYZ201906003
页数：12
CN：06
ISSN：11-1777/TP
分类号：19-30

摘要

现如今,由于人工智能的飞速发展,基于忆阻器的神经网络内存计算(processing in memory, PIM)架构吸引了很多研究者的兴趣,因为其性能远优于传统的冯·诺依曼计算机体系结构的性能.配备了支持功能单元的外围电路,忆阻器阵列可以以高并行度以及相比于CPU和GPU更少的数据移动来处理一个前向传播.然而,基于忆阻器的内存计算硬件存在忆阻器的外围电路面积过大以及不容忽视的功能单元利用率过低的问题.提出了一种基于3D忆阻器阵列的神经网络内存计算架构FMC(function-pool based memristor cube),通过把实现功能单元的外围电路聚集到一起,形成一个功能单元池来供多个堆叠在其上的忆阻器阵列共享.还提出了一种针对基于3D忆阻器阵列的内存计算的数据映射策略,进一步提高功能单元的利用率并减少忆阻器立方体之间的数据传输.这种针对基于3D忆阻器阵列的内存计算的软硬件协同设计不仅充分利用了功能单元,并且缩短了互联电路、提供了高性能且低能耗的数据传输.实验结果表明:在只训练单个神经网络时,提出的FMC能使功能单元的利用率提升43.33倍;在多个神经网络训练任务的情况下,能提升高达58.51倍.同时,和有相同数目的Compute Array及Storage Array的2D-PIM比较,FMC所占空间仅为2D-PIM的42.89%.此外,FMC相比于2D-PIM有平均1.5倍的性能提升,并且有平均1.7倍的能耗节约.
Nowadays, due to the rapid development of artificial intelligence, the memristor-based processing in memory(PIM) architecture for neural network(NN) attracts a lot of researchers' interests since it performs much better than traditional von Neumann architecture. Equipped with the peripheral circuit to support function units, memristor arrays can process a forward propagation with higher parallelism and much less data movement than that in CPU and GPU. However, the hardware of the memristor-based PIM suffers from the large area overhead of peripheral circuit outside the memristor array and non-trivial under-utilization of function units. This paper proposes a 3 D memristor array based PIM architecture for NNs(FMC) by gathering the peripheral circuit of function units into a function pool for sharing among memristor arrays that pile up on the pool. We also propose a data mapping scheme for the 3 D memristor array based PIM architecture to further increase the utilization of function units and reduce the data transmission among different cubes. The software-hardware co-design for the 3 D memristor array based PIM not only makes the most of function units but also shortens the wire interconnections for better high-performance and energy-efficient data transmission. Experiments show that when training a single neural network, our proposed FMC can achieve up to 43.33 times utilization of the function units and can achieve up to 58.51 times utilization of the function units when training multiple neural networks. At the same time, compared with the 2 D-PIM which has the same amount of compute array and storage array, FMC only occupies 42.89% area of 2 D-PIM. What's more, FMC has 1.5 times speedup and 1.7 times energy saving compared with 2 D-PIM.

引文

[1]Chi Ping,Li Shuangchen,Xu Cong,et al.Prime:A novel processing-in-memory architecture for neural network computation in reram-based main memory[C] //Proc of 2016 ACM/IEEE the 43rd Annual Int Symp on Computer Architecture (ISCA).Piscataway,NJ:IEEE,2016:27- 39
    [2]Shafiee A,Nag A,Muralimanohar N,et al.ISAAC:A convolutional neural network accelerator with in-situ analog arithmetic in crossbars[J].ACM SIGARCH Computer Architecture News,2016,44(3):14- 26
    [3]Song Linghao,Qian Xuehai,Li Hai,et al.PipeLayer:A pipelined ReRAM-based accelerator for deep learning[C] //Proc of 2017 IEEE Int Symp on High Performance Computer Architecture (HPCA).Piscataway,NJ:IEEE,2017:541- 552
    [4]Stefano A,Pritish N,Hsinyu T,et al.Equivalent-accuracy accelerated neural-network training using analogue memory[J].Nature,2018,558(7708):60- 67
    [5]Keckler S W,Dally W J,Khailany B,et al.Gpus and the future of parallel computing[J].IEEE Micro,2011,31(5):7- 17
    [6]Li Shuangchen,Xu Cong,Zou Qiaosha,et al.Pinatubo:A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories[C] //Proc of Design Automation Conf.Piscataway,NJ:IEEE,2016:173- 179
    [7]Mao Haiyu,Song Mingcong,Li Tao,et al.Lergan:A zero-free,low data movement and pim-based gan architecture[C] //Proc of the 51st Annual IEEE/ACM Int Symp on Microarchitecture (MICRO).Los Alamitos,CA:IEEE Computer Society,2018:669- 681
    [8]Wenqin Huangfu,Li Shuangchen,Hu Xing,et al.Radar:A 3D-ReRAM based DNA alignment accelerator architecture[C] //Proc of 2018 the 55th ACM/ESDA/IEEE Design Automation Conf (DAC).Piscataway,NJ:IEEE,2018:1- 6
    [9]Cheng Ming,Xia Lixue,Zhu Zhenhua,et al.Time:A training-in-memory architecture for memristor-based deep neural networks[C] //Proc of 2017 the 54th IEEE Design Automation Conf (DAC).New York:ACM,2017:26- 31
    [10]Bojnordi M N.Memristive Boltzmann machine:A hardware accelerator for combinatorial optimization and deep learning[C] //Proc of the 22nd IEEE Int Symp on High Performance Computer Architecture.Los Alamitos,CA:IEEE Computer Society,2016:1- 13
    [11]Kim D,Kung J,Chai S,et al.Neurocube:A programmable digital neuromorphic architecture with high-density 3D memory[C] //Proc of the 43rd Int Symp on Computer Architecture.Piscataway,NJ:IEEE,2016:380- 392
    [12]Gao Mingyu,Kozyrakis C.Hrl:Efficient and flexible reconfigurable logic for near-data processing[C] //Proc of the 22nd IEEE Int Symp on High Performance Computer Architecture.Los Alamitos,CA:IEEE Computer Society,2016:126- 137
    [13]Kim H,Kim H,Yalamanchili S,et al.Understanding energy aspects of processing-near-memory for HPC workloads[C] //Proc of the 2015 Int Symp on Memory Systems.New York:ACM:276- 282
    [14]Farmahini- Farahani A,Ahn J H,Morrow K,et al.Drama:An architecture for accelerated processing near memory[J].IEEE Computer Architecture Letters,2017,14(1):26- 29
    [15]Ahn J,Yoo S,Mutlu O,et al.Pim-enabled instructions:A low-overhead,locality-aware processing-in-memory architecture[J].ACM SIGARCH Computer Architecture News,2015,43(3):336- 348
    [16]Chen Yunji,Luo Tao,Liu Shaoli,et al.Dadiannao:A machine-learning supercomputer[C] //Proc of the 47th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO).Los Alamitos,CA:IEEE Computer Society,2014:609- 622
    [17]Liu T Y,Yan T H,Scheuerlein R,et al.A 130.7 mm2,2-layer 32-GB ReRAM memory device in 24-nm technology[C] //Proc of IEEE Int Solid-State Circuits Conf (Digest of Technical Papers).Piscataway,NJ:IEEE,2013:140- 153
    [18]Akinaga H,Shima H.Resistive random access memory (ReRAM) based on metal oxides[J].Proceedings of the IEEE,2010,98(12):2237- 2251
    [19]Wei Z,Kanzawa Y,Arita K,et al.Highly reliable taox ReRAM and direct evidence of redox reaction mechanism[C] //Proc of IEEE Int Electron Devices Meeting.Piscataway,NJ:IEEE,2008:1- 4
    [20]Liu Qi,Sun Jun,Lü Hangbing,et al.Real-time observation on dynamic growth/dissolution of conductive filaments in oxide-electrolyte-based ReRAM[J].Advanced Materials,2012,24(14):1844- 1849
    [21]Burr G W,Shelby R M,Sidler S,et al.Experimental demonstration and tolerancing of a large-scale neural network (165000synapses) using phase-change memory as the synaptic weight element[J].IEEE Transactions on Electron Devices,62(11):3498- 3507
    [22]Burgt Y,Lubberman E,Fuller E,et al.A non-volatile organic electrochemical device as a low-voltage artificial synapse for neuromorphic computing[J].Nature Materials,2017,16(4):414- 418
    [23]Agarwal S,Gedrim R B J,Hsia A H,et al.Achieving ideal accuracies in analog neuromorphic computing using periodic carry[C] //Proc of Symp on VLSI Technology.Piscataway,NJ:IEEE,2017:174- 175
    [24]Narayanan P,Fumarola A,Sanches L L,et al.Toward on-chip acceleration of the backpropagation algorithm using nonvolatile memory[J].IBM Journal of Research and Development,2017,61(4):1- 11
    [25]Muralimanohar N,Balasubramonian R,Jouppi N.Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0[C] //Proc of the 38th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO).Los Alamitos,CA:IEEE Computer Society,2007:3- 14
    [26]Jouppi N P,Kahng A B,Muralimanohar N,et al.CACTI-IO:CACTI with off-chip power-area-timing models[J].IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2015,23(7):1254- 1267
    [27]Poremba M,Mittal S,Li D,et al.DESTINY:A tool for modeling emerging 3D NVM and eDRAM caches[C] //Proc of Design Automation and Test in Europe (DATE).Piscataway,NJ:IEEE,2015:1543- 1546
    [28]Lecun Y.LeNet[OL].[2019-01-04].http://yann.lecun.com/exdb/lenet/
    [29]Liu Liu.ConvNet[OL].[2019-01-04].http://libccv.org/doc/doc-convnet/
    [30]Subramanian A S.Caffe Model Zoo[OL].[2019-01-04].https://github.com/BVLC/caffe/wiki/Model-Zoo
    [31]Lecun Y.MNIST[OL].[2019-01-04].http://yann.lecun.com/exdb/mnist/
    [32]Stanford University,Princeton University.ImageNet[OL].[2019-01-04].http://www.image-net.org/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700