摘要
长短期记忆(long short-term memory, LSTM)网络是一种循环神经网络,其擅长处理和预测时间序列中间隔和延迟较长的事件,多用于语音识别、机器翻译等领域.然而受限于内存带宽的限制,现今的多数神经网络加速器件的计算模式并不能高效处理长短期记忆网络计算;而阻变存储器交叉开关结构能够以存内计算形式完成高效、高密度的向量矩阵乘运算,从而成为一种高效处理长短期记忆网络的极具潜力的加速器设计模式.研究了面向阻变存储器的长短期记忆神经网络加速器模拟工具以及相应的神经网络训练算法.该模拟工具能够以时钟驱动的形式模拟设计者提出的以阻变存储器交叉开关结构为核心加速部件的长短期记忆加速器微体系结构,从而进行设计空间探索;同时改进了神经网络训练算法以适应阻变存储器特性.这一模拟工具基于System-C实现,且对于核心计算部分实现了图形处理器加速,可以提高阻变存储器器件的仿真速度,为探索设计空间提供便利.
Long short-term memory(LSTM) is mostly used in fields of speech recognition, machine translation, etc., owing to its expertise in processing and predicting events with long intervals and long delays in time series. However, most of existing neural network acceleration chips cannot perform LSTM computation efficiently, as limited by the low memory bandwidth. ReRAM-based crossbars, on the other hand, can process matrix-vector multiplication efficiently due to its characteristic of processing in memory(PIM). However, a software tool of broad architectural exploration and end-to-end evaluation for ReRAM-based LSTM acceleration is still missing. This paper proposes a simulator for ReRAM-based LSTM neural network acceleration and a corresponding training algorithm. Main features(including imperfections) of ReRAM devices and circuits are reflected by the highly configurable tools, and the core computation of simulation can be accelerated by general-purpose graphics processing unit(GPGPU). Moreover, the core component of simulator has been verified by the corresponding circuit simulation of a real chip design. Within this framework, architectural exploration and comprehensive end-to-end evaluation can be achieved.
引文
[1]Jouppi N P,Young C,Patil N,et al.In-datacenter performance analysis of a tensor processing unit[C] //Proc of the 44th Annual Int Symp on Computer Architecture.New York:ACM,2017:1- 12
[2]Xcelerit.Benchmarks:Deep Learning Nvidia P100 vs V100 GPU[OL].[2017-11-27].https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-deep-learning-nvidia-p100-vs-v100-gpu/
[3]Han Song,Kang Junlong,Mao Huizi,et al.ESE:Efficient speech recognition engine with sparse LSTM on FPGA[C]//Proc of the 2017 ACM/SIGDA Int Symp on Field-Programmable Gate Arrays.New York:ACM,2017:75- 84
[4]Wang Suo,Li Zhe,Ding Caiwen,et al.C-LSTM:Enabling efficient LSTM using structured compression techniques on FPGAs[C] //Proc of the 2018 ACM/SIGDA Int Symp on Field-Programmable Gate Arrays.New York:ACM,2018:11- 20
[5]Shafiee A,Nag A,Muralimanohar N,et al.ISAAC:A convolutional neural network accelerator with in-situ analog arithmetic in crossbars[C] //Proc of the 43rd Annual Int Symp on Computer Architecture.Piscataway,NJ:IEEE,2016:14- 26
[6]Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural Computation,1997,9(8):1735- 1780
[7]Evangelopoulos G N.Efficient hardware mapping of long short-term memory neural networks for automatic speech recognition[D].Belgium:KU Leuven,2016
[8]Liu Beiye,Li Hai,Chen Yiran,et al.Vortex:Variation-aware training for memristor x-bar[C]// Proc of the 52nd ACM/EDAC/IEEE Design Automation Conf (DAC).Piscataway,NJ:IEEE,2015:1- 6
[9]Tang Tianqi,Xia Lixue,Li Boxun,et al.Binary convolutional neural network on RRAM[C] //Proc of the 22nd Asia and South Pacific Design Automation Conf.Piscataway,NJ:IEEE,2017:782- 787
[10]Song Linghao,Qian Xuehai,Li Hai,et al.PipeLayer:A pipelined ReRAM-based accelerator for deep learning[C] //Proc of 2017 IEEE Int Symp on High Performance Computer Architecture.Piscataway,NJ:IEEE,2017:541- 552
[11]Dong Xiangyu,Xu Cong,Xie Yuan,et al.NVSim:A circuit-level performance,energy,and area model for emerging nonvolatile memory[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2012,31(7):994- 1007
[12]Xu Sheng,Chen Xiaoming,Wang Ying,et al.PIMSim:A flexible and detailed processing-in-memory simulator[J].IEEE Computer Architecture Letters,2018,18(1):6- 9
[13]Chen Paiyu,Peng Xiaochen,Yu Shimeng.NeuroSim:A circuit-level macro model for benchmarking neuro-inspired architectures in online learning[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2018,37(12):3067- 3080
[14]Xia Lixue,Li Boxun,Tang Tianqi,et al.MNSIM:Simulation platform for memristor-based neuromorphic computing system[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2018,37(5):1009- 1022
[15]Yao Peng,Wu Huaqiang,Gao Bin,et al.Face classification using electronic synapses[J].Nature Communications,2017,8:15199
[16]Muralimanohar N,Balasubramonian R,Jouppi N P.Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0[C] //Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO-40 2007).Piscataway,NJ:IEEE,2007:3- 14
[17]Long Yun,Na T,Mukhopadhyay S.ReRAM-based processing-in-memory architecture for recurrent neural network acceleration[J].IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2018,26(12):2781- 2794
[18]Gu Peng,Li Boxun,Tang Tianqi,et al.Technological exploration of RRAM crossbar array for matrix-vector multiplication[J].Journal of Computer Science and Technology,2016,31(1):3- 19
[19]Lee S R,Kim Y B,Chang M,et al.Multi-level switching of triple-layered TaOx RRAM with excellent reliability for storage class memory[C] //Proc of 2012 Symp on VLSI Technology.Piscataway,NJ:IEEE,2012:71- 72
① 与真实芯片的电路设计作对比得出该结论.