DSP高效片内二级Cache控制器的设计与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

DSP高效片内二级Cache控制器的设计与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Design and Implementation of High Performance Level Two Cache Controller on DSP Chip
作者：刘胜
论文级别：硕士
学科专业名称：电子科学与技术
中文关键词：“Cache+RAM”结构 ; 缺失流水线 ; 写缓冲 ; 写合并 ; 跨边界访问 ; EDMA访问 ; 数据一致性
英文关键词："Cache and RAM" structure ; miss pipeline ; write buffer ; write merge ; nonaligned access ; EDMA service ; memory consistency
学位年度：2008
导师：陈书明
学科代码：080903
学位授予单位：国防科学技术大学
论文提交日期：2007-11-01

摘要

数字信号处理器(DSP)在近年来得到广泛的发展及应用。“Cache+RAM”的存储结构已经成为高性能DSP设计中不可或缺的技术之一。二级Cache控制器的设计是“两级Cache+RAM”存储结构中的关键环节。如何设计和实现一个正确、高效同时又满足高频要求的二级Cache控制器是一个值得研究的问题。
     FT-CXX是我们自主研发中的一款高性能定点DSP,采用超长指令字(VLIW)技术,一拍内最多可以发射8条指令。预期CPU频率600MHz,外设频率300MHz,二级Cache(L2)的总容量1MB。本文对其中的L2控制器的设计和实现技术进行了研究,主要工作和贡献集中体现在以下几个方面:
     首先,分析了一般的Cache的设计方法,全面考察了主流DSP芯片中Cache的性能要求和实现技术,设计实现了FT-CXX L2的Cache/SRAM结构,确定了L2数据体、Tag体的结构及地址访问规则,设计实现了L2Cache的映象规则、替换算法、写策略等。
     其次,针对L2存储容量大、存储体只能支持CPU频率一半的事实,采取措施优化对一级Cache(L1D和L1P)缺失的处理。1)设计了缺失流水线,理想情况下平均每个L1的缺失代价只有两拍;2)在L1D和L2之间设计了一个宽度为64bit,深度为4且支持写合并的L1D写缺失缓冲队列,有效地减少了L1D写缺失的等待时间;3)提出了跨边界访问问题的解决方案,该方案具有效率高、硬件开销小且不会增加编译器的额外负担等特点。
     再次,设计并实现了一种高效的L2 SRAM的EDMA访问的处理机制。该机制充分挖掘了EDMA访问潜在的并行性,综合采用了EDMA请求猝发(可以连续发8个读请求,4个写请求)、侦听和数据发送处理流水化、基于侦听历史的侦听次数减少、基于旁路和归并机制的L2数据体访问的削减等技术,使EDMA的传输效率大大提高,平均访问一个数据只需要2-3拍,和一般的串行通路相比,加速比在2.0以上。
     最后,设计并实现了高效的数据一致性维护机制。一方面提供了丰富的Cache控制寄存器操作,另一方面对侦听和数据写回进行了分类处理。实验结果表明,该机制使系统典型请求的开销降低了10%以上。
     此外,本文对以上设计进行了较为系统地验证,并进行了逻辑综合和优化,使其在SMIC 0.13 um工艺下满足与一级Cache的接口部分工作频率为600MHz,内部的工作频率为300MHz的要求。
Nowadays the Digital Signal Processor (DSP) has got a lot of development and been widely used. And the "on-chip Cache and RAM" structure is becoming an indispensable technique in the design of the high performance DSP. The design of level two memory (L2) cache controller is a key point in the "on-chip two level Cache and RAM" structure. So it is a good research area that how to design and realize an accurate, efficient and frequency-satisfied L2 cache controller.
     FT-CXX is a 32-bit fixed-point high performance DSP being designed. Its architecture is very long instruction word (VLIW) and it can issue 8 instructions in a cycle. Its CPU will run at the frequency of 600MHz,and its peripheral equipment will run at 300MHz.The total capability of L2 is one million bytes. We design and realize the L2 cache controller of FT-CXX. The main work and contribution is as follows:
     First, we roundly review the cache techniques and the requisite performance in the popular DSP. The cache/RAM structure is designed and realized and the data bank, tag bank, and the address accessed rule are fixed. And the associative rules, choosing cache policies, writing policies are fixed and realized, too.
     Second, facing the fact that the L2 data bank can only run at a half frequency of CPU, we make some methods to reduce the cost of L1 (L1D and L1P) miss: 1), the L1 miss pipeline is designed. Once the pipeline has been totally filled, the increment cost of a new miss averages only 2 cycles. 2), between L1D and L2 we design a L1D write buffer which width is 64-bit and depth is 4. The write buffer allows merging of write requests. It can reduce the write miss cost efficiently. 3), a scheme which could solute the nonaligned access problem is designed. And this scheme, which has little hardware cost, is more efficient and couldn't make much burden to the complier.
     Third, we also provide a good method for the EDMA (enhanced direct memory access) to access the SRAM of L2. The potential parallelism between the accessing is being made good use of. The method contains supporting the burst access (8 reading burst and 4 writing burst), pipelining the snooping and sending, reducing the times of snooping by recording the snooping history, and reducing the times of accessing the L2 data bank by bypass and merging. The cost of per EDMA access is 2-3 cycles. Compared with the serial access, it has a speedup of 2.0 at least.
     At last, an efficient memory consistency protocol is also designed and realized. On one hand, various cache operations are provided. On the other hand, different snoopings and different write-backs are handled separately. The cost of some typical requests has been reduced by 10% at least from our experiment.
     In addition, we also complete the work of verification and synthesis of the L2 cache controller. In the SMIC 0.13μm technology, The design meets the frequency request which is 600MHz in the fast units, 300MHz in the slow units.

引文

[1]田黎育,何佩琨,朱梦宇,TMS320C6000系列DSP编程工具与指南,北京:清华大学出版社,2006
    [2]任丽香,马淑芬,李芳慧,TMS320C6000系列DSPs的原理与应用,北京:电子工业出版社,2000
    [3]张雄伟,曹铁勇,DSP芯片的原理与开发应用,北京:电子工业出版社,2000
    [4]Jennifer Eyre,Jeff Bier,the Evolution of DSP Processors,Berkekey Design Technoloy,2000
    [5]DSP技术最新应用动态,www.ed-china.com
    [6]DSP Architectures:Past Present and Future,www.bdti.com
    [7]TMS320C64x Technical Overview,www.ti.com
    [8]TigerSHARC Swallows DRAM,www.adi.com
    [9]程由猛,陈书明,高性能DSP片内二级Cache控制器设计与优化,第八届计算机工程与工艺全国年会,2003.8
    [10]陈书明,万江华,郭阳,高性能数字信号处理器研究进展,第九届计算机工程与工艺全国年会,2004.8
    [11]Jojn L.Henessy,David A.Patterson,Computer Architecture:A Quantitative Approach,Third Edition,2004
    [12]TMS320C6000 Peripherals Refenrence Guide,Literature Number:SPRU 190D,2001.2
    [13]TigerSHARC Sinks Teeth Into VLIW,www.adi.com
    [14]OMAP5910 Dual Core Processor-Technical Reference Manual,Texas Instruments Inc.,August,200.4
    [15]岳虹,嵌入式异构多核处理器的设计与实现,国防科学技术大学研究生院,博士学位论文,2006
    [16]Brian R Gaeke,Parry Husbands,Xiaoye S Li,Memory-intensive benchmarks:IRAM vs.cache-based machines,iram.cs.Berkeley.edu/papers,2002
    [17]Bill S.H.Kwan,Bruce F.Cockburn,Implementation of DSP-RAM:An Architecture for Parallel Digital Signal Processing In Memory,Electrical and Computer Engineering Conference,Canadian,2001
    [18]Sadagopan Srinivasan,Vinodh Cuppu,Transparent Data-Memory Organizations for Digital Signal Processors,International Conference on Compilers,Architecture,and Synthesis for Embedded Systems(CASES 2001),pp.44-48,2001,11
    [19]TMS320C62x,TMS320C67x DSP Cache Performance on Vocoder Benchmarks,www.ti.com
    [20]曾岗,基于DSP的视频压缩系统实现与优化,国防科学技术大学研究生院,硕士学位论文,2004
    [21]Zhong Wang,Michael Kirkpatrick,Optimal two level partitioning and loop scheduling for hiding memory latency for DSP applications,Proceedings of the 37th conference on Design automation,Los Angeles,California,United States,2000
    [22]Sanjeev Kohli,Cache Aware Scheduling for Synchronous Dataflow Programs,Master's Report Technical Memorandum UCB/ERL M04/03,2004,11
    [23]TMS320C6000 DSP Cache User's Guide,www.ti.com
    [24]sanjive Agarwala,Charles Fuoco,Tim Aderson,Dave Comisky,A multi-level memory system architecture for high performance DSP applications,IEEE,2000
    [25]Wayne Wolf,Instruction fetch characteristics of media processing,SPIE Photonics West,Media Processors 2002,San Jose,CA,pp.72-83,2002,1
    [26]Pablo Ib&iez,Characterization and Improvement of Load/Store Cache-based Prefetching,International Conference on Supercomputing archive Proceedings of the 12th international conference on Supercomputing,2000
    [27]Fritts,J,Multi-Level Memory Prefetching for Media and Stream Processing,IEEE International Conference on Multimedia and Expo,ICME 02.Proceedings,2002
    [28]U.J.Kapasi,W.J.Dally,the Imagine stream processor,proceedings of 2002 International Conference on Computer Design,2002
    [29]文梅,伍楠等,有效解决带宽瓶颈的流式存储层次研究,中国计算机体系结构年会(ACA'04),2004
    [30]马驰远等,支持监听一致性协议的二级Cache研究与实现,第十三届全国信息存储技术学术会议,2004
    [31]马鹏勇等,支持两条并行存储指令的Cache控制器设计,高技术通信增刊,2002.8
    [32]程由猛,陈书明,高性能DSP片内二级Cache控制器设计与优化,第八届计算机工程与工艺全国年会,2003.4
    [33]张丹瑜,高性能DSP片内存储系统的局部优化设计研究,国防科学技术大学研究生院,硕士学位论文,2004.11
    [34]卢晏安,陈书明,张丹瑜,郭阳,“银河飞腾”高性能DSP Cache系统的设计优化,第九届计算机工程与工艺全国学术年会,2005.8
    [35]张晨曦,王志英,张春元,戴葵,朱海滨,计算机体系结构,高等教育出版社,2000
    [36]Jim Handy,the cache memory book,Academic Press Professional,Inc,1993
    [37]刘胜,陈书明,多级存储系统跨边界访问实现策略研究,计算机工程与设计(已录用)
    [38]www.eventhelix.com/RealtimeMantra
    [39]Jose Fridman,"Data Alignment for Sub_word Parallelism in DSP ",IEEE,1999
    [40]Fumihiko Hayakawa,Hiroshi Okano,Atsuhiro Suga,An 8-Way VLIW Embedded Multimedia Processor with Advanced Cache Mechanism,IEEE,2002
    [41]Songping Mai,KunYang,Weili Lan,Chun Zhang,Zhihua Wang,An Open-Source Based Ehanced Multimedia-Processing Capacity for Embedded Applications,ISCAS,2006
    [42]Peng Wu,Alexandre E.EichenbergeJ~,Amy Wang,Efficient SIMD Code Generation for Runtime Alignment and Length Conversion,Proceeding of the International Symposium on Code Generation and Optimization(CGO),2005
    [43]孙书为,陈书明,一种基于RAM的DMA地址生成及传输控制机制的设计和实现,第八届计算机工程与工艺全国年会,2003.8
    [44]鲁建壮,刘胜,马卓,片上RAM的并行DMA数据通路设计与实现,第十一届计算机工程与工艺全国年会,2007.8
    [45]Phillip W.Hutto and Mustapue Ahamad.Slow Memory:Weakening Consistency to Enhance Concurrency in Distributed Shared Memories,IEEE,1990
    [46]L.ftode,J.ESingh and KLi,Scope Consistency:A Bridge between Release Consistency and Consistency,Theory Compute Systems 31,pp.451-473,1998
    [47]Neil H.E.West,Kamran Eshraghian.Pricipals of CMOS VLSI Design—A Systems Perspective.Addison Wesley publishing company,1993
    [48]周文华,罗晓沛,专用集成电路的设计验证方法及一种实际的通用微处理器设计的多级验证体系,计算机研究与发展,pp.764-768,1999,6(6)
    [49]刘胜,鲁建壮,陈书明,X DSP二级Cache的验证,第十一届计算机工程与工艺全国年会,2007.8
    [50]David Dernpster,Michael Stuart,VERIFICATION METHO-DOLOGY MANUAL Techniques for Verifying HDL Designs,pp.34,Team work International,1998
    [51]唐杉,徐强,王莉薇,数字IC设计方法、技巧与实践,机械工业出版社,2006
    [52]Guide to HDL Coding Styles for Synthesis,Synopsys Corp.2003.3
    [53]Design Compiler User Guide,Synopsys Corp.2003.6

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700