媒体数字信号处理器MediaDSP6410微结构研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
RISC/DSP是一种具有很高性价比的可编程的嵌入式媒体处理解决方案。本文作者参与了浙江大学信息与电子工程学系MediaProcessor实验室基于RISC/DSP架构的媒体数字信号处理器MediaDSP6410(简称MD6410)的研发工作,作为部分研究成果,本文着重探讨两发射乱序超标量和双线程扩展微结构的设计。
     测评给处理器设计提供有用的指导,从应用需求的角度提出对处理器设计的要求,从三个层面进行并行性开发。8路SIMD扩展最大化地开发了视频压缩算法核心的数据并行性;复合媒体处理指令开发了指令级并行性并具有好的代码效率;进一步开发线程级并行,将标量程序段和可向量化的程序段作为线程并行执行。
     根据嵌入式处理器的设计面积、功耗预算和设计、验证复杂度的限制,设计最低复杂度的乱序超标量处理器以提升标量代码的执行性能。提出了映射表结合不带操作数的发射缓冲的寄存器重命名机制。为了在不影响性能前提下简化设计,媒体指令和存储指令不进行重命名,复杂的媒体指令同MIPS指令流水线串行运行。改进了复合媒体指令的数据冲突检测机制,避免了全局停顿带来的关键路径。实验表明,在TSMC 130nm worst case下,MD6410流水线达到300MHz,以3.3%的面积代价获得1.6-2倍的标量性能改进。
     多线程扩展旨在开发并行算法,提高处理器的资源利用率和指令吞吐量。为最大化利用硬件资源,提出合理的并行算法和多核多线程硬件架构的映射关系。详细讨论了微结构的设计折中。设计了有利于线程优先级调度的译码段,考虑了共享流水线资源利用率的指令发射逻辑和改进的直接存储访问和便签式存储器接口。提出非阻塞式的消息传递线程同步机制,实现了灵活的多发射和多线程模式切换。实验结果表明,MD6410的双线程设计以5.9%的面积开销获得26%-35%的吞吐量提升。
RISC/DSP is a highly cost effective programmable solution to embedded media processing. The author takes part in the research on the media digital signal processor MediaDSP6410(MD6410) based on RISC/DSP architecture. The research was launched by MediaProcessor Lab of Department of Information Science and Electronic Engineering of Zhejiang University. As part of the research results, this thesis focuses on the research and design of 2-issue out-of-order superscalar and dual-threaded microarchitecture.
     Benchmarking guides the design of processor. The specification of processor design is based on the need of application. Parallelism can be developed in three ways. 8-way SIMD extension maximizes the data-level parallelism of the kernels of video compression algorithms. The compound media processing instructions exploit the instruction-level parallelism and are of good code density. Scalar program sections and vectorized sections can be seen as threads, thus thread-level parallelism can be also exploited.
     Embedded processor design is constrained by area, power budget and design complexity. A superscalar design of minimized complexity is proposed to improve the performance of executing scalar code. A register renaming mechanism which combines the rename map table and issue buffer without operands is proposed. To simplify the design without much sacrificing the performance, media register and store operand are not renamed, so the compound media instructions and RISC instructions are serially executed. The data hazard detection logic is reconsidered to avoid the critical path caused by global stall. Experiments show that MD6410 can work at 300MHz with TSMC 130nm technology in worst case. The performance of execution scalar code is 1.6 to 2 times of the original design at the area cost of 3.3%.
     Multithreaded extension is aiming at developing parallel algorithms and improving processor resource utilization and throughput. To maximize hardware resource utilization, a map relationship between parallel algorithm and multicore and multithreaded architecture is suggested. The tradeoff of microarchitecture design is carefully examined. The instruction decoder is designed to facilitate prioritized thread scheduling. The instruction issue logic considers the utilization of the shared execution pipeline. And the interface between direct memory access and scratch-pad memory is refined. A non-blocking message passing mechanism is proposed to implement thread synchronization, which makes flexible switch between multithread and superscalar modes possible. Experiments show that the throughput is 26%~35% improved at the area cost of 5.9%.
引文
[1]M.J.Flynn and P.Hung,Microprocessor design issues:Thoughts on the road ahead.IEEE Micro,2005.25(3):p.16-31.
    [2]K.Olukotun,et al.,The case for a single-chip multiprocessor.ACM SIGPLAN Notices,1996.31(9):p.2-11.
    [3]P.Kongetira,K.Aingaran,and K.Olukotun,Niagara:A 32-way multithreaded sparc processor.IEEE Micro-Institute of Electrical and Electronics Engineers,2005.25(2):p.21-29.
    [4]O.Wechsler,Inside Intel Core Microarchitecture:Setting New Standards for Energy-Efficient Performance.http://wwww.intel.com.
    [5]K.Hirata and J.Goodacre,ARM MPCore:The streamlined and scalable ARM11 processor core.Proceedings of the 2007 Asia and South Pacific Design Automation Conference 2007:p.747-748.
    [6]J.A.Kahle,et al.,Introduction to the Cell multiprocessor.1BM journal of Research and Development,2005.49(4/5):p.589-604.
    [7]D.Talla,et al.,Anatomy of a portable digital mediaprocessor.IEEE Micro,2004.24(2):p.32-39.
    [8]K.Sankaralingam,et al.,Exploiting ILP,TLP,and DLP with the polymorphous TRIPS architecture.Proceedings of the 30th annual international symposium on Computer architecture,2003:p.422-433.
    [9]B.Khailany,et al.,Imagine:Media processing with streams.IEEE Micro,2001.21(2):p.35-46.
    [10]R.Davis,N.Merriam,and N.Tracey,How embedded applications using an RTOS can stay within on-chip memory limits.Proceedings for the Work in Progress and Industrial Experience Sessions,12th EuroMicro Conference on Real-Time Systems,2000:p.43-50.
    [11]S.Dutta,R.Jensen,and A.Rieckmann,Viper:A multiprocessor SOC for advanced set-top box and digital TV systems.IEEE Design & Test of Computers,2001.18(5):p.21-31.
    [12]D.Talla and J.Golston,Using DaVinci technology for digital video devices.Computer,2007.40(10):p.53.
    [13]J.Goodacre and A.N.Sloss,Parallelism and the ARM instruction set architecture.Computer,2005.38(7):p.42-50.
    [14]N.Seshan,High VelociTI processing[Texas Instruments VLIW DSP architecture].IEEE Signal Processing Magazine,1998.15(2):p.86-101.
    [15]J.W.Waerdt,et al.,The TM3270 media-processor.Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture,2005:p.331-342.
    [16]R.K.Kolagotla,et al.,High performance dual-MAC DSP architecture.IEEE Signal Processing Magazine,2002.19(4):p.42-53.
    [17]S.Wichman and N.Goel,the second generation ZSP DSP.Microprocessor Forum,2001.
    [18]MIPS.Inc.MIPS32 4K Processor Core Family Software User's Manual.2004.
    [19]MIPS.Inc.MIPS32 24K Processor Core Family Software User's Manual.2005.
    [20]MIPS.Inc.MIPS32 74K Processor Core Family Software User's Manual.2007.
    [21]R.E.Gonzalez,Xtensa:A configurable and extensible processor.IEEE Micro,2000.20(2):p.60-70.
    [22]ARM.Inc.The ARM Cortex-A9 Processors.2007.
    [23]MIPS.Inc.MIPS32 1004K CPU Family Software User's Manual.2008.
    [24]ARM.Inc.Architecture and Implementation of the ARM Cortex-A8 Microprocessor.2005.
    [25]CEVA.Inc.CEVA-x1622 Datasheet.2008.
    [26]P.Gumming,The TI OMAP Platform Approach to SoC:Winning the SoC revolution:experiences in real design.Springer,2003.
    [27]MIPS.Inc.Single Chip Coherent Multiprocessing:The Next Big Step in Performance for Embedded Applications.2008.
    [28]J.Glossner,et al.,Sandblaster low-power multithreaded SDR baseband processor.Proceedings of the 3rd Workshop on Applications Specific Processors(WASP'04),2004:p.53-58.
    [29]V.Ramadurai,et al.,Implementation of H.264 decoder on Sandblaster DSP.IEEE International Conference on Multimedia and Expo,2005.
    [30]张奇,媒体数字信号处理器IP核关键技术研究.浙江大学硕士学位论文,2008.
    [31]周莉,RISC/DSP处理器的结构、微结构设计研究.浙江大学博士学位论文,2005.
    [32]J.F.Hennessy and D.Patterson,Computer architecture:a quantitative approach.3rd ed.2003:Morgan Kaufmann.
    [33]J.JDongarra,P.Luszczek,and A.Petitet,The LINPACK Benchmark:past,present and future.Concurrency and Computation:Practice and Experience,2003.15(9):p.803-820.
    [34]M.Johnson,Superscalar microprocessor design.1991:Prentice Hall PTR.
    [35]R.P.Weicker,Dhrystone benchmark:rationale for version 2 and measurement rules.ACM SIGPLAN Notices,1988.23(8):p.49-62.
    [36]C.Lee and M.Potkonjak,MediaBench:a tool for evaluating and synthesizing multimedia and communicatons systems.Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture,1997:p.330-335.
    [37]P.Pirsch,N.Demassieux,and W.Gehrke,VLSI architectures for video compression:a survey.Proceedings of the IEEE,1995.83(2):p.220-246.
    [38]R.P.Llopis,et al.,A low-cost and low-power multi-standard video encoder.Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis,2003:p.97-102.
    [39]A.Peleg,et al.,MMX technology extension to the Intel architecture.IEEE Micro,1996.16(4):p.42-50.
    [40]K.Diefendorff,et al.,AltiVec extension to PowerPC accelerates media processing.IEEE Micro,2000.20(2):p.85-95.
    [41]M.Tremblay,et al.,VIS speeds new media processing,IEEE Micro,1996.16(4):p.10-20.
    [42]郑伟,et al.,一种支持SIMD指令的低功耗分裂式ALU设计.计算机工程,2004.30(17):p.264-266.
    [43]李东晓,一种支持SIMD指令的流水化可拆分乘累加结构.计算机工程,2006.32(7):p.264-266.
    [44]B.J.XIA,P.LIU,and Q.D.YAO,New method for high performance multiply-accumulator design.Journal of Zhejiang University SCIENCE A,2009.10(7):p.1067-1074.
    [45]Y.Lin,et al.,SODA:A high-performance DSP architecture for software-defined radio.IEEE Micro,2007.27(1):p.114.
    [46]俞国军,et al.,视频处理器软硬件协同设计.浙江大学学报(工学版),2006.40(7):p.1117-1122.
    [47]李东晓,系统芯片中媒体增强数字信号处理器核设计研究.浙江大学博士学位论文,2004.
    [48]D.Sweetman,See MIPS run.2nd ed.2006:Morgan Kaufmann.
    [49]G.A.Frantz,et al.,The Texas Instruments TMS320C25 digital signal microcomputer.IEEE Micro,1986.6(6):p.10-28.
    [50]R.Hashemian,Design and hardware implementation of a memory efficient Huffman decoding.IEEE Transactions on Consumer Electronics,1994.40(3):p.345-352.
    [51]张凯舟,H.264编码器关键算法的实现和优化研究.浙江大学硕士学位论文,2008.
    [52]陈科明,媒体多处理器系统芯片的设计研究.浙江大学博士学位论文,2007.
    [53]E.Iwata and K.Olukotun,Exploiting coarse-grain parallelism in the MPEG-2 algorithm.Computer Systems Lab,Stanford University,Tech.Rep.,1998.
    [54]P.G.Paulin,et al.,Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management.Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis,2004:p.48-53.
    [55]M.D.McCool,Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform.GSPx Multicore Applications Conference,2006.
    [56]成杏梅,基于媒体芯片的实时操作系统实现研究.浙江大学博士学位论文,2008.
    [57]R.G.Babb,Parallel processing with large-grain data flow techniques.Computer,1984.17(7):p.55-61.
    [58]E.Lindholm,et al.,NVIDIA Tesla:A unified graphics and computing architecture.IEEE Micro,2008.28(2):p.39-55.
    [59]J.Corbal,R.Espasa,and M.Valero,DLP+ TLP processors for the next generation of media workloads.Proc.of the 7th Intl.Conf.on High-Performance Computer Architectures,2001:p.219-228.
    [60]D.Alpert and D.Avnon,Architecture of the Pentium microprocessor.IEEE Micro,1993.13(3):p.11-21.
    [61]E.McLellan,The Alpha AXP architecture and 21064 processor.IEEE Micro,1993.13(3):p.36-47.
    [62]R.E.Kessler,The alpha 21264 microprocessor.IEEE Micro,1999.19(2):p.24-36.
    [63]S.P.Song,M.Denman,and J.Chang,The PowerPC 604 risc microprocessor.IEEE Micro,1994.14(5):p.8.
    [64]P.Liu,et al.,MediaSoC:A System-on-Chip Architeture For Multimedia application.IEEE Int.Workshop VLSI Design & Video Tech,2005.
    [65]俞国军,基于DSPs的媒体处理器系统芯片设计研究.浙江大学博士学位论文,2006.
    [66]张奇,MediaDSP64设计备忘录.Internal Documents of CISE,2008.
    [67]V.Srinivasan,et al.,Optimizing pipelines for power and performance.MICRO-ANNUAL WORKSHOP THEN ANNUAL INTERNATIONAL SYMPOSIUM,2002:p.333-344.
    [68]T.Horel,et al.,UltraSPARC-Ⅲ:Designing third-generation 64-bit performance.IEEE Micro,1999.19(3):p.73-85.
    [69]J.P.Shen and M.H.Lipasti,Modern processor design:fundamentals of superscalar processors.2004:McGraw-Hill Science Engineering.
    [70]江国范,异质媒体双发射处理器的设计研究.浙江大学硕士学位论文,2008.
    [71]R.M.Tomasulo,An efficient algorithm for exploiting multiple arithmetic units.IBM journal of Research and Development,1967.11(1):p.25-33.
    [72]J.M.Tendler,et al.,POWER4 system architecture.IBM journal of Research and Development,2002.46(1).
    [73]B.Flachs,et al.,The microarchitecture of the synergistic processor for a cell processor.IEEE Journal of Solid-State Circuits,2006.41(1):p.63-70.
    [74]K.C.Yeager,The MIPS R10000 superscalar processor.IEEE Micro,1996.16(2):p.28-40.
    [75]G.Hinton,et al.,The microarchitecture of the Pentium 4 processor,Intel Technology Journal,2001.1(2).
    [76]D.Sima,The Design Space of Register Renaming Techniques in Superscalar Processors.IEEE Micro,2000.20(5):p.70-83.
    [77]D.Christie,Developing the AMD-K5 architecture.IEEE Micro,1996.16(2):p.16-27.
    [78]S.Palacharla,N.P.Jouppi,and J.E.Smith,Quantifying the complexity of superscalar processors.Univ.of Wisconsin Computer Science Tech.Report,1997.1328.
    [79]D.Parikh,et al.,Power issues related to branch prediction.Proceedings of the Eighth International Symposium on High-Performance Computer Architecture,2002:p.233-244.
    [80]A.Buyuktosunoglu,et al.,A circuit level implementation of an adaptive issue queue for power-aware microprocessors.Proceedings of the 11th Great Lakes symposium on VLSI,2001:p.78.
    [81]J.H.Edmondson,et al.,Internal organization of the Alpha 21164,a 300-MHz 64-bit quad-issue CMOS RISC microprocessor.Digital Technical Journal,1995.7(1).
    [82]H.Akkary,R.Rajwar,and S.T.Srinivasan,Checkpoint processing and recovery:Towards scalable large instruction window processors.Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture,2003.
    [83]J.E.Smith and A.R.Pleszkun,Implementing precise interrupts in pipelined processors.IEEE Transactions on Computers,1988.37(5):p.562-573.
    [84]M.C.Becker,et al.,The PowerPC 601 Microprocessor.IEEE Micro,1993.13(5):p.54-68.
    [85]A.Cristal,et al.,Toward kilo-instruction processors.ACM Transactions on Architecture and Code Optimization(TACO),2004.1(4):p.389-417.
    [86]Intel.Inc.First the Tick,Now the Tock:Next Generation Intel Microarchitecture.2008.
    [87]D.Koufaty and D.T.Marr,Hyperthreading technology in the netburst microarchitecture.IEEE Micro,2003.23(2):p.56-65.
    [88]E.B.VanDerTol,E.G.Jaspers,and R.H.Gelderblom,Mapping of H.264 decoding on a multiprocessor architecture.Proc.SPIE conf.on image and video communications and processing,2003.5022:p.707-718.
    [89]T.R.Jacobs,V.A.Chouliaras,and D.J.Mulvaney,Thread-parallel MPEG-2,MPEG-4 and H.264 video encoders for SoC multi-processor architectures.IEEE Transactions on Consumer Electronics,2006.52(1):p.269.
    [90]R.Kalla,B.Sinharoy,and J.M.Tendler,IBM Power5 chip:A dual-core multithreaded processor.IEEE Micro,2004.24(2):p.40-47.
    [91]MIPS.Inc.MIPS32 34K Processor Core Family Software User's Manual.2007.
    [92]D.M.Tullsen,et al.,Exploiting choice:Instruction fetch and issue on an implementable simultaneous multithreading processor.ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE,1996.23:p.191-202.
    [93]H.Q.Le,et al.,IBM Power6 microarchitecture.IBM journal of Research and Development,2007.51(6):p.639-662.
    [94] IBM.Inc. Cell Broadband Engine Programming Handbook. 2008.
    
    [95] K.D.Kissell, MIPS MT: A multithreaded RISC architecture for embedded real-time processing. Lecture Notes in Computer Science, 2008. 4917.
    [96] MIPS.Inc. The MIPS32 34K Core Family: Powering Next-Generation Embedded SoCs.2006.
    [97] D.M.Tullsen, et al., Supporting fine-grained synchronization on a simultaneous multithreading processor. Proceedings of the 5th International Symposium on High Performance Computer Architecture, 1999.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700