可配置可扩展媒体处理器设计
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文围绕支持多标准视频的处理器体系和SoC设计展开研究,提出了一种可配置可扩展的媒体处理器设计。该方案以嵌入式RISC处理器与视频处理引擎为核心,配以DMA等主要IP模块,通过AMBA总线和专用信道互连,构成紧耦合的异构多核可编程视频处理SoC平台。
     本文主要在以下三个方面开展了有价值的探索性研究:
     1)可配置可扩展的嵌入式RISC处理器的设计,特别是内存子系统相关的Cache和MMU设计。
     在32位嵌入式处理器CK-Core指令体系基础上,设计实现了可配置可扩展的嵌入式RISC处理器CK520。内存子系统的设计中提出了基于组拼合的可在线配置Cache和两级TLB结构的全综合设计MMU。微体系设计实现了参数化,提供了一组包括指令和数据缓存大小、相关性和替换策略,内存管理单元(MMU)的各级TLB表项数,跳转预测等的可配置选项。并可针对特定应用扩展数据通路定制特定的加速指令,或通过协处理器接口扩展可编程加速器。CK520既可以提供RISC处理器的良好编程性,又可以通过配置和扩展获得对不同应用的适应性和高效性。
     2)面向视频应用的内嵌数据组织的单指令多数据指令体系EDO-SIMD,和基于该指令体系的视频处理引擎设计。
     通过对视频应用和算法的分析,总结并提取出视频应用的一系列特性和算法核心。通过对算法核心的深入分析发现其中存在子矩阵、线性连续、蝶形交错、广播和时延偏斜等规则模式的数据组织,而传统的SIMD指令体系在上述数据组织中的开销很大,已成为提高视频处理性能的瓶颈,因此设计了一套面向视频处理的EDO-SIMD指令体系。与通用处理器的媒体扩展不同,EDO-SIMD指令体系并非一个基于RISC指令体系的扩展,而是作为一个面向视频应用优化的独立SIMD处理器设计,集成了媒体扩展的优点和许多高效视频处理的新特性。EDO-SIMD指令体系的优点包括:可编程性和灵活性支持不断涌现的视频标准,内嵌数据组织指令通过数据组织与计算融合实现高效能的视频处理,指令集简洁适用于低造价、低功耗的嵌入式系统,可面向应用扩展并根据性能和造价等作配置优化。
     基于EDO-SIMD指令体系的视频处理引擎设计中,采用参数化和模块化的设计原则实行可扩展的矢量长度,32比特数据ALU/MAC单元作为构成数据通路的1路模块,根据应用对计算性能的需求可选取1、2和4路等实现方案。在操作数读取和片上数据存储回写处,通过矢量置换网络实现内嵌数据组织指令。在ALU和乘法器等的设计中,采用门控位和拆分等策略实现了包括Byte,Half Word和Word在内的各精度数据处理SIMD指令。在片上数据存储的设计中,采用了Byte可寻址的双data-buffer的策略,既可以支持内存的非对齐访问,又可通过DMA并发完成数据的搬运。
     3)视频处理SoC平台的设计
     基于AMBA总线和主要处理引擎间专用信道的SoC互连,以嵌入式RISC处理器和基于EDO-SIMD指令体系的视频处理擎为核心,配以DMA和内存控制器等外设,构成了一个基本的异构多核视频处理SoC平台原型。该平台可有效挖掘数据级、指令级和任务级的并行性,提供了较高的视频处理性能。其中RISC处理器可以利用专用信道,通过远程函数调用的模式高效实现对视频处理引擎的任务调度和DMA的配置启动。该平台具有可配置和可扩展的特性,可以根据应用的需求对平台的各部分参数作优化配置,实现高性能低功耗的应用解决方案。
     应用-算法-体系-VLSI实现相互推动、相互印证的研究思路,和可配置、可扩展、参数化、模块化的设计方法,贯穿于整个研究内容和进程中,对于嵌入式处理器设计和SoC的开发具有一定参考价值。
Video applications are computationally intensive, stretching the capabilities of current embedded processors. In this dissertation, the architecture design and VLSI implementation of a configurable and extensible media processor is presented, which support multi-standard video applications. In the design, an embedded RISC core and an EDO-SIMD video processing engine were integrated with the DMA etc. via AMBA bus and dedicated communication channel as a high performance and low power heterogeneous multi-core platform. The research has made valuable exploration on following three aspects:
     First, configurable extensible embedded RISC processor design, especially cache and memory management unit (MMU) in memory subsystem.
     A RISC processor CK520 is designed based on 32-bit CK-CORE instruction set architecture. A way combined based on-line configurable cache and 2-level TLB MMU micro-architecture is proposed in memory subsystem design. Micro-architecture and memory subsystem of CK520 is parameristic, some parameters such as instruction cache and data cache size, way associativity, replace scheme, the size of MMU TLB, branch prediction etc are configurable. The application specific instruction such as MAC can be implemented via extending data path of basic core, while the programmable accelerator can be attached through coprocessor interface. So the CK520 can provide traditional programmability, as well as adaptability and efficiency via configuration and extension.
     Second, EDO-SIMD (embedded data organization SIMD) instruction set architecture and video processing engine design for multi-standard video processing acceleration.
     The features of video application and algorithm were summarized, through analysis on the typical video application benchmarking results. It was found that the operands involved some matrix, sequential, butterfly, broadcast and delay line skew addressing mode. In traditional SIMD media instruction set architecture, these operands organization overhead is obstacle to improve the performance and hardware efficiency. So a SIMD instruction set architecture with embedded data organization (EDO-SIMD) is proposed.
     EDO-SIMD is not designed as extension to RISC CPU, but designed as a standalone processor architecture optimized for video processing. The features of EDO-SIMD ISA is programmability and flexibility to support multi-standard video codec, embedded data organization instruction and video application specific instructions to boost video processing performance, simplicity for low cost and low power constrained embedded system, scalability and configurability to adapt with application requirements.
     In the micro-architecture design of video processing engine, a parametric and modular design methodology is applied to support configurable vector length, and a 32-bit data path is desgined as 1-way, and full data path can be tiled to LMAX/4 way according to application performance requirements. Vector permutation networks are inserted into operands reading ports and result store path to support embedded data organization instructions. In ALU and multiplier design, gated bits and split module scheme are use to support various data precision operations including byte, half word and word . And the on chip data memory is byte addressable double data buffer structure, so that the unaligned data load and store can be supported and data can be prepared by DMA engine concurrently with computing.
     Finally, a video application specific SoC platform design and optimize.
     In video SoC platform, the embedded RISC processor and EDO-SIMD video processing engine were integrated with the DMA and LCDC etc. IP based on AMBA SoC interconnection and dedicated communication channel. The platform can exploit data level parallelism, instruction level parallelism and task level parallelism. The RISC processor can schedule the tasks on video engine and configure and kick off DMA transaction efficiently in a remote procedure call mode. The platform can be configured and extended to fulfill the application requirements, and get a high performance low power video system solution.
     The research methodology that application, algorithm, architecture, VLSI implementation are considered seamless, as well as the configurable, extensible, parametric and modular design methodology, are valuable for the embedded processor and SoC design.
引文
[1]I.E.G.Richardson,"H.264/MPEG-4 Part 10 White Paper," May,2004.
    [2]S.Srinivasan and S.L.Regunathan,"An Overview of VC-1," in Visual Communications and Image Processing,2005,pp.720-728.
    [3]F.Liang,M.Siwei,and W.Feng,"Overview of AVS Video Standard," in Proc.2004 IEEE Intl.Conf.Multimedia & Expo,2004,pp.423-426.
    [4]H.Schwarz,D.Marpe,T.Schierl,and T.Wiegand,"Combined Scalability Support for the Scalable Extension of H.264/AVC," in IEEE International Conference on Multimedia and Expo 2005,pp.446-449.
    [5]D.Alfonso,B.Biffi,and L.Pezzoni,"Adaptive GOP size control in H.265/AVC encoding based on Scene Change Detection," Signal Processing Symposium,2006.NORSIG 2006.Proceedings of the 7th Nordic,pp.86-89,2006.
    [6]“信息产业电子信息产品“十五”投资指南,”2003.
    [7]I.E.G.Richardson,H.264 and MPEG-4 Video Compression:Video Coding for Next-generation Multimedia:John Wiley and Sons,2003.
    [8]J.Van Praet,G.Goossens,D.Lanneer,and H.De Man,"Instruction set definition and instruction selection for ASIPs," Proceedings of the 7th international symposium on High-level synthesis,pp.11-16,1984.
    [9]J.L.Gustafson,"Reevaluating Amdahl's law," Communications of the ACM,vol.31,pp.532-533,1988.
    [10]"H.264 White paper," UB Video Inc.www.ubvideo.com.
    [11]J.Henning,"SPEC2000:measuring CPU performance in the.new millennium," IEEE Computer,July 2000.
    [12]T.R.Halfhill,"EEMBC releases first benchmarks," Microprocessor Report,vol.1,2000.
    [13]"BDTI benchmark," http://www.bdti.com
    [14]C.Lee,M.Potkonjak,and W.H.Mangione-Smith,"MediaBench:a tool for evaluating and synthesizing multimedia and communicatons systems," in Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture Research Triangle Park, North Carolina, United States:IEEE Computer Society, 1997.
    [15] J. E. Fritts, F. W. Steiling, and J. A. Tucek, "MediaBench II Video:Expediting the Next Generation of Video Systems Research," SPIE Electronic Imaging - Embedded Processors for Multimedia and Communications II, pp. 79-93, Jan. 2005.
    [16] M. J. Flynn, "Very high-speed computing systems," in Readings in computer architecture: Morgan Kaufmann Publishers Inc., 2000, pp.519-527.
    [17] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach: Morgan Kaufmann, 2006.
    [18] H. Nguyen and L. K. John, "Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology," in Proceedings of the 13th international conference on Supercomputing Rhodes, Greece:ACM Press, 1999.
    [19] "Optimizing Video Compression for Intel? Digital Security Surveillance applications with SIMD and Hyper-Threading Technology," White Paper,http://www.intel.com/desien/intarch/papers/309629.htm Intel Inc.
    [20] N. T. Slingerland and A. J. Smith, "Multimedia Instruction Sets for General Purpose Microprocessors: a Survey," EECS Department,University of California, Berkeley UCB/CSD-00-1124, 2000.
    [21] D. Talla, L. K. John, and D. Burger, "Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements,"IEEE Trans. Comput., vol. 52, pp. 1015-1031, 2003.
    [22] R. Espasa, M. Valero, and J. E. Smith, "Vector architectures: past, present and future," in Proceedings of the 12th international conference on Supercomputing Melbourne, Australia: ACM Press, 1998.
    [23] C. Kozyrakis and D. Patterson, "Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks," in Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture Istanbul, Turkey: IEEE Computer Society Press, 2002.
    [24] J. Gebis, S. William, C. Kozyrakis, and D. Patterson, "VIRAM1: A Media-Oriented Vector Processor with Embedded DRAM " in 41st Design Automation Student Design Contenst, San Diego, CA, June, 2004.
    [25] K. Konstantinides, "VLIW Architecture For Media Processing," Signal Processing Magazine, IEEE, vol. 15, pp. 16-19, 1998.
    [26] J. T. J. V. Eijndhoven and E. J. D. Pol, "TriMedia CPU64 Architecture," in Proceedings of the 1999 IEEE International Conference on Computer Design: IEEE Computer Society, 1999.
    [27] J. O'Donnell, "MAP1000A: A 5W, 230MHz VLIW Mediaprocessor,"Proceedings of Hot Chips, vol. 11, pp. 95-109, 1999.
    [28] J. Golston, D. Hoyle, V. Markandey, and etal., "C64x VelociTI. 2 extensions support media-rich broadband infrastructure and image analysis systems," Proc. SPIE, vol. 4313, pp. 1-10, 2001.
    [29] J. A. Fisher, P. Faraboschi, and C. Young, Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools: Morgan Kaufmann, 2005.
    [30] T. Instruments, "TMS320C80 Multimedia Video Processor (MVP):Technical Brief," Texas Instruments, Houston, Texas, 1994.
    [31] D. Pham, S. Asano, M. Bolliger, and etal., "The design and implementation of a first-generation CELL processor," Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, pp. 184-186,2005.
    [32] J. Chaoui, K. Cyr, J. P. Giacalone, and e. al, "OMAP: Enabling Multimedia Applications in Third Generation (3G) Wireless Terminals,"Texas. Instrument Technical White Paper, SWPA001, December, 2000.
    [33] K. Diefendorff and P. K. Dubey, "How Multimedia Workloads Will Change Processor Design," Computer, vol. 30, pp. 43-45, 1997.
    [34] P. Ranganathan, S. Adve, and N. P. Jouppi, "Performance of image and video processing with general-purpose processors and media ISA extensions," in Proceedings of the 26th annual international symposium on Computer architecture Atlanta, Georgia, United States: IEEE Computer Society, 1999.
    [35] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. L, L. pez, P. R.Mattson, and J. D. Owens, "A bandwidth-efficient architecture for media processing," in Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture Dallas,Texas,United States:IEEE Computer Society Press,1998.
    [36]S.G.Berg,"Cache Prefetching," Techique Report UW-CSE,pp.02-04,2002.
    [37]A.Prati,"Exploring multimedia applications locality to improve cache performance," in Proceedings of the eighth ACM international conference on Multimedia Marina del Rey,California,United States:ACM Press,2000.
    [38]J.Sebot and N.Drach-Temam,"Memory bandwidth:The true bottleneck of SIMD multimedia performance on a superscalar processor,"Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing,pp.439-447,2001.
    [39]A.Core,"ARC 32-Bit Configurable RISC Processor:Datasheet," in www.arc.com:July,2000.
    [40]R.E.Gonzalez,"Xtensa:A Configurable and Extensible Processor," IEEE Micro,vol.20,pp.60-70,2000.
    [41]S.Segars,"The ARM9 family-high performance microprocessors for embeddedapplications," Computer Design:VLSI in Computers and Processors,1998.ICCD'98.Proceedings.,International Conference on,pp.230-235,1998.
    [42]D.Cormie,"The ARM11 microarchitecture," White paper,Apr,2002.
    [43]G.Kane and J.Heinrich,MIPS RISC architectures:Prentice-Hall,Inc,1992.
    [44]R.P.Weicker,"Dhrystone benchmark:rationale for version 2 and measurement rules," SIGPLAN Not.,vol.23,pp.49-62,1988.
    [45]S.Segars,"Low power design techniques for microprocessors," in Int.Solid-State Circuits Conf.Tutorial,San Francisco,USA.,2001,pp.4-10.
    [46]郑伟,姚庆栋,and张明等,“一种低功耗Cache设计技术的研究,”电路与系纺学报,vol.9,pp.21-24,2004
    [47]A.Malik,B.Moyer,and D.Cermak,"A low power unified cache architecture providing power and performance flexibility " in Proceedings of the 2000 international symposium on Low power electronics and design Rapallo,Italy:ACM Press,2000.
    [48]张宇宏,王界兵,严晓浪等,”标志预访问和组选择历史相结合的低功耗指令cache”电子学报,vol.32(8),pp.1287-1290,2004
    [49]S.J.Wilton and N.P.Jouppi,"CACTI:An Enhanced Cache Access and Cycle Time Model," IEEE Journal of Solid State Circuits,vol.31(5),1996.
    [50]A.Chandrakasan,"Low power circuit and system design," International Electron Device Meeting short course,2000.
    [51]M.Powell,S.H.Yang,B.Falsafi,K.Roy,and T.N.Vijaykumar,"Gated-Vdd:A Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories," ACM/IEEE International Symposium on Low Power Electronics and Design,pp.90-95,2000.
    [52]D.Parikh,Y.Zhang,and e.al,"Comparison of State-Preserving vs.Non-State-Preserving Leakage Control in Caches," in the Second Annual Workshop on Duplicating,Deconstructing,and Debunking in conjunction with ISCA-30,San Diego,CA,2003.
    [53]T.M.Austin and G.S.Sohi,"High-bandwidth address translation for multiple-issue processors," in the 23rd annual international symposium on Computer architecture Philadelphia,Pennsylvania,United States,1996,pp.158-167.
    [54]S.Manne,A.Klauser,D.Grunwald,and F.Somenzi,"Low-Power TLB Design for High-Performance Microprocessors," Univ.of Colorado Technical Report,1997.
    [55]A.M.F.R.B.Lee,and A.Bubshait,"Multimedia instructions in IA-64,"in Proceedings of the IEEE International Conference on Multimedia and Expo(ICME 2001),August 2001,pp.281-284.
    [56]A.P.a.U.Weiser,"MMX technology extension to the Intel architecture,"IEEE Micro,vol.16,pp.10-20,August 1996.
    [57]"MIPS Extension for Digital Media with 3D," White Paper,www.mips.com,MIPS Technologies,Inc.
    [58]G.F.S.Obeman,and F.Weber,"AMD 3Dnow! technology:architecture and implementations," IEEE Micro,vol.19,pp.37-48,April 1999.
    [59]T.Shreekant and T.Huff,"Implementing streaming SIMD extensions on the Pentium Ⅲ processor," IEEE Micro,vol.20,pp.47-57,2000.
    [60]P.K.D.K.Diefendorff,R.Hochsprung,and H.Scales "AltiVec extension to PowerPC accelerates media processing" IEEE Micro vol.20 pp.85-95,April 2000.
    [61]S.Welzel,"Intel's Multimedia-ISA Extensions:MMX,SSE,SSE2,SSE3with Coding Examples," www.binenet.de.
    [62]S.A.Mahlke,R.E.Hank,J.E.McCormick,D.I.August,and W.-M.W.Hwu,"A comparison of full and partial predicated execution support for ILP processors," in Proceedings of the 22nd annual international symposium on Computer architecture S.Margherita Ligure,Italy:ACM Press,1995.
    [63]P.K.D.T.M.Conte,M.D.Jennings,R.B Lee,A.Peleg,S.Rathnam,M.Schlansker,P.Song,A.Wolfe,"Challenges to Combining General-Purpose and Multimedia Processors," IEEE Computer,vol.30,pp.33-37,Dec.1997.
    [64]"Code samples of DSP Kernels,Image,Telecommunications,Vectorized Common Math Subroutines,and Video " Freescale semiconductor Inc.http://www.freescale.com/altivec.
    [65]C.H.S.a.S.C.F.W.C.Chen,"A Fast Computational Algorithm for the Discrete Cosine Transform," IEEE Transactions on Communications,vol.COM-25,pp.11004-1009,Sept.1987.
    [66]N.T.Slingerland and A.J.Smith,"Design and characterization of the Berkeley multimedia workload," Multimedia Systems,vol.8,pp.315-327,2002.
    [67]T.Austin,E.Larson,and D.Ernst,"SimpleScalar:An Infrastructure for Computer System Modeling," Computer,vol.35,pp.59-67,2002.
    [68]Yan Xiaolang,Yu Longli,and Wang Jie-bing,"A front-end automation tool supporting design,verification and reuse of SOC," 浙江大学学报(英文版),vol.5,pp.1102-1105,2004.
    [69]"www.engelschall.com/sw/eperl."
    [70]D.J.Katz and R.Gentile,Embedded Media Processing(Embedded Technology):Newnes,2005.
    [71]AMBA Specification(Rev 2.0).13 May 1999.ARM Limited..
    [72]C.W.Fraser and D.R.Hanson.A retargetable compiler for ANSI C. SIGPLAN Notices,26(10),October 1991.
    [73]T.Greotker,S.Liao,G.Martin,and S.Swan,System design with SystemC:Ebrary.
    [74]Synopsys Inc.Module compiler data sheet,http://www.synopsys.com/products/datapath/module_comp_ds.html.
    [75]T.Givargis,F.Vahid,J,and r.Henkel,"System-level exploration for pareto-optimal configurations in parameterized systems-on-a-chip," in Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design San Jose,California:IEEE Press,2001.
    [76]M.Horowitz,A.Joch,F.Kossentini,and A.Hallapuro,"H.264/AVC baseline profile decoder complexity analysis," Circuits and Systems for Video Technology,IEEE Transactions on,vol.13,pp.704-716,2003.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700