嵌入式异构多核处理器设计与实现关键技术研究

英文题名：Research on the Design and Implementation Techniques of Embedded Heterogeneous Multiprocessor
作者：岳虹
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：异构多核处理器 ; 定制处理器 ; 多媒体处理 ; 子字并行 ; DCT/IDCT变换 ; 指令集定制 ; VLSI设计 ; 嵌入式处理器
英文关键词：heterogeneous multiprocessor ; application-specific instruction set processor ; multimedia processing ; subword parallelism ; DCT/IDCT transform ; instruction customization ; VLSI design ; embedded processor
学位年度：2006
导师：王志英 ; 戴葵
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2006-10-01

摘要

嵌入式应用的发展要求嵌入式微处理器具有高性能、低功耗、结构可扩展、成本低和设计周期短的特征。嵌入式微处理器体系结构及设计方法因此而面临着极大的挑战。在当前集成电路工艺技术条件下,基于面向特定应用的定制处理器设计技术,开展嵌入式异构多核处理器设计与关键技术研究,是该领域的一个重要研究方向,其深入研究具有重要的理论和现实意义。
     本文在嵌入式异构多核处理器体系结构研究中,结合面向特定应用定制处理器的设计技术,提出了一种以定制处理器核为基础的嵌入式异构多核处理器体系结构,以期在实时性能、设计灵活性以及成本和功耗之间获取最佳的设计折衷。文中以多媒体应用为例,还重点研究了该嵌入式异构多核处理器体系结构设计与实现的核心关键技术,主要包括设计开发环境的构建、应用程序的特征分析、指令集的定制以及定制功能单元的设计等。并在上述研究工作的基础上,具体设计实现了一款面向多媒体应用的高性能嵌入式异构多核处理器芯片,验证了本文的相关研究工作。
     本文所取得的研究成果主要有:
     1.提出了一种以定制处理器核为基础的可扩展嵌入式异构多核处理器体系结构。该嵌入式异构多核处理器体系结构融合了高性能通用嵌入式处理器核和多个可面向特定应用进行定制的定制处理器核,基于传输触发体系结构的定制处理器核具有很好的可扩展特点,以及规整性和模块化特点,其硬件可以层次化自动设计实现。
     2.基于本文所提出的嵌入式异构多核处理器体系结构,提出了其设计实现过程中的体系结构可重定向模拟技术、指令集定制算法及硬件自动生成技术,并在此基础上建立了相应的设计开发环境,有效缩短了设计周期,对相应的嵌入式异构多核处理器的设计、实现、测试和验证提供了有力的支撑。本文使用此设计开发环境,对多媒体应用程序特征及负载进行了量化分析,得到对面向多媒体应用的嵌入式异构多核处理器设计具有指导意义的统计分析结论。
     3.提出并设计实现了一种基于并行加法器阵列的分散式DCT/IDCT定制功能单元体系结构。该体系结构采用了动态伸缩技术和数据分块技术,将乘法操作转变为查表操作和加法操作,再结合简单的移位操作,完成最终结果的计算。因此只需要很少数量的低位宽加法器、移位器及小规模ROM存储器,既能完成DCT/IDCT变换,并仍能保证计算结果具有很高精度。而且其结构规整,便于硬件高效实现。
     4.针对多媒体应用计算特点和特殊计算需求,提出并定制了子字并行指令及初等函数计算指令,设计实现了对这些定制指令进行支持的子字并行ALU,多模式子字并行乘法器以及基于CORDIC算法的初等函数计算单元。这些定制功能单元使面向多媒体应用的嵌入式异构多核处理器的实际应用性能得到了大幅度提高,用较小的芯片面积开销获取了较高的应用程序执行性能。
     5.在上述研究工作的基础上,设计实现了一款面向多媒体应用的嵌入式异构双核处理器EHMP-01芯片。系统研究了该处理器的设计与实现关键技术,包括微体系结构设计、存储系统设计、外围接口设计、逻辑设计和VLSI实现,以及芯片的测试和验证等。该处理器在0.18um工艺下流片,芯片总面积为4.8*4.8mm2,工作主频可以达到300MHz。在300MHz工作主频下,动态功耗仅为670mW。实际运行表明该芯片工作稳定可靠。
     EHMP-01嵌入式异构双核处理器芯片的成功流片,对本文提出的以定制处理器核为基础的嵌入式异构多核处理器体系结构、设计方法以及一系列关键技术进行了有效的验证。
The evolution of embedded applications requires advanced embeded microprocessor (EMP) to have the features of high performance, low power, architectural scalability, low design cost and short design cycle (time-to-market). The architecture and design methodology of EMP hence encounters great challenges. Under the current integrated circuit manufacturing process, the research of the design and implementation techniques of embedded heterogeneous multiprocessor based on the design methodology of the Application Specific Instruction-set Processor (ASIP) is an important area of the EMP research. The in-depth study will have great theoretical and practical significance.
     In this thesis, we applied the design methodology of ASIP to the design of embedded heterogeneous multiprocessor, and proposed a new embedded heterogeneous multi-ASIP processor architecture, to achieve the best tradeoff among real time performance, design flexibility, design cost and energy consumption. Taking the multimedia application as a practical example, many efforts were put on the design and implementation techniques of the new multi-ASIP processor architecture, including the design space exploration, application characteristics analysis, instruction customization and the design of customized function units. Based on these research works, we developed a high performance embedded heterogeneous dual-core processor EHMP-01 for multimedia applications.
     Primary innovative works of this thesis can be summarized as follows:
     1. We proposed a scalable embedded heterogeneous multi-ASIP processor architecture. This architecture, which consists of one high performance general purpose embedded processor with multiple ASIPs, can be scaled and customized for different applications. The multiple ASIPs implemented based on transport triggered architecture provide much scalability, and can be automatically generated based on its regular modular design.
     2. We proposed an automatic implementation methodology for the heterogeneous multi-ASIP processor, and established the design and performance evaluation environment. The environment provides best support to the effective design, implementation, test and verification of the heterogeneous multi-ASIP processor. Based on this environment, we quantitively analyzed the multimedia application characteristics and workloads, to get instructive statistics information for the design of multi-ASIP processor for the multimedia applications.
     3. We proposed a new distributed DCT/IDCT architecture based on parallel adders. Dynamic ranging and data partition technique are used in the architecture, multiplication operations are transformed to table lookup, add and shift operations. So, only small amount of low cost adders, shifters and ROM memory are needed in the hardware with the insurance of high accuracy. Regular structures also simplified the hardware implementation.
     4. Aimed at the special computation demand of multimedia application, we proposed a customized design solution for the subword-parallel instructions and elementary function instructions, designed and implemented the corresponding function units, which are customized subword-parallel function units and elementary function units based on CORDIC algorithm. These function units provide high performance speedup ratio with low area cost for the embedded heterogeneous multi-ASIP processor for mutimedia applications.
     5. We designed and implemented an embedded heterogeneous dual-core SoC chip EHMP-01 based on the above studies. The design of microarchitecture, memory subsystem and peripherals interface were discussed. The logic design and VLSI design, verification and test of the chip were also fully exploited. EHMP-01 was implemented under 0.18um process. The area of the die is about 4.8*4.8mm2 and it can operate at 300MHz with a consumption of 670mW power dissipation in average.
     Silicon implementation of EHMP-01 processor verified the effectiveness and correctness of the design methodology and a series of key techniques of embedded heterogeneous multi-ASIP processor proposed in this thesis.

引文

[1] Hennessy J L, Patterson D A. Computer Architecture: A quantitative approach. Morgan Kaufman Publishers, 3rd edition, 2002
    [2] Semiconductor Industry Association. The international technology roadmap for semiconductors. 2005 edition, 2005. http://public.itrs.net/ Files/2005ITRS/
    [3] Wang D T., The CELL microprocessor. Real World Technologies, ISSCCC 2005, 2005. http://www.realworldtech.com/page.cfm?ArticleID=RWT021%005084318
    [4] Burger D, Goodman J R. Billion-Transistor architectures. IEEE Computer, 30(9):46-48, 1997
    [5] Burger D, Goodman J R. Billion-Transistor architectures: there and back again. IEEE Computer, 37(3):22-28, 2004
    [6] Patt Y N, et al. One billion transistors, one uniprocessor, one chip. IEEE Computer, 30(9):51-58, 1997
    [7] Matzke D. Will physical scalability sabotage performance gains? IEEE Computer, 30(9):37-39, 1997
    [8] Hinton G, Sager D, Upton M, Boggs D, Carmean D M, Roussel P. The micro- architecture of the Pentium? 4 processor. Intel Technical Journal, 2001
    [9] Diefendorff K, Dubey P K. How multimedia workloads will change processor design. IEEE Computer, 30(9):43-45, 1997
    [10] Intel Corporation. IA-64 application developers architecture guide, 1999
    [11] Stokes J H. An introduction to 64-bit Computing and x86-64. Ars Technica. http://arstechnica.com/cpu/03q1/x86-64/x86-64-1.html
    [12] Intel Corporation. Intel Extended Memory 64 Technology. http://www.intel.com/ technology/ 64bitextensions/
    [13] BDTI, Microprocessors vs. DSPs: Fundamentals and Distinctions,2004.
    [14] Intel PXA800F Cellular Processor – Development Manual, Intel Corp., Feb. 2003
    [15] OMAP5910 Dual Core Processor – Technical Reference Manual, Texas Instruments, Jan. 2003
    [16] MIPS32TMM4KTMcore. MIPS Technologies (http://www.mips.com).
    [17] ARCtangentTMprocessor. Arc International (http://www.arc.com).
    [18] XtensaTMmicroprocessor. Tensilica Inc. (http://www.tensilica.com).
    [19] ADSP-BF561 Blackfin Symmetric Multi-Processor. Analog Devices, Inc. (http:// www. analog.com).
    [20] Liem, C., May,T.; Paulin, P.: Instruction-set matching and selection for DSP and ASIP code generation, Proceedings of the European Design and Test Conference,1994. EDAC, the European Conference on Design Automation. ETC European Test Conference. EUROASIC, 28 Feb. – 3 March 1994, Pages: 31-37.
    [21] K. Keutzer, S. Malik, and A. R. Newton. From ASIC to ASIP: The Next Design Discontinuity, in IEEE International Conference on Computer Design, pp. 84–90, September 2002.
    [22] Alain Artieri, Viviana D’Alto, Richard Chesson, Mark Hopkins, and Marco C. Rossi. NomadikTM Open Multimedia Platform for Next-generation Mobile Devices, STMicroelectronics Technical Article TA305, 2003, available at http://www.st.com.
    [23] Kapasi, U.J.; Dally, W.J.; Rixner, S.; Owens, J.D.; Khailany, B. The Imagine Stream Processor. Computer Design: VLSI in Computers and Processors, 2002. 16-18 Sept, 2002 CA, USA, pp. 282-288.
    [24] S. Dutta, R. Jensen, and A. Rieckmann, Viper: A Multiprocessor SOC for Advanced Set-Top Box and Digital TV Systems, IEEE Design and Test of Computers, September/October 2001, pp. 21-31.
    [25] OMAP5910 Dual Core Processor – Technical Reference Manual, Texas Instruments Inc., August, 2004
    [26] L. Gauthier, S. Yoo, and A. Jerraya. Automatic generation of application specific architectures for heterogeneous multiprocessor system-on-chip. In Proceedings of the Design Automation Conference, pp. 518-523, June 2001.
    [27] Stephen Craven, Cameron Patterson, Peter Athanas. A Methodology for Generating Application-Specific Heterogeneous Processor Arrays, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06) pp. 251a, 2006.
    [28] Wayne Wolf. Multiprocessor Systems-on-Chips, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06), 2006.
    [29] M.J. Rutten et al., Eclipse: Heterogeneous Multiprocessor Architecture for Flexible Media Processing, Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDIVM), April 2002, Fort Lauderdale, Florida, USA.
    [30] Andreas Fidjeland, Wayne Luk. Customising Application-Speci.c Multiprocessor Systems: a Case Study, 2005 IEEE International Conference on Application- Specific Systems, Architecture Processors (ASAP'05), pp. 239-246, 2005.
    [31] Benoit Clement, Richard Hersemeule, Etienne Lantreibecq, Bernard Ramanadin, Pierre Coulomb, Francois Pogodalla. Fast Prototyping: A System Design Flow Applied to a Complex System-On-Chip Multiprocessor Design, 36th Annual Conference on Design Automation (DAC'99), pp. 420-424, 1999.
    [32] Amer Baghdadi, Damien Lyonnard, Nacer-E. Zergainoh, Ahmed. A. Jerraya. AnEfficient Architecture Model for Systematic Design of Application-Specific Multiprocessor SoC, Design, Automation, and Test in Europe (DATE '01), pp.55-62, 2001.
    [33] Fei Sun, Srivaths Ravi, Anand Raghunathan, Niraj K. Jha. Synthesis of Application-Specific Heterogeneous Multiprocessor Architectures Using Extensible Processors, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID'05), pp. 551-556, 2005.
    [34] S. Prakash and AC Parker. SOS: synthesis of application-specific heterogeneous multiprocessor systems, Journal of Parallel and Distributed Computing, vol. 16, no. 4, pp. 38-51, 1992.
    [35] Gert Goossens. Application-Specific Instruction-set Processors as a Cornerstone of Heterogeneous MPSoCs: What, Why, and How?, http://tima.imag.fr/MPSOC/
    [36] Arnab Sarkar, P. P. Chakrabarti, Rajeev Kumar. Frame Based Fair Multiprocessor Scheduler: A Fast Fair Algorithm for Real-Time Embedded Systems, 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design(VLSID'06), pp. 677-682, 2006.
    [37] Nathan Fisher, James H. Anderson, Sanjoy Baruah. Task Partitioning upon Memory-Constrained Multiprocessors, 11th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'05), pp. 416-421, 2005.
    [38] Shelby Funk, Sanjoy Baruah. Task Assignment on Uniform Heterogeneous Multiprocessors, 17th Euromicro Conference on Real-Time Systems (ECRTS'05), pp. 219-226, 2005.
    [39] L. Benini and G. DeMicheli. Networks on Chips: A New Paradigm for Component-Based MPSoC Design, IEEE Computer, pp. 70-78, Jan. 2002.
    [40] Yuriy Sheynin, Elena Suvorova, Felix Shutenko. Complexity and Low Power Issues for On-chip Interconnections in MPSoC System Level Design, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06), pp. 283-288, 2006.
    [41] Terry Tao Ye, Luca Benini, Giovanni De Micheli. Packetized On-Chip Interconnect Communication Analysis for MPSoC, Design, Automation and Test in Europe Conference and Exhibition (DATE'03), pp. 344-349, 2003.
    [42] Partha Pratim Pande, Cristian Grecu, Michael Jones, Ivanov Ivanov, Res Saleh. .Evaluation of MP-SoC Interconnect Architectures: a Case Study, 4th IEEE International Workshop on System-on-Chip for Real-Time Applications, (IWSOC'04), pp. 253-256, 2004.
    [43] Akira Yamawaki, Masahiko Iwane. Coherence Maintenances to realize anefficient parallel processing for a Cache Memory with Synchronization on a Chip-Multiprocessor, 8th International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN'05), pp. 324-333, 2005.
    [44] Mahmut Taylan Kandemir. Exploiting Memory Bank Locality in Multiprocessor SoC Architectures, 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pp. 92b, 2004.
    [45] O.Ozturk, M.Kandemir, MJIrwin, S. Tosun. On-Chip Memory Management for Embedded MpSoC Architectures Based on Data Compression, In Proc. IEEE International SOC Conference (SOCC 2005), Washington, DC, Spetember 2005.
    [46] Sang-Il Han, Amer Baghdadi, Marius Bonaciu, Soo-Ik Chae, Ahmed A. Jerraya. An Efficient Scalable and Flexible Data Transfer Architecture for Multiprocessor SoC with Massive Distributed Memory, 41st Conference on Design Automation, (DAC'04), pp. 250-255, 2004.
    [47] Tajana Simunic Rosing. Optimization of Reliability and Power Consumption in MPSoCs, http://tima.imag.fr/MPSOC/.
    [48] Robert Glass, Formal Methods are a Surrogate for a More Serious Software Concern, IEEE Computer, Vol.29, No.4, pp.19, April 1996.
    [49] Yoo S., Jerraya AA, Yoo S. et al,Hardware/software cosimulation from interface perspective, Computers and Digital Techniques 6, Volume 152, Issue 3, pp. 369-. 379. 2005.
    [50] JoAnn M. Paul, Donald E. Thomas, Andrew S. Cassidy: High-level modeling and simulation of single-chip programmable heterogeneous multiprocessors. ACM Transation of Design Automation. Electric. System. 10(3): 431-461, 2005.
    [51] Wayne Wolf. The Future of Multiprocessor Systems-on-Chips, 41st Conference on Design Automation (DAC'04), pp. 681-685, 2004.
    [52] Gloria A. D.; Faraboschi, P. An evaluation system for application specific architectures. Proceedings of the 23rd Annual Workshop and Symposium on Microprogramming and Microarchitecture(Micro 23), pp.87-89 27-29, Nov. 1990.
    [53] Manoj Kumar Jain, M. Balakrishnan, Anshul Kumar. ASIP Design Methodologies: Survey and Issues. Proceedings of the Fourteenth International Conference on VLSI Design, pp. 76-81, 3-7 Jan. 2001.
    [54] Ghazal, N.; Newton, R.; Jan Rabaey. Retargetable estimation scheme for DSP architectnre selection. Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC 2000), pp. 485-489, 25-28 Jan. 2000.
    [55] G. Ezer. Xtensa with user defined DSP coprocessor microarchitectures, Proceedings of 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors, pp. 335–342, 2000.
    [56] P. Mishra. Rapid Exploration of Pipelined Processors through AutomaticGeneration of Synthesizable RTL Models, Proceedings of 14th IEEE International Workshop on Rapid Systems Prototyping, pp. 226–232, Jun. 2003.
    [57] Yuki Kobayashi. Synthesizable HDL generation method for configurable VLIW processors, ASP-DAC’04, pp.842-845, 2004.
    [58] J. Eyre and J. Bier. Independent DSP benchmarks: Methodologies and results. In Proceeding of International Conference on Signal Processing Applications and Technology (ICSPAT), 1999.
    [59] A. R. Weiss. The standardization of embedded benchmarking: Pitfalls and opportunities. In Proceeding of the 17th International Conference on Computer Design (ICCD), pp. 492–498, 1999.
    [60] Chunho Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proc. 30th Int. Symp. on croarchitecture (MICRO-30), pp. 330–335, 1997.
    [61] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A free, commercially representative embedded benchmark suite. In Proc. 4th Workshop on Workload Characterization (WWC), pp. 3–14, 2001.
    [62] T. Wolf and M. Franklin. CommBench – a telecommunications benchmark for network processors. In Proc. IEEE Int. Symp. on Performance Analysis of Systems and Software (ISPASS), pp. 154–162, 2000.
    [63] Urs Anliker, Jan Beutel, Matthias Dyer, Rolf Enzler, Paul Lukowicz, Lothar Thiele, Gerhard Troster. A Systematic Approach to the Design of Distributed Wearable Systems. IEEE Transactions on Computers, vol. 53, no. 8, pp. 1017-1033, Aug 2004.
    [64] C. Plessl, R. Enzler, H. Walder, J. Beutel, M. Platzner, and L. Thiele. Reconfigurable hardware in wearable computing nodes. In Proc. 6th Int. Symp. on Wearable Computers (ISWC), pp. 215–222, 2002.
    [65] C. Plessl, R. Enzler, H. Walder, J. Beutel, M. Platzner, L. Thiele, and G. Tr¨oster. The case for reconfigurable hardware in wearable computing. Personal and Ubiquitous Computing, 7(5):299–308, Oct. 2003..
    [66] A. Halambi, P. Grun, et al. EXPRESSION: A language for architecture exploration through compiler/simulator retargetability. Proceedings of the European Conference on Design, Automation and Test (DATE), pp.100-104, Mar. 1999.
    [67] Jason Fritts, Wayne Wolf, and Bede Liu. Understanding multimedia application characteristics for designing programmable media processors, SPIE Photonics West, Media Processors '99, pp. 2-13, San Jose, CA, January 1999.
    [68] Zhao Wu , Wayne Wolf. Data-path synthesis of VLIW video signal processors, Proceedings of the 11th international symposium on System synthesis, pp.96-101,December 02-04, 1998, Hsinchu, Taiwan, China
    [69] K. Atasu, L. Pozzi, and P. Ienne. Automatic application-specific instruction-set extensions under microarchitectural constraints, in Proceeding of Design Automation Conference., pp. 256–261, June 2003.
    [70] N. Clark, H. Zhong, W. Tang, and S. Mahlke. Processor acceleration through automated instruction set customization, in Proceeding of the International Symp. Microarchitecture, pp. 40–47, Dec. 2003.
    [71] A. Peymandoust, L. Pozzi, P. Ienne, and G. De Micheli. Automatic instruction-set extension and utilization for embedded processors, in Proc. Int. Symp. Application-Specific Systems, Architectures, and Processors, pp. 108–118, June 2003.
    [72] P. Biswas and N. Dutt. Greedy and heuristic-based algorithm for synthesis of complex instructions in heterogeneous-connectivity-based DSPs. School of Information and Computer Science, University of California, Irvine, Technical Report. 03-16, May 2003.
    [73] R.B.Lee. Accelerating Multimedia with Enhanced Microprocessors, IEEE Micro, vol.15, pp.22-32, 1995
    [74] M.-Y. Wu and D. D. Gajski. Hypertool: A programming aid for message-passing systems, IEEE Trans. Parallel & Distrib. Systems, vol. 1, no. 3, pp. 330–343, July 1990.
    [75] RB Lee, AM Fiskiran and A. Bubshait. Multimedia. Instructions in IA-64, Proc. IEEE International Conference on. Multimedia and Expo, Aug. 22-25, 2001.
    [76] ARM Ltd. AMBA Specification Rev. 2.0. [Online], http://www.arm.com.
    [77] Leon3 Processor Introduction, http://www.gaisler.com/cms4_5_3/index.php? option = com_content&task=view &id=13&Itemid=53
    [78] The SPARC Architecture Manual Version 8, http://www.sparc.com/standards/ V8.pdf
    [79] GNU GENERAL PUBLIC LICENSE, http://www.gaisler.com/doc/gnugpl.txt
    [80] Henk Corporaal. Transport Triggered Architectures: Design and Evaluation. PhD thesis, Delft Univ. of Technology, September 1995. ISBN 90-9008662-5.
    [81] Henk Corporaal. Microprocessor Architecture from VLIW to TTA, John Wiley & Sons Ltd, West Sussex, England, 1998
    [82] Scott Rixner. Stream Processor Architecture. Kluwer Academic Publishers, 2002.
    [83] Johan Janssen and Henk Corporaal. Registers on demand: Integrated register allocation and instruction scheduling. In Third Annual Conference of ASCI, June 1997.
    [84] K. R. Rao and J. J. Hwang. Techniques and Standards for Image, Video and Audio Coding. Englewood Cliffs, NJ: Prentice-Hall, 1996.
    [85] P. Pirsch, N, Demassieux, and W. Gehrke. VLSI architectures for videocompression – a survey. Proceedings of the IEEE, vol. 83, no. 2, pp. 220-246, Feb. 1995.
    [86] 黎铁军,李思昆, 嵌入式流媒体处理器体系结构技术研究, 博士学位论文. 长沙:国防科技大学.2005.
    [87] J. Liang, T. Tran and P. Topiwala. A 16-bit Architecture for H.26L, treating DCT Transforms and Quantization, ITU-T Q.6/SG16 (VCEG) M16.doc, April, 2001. ftp://standard.pictel.com/videosite/0104_ Aus /VCEG-M16.doc.
    [88] Liang Jie and Trac D. Tran. Fast multiplierless approximations of the DCT with lifting scheme, IEEE transaction on signal processing, vol. 49, no. 12, pp. 3032-3044, 2001.
    [89] Hong Yue, Marian Choy, Zhiying Wang, YK Chan. Evaluation of Advantages of A High Throughput High Accuracy IDCT Microarchitecture, IWAIT'07
    [90] C. Loeffler, A. Ligtenberg, and G. Moschytz. Practical Fast 1D DCT Algorithms With 11 Multiplications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 988-991, 1989.
    [91] Henrique S. Malvar, Antti Hallapuro, Marta Karczewicz, and Louis Kerofsky. Low-Complexity Transform and Quantization in H.264/AVC, IEEE transactions on circuits and systems for video technology, vol. 13, no. 7, July 2003.
    [92] Tero Rissa, Peter Y. K. Cheung, Wayne Luk. SoftSONIC: A Customisable Modular Platform for Video Applications. Leuven, Belgium, FPL 2004: pp. 54-63, August, 2004.
    [93] N. I. Cho and S. U. Lee. DCT algorithms for VLSI parallel implementation, IEEE Trans. on ASSP., vol. 38, no. 1, pp. 121-127, 1990.
    [94] L. W. Chang, and M. C. Wu. A unified systolic array for discrete cosine and sine transforms, IEEE Trans. on SP, vol. 39, no. 1, pp. 192-194, 1991.
    [95] J. I. Guo, C. M. Liu, and C. W. Jen. A new array architecture for prime length discrete cosine transform, IEEE Trans. On SP, vol. 41, no. 1, pp. 436-442, 1993.
    [96] J. I. Guo. An Efficient Parallel Adder Based Design for One Dimensional Discrete Cosine Transform, IEE Proceedings Circuits, Devices, and Systems, vol.147, no. 5, pp.276-282, Oct. 2000.
    [97] JI Guo, Jia-Wei Chen, and Cheng-Chung Wu. An Efficient Adder-Based 2-D DCT/IDCT Design for Image Compression Applications, Proc. 20th VLSI DESIGN/CAD Symposium, Aug. 2001.
    [98] M. T. Sun, T.C. Chen, and A. M. Gottlieb. VLSI implementation of a 16x16 discrete cosine transform, IEEE Trans. on CAS, vol. 36, no. 4, pp. 610-616, 1989.
    [99] J. I. Guo, C. M. Liu, and C. W. Jen. The efficient memory-based VLSI arrays for DFT and DCT, IEEE Trans. on CAS-II, vol. 39, no. 10, pp. 723-733, 1992.
    [100] Uramoto S, Inoue Y, Takabatake A, et al. A 100 MHz 2-D discrete cosine transform core processor, IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 492-498, 1992.
    [101] IEEE Standard Specifications for the Implementations of 8 x 8 Inverse Discrete Cosine Transform, IEEE Std. 1180-1190
    [102] Sungwook Yu, Earl E. Swartzlander. DCT Implementation with Distributed Arithmetic, IEEE Transactions on Computers, vol. 50, no. 9, pp. 985-991, Sept. 2001.
    [103] Volder J E. The CORDIC Trigonometric Computing Technique. IRE Trans. on Electronic Computing, 1959, EC-8:330-334.
    [104] J.s. Walther. A Unified Algorithm for Elementary Functions, Proc. AFIPS Spring Joint Computer Conference, pp.379-385, 1971.
    [105] S.Wang, V.Piuri. A Unified View of CORDIC Processor Design, in Application Specific Processor, Ed. By Earl E. Swatzlander, Jr., Kluwer Academic Press, pp.121-160, 1996.
    [106] P.Pirsch. Architectures for Digital Signal Processing, John Wiley & Sons,1998
    [107] Bajard, J.C., Kla.S, Muller,J. BKM: a new hardware algorithm for complex elementary functions, Computers, IEEE Transaction, Vol.43, No.8, pp. 955-963, Aug 1994,
    [108] J.D.Bruguera et.al, Design of a pipelined radix 4 CORDIC processor, Journal of Parallel COmputing, Vol.19, No.7, pp.729-734, 1993.
    [109] Wang. S; Piuri. V; Wartzlander.E, Hybrid CORDIC algorithms, Computers, IEEE Transaction, Volume 46, No.11, pp.1202-1207, Nov 1997.
    [110] Matthew D. Jennings. Subword Extensions for Video Processing on Mobile Systems, IEEE Concurrency 4 (1998) pp. 13-16.
    [111] David A. Sykes, Brain A. Molloy. The design of an efficient simulator for the Pentium Pro processor, Proceedings of the 28th conference on Winter simulation table of contents, Coronado, pp: 840-847, 1996
    [112] Molloy, B. The validation of a multiprocessor simulator, Proceedings of the 1993 winter simulation conference, pp. 625-631, 1993.
    [113] D. A. Patterson. Reduced instruction set computers. Commun. ACM, 28(1):8-21, 1985.
    [114] Jan Hoogerbrugge, Henk Corporaal. Register file port requirements of transport triggered architectures, Micro 27, Santa Clara, December 1994
    [115] Abella J. and Gonzalez A. On reducing register pressure and energy in multiple-banked register files, Proceedings of the 13th International Conference on Computer Design: VLSI in Computers & Processors (ICCD 2003), pp. 14-20, San Jose, California, United States, 2003
    [116] J. Hiser, S. Carr, P. Sweany, S.J. Beaty. Register Assignment for SoftwarePipelining with Partitioned Register Banks, Proceedings of the 2000 International Parallel and Distributed Processing Symposium, Cancun, Mexico, May 1-4, 2000
    [117] TMS320C64x DSP Library Programmer's Reference, Texas Instruments Inc., Apr 2002
    [118] TMS320C64x CPU and Instruction Set Reference Guide, Texas Instruments, Inc, USA, 2000
    [119] XDRTM DRAM System Design Overview, Rambus Inc., 2003
    [120] OMAP5910/5912 Multimedia Processor DSP Subsystem Reference Guide, Texas Instruments Inc., May, 2005
    [121] OMAP5910 Dual-Core Processor Inter-Processor Communication Reference Guide, Texas Instruments Inc., Jan, 2005
    [122] OMAP5910 Dual-Core Processor Functional and Peripheral Overview Reference Guide, Texas Instruments Inc., October, 2003
    [123] Cell Broadband Engine Architecture v1.0, IBM technical library, 08 August 2005
    [124] Nomadik mobile multimedia application processor, STmicroelectronics Inc., 2006.
    [125] P. van der Wolf, E. de Kock, T. Henriksson W Kruijtzer and G. Essink. Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach. Special Session. In Proceedings of 2th CODES+ISSS, pp. 206-217, Stockholm, Sweden, 2004.
    [126] M. Shalan and V. Mooney. Hardware Support for Real-Time Embedded Multiprocessor System-on-a-Chip Memory Management, Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES'02), pp. 79-84, May 2002.
    [127] R.B. Lee. Subword Parallelism with MAX-2, IEEE Micro, Vol. 16, No. 4, pp. 51-59, August 1996.
    [128] A. Peleg, U. Weiser. MMX Technology Extension to the Intel Architecture, IEEE Micro, Vol. 16, No. 4, pp. 42-50, August 1996.
    [129] M. Tremblay, et al. VIS Speeds New Media Processing, IEEE Micro, vol. 16, no. 4, pp. 10-20, August 1996.
    [130] JM Kim and DS Wills. High-Performance and Energy -Efficient Heterogeneous Subword Parallel Instructions, in Proceedings of the IEEE International Workshop on Signal Processing Systems, pp. 75-80, Seoul, South Korea,2003
    [131] Ruby B. Lee. Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures. ASAP'00, pp.3-14,2000.
    [132] J. Fridman and Zvi Greefield. The TigerSharc DSP Architecture , IEEE Micro, pp. 66-76, Jan-Feb. 2000
    [133] J. T. J. van Eijndhoven, et al. TriMedia CPU64 Architecture. In Proceedings of ICCD, pp. 586-592, Austin, Texas, October 1999.
    [134] Murugappan Senthilvelan and Michael J.Schulte. A Flexible Arithmetic and Logic Unit for Multimedia Processing, Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, August 2003
    [135] Jesus Corbal , Roger Espasa , Mateo Valero. MOM: a matrix SIMD instruction set architecture for multimedia applications, Proceedings of the 1999 ACM/IEEE conference on Supercomputing, pp.15-es, November, 1999.
    [136] Chandrakasan A P, Brodersen R W. Low Power Digital CMOS Design, Kluwer Academic Publishers, 1995.
    [137] M.S. Schmookler et al. A Low-power, High-speed Implementation of a PowerPCTM Microprocessor Vector Extension, Comp. Arith., Proc. 14th IEEE Symp., 1999.
    [138] S. Krithivasan and MJ Schulte. Multiplier Architectures for Media Processing, Proc. 37th Asilomar Conf. Signals, Systems, and Computers, pp. 2193-2197, 2003.
    [139] D. Tan, A. Danysh, and M. Liebelt. Multiple-Precision Fixed-Point Vector Multiply-Accumulator using Shared Segmentation, Comp. Arith., Proc. 16th IEEE Symp., pp. 12–19, 2003.
    [140] A. A. Farooqui and V. G. Oklobdzija. General Data-Path Organization of a MAC Unit for VLSI Implementation of DSP Processors, Proc. IEEE Int'l Symp. Circuits and Systems, pp. 260–263, 1998
    [141] Suzuki, K. et al. A 2000-MOPS embedded RISC processor with a Rambus DRAM controller, IEEE J. Solid-. State Circuits, Vol. 34, pp. 1010-1021, 1999.
    [142] R.B. Lee and AM Fiskiran. PLX: A Fully Subword-Parallel Instruction Set Architecture for Fast Scalable Multimedia Processing, Proceedings of the 2002 IEEE International Conference on Multimedia and Expo, pp.117-120, 2002
    [143] P. R. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. offchip memory: the data partitioning problem in embedded processor-based systems. ACM Transactions on Design Automation of Electronic Systems, 5(3):682-704, 2000.
    [144] R. P. Dick, D. L. Rhodes, and W. Wolf. TGFF: Task graphs for free. Proc. Int. Symp. HW/SW Codesign, pp. 97-101, Mar. 1998.
    [145] J.M. Delosme. VLSI implementation of rotations in pseudo-Euclidean spaces, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 2, pp. 927-930. 1983.
    [146] J. M. Muller. Discrete basis and computation of elementary functions, IEEE Trans. Comput., vol. C-34, no. 9, pp. 857-862, Sept.1985.
    [147] T. Sung, T. Parng, Y. Hu, and P. Chou. Design and implementation of a VLSI CORDIC processor, in Proc. 1986 IEEE Int. Symp. Circuits Syst., vol. 3, pp. 934-935. 1986.
    [148] Xiaobo Ronald Hu, Harber G, Steven Bass C. Expanding the range ofConvergence of the CORDIC of the CORDIC Algorithm. IEEE Transactions on Computer, 1991 (40): 13-20.
    [149] ANSI/IEEE Standard 754-1985, IEEE Standard for Binary Floating-Point Arithmetic, 1985.
    [150] Taeweon Suh, Hsien-Hsin S. Lee, Douglas M. Blough. Integrating Cache Coherence Protocols for Heterogeneous Multiprocessor Systems, Part 1. IEEE Micro, vol. 24, no. 4, pp. 33-41, Jul/Aug, 2004
    [151] Taeweon Suh, Hsien-Hsin S. Lee, Douglas M. Blough. Integrating Cache Coherence Protocols for Heterogeneous Multiprocessor Systems, Part 2. IEEE Micro, vol. 24, no. 5, pp. 70-78, Sept/Oct, 2004
    [152] U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. Ahn, P. Mattson, J. Owens, Programmable Stream Processors, IEEE Computer, pp. 54-62, Volume 36, No. 8, August 2003.
    [153] Intel, Product Brief: Intel IXP2850 Network Processor, 2002, Available at http://www.intel.com.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700