面向万亿次量级嵌入式计算的体系结构关键技术研究

英文题名：Key Techniques Research on Terascale Embedded Computing
作者：杨乾明
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：万亿次 ; 数据存储层次 ; 部分互连 ; 流模板 ; 超长指令字 ; 动态可重构
英文关键词：Terascale ; Data Memory Hierarchy ; Partial-connected Crossbar
英文关键词：Stream Architecture Template ; VLIW ; Dynamic Partial Reconfiguration
学位年度：2012
导师：张春元
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2012-10-01

摘要

随着各种通信标准和编码算法的不断演进，高性能嵌入式应用对处理器的性能和能耗提出了越来越高的需求，万亿次量级嵌入式应用开始涌现，超大规模集成电路（VLSI）技术的飞速发展也为构建满足这种需求的高能效嵌入式处理器提供了可能。然而，将VLSI潜能变成满足万亿次量级嵌入式应用需求的实际计算能力仍然是一项极具挑战性的工作。传统的嵌入式处理器采用简单的处理器结构，可以获得很低的功耗，但是性能远不能满足未来嵌入式应用的需求。而以GPU、MIC为代表的高性能微处理器，采用众核结构在单个芯片上集成了数十亿支晶体管，虽然可以提供很高的性能，但是由于使用传统的超标量、同时多线程等技术，消耗了大量的功耗，远不能满足未来嵌入式应用的能耗需求。基于以上背景，作者选择了“面向万亿次量级嵌入式计算的体系结构关键技术研究”作为论文课题。
     本文深入研究了各种能耗有效的体系结构技术，研究内容涉及新型数据存储层次设计、全分布式VLIW的功能单元互连设计、超低功耗的处理器核设计、基于流模板的可重构计算等关键领域。本文的工作和创新体现在：
     1、提出了多级粒度匹配的数据存储层次（MGR：Multi-level Granularity-matchedRegister Hierarchy）设计。MGR将嵌入式应用的数据访问和处理过程层次化：最外层为粗粒度的流式数据访问，有很强的顺序性和可预知性；中间层为块数据访问模式，每次取一个块，可预知性强，块间相关性较弱；最内层是对块内数据的访问，较灵活，具有一定的随机性。针对这三个层次，MGR分别用帧缓冲存储器、高级寄存器文件和超小像素点寄存器文件去捕获不同层的数据局域性，使得每一级存储层次的设计都只需关注其本身功能的实现，这样每一层的硬件实现都简单高效。实验结果显示，相比于当前的其它典型存储层次，MGR可以获得53%~62%的能耗降低，同时性能保持不变或只有少许降低。
     2、提出了面向全分布式VLIW结构的功能单元部分互连设计。针对全分布式VLIW结构下功能单元全互连结构延迟大、功耗高、可扩展性差的问题，提出功能单元部分互连设计。首先分析了嵌入式应用对全互连结构的使用情况，总结出几种典型的通信模式；然后针对这些通信模式提出了多种部分互连结构，建立了部分互连结构的VLSI模型；最后深入分析了各种部分互连结构对延迟、面积、功耗和程序性能的影响。实验结果显示，相比于全互连结构，部分互连结构可以极大的降低硬件开销，而性能只有稍许的降低。同时，随着VLIW规模的扩大，部分互连将展现出更好的可扩展性。
     3、设计了一种超低功耗的嵌入式处理器核。由大量简单小核和少量复杂大核构成的大规模多核并行机制成为提高嵌入式处理器能效的主流趋势。针对简单小核，提出Smart Core处理器设计。Smart Core基于显式并行、精确计算的设计理念，采用了VLIW并行执行模式、多级数据存储层次（流式存储+层次化寄存器文件+超小寄存器文件）、非对称全分布式指令寄存器来分别降低指令流水线、数据供应、指令供应的能耗。初步的实验结果表明，Smart Core比传统嵌入式处理器提高能效25倍，在40nm工艺下，由Smart Core构建的众核系统可以获得单芯片1Tops以上的性能，同时保持操作能效比在100Gops/W以上。
     4、提出了基于流模板的多粒度动态可重构处理器（MGR-SAT： AMulti-Granularity Reconfigurable DSP based on Stream Architecture Template）设计。MGR-SAT结合了流处理技术、动态可重构技术和基于平台的技术，在硬件上由标量核、流处理核及相应外部接口组成。流处理核是一个动态可配置单元，由粗粒度可配置单元和细粒度可配置单元组成，用于计算加速。MGR-SAT整体上以流处理的方式运行，标量核负责配置流处理核，并启动流处理核的执行和数据传输。实验结果显示，MGR-SAT与当前典型的处理平台相比，有着明显的性能和功耗优势。
With the evolution of more sophisticated communication standards and algorithms,embedded applications exhibit higher performance and efficiency requirements. Someemerging applications demand terascale operations per second. Although the rapiddevelopment of VLSI technology enables building processor with the tera order ofcomputing capacity, how to transfer the billions of transistors to the actual computingpower is still a challenging task. Using the simple control structure, traditionalembedded processor can get very low power consumption, but not provide enoughperformance. High performance microprocessors such as GPU and MIC High integratebillions of transistors by the many core technology, and can provide the performanceexceeding one Tops, but they are far from meeting the need of the future embeddedapplication in power and energy efficiency because they used the technologies ofmultithread and shared coherent cache, which consume much energy. To solve theabove problems, the subject of “Key techniques Research on terascale embeddedcomputing” is selected by this article.
     This article focuses on various energy-efficient architecture technologies, includingnew data memory hierarchy design, interconnection of functional units in fullydistributed VLIW, ultra low power processor core design, the organization ofcomputing resources. This thesis has completed the following main contributions andinnovations:
     1. We propose a multi-level granularity-matched register hierarchy named MGR.MGR divides the data access of embedded applications into three layers. The outermostlayer deals with the sequential and predictable streaming data; the middle layer dealswith block data and the dependencies between blocks are weak; the innermost layerdeals with the data within the same block and the access pattern is flexible and random.Corresponding to the three layers, MGR use frame buffer register file, the enhancedregister file and tiny-sized pixel register file to capture their respective data localities.So each memory layer is concerned only with its own function and its hardwareimplementation becomes simple and efficient. Compared to other typical memoryhierarchy, the results show that MGR can get53%-62%of reduction in energyconsumption, while achieving almost the same performance.
     2. We study the partial-connected crossbar for fully distributed VLIW. Thecrossbar with full connectivity is high delay, high power consumption and weak scaling.We first analyze the usage of full crossbar in embedded applications and summarizeseveral typical communication patterns. Corresponding to them, kinds of crossbars withsparse connectivity are proposed. We model the delay, area, power of the partialconnected crossbar. The experimental results show that, compare to the full crossbar, partial connected crossbar can greatly reduce the hardware cost while decreasingperformance slightly. Moreover, when scaling the number of function units in VLIW,the partial connected crossbar will exhibit more efficiency.
     3. We design an ultra-low-power embedded processor core. The future many coreprocessors may consist of a large number of small processor cores and some bigprocessor cores may construct. As the role of small core, an ultra-low-power embeddedprocessor core named Smart Core is proposed. On the methodologies of explicit paralleland accurate computing, Smart Core use the VLIW execution mode, multi-level datamemory hierarchy (streaming memory+hierarchical register file+tiny-sized registerfile), and asymmetrical fully distributed instruction register to reduce the energy ofinstruction pipeline, data supply and instruction supply correspondingly. Preliminaryresults show that Smart Core achieves an energy efficiency that is25x greater than thetraditional embedded RISC processor. When scaled to a40nm CMOS technology,single chip multi-processor, consisted of many cores like Smart Core, is capable ofproviding more than1Tops performance while achieving efficiency of100Gops/W ormore.
     4. We present a multi-granularity reconfigurable DSP based on stream Architecturetemplate named MGR-SAT. MGR-SAT merges stream processing technology, dynamicreconfigurable technology and platform-based technology, consisting of scalar core,stream processing core and the external interfaces. The stream processing consists of acoarse-grained reconfigurable unit and a fine-grained reconfigurable unit and can bereconfigurable dynamically when running. Scale core is responsible for configuring thestream processing core, initiating it and enabling the transfers of block data. Theexperimental results show that, compared to other typical processing platform,MGR-SAT delivers higher performance and power efficiency significantly.

引文

[1] Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti andKrisztian Flautner. AnySP: Anytime Anywhere Anyway Signal Processing [C]. InProceedings of the36thannual International Symposium on ComputerArchitecture (ISCA), Austin, Texas, USA,2009:20–24.
    [2] O. Silven and K. Jyrkka. Observations on power-efficiency trends in mobilecommunication devices [J]. EURASIP Journal of Embedded Systems,2007(1):17–17.
    [3] M. Who, et al. The next generation challenge for software defined radio [C]. InProceedings of the7thInternational Conference on Systems, Architectures,Modeling, and Simulation,2007:343-354.
    [4] Davis M.E.. Space Based Radar Moving Target Detection Challenges [C]. IEEERADAR Conference,2002:143-147.
    [5] S.Farsiu, D.Robinson, M.Elnd and P.Milanfar. Advances and challenges insuper-resolution [J]. International Journal of Imaging Systems and Technology,2004,14(2):47-57.
    [6] Robert Bond. High Performance DoD DSP Applications [R].2003Workshop onStreaming Systems,2003.
    [7] Henry S. Kenyon. Unmanned Combat Aircraft Program Takes Off [J]. Signal,2004,58(11):49-52.
    [8] C. G. Masi. Machine Vision Comes of Age: engineers find machine vision canreplace more complex point-sensor-based systems [J]. Control Engineering,2008,55(5):24-34.
    [9] Abraham, D.A. Array Modeling of Active Sonar Clutter [J]. IEEE Journal ofOceanic Engineering,2008,33(2):158-170.
    [10] Wenming, CaoHao, Feng Lili, HuTiancheng, HeWuhan. Space TargetRecognition Based on Biomimetic Pattern Recognition [C]. InternationalWorkshop on Database Technology and Applications,2009:25-26.
    [11] Iain E., Richardson G.. H.264and MPEG-4Video Compression-Video Codingfor Next-Generation Multimedia [R]. John Wiley&Sons Ltd,2003.
    [12] Boudewijn P.F., Lelieveldt. Information Processing in Medical Imaging [J].Medical Image Analysis,2008,12(6):729-730.
    [13] Dollarhide AW., Rutledge T., Weinger MB., Dresselhaus TR.. Use of a handheldcomputer application for voluntary medication event reporting by inpatient nursesand physicians [J]. Journal of General Internal Medicine,2008,23(4):418-422.
    [14] K. Kuusilinna, et al.. Designing BEE: a Hardware Emulation Engine for SignalProcessing in Low-Power Wireless Applications [J]. EURASIP Journal onApplied Signal Processing,2003:502-513.
    [15] G. E. Moore. Excerpts from A Conversation with Gordon Moore: Moore’S Law[R]. Intel Corporation,2005
    [16] Jinuk Luke Shin, Kenway Tam, Dawei Huang, and et al. A40nm16-Core128-Thread CMT SPARC SoC Processor [C]. IEEE International Solid-StateCircuits Conference,2010:98-99.
    [17] D. Burger, J.R. Goodman. Billion-Transistor Architectures: There and BackAgain [C]. Computer,2004,37(3):22-28.
    [18] William J.Dally, Patrick Hanrahan, Mattan Erez, and et al. Merrimac:Supercomputing with Streams [C]. In Proceedings of the SupercomputingConference,2003.
    [19] S. Hsu and et al. A2GHz13.6mW12x9b Multiplier for Energy Efficient FFTAccelerators [C]. In Proceedings of31stEuropean Solid-State Circuits Conference,2005:199-202.
    [20] NVIDIA Inc. NVIDIA GeForce GTX200GPU Architectural Overview [EB/OL].http://www.nvidia.com，2009.
    [21] Ron Ho, Ken Mai, and Mark Horowitz. Managing wire scaling: A circuitperspective [C]. In Proceedings of the IEEE International InterconnectTechnology Conference,2003.
    [22] Brucek Khailany. the VLSI Implementation and Evaluation of Area and EnergyEfficient Streaming Media Processors [D]. Ph.D. Thesis, Stanford University,2003.
    [23] Jessy Fang. Challenges and Opportunities on Multi-core Microprocessor [C].ACSAC2005,2005:389-390.
    [24] Mark Horowitz. Scaling, power and the future of cmos [C]. In Proceedings of the20thInternational Conference on VLSI Design,2007:23.
    [25] Wayne Wolf. High Performance Embedded Computings: Architecture,Application, and Methodologies [M]. Morgan Kaufmann Publishing,2007.
    [26] Scott Rixner. Stream Processor Architecture [M]. Kluwer Academic Publishers,Boston, MA,2001.
    [27] W.J. Dally, James Balfour, David Black-Shaffer, and et al.. efficient embeddedcomputing [C]. Computer,2008,41(7):27-32.
    [28] Yoonseo Choi, Yuan Lin, Nathan Chong, Scott Mahlke, Trevor Mudge. StreamCompilation for Real-time Embedded Multicore Systems [C]. In proceedings ofCGO,2009:210-220.
    [29] William. Language and Compiler Support for Stream Programs [D]. Ph.D. Thesis,Massachusetts Institute of Technology,2009.
    [30] NVIDIA Inc. CUDA Programming Guide v1.0[R]. http://www.nvidia.com,2007.
    [31] YU-Kwond Kwok, Ishfaq Ahmad. Static Scheduling Algorithms for AllocatingDirected Task Graphs to Multiprocessors [J]. ACM Computing Surveys,1999,31(4):406-471.
    [32] Michael A. Bender, Cynthia A. Phillips. Scheduling DAGs on AsynchronousProcessors [C]. In proceedings of SPAA,2007:35-45.
    [33] Zheng Wang, Michael F.P., O’Boyle. Mapping Parallelism to Multi-cores: AMachine Learning Based Approach [C]. In Proceedings of PPoPP,2009:75-84.
    [34] Timothy J. Knight, Ji Young Park, and et al.. Compilation for Explicitly ManagedMemory Hierarchies [C]. In Proceedings of PPoPP,2007:226-236.
    [35] Scott Schneider, Jae-Seung Yeom, Benjamin Rose, and et al.. A Comparison ofProgramming Models for Multiprocessors with Explicitly Managed MemoryHierarchies [C]. In proceedings of PPoPP,2009:131-140.
    [36] Lee Jeong-Gun, Kwang-Ju, Oryong-dong, and et al.. Instruction level redundantnumber computations for fast data intensive processing in asynchronousprocessors [J]. Journal of Systems Architecture,2005,51(3):151-164.
    [37] Scott Rixner, William J. Dally, Ujval J. Kapasi, and et al.. A bandwidth-efficientarchitecture for media processing [C]. In Proceedings of the31stAnnualIEEE/ACM International Symposium on Microarchitecture,1998:3–13.
    [38] Ian Buck, Tim Foley, Daniel Horn, and et al.. Brook for GPUs: StreamComputing on Graphics Hardware [J]. ACM Transactions on Graphics,2004,23(3):777–786.
    [39] David Tarditi, Sidd Puri, Jose Oglesby. Accelerator: Using Data Parallelism toProgram GPUs for General-Purpose Uses [C]. In Proceedings of the2006ASPLOS,2006.
    [40] AMD Inc. AMD Stream SDK User Guide v1.2.1(beta)[R]. http://www.amd.com,2008.
    [41] Ghuloum A., Smith T., Wu G., and et al.. Future-Proof Data Parallel Algorithmsand Software On Intel Multi-Core Architecture [J]. Intel Technology Journal,2007,11(4):333–348.
    [42] Peter Mattson et al.. Imagine Programming System User’s Guide [R].http://cva.stanford.edu,2002.
    [43] Kayvon Fatahalian, Timothy J. Knight, Mike Houston, and et al.. Sequoia:Programming the Memory Hierarchy [C]. In Proceedings of the2006SC,2006:46-57.
    [44] Peter Kogge et al.. ExaScale Computing Study: Technology Challenges inAchieving Exascale Systems [R]. DARPA IPTO,2008.
    [45] Analog Devices, Inc. ADSP-TS201S: TigerSHARC Embedded Processor [R].http://www.analog.com,2006.
    [46] A. Abbo, R. Kleihorst, V. Choudhary, L. Sevat, P. Wielage, S. Mouy, B.Vermeulen, and M. Heijligers. Xetal-II: a107Gops,600mW massively parallelprocessor for video scene analysis [J]. IEEE Journal of Solid-State Circuits,2008,43(1):192–201.
    [47] Cacium Corporation. OCTEON III CN7XXX Multi-Core MIPS64Processors[EB/OL]. http://www.cavium.com/OCTEON_MIPS64.html,2012.
    [48] picochip Corporation. PC202/PC202-10Integrated Baseband Processor [EB/OL].http://www.picochip.com,2012.
    [49] Yuri Nishikawa, Michihiro Koibuchi, Masato Yoshimi, Kenichi Miura andHideharu Amano. Performance Improvement Methodology for ClearSpeed’sCSX600[C]. International Conference on Parallel Processing,2007.
    [50] Kees van Berkel, Fank Heinle, Patrick P.E. Meuwissen, and et al; VectorProcessing as an enabler for Software-Defined Radio in Handheld Devices [J].EURASIP Journal on Applied Signal Processing,2005.
    [51] S. Kyo and S. Okazaki. IMAPCAR: A100Gops In-Vehicle Vision ProcessorBased on128Ring Connected Four-Way VLIW Processing Elements [J]. Journalof Signal Processing Systems,2011,62:5–16.
    [52] Tilera Inc. Tile-Gx Processor Family Product Brief [EB/OL].http://www.tilera.com,2009.
    [53] B. Khailany et al.. A Programmable512Gops Stream Processor for Signal, Image,and Video Processing [C]. IEEE ISSCC,2007.
    [54] NVIDIA Inc. Nvidia Tesla C2050/C2070GPU Computing Processor [EB/OL].http://www.nvidia.com,2009.
    [55] NVIDA Inc. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi(White Paper)[R]. http://www.nvidia.com,2009.
    [56] AMD Inc. Radeon HD5870overview [EB/OL]. http://www.amd.com,2009.
    [57] Sriram Vangali, Jason Howard, Gregory Ruhi, et al.. An80-Tile1.28TFLOPSNetwork-on-Chip in65nm CMOS [C].2007IEEE International Solid-StateCircuits Conference,2007.
    [58] Draco Electronics.3000+Cores Chip–Draco1for high performance computing[Z]. Technical Report,2010.
    [59] James Balfour, William J. Dally, and et al. An Energy-Efficient ProcessorArchitecture for Embedded Systems [C]. IEEE Computer Architecture Letters,2008,7(1):29-32.
    [60] Francky Catthoor, Praveen Raghavan, and et al. Ultra-Low EnergyDomain-Specific Instruction-Set Processors [M]. Springer Science Publisher,2009.
    [61] Mike Butts. Synchronization through Communication in a Massively ParallelProcessor Array [J]. IEEE Micro, September/October,2007:32-40.
    [62] Intel Corporation. Introducing Intel Many Integrated Core Architecture [EB/OL].http://www.intel.com,2011.
    [63] Daniel R. Johnson, Matthew R. Johnson, John H. Kelm, and et al.. RIGEL: A1,024-CORE SINGLE-CHIP ACCELERATOR ARCHITECTURE [J]. IEEEMicro, July/August,2011:30-41.
    [64] Aeroflex Gaisler. GRLIB IP Library User’s Manual [R]. http://www.gaisler.com,2009.
    [65] Wikipedia. Leon processor [EB/OL]. http://en.wikipedia.org/wiki/leon,2012.
    [66] Aeroflex Gaisler. BCC-Bare-C Cross-Compiler User’s Manual [R].http://www.gaisler.com,2009.
    [67] John L. Hennessy and David A. Patterson. Computer Architecture: A QuantitativeApproach (Fourth Edition)[M]. Morgan Kaufmann Publishing,2007.
    [68] Josh A. Fisher, Paolo Faraboschi, and Cliff Young. Embedded Computing: AVLIW approach to Architecture, Compilers and Tools [M]. Morgan KaufmannPublishing,2005.
    [69] Q. Jacobson and J.E. Smith. Instruction pre-processing in trace processors [C]. InProceedings of the5thInternational Symposium on High Performance ComputerArchitecture,1999:125
    [70] Nathan Clark, Jason Blome, Michael Chu, and et al.. An architecture frameworkfor transparent instruction set customization in embedded processors [C]. InProceedings of the32ndannual international symposium on ComputerArchitecture,2005:272–283.
    [71] Texas Instruments. TMSC320C55x DSP Mnemonic Instruction Set ReferenceGuide [R], http://www.ti.com,2002.
    [72] Tom R. Halfhill. Tensillica tackles bottlenecks [J]. Microprocessor Report,2004:34-40.
    [73] J.van de Waerdt, S.Vassiliadis, S.Das, S.Mirolo, and et al. The tm3270media-processor [C]. In Proceedings of the38thAnnual IEEE/ACM Intnl.Symposium on Microarchitecture (MICRO’05),2005:331–342.
    [74] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, and et al. The case for asingle-chip multiprocessor [J]. SIGPLAN Not.,1996,31(9):2–11.
    [75] U.G. Nawathe, M. Hassan, K.C. Yen, A. Kumar, A. Ramachandran, and D.Greenhill. Implementation of an8-core,64-thread, power-efficient sparc server ona chip. Solid-State Circuits [J]. IEEE Journal of,2008,43(1):6–20.
    [76] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta,and S. Kottapalli. A45nm8-core enterprise xeon processor [C]. IEEEInternational Solid-State Circuits Conference,2009:56-57.
    [77] N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, and A. Kovacs. Theimplementation of the65nm dual-core64b Merom processor [C]. In Solid-StateCircuits Conference,2007:106–590.
    [78] Lance Hammond, Basem A. Nayfeh, and Kunle Olukotun. A single-chipmultiprocessor [J]. Computer,1997,30(9):79–85.
    [79] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, and et al.. Thelandscape of parallel computing research: A view from Berkeley [R].UCB/EECS-2006-183, University of California, Berkeley,2006.
    [80] Anant Agarwal and Markus Levy. The kill rule for multicore [C]. In Proceedingsof the44th annual Design Automation Conference,2007:750–753.
    [81] Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, and et al..Piranha: a scalable architecture based on single-chip multiprocessing [C]. ISCA,2000:282–293.
    [82] Rehan Hameed, Wajahat Qadeer, Megan Wachs, et al. Understanding Sources ofInefficiency in General-Purpose Chips [C]. International Symposium onComputer Architecture (ISCA),2010:37-47.
    [83] Ofer Shacham, Omid Azizi, Megan Wachs, and et al.. Rethinking Digital DesignWhy Design Must Change [J]. IEEE Micro, November/December,2010:9-24.
    [84] Mauricio Alvarez, Esther Salam′, et al. A Performance Characterization of HighDefinition Digital Video Decoding using H.264/AVC [C]. IEEE InternationalSymposium on Workload Characterization,2005:24–33.
    [85] D.F. Zucker, M.J. Flynn, and R.B. Lee. A Comparison of Hardware PrefetchingTechniques for Multimedia Benchmarks [C]. International Conference onMultimedia Computer and System,1996.
    [86] Norman P. Jouppi. Improving direct-mapped cache performance by the additionof a small fully-associate cache and prefetch buffers [C]. International Symposiumon Computer Architecture (ISCA),1990:62-73.
    [87] J.Zalamea, J.Llosa, E.Ayguade, and M.Valero. Two-level hierarchical register fileorganization for vliw processors [C]. IEEE/ACM International Symposium onMicroarchitecture (MICRO),2000:137–146.
    [88] J.Ph. Diguet, S. Wuytack, F. Catthoor and et al.. Formalized methodology for datareuse exploration in hierarchical memory mappings [C]. International Symposiumon Low Power Electronics and Design,1997:30-35.
    [89] Brucek Khailany, William J. Dally, Ujval J. Kapasi, and et al.. Imagine：MediaProcessing with Streams [J]. IEEE Micro, March/April,2001:35-46.
    [90] Jimit Shah, K.S. Raghunandan and Kuruvilla Varghese. Area Optimized H.264Intra Prediction Architecture for1080p HD Resolution [C]. IEEE InternationalConference on Application-specific Systems, Architectures and Processors(ASAP),2010:297-300.
    [91] Afrin Naz, Mehran Rezaei, Krishna Kavi, and et al.. Improving Data CachePerformance with Integrated Use of Split Caches, Victim Cache and StreamBuffers [C]. ACM SIGARCH Computer Architecture News,2005,33(3):41-48.
    [92] J.H.Kelm, Daniel R. Johnson, William Tuohy, and et al.. Cohesion: A HybridMemory Model for Accelerators [C]. International Symposium on ComputerArchitecture (ISCA),2010:429-440.
    [93] Mei Wen, Nan Wu, Chunyuan Zhang, and et al. On-chip Memory SystemOptimization Design for the FT64Scientific Stream Accelerator [J]. IEEE Micro,July/August,2008:51-70.
    [94] R. Bannakar, S.Steinke, B.Lee, and et al.. Scratchpad Memory: A DesignAlternative for Cache On-chip Memory in Embedded Systems [C]. InternationalSymposium on Hardware/software Codesign (CODES),2002:73-78.
    [95] D.Chiou, P.Jain, L.Rudolphm, and et al.. Application-specific memorymanagement for embedded systems using software-controlled caches [C]. DesignAutomation Conference (DAC),2000:416–419.
    [96] Timothy J. Knight, Ji Young Park, Manman Ren, and et al.. Compilation forExplicitly Managed Memory Hierarchies [C]. ACM Symposium on Principles andPractice of Parallel Programming (PPoPP),2007:226-236.
    [97] Abhishek Das and William J. Dally. Stream Scheduling: A Framework to ManageBulk Operations in a Memory Hierarchy [C]. International Conference on ParallelArchitecture and Compilation Techniques (PACT),2007:15-19.
    [98] Scott Rixner. Stream Processor Architecture [M]. Kluwer Academic Publishers,Boston, MA,2001.
    [99] S. Agarwala, P. Koeppen, T. Anderson, and et al.. A600MHz VLIW DSP [C].International Solid-State Circuits Conference-Digest of Technical Papers,2002:56–57.
    [100] P. Hanrahan. Why Are Graphics Systems So Fast?(Keynote)[R]. InternationalConference on Parallel Architectures and Compilation Techniques (PACT),2009.
    [101] Peter Mattson, Ujval Kapasi, John Owens. Imagine Programming SystemDeveloper’s Guide [R]. http://cva.stanford.edu,2003.
    [102] Cadence Design Systems, Inc. Using Encounter RTL Compiler [R]. CadenceDesign Systems, Inc,2010.
    [103] Cadence Design Systems, Inc. Encounter Digital Implementation System UserGuide [R]. Cadence Design Systems, Inc,2010.
    [104] Cadence Design Systems, Inc. Encounter Power System User Guide [R]. CadenceDesign Systems, Inc,2010.
    [105] ARM Inc. TSMC65nm CLN65GPLUS RVT Process1.0-Volt12-TrackAdvantage Standard Cell Library v2.1Databook [R]. http://www.arm.com,2009
    [106] Naveen Muralimanohar. Cacti6.5[Z]. http://www.cs.utah.edu/~naveen/,2010.
    [107] Philips Corporation. TriMedia processor [Z]. http://www.trimedia.philips.com,2007.
    [108] J. Fridman and Z. Greenfield. The TigerSHARC DSP architecture [J]. IEEEMicro, Jan/Feb.2000,20(1):66-76.
    [109] Turley Jim and Hakkarainen Harri. TI’s New ‘C6x DSP Screams at1,600MIPS[J]. Microprocessor Report, February,1997:14-17.
    [110] S. Rixner, W. J. Dally, B. Khailany, and et al.. Register organization for mediaprocessing [C]. In Proceeding of the6thInternational Symposium onHigh-Performance Computer Architecture,2000:375–386.
    [111] R. Balasubramonian, N. Muralimanohar, K. Ramani, and et al.. Microarchitecturalwire management for performance and power in partitioned architectures [C].HPCA,2005:28-39.
    [112] Viktor S. Lapinskii, Margarida F. Jacome, and Gustavo A. De Veciana. ClusterAssignment for High Performance Embedded VLIW Processors [J]. ACMTransactions on Design Automation of Electronic Systems, July,2002:430-454.
    [113] Hugo DeMan. Ambient intelligence: Giga-scale dreams and nano-scale realities(Keynote speech)[R]. In Proc of ISSCC,2005.
    [114] Anup Gangwar, M. Balakrishnan, Preeti R. Panda and Anshul Kumar. Evaluationof Bus Based Interconnect Mechanisms in Clustered VLIW Architectures [C].DATE,2005.
    [115] A. S. TERECHKO, LE TH′ENAFF, E., VAN EIJNDHOVEN, J. T. J., ANDCORPORAAL, H. Inter-cluster communication models for clustered VLIWprocessors [C]. HPCA,2003:354–364.
    [116] A. S. TERECHKO and H. CORPORAAL. Inter-cluster Communication in VLIWArchitectures [J]. ACM Transactions on Architecture and Code Optimization,article11,2007,4(2):1-38.
    [117] Sourabh Saluja, Anshul Kumar. Performance Analysis of Inter ClusterCommunication Methods in VLIW Architecture [C]. Proceedings of the17thInternational Conference on VLSI Design (VLSID’04),2004.
    [118] Praveen Raghavan, Satyakiran Munaga, Estela Rey Ramos, and et al. ACustomized Cross-Bar for Data-Shuffling in Domain-Specific SIMD Processors[C]. ARCS2007, LNCS4415,2007:57-68.
    [119] Jae Young Hur, Todor Stefanov, Stephan Wong, and Stamatis Vassiliadis.Systematic Customization of On-Chip Crossbar Interconnects [C]. ARC2007,LNCS4419,2007:61-72.
    [120] A. Yavuz Omc, H. M. Huang. Crosspoint Complexity of Sparse CrossbarConcentrators [J]. IEEE TRANSACTIONS ON INFORMATION THEORY,1996,42(5):1466-1471.
    [121] Guy Lemieux, Paul Leventis and David Lewis. Generating Highly-RoutableSparse Crossbars for PLDs [C]. FPGA,2000.
    [122] Mattson, P., and et al.. Communication scheduling [C]. Proceedings of the9thInternational Conference on Architectural Support for Programming Languagesand Operating Systems,2000:82-92.
    [123] G. Essakimuthu, N. Vijaykrishnan, and M. J. Irwin. An analytical powerestimation model for crossbar interconnects [R]. CSE-02-009(Technical Report),Penn State University,2002.
    [124] Hangsheng Wang. A Detailed Architectural-Level Power Model for RouterBuffers, Crossbars and Arbiters [R]. http://www.princeton.edu,2004.
    [125] B. Afzal, A. Afzali-Kusha, and M. El Nokali. Efficient Power Model for CrossbarInterconnects [C]. ISCAS,2005,6:5858-5861.
    [126] Santanu Dutta, Kevin J. O’Connor and Andrew Wolfe. High-PerformanceCrossbar Interconnect for a VLIW Video Signal Processor [C]. IEEE ASICConference,1996:45-50.
    [127] INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS:2008UPDATE [R]. http://www.itrs.net,2008.
    [128] William J. Dally and John W. Poulton. Digital Systems Engineering [M].Cambridge University Press, New York, NY,1998.
    [129] Shekhar Borkar and Andrew A.Chien. The Future of Microprocessors [J].Communications of the ACM,2011,54(5):87-97.
    [130] Johnson, M. Superscalar Microprocessor Design [M]. Prentice Hall, EnglewoodCliffs, N.J,1990.
    [131] Yeager, K. The MIPS R10000superscalar microprocessor [J]. IEEE Micro,1996,16(2):28–40.
    [132] J. A. Fisher. Very long instruction word architectures and the ELI-52[C]. In Proc.10thAnnu. Int. Symp. Computer Architecture,1983:140–150.
    [133] Joseph A. Fisher, Paolo Faraboschi, and Cliff Young. VLIW Processors: FromBlue Sky To Best Buy [J]. IEEE Solid-State Circuit Magazine, spring,2009:10-17.
    [134] Nicholas FitzRoy-Dale. The VLIW and EPIC processor architectures [EB/OL].http://www.cse.unsw.edu.au/~disy/,2005.
    [135] M．S．Schlansker and B．IL Rau. EPIC：Explicitly Parallel Instruction Computing[J]. IEEE Computer,2000,33(2):37-45.
    [136] P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, and F. Homewood. Lx: Atechnology platform for customizable VLIW embedded processing [C]. In Proc.27thAnnu. Int. Symp. Computer Architecture,2000:203–213.
    [137] McNairy, C., and D. Soltis. Itanium2processor microarchitecture [J]. IEEEMicro,2003,23(2):44–55.
    [138] S. Kyo, S. Okazaki, and T. Arai. An integrated memory array processorarchitecture for embedded image recognition systems [C]. In Proc. Int. Symp.Computer Architecture (ISCA),2005:134-145.
    [139] H. Kaul, M. A. Anders, S. K. Mathew, and et al.. A300mV494Gops/WReconfigurable Dual-Supply4-Way SIMD Vector Processing Accelerator in45nm CMOS [C]. In IEEE Int. Solid-Stace Circ. Conf,2009:260–263.
    [140] Asanovic, K.. Vector microprocessors [D]. Ph.D. thesis, Computer ScienceDivision,Univ. of California at Berkeley,1998.
    [141] Oliker, L., A. Canning, J. Carter, J. Shalf, and S. Ethier. Scientific computationson modern parallel vector systems [C]. Proc. of ACM/IEEE Conference onSupercomputing,2004:10.
    [142] N. T. Slingerland and A. J. Smith. Multimedia extensions for general purposemicroprocessors: A survey [J]. Microprocessor Microsystem,2005,29(5):225–246.
    [143] Intel Corporation. Streaming SIMD Extension2(SSE2)[EB/OL].http://www.intel.com/support/processors/sb/cs-001650.htm,2007.
    [144] Linley Gwennap. AltiVec Vectorizes PowerPC Forthcoming MultimediaExtensions Improve on MMX [J]. Microprocessor Report,1998,12(6):1-5.
    [145] P.Kongetira, K.Aingaran, and K.Olukotun. Niagara: A32-Way Multi-threadedSparc Proocessor [J]. IEEE Micro,2005,25(2):21-29.
    [146] Fujitsu Limited. SPARC64VI/VI+: Next Generation Processor [EB/OL].http://www.fujitsu.com,2005.
    [147] Robert Golla. Niagara2: A Highly Threaded Server-on-a-Chip [C]. Hot Chips,2006.
    [148] Christoforos Kozyrakis, David Patterson. Vector Vs. Superscalar and VLIWArchitectures for Embedded Multimedia Benchmarks [C]. In the Proceedings ofthe35thInternational Symposium on Microarchitecture,2002:283-293.
    [149] Christoforos Kozyrakis. Scalable Vector Media-processors for EmbeddedSystems [D]. Ph.D. Thesis, University of California at Berkeley,2002.
    [150] Christos Kozyrakis and David Patterson. Overcoming the limitations ofconventional vector processors [C]. In30thAnnual International Symposium onComputer Architecture,2003:399-409.
    [151] S. Segar. Low power design techniques for microprocessors [C]. In Proceedingsof International Solid State Circuits Conference,2001.
    [152] Johnson Kin, Munish Gupta, and William H. Mangione Smith. Filtering memoryreferences to increase energy efficiency [J]. IEEE Transactions on Computers,2000,49(1):1–15
    [153] Gang-Ryung Uh, Yuhong Wang, David Whalley, and et al.. Effective exploitationof a zero overhead loop buffer [C]. Proceedings of the ACM SIGPLAN workshopon Languages, compilers, and tools for embedded systems,1999:10–19.
    [154] E.Roternberg, S.Bennett, and J.Smith. Trace cache: A low latency approach tohigh bandwidth instruction fetching [C]. Proc. of29thIntnl. Symposium onMicroarchitecture (MICRO),1996.
    [155]伍楠.流处理器MASA内核的研究及实现[D].硕士学位论文,国防科学技术大学,2005.
    [156] Nan Wu, Qianming Yang, Mei Wen, and et al.. Tiled Multi-Core StreamArchitecture [J]. Transactions on High-Performance Embedded Architectures andCompilers (HiPEAC),2009,4(3):274-293.
    [157] Lam, M. Software pipelining: An effective scheduling technique for VLIWprocessors [C] ACM SIGPLAN Conf. on Programming Language Design andImplementation,1988:318–328.
    [158] Eric LaForest. Survey of Loop Transformation Techniques [R]. ECE1754,2010.
    [159] Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien. Yu, and et al.. Design ofion-implanted MOSFETS with very small physical dimensions [J]. IEEE Journalof Solid State Circuits,1974,9(5):256-268.
    [160] Psilogeorgopoulos M., Munteanu M., Chuang T., and et al.. ContemporaryTechniques for Lower Power Circuit Design [R]. Tech Report. D2.1, theUniversity of Sheffield,1998.
    [161]拉贝,钱德拉卡山,尼科利奇.数字集成电路-设计透视（第2版）[M].北京:清华大学出版社,2004.
    [162] Vissers, K. A.. Parallel Processing Architectures for Reconfigurable Systems [C].In Design, Automation and Test in Europe Conference and Exhibition (DATE),Munich, Germany,2003.
    [163] Platzner, M.,. Reconfigurable Computer Architectures [J]. e&i Elektrotechnik andInformationstechnik, Springer,1998,115(3):143-148.
    [164] Chen Chang. Design and Applications of a Reconfigurable Computing System forHigh Performance Digital Signal Processing [D]. Ph.D thesis, UNIVERSITY OFCALIFORNIA, BERKELEY,2005.
    [165] Marco D. Santambrogio, Donatella Sciuto. Partial Dynamic Reconfiguration: thecaronte approach. a new degree of freedom in the HW/SW codesign [C]. FPL,2006:1-2.
    [166] Francois Labonte, Peter Mattson, and et al.. The Stream Virtual Machine [C].Proceedings of the13thInternational Conference on Parallel Architecture andCompilation Techniques,2004:267-277.
    [167] Taylor M B, Lee W, Miller J, et al. Evaluation of the Raw Microprocessor: AnExposed-Wire-Delay Architecture for ILP and Streams [C]. ISCA,2004:2-13.
    [168] Sankaralingam K, Nagarajan A, McDonald R, and et al. DistributedMicroarchitectural Protocols in the TRIPS Prototype Processor [C].39thAnnualInternational Symposium on Microarchitecture,2006:480-491.
    [169] Jung H A, Dally W J, Kapasi U J, et al. Evaluating the Imagine StreamArchitecture [C]. Proceedings of the31stAnnual International Symposium onComputer Architecture,2004:14-25.
    [170] Chai S K, Chiricescu S, Essick R, and et al.. Streaming Processor forNext-Generation Mobile Image Applications [J]. IEEE communication magazine,2005:81-89.
    [171] Hankins R A, Chinya G N, Collins J D, and et al.. Multiple Instruction StreamProcessor [C]. In Proceedings of the33rdInternational Symposium on ComputerArchitecture,2006:114-127.
    [172] Stamatis Vassiliadis, Dimitrios Soudris, Yale Patt, and et al.. Fine-andCoarse-Grain Reconfigurable Computing [M]. Springer.2007.
    [173] Katarzyna Leijten-Nowak. Template-Based Embedded ReconfigurableComputing [D]. Ph.D. Thesis, Eindhoven University of Technology,2004.
    [174] A. Sangiovanni-Vincentelli and G. Martin. Platform-based design and softwaredesign methodology for embedded systems [J]. IEEE Design and Test ofComputers,2001,18(6):23–33.
    [175] R. Goering. Platform-based design: A choice, not a panacea [R]. EETimes,2002.
    [176] Texas Instruments Inc. OMAP2430overview [EB/OL]. http://www.ti.com,2009.
    [177] PACT XPP Technologies. The XPP-III White Paper (Release2.0.1)[R].http://pactxppnew.com,2010.
    [178] Philips Electronics N.V. Nexperia Advanced Prototyping Architecture [R].http://www.philips.com,2008.
    [179] Triscend Corporation. Triscend A7S Configurable System-on-Chip Platform [R].http://www.triscend.com,2010.
    [180] Chien S Y, Chen T H, Chen J C, and et al.. Course-Grained Reconfigurable ImageStream Processor Architecture for Embedded Image/Video Processing andAnalysis [C]. ICME,2009:1578-1579.
    [181] Singh H, Lee M, Lu G, and et al.. Morhposys: Case Study of a ReconfigurableComputing System Targeting Multimedia Applications [C]. Proc. of DesignAutomation Conference (DAC'00),2000:573-578.
    [182] T. Miyamori, K.Olukotun. A Quantitative Analysis of ReconfigurableCoprocessor for Multimedia Applications [C]. Proc. Of IEEE Sym. On FCCM,1998:2-11.
    [183] Jian H, Matthew P, Jooheung L, and Ronald FD. Scalable FPGA-basedArchitecture for DCT Computation Using Dynamic Partial Reconfiguration [J].ACM Transactions on Embedded Computing Systems, December,2008:1-18.
    [184] Claus C, Zeppenfeld J, M uller F, and Stechele W. Using partial-run-timereconfigurable hardware to accelerate video processing in driver assistance system[C]. DAC,2007.
    [185] Mateusz M, Jürgen T, Ali A, and Christophe B. The Erlangen Slot Machine: ADynamically Reconfgurable FPGA-Based Computer [J]. The Journal of VLSISignal Processing,2007,47(1):15-31.
    [186] Rixner S, Dally W J, Kapasi U J, and et al.. Memory access scheduling [C]. InProceedings of the27thAnnual International Symposium on ComputerArchitecture,2000:128-138.
    [187] Xilinx Inc. XPS HWICAP (v1.00.a) Product Specification (DS586)[R].http://www.xilinx.com,2007.
    [188] Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch. Run-time PartialReconfiguration Speed Investigation and Architectural Design Space Exploration[C]. FPL,2009.
    [189]杨乾明.多核流体系结构模拟器研究与实现[D].硕士学位论文,国防科学技术大学,2008.
    [190] Pi Y, Long H, and Huang S.. A SAR parallel processing algorithm and itsimplementation [C]. FIEOS Conf.,2002
    [191] Mei Wen, Nan Wu, Qianming Yang, and et al.. The MASALA Machine:Accelerating Thread-intensive and Explicit Memory Management Programs withDynamically Reconfigurable FPGAs [C]. FPGA,2012.
    [192] Liu Xiao, et al.. Implementation for High Resolution SAR Parallel Imaging [J].Information and Electronic Engineering,2008,6(1):24-28.
    [193] Peter Carlston, et al. Optimizing an Innovative SAR Post-Processing Algorithmfor Multi-Core Processors: A Case Study [C]. High Performance EmbeddedComputing Workshop,2009.
    [194] William Lundgren, et al. Programming Examples that Expose Efficiency Issuesfor the Cell Broadband Engine Architecture [C]. High Performance EmbeddedComputing Workshop,2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700