面向应用的指令集处理器关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

面向应用的指令集处理器关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on the Key Techniques of Application-Specific Instruction-Set Processors
作者：陈虎
论文级别：博士
学科专业名称：电子科学与技术
中文关键词：面向应用的指令集处理器 ; 扩展指令 ; 自动指令集扩展 ; 数据通路压缩 ; 可扩展指令编码 ; 分布式指令执行控制
英文关键词：Application-Specific Instruction-Set Processor (ASIP) ; Extended Instruction ; Automatic Instruction-Set Extension (ISE) ; Data-Path Compression ; Scalable Instruction Encoding ; Distributed Instruction Execution Control
学位年度：2011
导师：陈书明
学科代码：080903
学位授予单位：国防科学技术大学
论文提交日期：2011-09-01

摘要

媒体处理、软件无线电等领域中,应用标准不断演进、算法的复杂度日益增加。同时,人们不断要求媒体和通信技术提供更高的服务质量。这些因素使得应用计算复杂度的增长速度远超过处理器性能提升的速度。处理器必须在给定的资源约束下和约定的时间内完成大量的数据计算。受寄存器文件面积、延时和功耗的影响,及指令集结构、指令编码方式和指令执行控制逻辑的限制,传统的超长指令字或超标量单处理器的可扩展性和功耗效率非常有限。同时由于应用中单线程的指令级并行度有限,单处理器的性能已难以再随片上可用资源的增加而增加。
     过去10年里,片上多处理器(Multiprocessor System-on-Chip,MPSoC)证明是进一步提升处理器性能和功耗效率的有效方式之一。MPSoC通过将片上资源划分到多个处理器核中实现了多任务的并发运行,有利于减小处理器核的复杂度、降低MPSoC系统的工作频率、提高MPSoC系统的功耗效率。但串行编写的应用程序使得应用难以快速、高效地映射到MPSoC上。同时MPSoC的片上通信对处理器的性能、面积和功耗均产生一定的开销。这些因素使得MPSoC的性能并没有像预期的那样随集成电路技术带来的片上可用资源的增加而线性增加。
     单处理器解决方案具有应用编程简单、资源通信开销小等优点。同时,基于传统处理器核对指令集、存储结构、通信协议等进行扩展构建面向应用的处理器证明是一种提升功耗效率的有效方式。因此,在MPSoC虽已成为处理器发展主流的时代有必要重新审视单处理器解决方案,找出限制单处理器的性能和功耗效率的关键因素,考虑单处理器满足数据计算密集型复杂应用的计算需求的可能。本文通过分析应用及其执行特征试图找出单处理器的根本缺陷,并在指令集定制、指令执行控制方式、可扩展指令编码、面向应用的处理器结构方面展开研究,提出了提升单处理器性能和功耗效率的相关技术。本文的研究内容和主要贡献如下:
     1)分析了两种典型应用及其在VLIW结构处理器上的运行情况,总结了单处理器在处理计算密集型复杂应用方面的不足。通过分析发现,两种典型应用在任务级、循环级、基本块级和指令级多个层次均存在不同程度的并行性,且核心算法的计算模式相对固定。但是,传统处理器的指令执行控制方式和指令编码方式限制了处理器的可扩展性;传统的RISC处理器的指令集过于精简,不利于处理器的性能和功耗效率的提升;传统的充分考虑控制相关、数据相关和资源相关等信息的指令调度方式可扩展性差,不利于充分发挥处理器中可用资源的效能。
     2)提出了一种快速的扩展指令集自动产生方法。该方法首先分析应用、获取应用中常用的算术逻辑操作,并以这些操作为中心采用加窗、步进等方式逐步增加扩展指令的复杂度,在保证每个步骤的输出均是局部最优结果的同时控制可算法复杂度。该方法不仅可以有效开发设计空间、产生高效的扩展指令,而且算法复杂度与典型操作的数量成正比,与以每个典型操作为中心进行的平均搜索步骤成正比,从而使得算法复杂度随应用复杂度的增加成线性增加。
     3)提出了一种新的指令资源压缩方法。该方法首先找出扩展指令的关键路径并将指令的数据流图分割成多条件路径,然后将一条指令的路径与其它所有指令的路径一一匹对找出路径的最大公共等价子串,再以最大公共等价子串为索引压缩所有路径,从而保证了指令间的资源充分共享。同时,该方法允许对指令的数据流图进行修改,在路径中插入延迟和资源开销小的简单操作使路径或路径的一部分的数据流图与其它路径或者其它路径的一部分的数据流图等同,从而减小了插入的多路选择器的数量,减小了多路选择器产生的面积、延时开销。
     4)提出了一种软/硬件协同的指令编码方法,旨在不显著增加代码尺寸的情况下消除指令编码对处理器的可扩展性的限制,同时保持原处理器的指令字长、指令集结构、硬件解码结构和编译调度算法不变。该方法基于簇型处理器(Clustered Processors)通过将派发到相同簇中的功能单元上执行的指令组合在一个指令包内,将包内指令的公共信息抽取出来以指令包头的形式插入包中,减小了指令字内需要编码的信息量,增加了固定长指令字的编码空间。同时,该方法对公共信息的类型以及包头指令的数量没有限制,从而提高了该方法的可扩展性。
     5)提出了一种集中式和分布式相结合的指令执行控制机制,即取指、译码和流出仍采用集中控制方式,指令的执行和写回采用分布式控制方式。这种机制将指令的译码和流出由原来的指令级转变成指令包级,简化了指令流出的复杂度。同时,该机制将指令执行过程中的取数、执行、写回三个环节的控制过程交给功能单元和分布式寄存器文件,不仅简化了控制逻辑的复杂度而且实现了控制机制的可扩展性。此外,该机制允许产生数据的指令输出一旦有效后马上就能被消费数据的指令使用,提高了处理器开发数据局部性的能力。
     6 )基于一种可扩展的簇型处理器提出了可以支持复杂指令的ASIP(Application-Specific Instruction-set Processor)结构。该ASIP的基本结构中的功能单元、寄存器文件被划分到多个簇中,各个簇可以自行管理派发到簇内的指令的执行并通过可扩展的操作数传递网实现簇间通信。因而,向基本结构中添加包含扩展功能单元的扩展簇时不影响其它的结构和资源分配,从而减小了ASIP的设计复杂度。同时,扩展功能单元允许扩展指令具有最多6个输入操作数和最多2个输出操作数,允许扩展指令具有更高的复杂度,极大地拓展了扩展指令的空间。
The evolution of applications in fields like media processing and software definition radio (SDR) continuously brings more complex algorithms. The demand for high-quality services requires huge volumes of data processed under given time and resource constraints. Traditional VLIW and superscalar processors do not scale well because the area, delay, and power consumption of centralized register files increase proportionally to O(N2), O(N), and O(log4N) respectively as the number of access ports N increases. Meanwhile, the control paths of these processors, such as instruction issuing and commitment, are organized in a centralized manner, which results in high complexity and poor scalability.
     Multiprocessor System-on-Chips (MPSoCs) have emerged in the past decade as promising solutions to meet the computation requirement. However, integrating multiple cores on a single chip does not directly increase processors’performance or power efficiency for most sequentially written applications. The dilemma stemming from parallelizing sequential programs or writing parallel programs to take full advantage of available resources in MPSoCs drives us to rethink about the uni-core solutions from a hardware/software codesign perspective: What are the fundamental limitations of uni-core processors? How to eliminate these limitations and reduce hardware complexity? How to speedup applications with minimal modification to the architecture of state-of-the art processors? The main contributions of this paper are summarized as follows:
     1) We analyzed the inefficiency of a uni-core VLIW processor in processing two typical computation-intensive benchmarks from the area of media processing and software defined radio, and tried to find out fundamental limitations that hinder the scalability, performance and power efficiency of uni-core solutions. We drew the conclusion from the analysis that there are three aspects behind the inefficiency of uni-core processors, namely, the ultra-simplified instruction-set architecture (ISA) of RISC-like processors, the traditional plained binary instruction encoding scheme and centralized instruction execution control strategy. We proposed systmatical strategies to overcome the limitation of uni-core solutions based on the conclusion.
     2) We proposed an automatic method that can fast enumerate candidate extended instructions. This method first profiles source codes and recognizes the ALU operations that account for most of the execution time, then, generate candidate extended instructions around these typical operations using windowed and progressive search processes. The result pattern from each search step is locally optimal, which guarantees the efficiency of the ultimate pattern to some extend. This instruction enumeration method can not only effectively explore the design space but also have linear complexity. The algorithm complexity grows linearly with the number of typical operatons and average search steps around each typical operation.
     3) We proposed a novel resource compression method for extended instructions implemented on extended functional units. The method first finds out the critical path of an instruction and patitions the rest of the DFG (Date Flow Graph) of the instruction into multiple paths, then, finds out the MCES (Maximal Common Equivalent String) of all paths of all instructions and compresses these instructions contain the MCES. The method can guarantee resource be effectively shared among instructions. Meanwhile, the method allows modifying the DFG of an instruction through inserting simple operations into the paths of the instruction in order to reduce the number of inserted multiplexers and reduce the impact of multiplexers on area and delay.
     4) We proposed a hardware/software instruction encoding scheme to improve the scalability of uni-core architectures. By statically scheduling a sequence of dependent instructions into a pack, implementing common information in the pack in a dedicated instruction word, and converting instruction issuing to pack issuing, we could substantially reduce the number of bits required to encode instructions and the hardware complexity of instruction issuing, thus improving the scalability usually limited by fixed-length instruction formats and centralized instruction issuing.
     5) To improve the scalability and performance of uni-core processors, we proposed a novel distributed instruction execution control scheme and implement the pipeline using this scheme. The highly scalable pipeline that features in-order issuing, out-of-order execution and parallel but in-order commitment because the functional units partitioned among clusters are allowed to read operands, execute instructions, write back results and maintain data dependency themselves. The scalability is improved by the instruction execution control scheme, while the performance is enhanced by the increased hardware speed and the improved temporal data locality.
     6) We proposed a novel ASIP architecture based on the scalable pipeline using distributed instruction execution control. The ASIP could support complex instructions which could have a maximum of 6 input operands and 2 output operands, which substantially extend the design space of extended instructions and improve the potential speedup from instruction customization. The execution resources like functional units and register file are patitioned among clusters, inter-cluter communication is implemented through a scalable operand passing network. Thus change in the functionality of extended fuctional units will not affect the baseline architecture.

引文

[1] Thomas Wiegand, Gary J. Sullivan, Gisle Bjontegaard and Ajay Luthra. Overview of the H.264/AVC Video Coding Standard [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2003, 13(7): 560~576.
    [2] H. Kalva, et al. The H.264/AVC Video Coding Standard [J]. IEEE Multimedia, 2006, 13(4): 86~90.
    [3] Po-Chih Tseng, Yung-Chi Chang, Yu-Wen Huang, et al. Advances in Hardware Architectures for Image and Video Coding - A Survey [C]. // Proceedings of the IEEE. Washington: IEEE Press, 2005, 93(1): 184~197.
    [4] Shao-Yi Chien, Yu-Wen Huang, Ching-Yeh Chen, et al. Hardware Architecture Design of Video Compression for Multimedia Communication Systems [J]. IEEE Communications Magazine, 2005, 43(8): 122~131.
    [5] Anant Agarwal and Markus Levy. The Kill Rule for Multicores [C]. // Proceedings of 44th Design Automation Conference. San Diego: IEEE Press, 2007: 750~753.
    [6] Wayne Wolf. The Future of Multiprocessor Systems-on-Chips [C]. // Proceedings of 41st Design Automation Conference. San Diego: IEEE Press, 2004:681~685.
    [7] Wayne Wolf, Ahmed Amine Jerraya, and Grant Martin. Multiprocessor System-on-Chip (MPSoC) Technology [J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2008, 27(10): 1701-1713.
    [8] Pierre Guerrier and Alain Greiner. A Generic Architecture for On-Chip Packet-Switched Interconnections [C]. // Proceedings of the 2000 Conference on Design, Automation & Test in Europe. Paris, France: ACM Press, 2000: 250~256.
    [9] Matthew J. Bridges, Neil Vachharajani, et al. Revisiting the Sequential Programming Model for the Multicore Era [J]. IEEE Micro, 2008, 28(1): 12~20.
    [10] Mojtaba Mehrara, Thomas Jablin, et al. Multicore Compilation Strategies and Challenges [J]. IEEE Signal Processing Magazine, 2009, 26(6): 55~63.
    [11] Bryan Catanzaro, Armando Fox, et al. Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford. IEEE Micro, 2010, 30(2): 41~55.
    [12] ISO/IEC 11172, Information Technology - Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s [S].
    [13] ISO/IEC 13818, Information Technology - Generic Coding of Moving Pictures and Associated Audio Information [S].
    [14] ISO/IEC 14496–2, Information Technology: Coding of Audiovisual Objects - Part 2: Visual [S].
    [15] ITU-T Recommendation H.261, Video Codec for Audiovisual Services at p X 64kbit/s [S].
    [16] ITU-T Recommendation H.262, ISO/IEC 13818-2:1996, InformationTechnology–Generic Coding of Moving Pictures and Associated Audio Information: Video [S].
    [17] ITU-T Recommendation H.263, Video Coding for Low Bitrate Communication [S].
    [18] ITU-T Rec. H.264/ISO/IEC 11496-10, Advanced Video Coding, Final Committee Draft, Document JVTG050 [S].
    [19] AVS-N1063, Draft of Advanced Audio Video Coding - Part 2:Video [S].
    [20] SMPTE 421M, VC-1 Compressed Video Bitstream Format and Decoding Process [S].
    [21] Naim Dahnoun. Digital Signal Processing Implementation Using the TMS320C6000 DSP Platform [M]. Boston: Addison-Wesley, 2000.
    [22]张波涛.片上高性能嵌入式计算:面向软基带的应用模型和体系结构[D].长沙:国防科学技术大学, 2011: 2~4.
    [23] James E. Katz. Handbook of Mobile Communication Studies [M]. London, England: MIT Press, 2008.
    [24] Dileep Bhandarkar and Jason Ding. Performance Characterization of the Pentium Pro Processor [C]. // Proceedings of 3rd the Symposium on High Performance Computer Architecture. San Antonio, Texas: IEEE Press, 1997: 288~297.
    [25] Rich Witek and James Montanaro. StrongARM: A High-Performance ARM Processor [C]. // Proceedings of 41st IEEE International Computer Conference. San Jose, CA: IEEE Press, 1996: 188~191.
    [26] John R. Hauser and JohnWawrzynek. Garp: A MIPS Processor with a Reconfigurable Coprocessor [C]. // Proceedings of the 5th Symposium on FPGA Based Custom Computing Machines. Napa Valley, CA: IEEE Press, 1997: 24~33.
    [27]谷会涛,陈书明,孙书为.支持多种标准的高清视频运动估计协处理器[J].计算机研究与发展. (已录用)
    [28] TMS320C6000 Digital Signal Processors. Texas Instruments (TI), www.ti.com
    [29] Blackfin Digital Signal Processors. Analog Devices , Inc(ADI). www.analog.com
    [30] Xilinx Virtex FPGA Platforms. Xilinx. www.xilinx.com
    [31] H. Meyr. System-on-Chip for Communications: The Dawn of ASIPs and the Dusk of ASICs. // Keynote Speech of IEEE International Workshop on Signal Processing Systems [C]. Seoul, korea: IEEE Press, 2003: 4~5.
    [32] T. Glokler, S. Bitterlich and H. Meyr. ICORE: A Low-Power Application Specific Instruction Set Processor for DVB-T Acquisition and Tracking [C]. // Proceedings of IEEE Workshop on Signal Processing Systems. Washington DC: IEEE Press, 2000: 102~106.
    [33] T. Glokler, S. Bitterlich and H. Meyr. Increasing the Power Efficiency of Application Specific Instruction Set Processors using Datapath Optimization [C]. // Proceedings of the IEEE Workshop on Signal Processing Systems. Lafayette (LA):IEEE Press, 2000: 563~570.
    [34] Paolo Ienne and Rainer Leupers. Customizable Embedded Processors: Design Technologies and Applications [M]. San Francisco: Elsevier, 2007.
    [35] Andreas Wieferink, Heinrich Meyr and Rainer Leupers. Retargetable Processor System Integration into Multi-Processor System-on-Chip Platforms [M]. San Francisco: Elsevier, 2008.
    [36] Katherine Compton and Scott Hauck. Reconfigurable Computing: A Survey of Systems and Software [J]. ACM Computing Surveys, 2002, 34(2): 171~210.
    [37] Russell Tessier and Wayne Burleson. Reconfigurable Computing for Digital Signal Processing: A Survey [J]. Journal of VLSI Signal Processing, 2001, 28(1): 7~27.
    [38] T.J. Todman, G.A. Constantinides, et al. Reconfigurable Computing: Architectures and Design Methods [C]. // Proceedings of IEE Computers and Digital Techniques. Washington DC: IEEE Press, 2005, 152(2): 193~207.
    [39] Ricardo E. Gonzalez. Xtensa: A configurable and Extensible Processor [J]. IEEE Micro, 2002, 20(2): 60~70.
    [40] Kurt Keutzer, et al. From ASIC to ASIP: the Next Design Discontinuity [C]. // Proceedings of the International Conference on Computer Design. Washington: IEEE Press, 2002: 43~57.
    [41] Scott Rixner, William J. Dally, Brucek Khailany, et al. Register Organization for Media Processing [C]. //Proceedings of 6th International Symposium on High-Performance Computer Architecture. Toulouse, France: IEEE Press, 2000: 375~386.
    [42] J. Lee, K. Choi, and N. D. Dutt. Energy-efficient instruction set synthesis for application-specific processors [C]. // Proceedings of the 2003 International Symposium on Low Power Electronics and Design. Seoul, Korea: ACM Press, 2003: 330–333.
    [43] Michael John Sebastian Smith. Application-Specific Integrated Circuits [M]. Boston, Massachusetts: Addison-Wesley: 1997.
    [44] Peter Flake, Simon Davidmann and Frank Schirrmeister. System-Level Exploration Tools for MPSoC designs [C]. // Proceedings of the 43rd annual Design Automation Conference. San Francisco, California: IEEE Press, 2006: 286~287.
    [45] Cesare Alippi, William Fornaciari, Laura Pozzi and Mariagiovanna Sami. A DAG-based Design Approach for Reconfigurable VLIW Processors [C]. //Proceedings of the 1999 Conference on Design, Automation & Test in Europe. New York: ACM Press, 1999:778~779.
    [46] Kubilay Atasu, Laura Pozzi, et al. Automatic Application-Specific Instruction-set Extensions under Micro-architectural Constraints [C]. // Proceedings of the 40th annual Design Automation Conference. New York: ACM Press, 2003:256~261.
    [47] Nathan Clark, et al. Processor Acceleration through Automated Instruction SetCustomization [C]. // Proceedings of 36th Annual IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Press, 2003:129~140.
    [48] Fei Sun, Srivaths Ravi, Anand Raghunathan and Niraj K. Jha. A Scalable Application-Specific Processor Synthesis Methodology [C]. //Proceedings of the International Conference on Computer-Aided Design. Washington: IEEE Press, 2003:283~290.
    [49] Pan Yu and Tulika Mitra. Scalable Custom Instructions Identification for Instruction-set Extensible Processors [C]. // Proceedings of International Conference on Compilers, Architectures and systhesis of Embedded systems. New York: ACM Press, 2004:69~78.
    [50] Jason Cong, Yiping Fan, Guoling Han and Zhiru Zhang. Application-Specific Instruction Generation for Configurable Processor Architectures [C]. // Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York: ACM Press, 2004:183~189.
    [51] Partha Biswas, Vinay Choudhary, Kubilay Atasu and Laura Pozzi. Introduction of Local Memory Elements in Instruction-set Extensions [C]. // Proceedings of the 41th Design Automation Conference. New York: ACM Press, 2004: 729~734.
    [52] Laura Pozzi and Paolo Ienne. Exploiting Pipelining to Relax Register-file Port Constraints of Instruction-set Extensions [C]. // Proceedings of International Conference on Compilers, Architectures and systhesis of Embedded systems. New York: ACM Press, 2005: 2~10.
    [53] Jason Cong, Yiping Fan, Guoling Han, Ashok Jagannathan, et al. Instruction-set Extension with Shadow Registers for Configurable Processors [C]. // Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York: ACM Press, 2005:99~106.
    [54] Theo Kluter, Philip Brisk, Paolo Ienne and Edoardo Charbon. Speculative DMA for Architecturally Visible Storage in Instruction Set Extensions [C]. // Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. New York: ACM Press, 2008. 243~248.
    [55] Ramkumar Jayaseelan, Haibin Liu and Tulika Mitra. Exploiting Forwarding to Improve Data Bandwidth of Instruction-set Extensions [C]. // Proceedings of the 43rd Design Automation Conference. New York: ACM Press, 2006.43-48.
    [56] Ya-shuai Lü, Li Shen, et al. Customizing Computation Accelerators for Extensible Multi-issue Processors with Effective Optimization Techniques [C]. // Proceedings of 45th Design Automation Conference. New York: ACM Press, 2008:197~200.
    [57]吕雅帅,沈立,黄立波,王志英.面向嵌入式应用的指令集自动扩展[J].电子学报, 2008, 36(5): 985~988.
    [58] Quang Dinh, Deming Chen and Martin D. F. Wong. Efficient ASIP Design forConfigurable Processors with Fine-grained Resource Sharing [C]. // Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York: ACM Press, 2008: 99~106.
    [59] Alauddin Alomary, Takeharu Nakata, Yoshimichi Honma, et al. An ASIP Instruction Set Optimization Algorithm with Functional Module Sharing Constraint [C]. // Proceedings of the International Conference on Computer-Aided Design. Los Alamitos: IEEE Press, 1993: 526~532.
    [60] Bernardo Kastrup, Arjan Bink and Jan Hoogerbrugge. ConCISe: A Compiler-driven CPLD-based Instruction Set Accelerator. [C]. Proceedings of 7th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. Washington: IEEE Press, 1999:92~101.
    [61] Philip Brisk, Adam Kaplan and Majid Sarrafzadeh. Area-efficient Instruction Set Synthesis for Reconfigurable System-on-Chip Designs [C]. // Proceedings of the 41th Design Automation Conference. New York: ACM Press, 2004:395~400.
    [62] Marcela Zuluaga and Nigel Topham. Resource Sharing in Custom Instruction Set Extensions [C]. // Proceedings of IEEE Symposium on Application Specific Processors. Anaheim: IEEE Press, 2008:7~13.
    [63] E. M. Witte, et al. Applying Resource Sharing Algorithms to ADL-driven Automatic ASIP Implementation [C]. // Proceedings of IEEE International Conference on Computer Design. Washington: IEEE Press, 2005:193~199.
    [64] Hai Lin and Yunsi Fei. Resource Sharing of Pipelined Custom Hardware Extension for Energy-efficient Application-Specific Instruction Set Processor Design [C]. // Proceedings of IEEE International Conference on Computer Design. Piscataway: IEEE Press, 2009:158~165.
    [65] Pechanek, Gerald G., Barry, et al. Methods and Apparatus for Scalable Instruction Set Architecture with Dynamic Compact Instructions [P]. United States: Patent 6101592, Jun. 13, 2002.
    [66] Paolo Faraboschi, Geoffrey Brown, et al. Lx: A Technology Platform for Customizable VLIW Embedded Processing [C]. // Proceedings of the 27th Annual International Symposium on Computer Architecture. Vancouver, British Columbia, Canada: IEEE Press, 2000:203~213.
    [67] Jong-eun Lee, Kiyoung Choi and Nikil Dutt. Efficient Instruction Encoding for Automatic Instruction Set Design of Configurable ASIPs [C]. // Proceedings of the International Conference on Computer-Aided Design. San Jose, California, USA: ACM Press, 2002: 649~654.
    [68] Swarnalatha Radhakrishnan, Hui Guo, et al. Application Specific Forwarding Network and Instruction Encoding for Multi-pipe ASIPs [C]. // Proceedings of the 4th International Conference Hardware/Software Codesign and System Synthesis. Seoul, Korea: ACM Press, 2006: 241~246.
    [69] Nikolaos Vassiliadis, et al. The ARISE Approach for Extending Embedded Processors With Arbitrary Hardware Accelerators [J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2009, 17(2): 221-233.
    [70] Guillermo Rozas, Alexander Klaiber and Eric Hao. Method and System for Using One or More Address Bits and an Instruction to Increase an Instruction Set [P]. United States: Patent 7606997B1, 2009.
    [71] John P. Banning, Eric Hao and Brett Coon. System and Method of Instruction Modification [P]. United States: Patent 7698539B1, 2010.
    [72] Lucian Codrescu, Erich Plondke, Muhammad Ahmed, William C. Anderson. Method and System for Encoding Variable Length Packets with Variable Instruction Sizes [P]. United States: Patent 7526633B2, Apr. 28, 2009.
    [73] ARC International. http://www.arc.com.
    [74] Tensilica Inc. http://www.tensilica.com.
    [75] Stretch Inc. http://www.stretchinc.com.
    [76] Nunez-Yanez J L, Eddie H, Chouliaras V. A Configurable and Programmable Motion Estimation Processor for the H.264 Video Codec [C]. // Proceedings of International Conference on Field Programmable Logic and Applications. Heidelberg: IEEE Press, 2008:149~154.
    [77] Philip Dang. High Performance Architecture of an Application Specific Processor for the H.264 Deblocking Filter [J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008, 16(10): 1321~1334.
    [78] Sung Dae Kim and Myung H. Sunwoo. ASIP Approach for Implementation of H.264/AVC [J]. Journal of Signal Processing Systems, 2008, 50(1): 53~67.
    [79] Thomas Schuster, Bruno Bougard, et al. Design of a Low Power Pre-synchronization ASIP for Multimode SDR Terminals [J]. Lecture Notes in Computer Science, 2007, 4599/2007:322~332.
    [80] Timo Vogtand and Norbert Wehn. A Reconfigurable ASIP for Convolutional and Turbo Decoding in an SDR Environment [J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008, 16(10): 1309~1320.
    [81] K, Farkas and P. Chow. The Multicluster Architecture: Reducing Cycle Time through Partitioning [C]. // Proceedings of IEEE/ACM International Symposium on Microarchitecture. Portland, OR, USA: IEEE Press, 1997:149~159.
    [82] Sanchez, J. and A. Gonzalez. Modulo Scheduling for a Fully-distributed Clustered VLIW Architecture [C]. // Proceedings of IEEE/ACM International Symposium on Microarchitecture. Monterey, California, USA: IEEE Press, 2000:124~133.
    [83] Kozyrakis, C. and D. Patterson. Overcoming the Limitations of Conventional Vector Processors [C]. // Proceedings of International Symposium on Computer Architecture. San Diego, USA: ACM Press, 2003:399~409.
    [84] Balfour, J., W. J. Dally, et al. An Energy-Efficient Processor Architecture forEmbedded Systems [J]. Computer Architecture Letters, 2008, 7(1): 29~32.
    [85] Bunchua, S. and D. S. Wills. Reducing Operand Transport Complexity of Superscalar Processors using Distributed Register Files [C]. // Proceedings of International Conference on Computer Design. San Jose, CA, USA: IEEE Press, 2003:532~535.
    [86] Jason Cong and Yiping Fan. Simultaneous Resource Binding and Interconnection Optimization Based on a Distributed Register-File Microarchitecture [J]. ACM Trans Design Automation of Electronic Systems, 2009, 14(3):1~30.
    [87] Rajeev Balasubramonian, Sandhya Dwarkadas and David H. Albonesi. Dynamically Managing the Communication-parallelism Trade-off in Future Clustered Processors [C]. // Proceedings of International Symposium on Computer Architecture. San Diego, USA: ACM Press, 2003: 275~286.
    [88] Burger, D. and S. W. Keckler. Scaling to the End of Silicon with EDGE Architectures [J]. IEEE Computer, 2004, 37(7):44~55.
    [89] Taylor, M. B. and J. Psota. Evaluation of the RAW Microprocessor: an Exposed-wire-delay Architecture for ILP and Streams [C]. // Proceedings of International Symposium on Computer Architecture. München, Germany: ACM Press, 2004: 2~13.
    [90] Hongtao, Z. and K. Fan. A Distributed Control Path Architecture for VLIW Processors [C]. // Proceedings of International Conference on Parallel Architectures and Compilation Techniques. Saint Louis, Missouri: IEEE press, 2005:197~206.
    [91] Aleta, A. and J. M. Codina. Heterogeneous Clustered VLIW Microarchitectures [C]. // Proceedings of International Symposium on Code Generation and Optimization. San Jose, California: IEEE press, 2007:354~366.
    [92] Amirali Baniasadi, Andreas Moshovos. Instruction Distribution Heuristics for Quad-cluster, Dynamically-scheduled, SuperscalarProcessors [C]. // Proceedings of IEEE/ACM International Symposium on Microarchitecture. Monterey, California: IEEE Press, 2000: 337~347.
    [93] Ho-Seop Kim and James E. Smith. An Instruction Set and Microarchitecture for Instruction Level Distributed Processing [C]. // Proceedings of International Symposium on Computer Architecture. Anchorage, AK, USA: ACM Press, May 2002:71~81.
    [94] Michael Chu, Kevin Fan and Scott Mahlke. Region-based Hierarchical Operation Partitioning for Multicluster Processors [C]. // Proceedings of ACM Conferenceon Programming Language Design and Implementation. San Diego, California: ACM Press, 2003:300~311.
    [95] Francis Tseng and Yale N. Patt. Achieving Out-of-Order Performance with Almost In-Order Complexity [C]. // proceedings of International Symposium on Computer Architecture. Beijing, China: ACM Press, 2008:3~12.
    [96] Ramon Canal, Joan Manuel Parcerisa, et al. Dynamic Cluster Assignment Mechanisms [C]. // Proceedings of International Symposium on High-Performance Computer Architecture. Toulouse, France: IEEE Press, 2000: 133~142.
    [97] Parcerisa M. and J. Sahuquillo. On-chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures [J]. IEEE Transactions on Parallel and Distributed Systems, 2005, 16(2): 130~144.
    [98] Peter G. Sassone and D. Scott Wills. Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication [C]. // Proceedings of IEEE/ACM International Symposium on Microarchitecture. Portland, Oregon: IEEE Press, 2004: 7~17.
    [99] Salverda, P. and C. Zilles. A Criticality Analysis of Clustering in Superscalar Processors [C]. // Proceedings of IEEE/ACM International Symposium on Microarchitecture. Barcelona, Spain: IEEE Press, 2005: 55~66.
    [100] Behnam Robatmili, Sibi Govindan, Doug Burger, et al. Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors [C]. // Proceedings of International Symposium on High-Performance Computer Architecture. San Francisco, California: IEEE Press, 2011: 431-442.
    [101] Hyunseok Lee, Yuan Lin, Yoav Harel, Mark Who and Scott Mahlke. Software Defined Radio - A High Performance Embedded Challenge [J]. Lecture Notes in Computer Science. 2005, 3793/2005:6~26.
    [102] Forney, G. and D., Jr. The Viterbi Algorithm [C]. // Proceeding of IEEE. Washington: IEEE Press, 1973, 61:268~278.
    [103] Claude Berrou, Alain Glavieux and Punya Thitimajshima. Near Shannon Limit Error Correcting Coding and Decoding: Turbo-Codes [C]. // Proceedings of IEEE International Conference on Communications. Geneva, Switzerland: IEEE Press, 1993: 1064~1070.
    [104] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems [C]. // Proceedings of IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Press, 1993: 330~335.
    [105] Mark Woh, et al. The Next Generation Challenge for Software Defined Radio [C]. // Proceedings of International Symposium on Systems, Architectures, Modeling and Simulation. New York: Springer-Verlag Berlin Heidelberg, 2007: 343~357.
    [106] Bhuvan Middha, et al. A Trimaran Based Framework for Exploring the Design Space of VLIW ASIPs with Coarse Grain Functional Units [C]. // Proceedings of 15th international symposium on System Synthesis. New York: ACM Press, 2002: 2-7.
    [107] Vinod Kathail, Michael S. Schlansker,et al. HPL-PD Architecture Specification:V 1.1 [EB/OL]. www.hpl.hp.com/techreports/93/HPL-93-80R1.pdf, 1993-80R1.
    [108] Trimaran Documents. Documents Useful for Understanding Elcor [EB/OL]. www.trimaran.org/docs/elcor_implementation_docs.pdf.
    [109] Xiph.org Test Media: foreman. http://media.xiph.org/video/derf/
    [110] J. A. Fisher. The Optimization of Horizontal Microcode within and Beyond Basic Block: An Application of Processor Scheduling with Resource [R]. New York: New York University, 1979: 1~12.
    [111] J. A. Fisher. Trace scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computer, 1981, 30(7):478~490.
    [112] David Eppstein. Subgraph Isomorphism in Planar Graphs and Related Problems [C]. // Proceedings of the 6th annual ACM-SIAM symposium on Discrete algorithms. Philadelphia: Industrial and Applied Mathematics Society Press, 1995:632-640.
    [113] Luigi P. Cordella, Pasquale Foggia, et al. A (sub)graph Isomorphism Algorithm for Matching Large Graphs [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(10): 1367-1372.
    [114] Robert Tarjan. Depth-First Search and Linear Graph Algorithms [C]. // Proceedings of 12th Annual Symposium on Switching and Automata Theory. Washington: IEEE Press, 1971:114-121.
    [115] E. Ukkonen. On-line Construction of Suffix Trees [J]. Algorithmica, 1995, 14 (3): 249~260.
    [116] Damjan Lampret. OpenRISC 1200 IP Core Specification [EB/OL]. http://opencores.org. 2001.
    [117] M. P. Hansen. Metaheuristics for Multiple Objective Combinatorial Optimization. Denmark: Technical University of Denmark, 1998: 2~12.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700