同步数据触发多核处理器体系结构关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着VLSI技术的迅猛发展与应用需求的不断提高,单纯依靠提升主频已经很难进一步提高处理器性能,采用以多核处理器为代表的先进体系结构已经逐渐成为提高处理器性能的主要途径。受当前集成电路工艺条件的推动,片内多核处理器结构已初现端倪,但尚有一系列科学技术问题亟待解决,主要包括多核并行体系结构问题、多核互连通信问题、多核多级存储问题等。针对多核处理器体系结构面临的核心理论与设计技术问题进行研究,可为未来超高性能多核处理器芯片的设计与实现提供坚实的理论和技术基础,具有重要理论意义和应用价值。
     本文针对超高性能多核处理器,主要深入研究了一种同步数据触发(Synchron- ous Data Triggered Architecture,SDTA)多核体系结构,它包括了大量高性能SDTA计算内核,每个内核具有结构简单、计算资源利用率高、计算能力强、可扩展性好等优势。结合同步数据触发多核处理器特点,本文重点对SDTA处理单元设计关键技术进行研究,采用资源优化途径来提高执行性能并降低其代价开销,同时利用指令压缩技术来解决其代码体积问题。继而,本文还对SDTA多核片内互连通信结构进行建模,研究并实现了具有高带宽、低延迟、低代价特点的多核互连通信系统。取得的主要研究成果如下:
     1.提出了一种同步数据触发多核体系结构,它包括SDTA单元计算内核、SDTA单元存储系统、片上通信互连结构、多核同步机制等部分。单个处理单元结构简单,设计灵活,可扩展性强,有效支持SIMD和MIMD,允许开发多个层次上的并行性。另外,设计了包括指令Cache、局部存储器、DMA部件及二级Cache的多核存储系统,采用了片上网络基本通信构架,支持与SPARC体系结构兼容的同步机制。
     2.提出了一种代价解析模型用来评价处理单元的面积与功耗,满足精度要求的同时具有较好灵活性与较高工作效率。还提出了适应于SDTA处理单元的硬件资源优化方法,在建立软硬件设计工具链的基础之上,开展启发式搜索算法指导的计算内核局部优化与解析式处理单元全局优化等过程,具有优化效率高、效果好等特点。
     3.提出了一种模板式垂直字典压缩技术,用于解决SDTA体系结构中的代码稀疏问题,它强调代码压缩比、解压实时性与资源开销三个方面的因素。还继续提出了分流并行解压硬件模型,并修改了软件工具链。该技术以较小执行周期为代价,极大减少了代码体积,降低了芯片面积与功耗开销。
     4.提出了面向片上互连网络的解析式性能分析方法。建立了基于M/G/1/N排队系统的片上网络数学模型,分析精度好、效率高,有助于片上网络结构设计及应用程序拓扑映射优化。为解决单通道结构所暴露的性能瓶颈,还提出了两种改进的多通道结构数学模型,借助各项性能指标,最终指导完成了SDTA多核片上互连网络的微体系结构设计与实现。
     5.提出了一种基于拥塞缓解的动态虚拟通道结构,用于解决片上路由器缓冲利用率低、阻塞现象频繁等缺陷。改进了典型路由器结构设计,完成了动态多通道路由器的VLSI实现。实验表明,它能自适应于网络流量特征动态调整虚拟通道组织方式,改善网络性能,同时,还采用了链表方式来组织虚拟通道共享缓冲,具有较小代价开销,通过提高缓冲利用率,节省了大量芯片面积与功耗。
     实验结果表明,面向多媒体信号处理领域,经硬件资源优化后的SDTA处理单元具有硬件代价小、执行性能高等特点,其内核性能与TI-C64 DSP相当,整个处理单元对多媒体应用具有显著加速效果。另外,SDTA片上互连网络具有高带宽、低延迟等特点,尤其是,提出的动态虚拟通道技术能有效降低代价开销,继续改善网络性能。相关研究成果为SDTA多核处理器提供了较好的解决方案和理论分析基础,能够直接适用今后的多核处理器芯片的设计与实现。
With the rapid development of very large scale integration technology and the increasing magnitude of application requirements, the advanced multi-core architecture has been the prevalent approach to further improve the processor performance instead of high frequency. Recently, with the promotion of integrate circuit conditions, the multi-core processor has come into sight. However, there still remain lots of problems to be solved, including multi-core parallelism architecture, the solution for on-chip communication, the bandwidth-balanced multi-level memory system and so on. The in-depth study on these theories and design problems will provide the implementation of further high-performance multi-core with great theoretical and practical significance.
    
     During the research on high-performance processor, this dissertation presents a syn- chronous data triggered multi-core architecture, where each processor element with scalability characteristics provides high performance, while corresponding to the simple structure and high utilization of transistor resources. Combining with the synchronous data triggered multi-core architecture, some key design techniques on SDTA processor element have been well studied. The novel resource optimization approach is used to improve the performance and save the hardware cost, and then the code compression method is deeply studied to solve code density problem. Following, an accurate analytical performance analysis approach for network on chip is developed, and the on-chip communication structure with characteristics of high-bandwidth, low latency and low cost is implemented. The main contributions are listed as follows.
     1. We propose a synchronous data triggered multi-core architecture, which is composed of SDTA computing cores, SDTA memory system, the on-chip com- munication structure, the multi-core synchronization mechanism and so on. Each processor with simple and flexible structure supports both SIMD and MIMD, and it has the high performance ability by exploiting the parallelisms during different levels. Besides, the memory system includes the instruction cache, local memory, DMA engine as well as secondary eDRAM-based cache. The network on chip is introduced for the on-chip commication structure, while the effective synchroni- zation mechanism is adpoted to be compatible with SPARC architecture.
     2. We develop the software and hardware utility suits for synchronous data triggered processor element, and introduced an analytical approach for cost estimation, which meets the precision requirement and has the advantages of flexibility and high- efficiency. Also, we proposed a novel automated approach to explore and design the high-efficiency processor element. The design space is explored using a divide- and-conquer approach, where heuristic-based search process is followed for optimal computing cores and the analytical method using trace-driven simulation is for overall processor element.
     3. We put forward a template vertical dictionary-based program compression scheme to solve poor code density problem of synchronous data triggered architecture. This scheme emphasizes three aspects, involving the low compression ratio, the limited hardware cost and the run-time decompression. Furthermore, we develop the multi- stream parallel decompression engine and update the software utility suits. This scheme achieves the ultra-low compression ratio with the expense of little execution overhead, while the area and power consumption are saved efficiently.
     4. We propose a novel performance analysis approach for network on chip based on analytical router modeling. According to the generalized router architecture, the analytical router model which uses M/G/1/N queuing system is established, and it may be used to explore the communication architecture and guide the application mappings. To eliminate the bottleneck during the performance analysis, the analytical models for the improved multi-channel structures are described, which may be used to further guide the design of on-chip routers. By the analytical analysis results, the on-chip network micro-architecture for multi-core processor is designed and implemented in the end.
     5. We further present the novel dynamic virtual channel architecture with congestion awareness scheme to solve the low buffer utilization and eliminate various blockings. By modifying the previous high speed router, the VLSI implementation of router with dynamic channels is completed. The modified router may regulate the channel organization according to traffic conditions, and it provide throughput increase and latency decrease with the obvious savings of silicon area and power consumption.
     Plenty of experiments are completed. Towards multimedia and signal processing domains, the optimized processor element has the characteristics of high performance and low cost. The computing core is similar with TI TMS320-C64 series DSP and the overall processor element does the obvious acceleration in the multimedia applications. Then, the communication structure with low-latency and high throughput is presented, and the measure for low hardware cost is put forward. These key techniques with sufficient theory basis may be directly applied to the design and implementation of further multi-core processor.
引文
[1]. J. L. Hennessy, D. A. Patterson. Computer Architecture: A quantitative approach. Morgan Kaufman Publishers, 3rd edition, 2002.
    [2]. ITRS. International technology roadmap for semiconductors. 2006 update. Tech Report. ITRS, April 2007, http://public.itrs.net.
    [3]. D. Burger, J. R. Goodman. Billion-Transistor Architectures:There and Back Again [J]. IEEE Computer, 2004, 37(3): 22-28.
    [4]. S. Borkar. Thousand Core Chips-A Technology Perspective, Proceeding of Design Automation Conference, June 4-8, 2007, San Diego, California, USA : 746-780.
    [5]. J. M. Rabaey, A. Chandrakasan, B. Nikolic. Digital Integrated Circuits:A Design Perspective (2nd Edition).影印版.北京:清华大学出版社,2004.
    [6].胡伟武,李国杰,纳米级工艺对微处理器设计的挑战,信息技术快报,2008年第1期.
    [7]. Z. Zhu, Robust Dynamic Circuits with Low Power and High Performance for Nanometer CMOS Technologies, Ph.D. Thesis, 2005.
    [8]. P. Gelsinger. Microprocessors for the new millennium: Challenges, opportunities and new frontiers, ISSCC Tech. Digest, 2001: 22-25.
    [9]. G. Hinton. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Q1, 2001.
    [10]. C. McNairy, R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor, IEEE Micro, vol. 25, no. 2, Mar/Apr, 2005. Page(s):10-20.
    [11]. J. M. Tendler, et al. IBM POWER4 System Microarchitecture. IBM Journal of Research and Development 46(1): 5-26, 2002.
    [12]. C. E. Kozyrakis, D. A. Patterson. A new direction for computer architecture research, IEEE Computer Volume 31, Issue 11, Nov 1998. Page(s):24-32.
    [13]. D. Talla, L. K. John. Execution characteristics of multimedia applications on a PentiumII processor, Proceeding of International Performance, Computing, and Communications, Feb. 2000, Page(s):516 - 524.
    [14]. M. Rixner. Stream Processor Architecture, Kluwer Academic Publishers. Boston, MA, 2001.
    [15]. C. Lee, M. Potkonjak, W. H. Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. Proc IEEE/ACM Sypm. On Microarchitecture[C]. 1997. Page(s): 330-335.
    [16]. H. Liao, A. Wolfe. Available Parallelism in Video Applications. Proc IEEE/ACM International Sypm. On Microarchitecture, 1997. Page(s):321-329.
    [17]. J. Corbal, et.a1. DLP + TLP processors for the next generation of mediaworkloads.Proceedings of 7th International Sypm. On HPCA, 2001. Page(s): 219-228.
    [18]. R. B. Lee. Subword parallelism with max2. IEEE Micro, 1996, 16(4): 51-59.
    [19]. NOMADIK(TM) Open Multimedia Platform for Next-generation Mobile Devices. http://eu.st.com/stoneline/books/ascii/docs/9036.htm.
    [20]. M. Kim, S. Banerjee. Design space exploration of real-time multi-media MPSoCs with heterogeneous scheduling policies, Proceedings of 4th international conference on Hardware/software codesign and system synthesis, Seoul, Korea, 2006. Page(s):16-21.
    [21]. M. Ohmacht, D. Hoenicke. The eDRAM based L3-Cache of BlueGene/L Supercomputer Processor Node, Proceedings of 16th Symposium on Computer Architecture and High Performance Computing, 2004. Page(s): 18-22.
    [22].安虹.高效能通用微处理器芯片关键技术途径探讨,信息技术快报,2004年第12期.
    [23]. G. J. M. Smit, A. B. J. Kokkeler, P. T. Wolkotte. Multi-core architectures and streaming applications, Proceedings of the international workshop on System level interconnect prediction, Newcastle, UK. April 2008. Page(s): 35-42.
    [24]. J. C. Chu, W. C. Ku, S. H. Chou, T. F. Chen, J. I. Guo, An embedded coherent- multithreading multimedia processor and its programming model, Proceedings of 44th annual conference on Design automation, June 2007.
    [25]. S. Moch. HIBRID-SOC: a multi-core architecture for image and video applica- tions, ACM SIGARCH Computer Architecture News, June 2004.
    [26]. T. Saidani. Parallelization schemes for memory optimization on the cell process- sor: a case study of image processing algorithm, Proceedings of the 2007 workshop on Memory performance, September 2007.
    [27]. A. Wahyudi, A. Omondi. Parallel multimedia processor using customised Infineon TriCores. Proceedings of Euromicro Symposium on Digital System Design, Sept. 2002.
    [28]. R. Nevada, High-performance ethernet-based communications for future multi- core processors, Proceedings of the 2007 ACM/IEEE conference on Super- computing, Reno, Nevada, 2007.
    [29]. W. R. Zhu, V. C. Sreedhar. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures, Proceedings of 34th international symposium on Computer architecture, San Diego, California, USA, 2007. Page(s): 35-45.
    [30]. AMBA specification, Rev. 2.0. ARM Corporation. www.arm.com, 1999.
    [31]. WISHBONE System-on-Chip Interconnect Architecture for Portable IP Cores. Silicore Corporation. www.silicore.net, 2001.
    [32]. R. Kumar, V. Zyuban. Interconnections in Multi-Core Architectures: Under-standing Mechanisms, Overheads and Scaling, Proceedings of the 32nd annual international symposium on Computer Architecture, Page(s): 408-419, Washing- ton, DC, USA, 2005.
    [33]. L. Benini, G. D. Micheli. Networks on Chips: A New Soc Paradigm. IEEE Transactions on Computers, Vol. 35(1), January, 2002. Page(s): 70-78.
    [34]. M. Dehyadgari, M. Nickray. A New Protocol Stack Model for Network on Chip, Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures, January, 2006.
    [35]. N. E. Jerger, L. S. Peh, M. Lipasti. Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support, Proceedings of 35th International Symposium on Computer Architecture, Beijing, China, 2008. Pages: 229-240.
    [36]. Z. Zhang, A. Greiner, S. Taktak. A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip, Proceedings of 45th annual conference on Design automation, Anaheim, California, June, 2008. Pages: 441-446.
    [37]. M. Li, Q. A. Zeng, W. B. Jone. DyXY - A Proximity Congestion-Aware Deadlock-Free Dynamic Routing Method for Network on Chip, Proceedings of 43th annual conference on Design automation, July 24–28, 2006, San Francisco, California, USA.
    [38]. J. C. Hu, R. Marculescu. DyAD - smart routing for networks-on-chip. In Proc. Design Automation Conference, San Diego, USA, 2004. Pages: 260–263.
    [39]. M. T. Hemani, A. Kumar, S. Ellervee. Globally Asynchronous Locally Synchronous Architecture for Large High Performance ASICs. Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS). Orlando, USA, June, 1999. Vol. 2. Pages: 512-515.
    [40]. J. Muttersbach, T. Villiger, V. Fichtner. Practical Design of Globally- Asynchronous Locally-Synchronous Systems. Proceedings of the 6thInternational Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC). Eilat, Israel, April, 2000. Pages: 52-59.
    [41]. P. Abad, V. Puente, A. Gregorio. Rotary router: an efficient architecture for CMP interconnection networks, Proceedings of the 34th annual international symposium on Computer architecture, San Diego, California, USA, 2007. Pages: 116-125.
    [42]. J. Kim, C. Nicopoulos, D. Park. A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, Proceedings of 33rd annual international symposium on Computer Architecture, June, 2006. Pages: 4-15.
    [43]. R. Teodorescu, J. Torrellas. Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors, Proceedings of 35th International Symposium on Computer Architecture, Beijing, China, 2008. Pages: 363-374.
    [44]. C. Lee, J. K. Lee, T. T. Hwang. Compiler optimization on VLIW instructionscheduling for low power, ACM Transactions on Design Automation of Electronic Systems, Volume 8, Issue 2, 2003. Pages: 252-268.
    [45]. J. Pangjun, S. S. Sapatnekar. Low-Power Clock Distribution Using Multiple Voltages and Reduced Swings, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 3, JUNE 2002.
    [46]. J. Oliver, R. Rao, D. Franklin. Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications, Journal of Embedded Computing archive, Volume 2, Issue 2, April 2006. Pages: 157-166.
    [47]. J. A. Kahle, et al. Introduction to the Cell multiprocessor, IBM J. RES. & DEV., VOL 49, NO.4/5, July/September 2005.
    [48]. L. Yang, D. Gao, J. Mostoufi. System design methodology of ultraSPARC-I, Proceedings of the 32nd ACM/IEEE conference on Design automation, San Francisco, California, United States, 1995. Pages: 7-12.
    [49]. J. H. Ahn, W. J. Dally, B. Khailany. Evaluating the Imagine Stream Architecture, Proceedings of 31st annual international symposium on Computer architecture, Munchen, Germany, 2004.
    [50]. OMAP5910 Dual Core Processor– Technical Reference Manual, Texas Instru- ments Inc., August, 2004.
    [51]. Y. Nishikawa, M. Koibuchi, M. Yoshimi. Performance Improvement Methodology for ClearSpeed's CSX600. Proceedings of the 2007 International Conference on Parallel Processing, 2007.
    [52]. http://www.clearspeed.com.
    [53]. B. Sinharoy, R. N. Kalla, J. M. Tendler. POWER5 System microarchitecture, IBM Journal of Research and Development, Volume 49, Issue 4/5, July 2005. Pages: 505-521.
    [54]. Y. N. Patt, S. J. Patel, D. H. Friendly, J. Stark. One Billion Transistors, One Uniprocessor, One Chip, IEEE Computer, Sept. 1997. Pages:51- 57.
    [55]. K. C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, pp.28-40, Volume 16, Issue 2, April 1996.
    [56]. T. Shanley. Pentium Pro and Pentium II system architecture (2nd ed.), Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1998.
    [57]. TMS320C64x CPU and Instruction Set Reference Guide, Texas Instruments, Inc, USA, 2000.
    [58]. M. S. Schlansker, B. R. Rau. EPIC: Explicitly Parallel Instruction Computing, IEEE Computer, Feb. 2000, 33(2): 37-45.
    [59]. J. Huck, D. Morris, et al. Introducing the IA-64 architecture, IEEE Micro, Sept.-Oct. 2000, 20(5):12-23.
    [60]. H. Corporaal. Transport Triggered Architectures: Design and Evaluation. PhD thesis, Delft Univ. of Technology, September 1995.
    [61]. H. Corporaal. Microprocessor Architecture from VLIW to TTA, John Wiley & Sons Ltd, West Sussex, England, 1998.
    [62]. C. E. Kozyrakis, et al. Scalable Processors in the Billion-Transistor Era: IRAM. IEEE Computer Special Issue: Future Microprocessors - How to use a Billion Transistors, September 1997.
    [63]. I. Mavroidis. A Low Power 200 MHz Multiported Register File for the Vector-IRAM Chip, University of California at Berkeley, CA, USA, Technical Report, 2001.
    [64]. R. B. Lee.Subword parallelism with max-2[J].IEEE Micro,1996,16(4):51-59.
    [65]. S. Thakkar, T. Huff. Internet Streaming SM D Extensions. Cornputer, 1999, 32(12):26-34.
    [66]. J. Tiler, et a1. AltiVecTM: Bringing vector technology to the PowerPC processor family. IEEE Int. Conf. on Performance, Computing and Communication, 1999. Pages: 437-444.
    [67]. http://www.embeddedinsight.com/pdf/MIPS3Dback.pdf.
    [68]. Enhanced 3DNow! Technology for the AMD AthlonTM Processor. http://www3 pub .amd.com/products/cpg/athlon/3dnowwp.html.
    [69]. The Intel Pentium4 processor product overview. http:// www.inte1.com/design/ Pentium4/ prodbref/~streaming.
    [70]. R. Halstead, T. Fujita. MASA: a multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th annual International Symposium on Computer Architecture, May-June, 1988. Pages: 443-451.
    [71]. C. Hansen. MicroUnity’s MediaProcessor architecture, IEEE Micro, (16): 34-41, August 1996.
    [72]. J. Kreuzinger, T. Ungerer. Context-switching techniques for decoupled multithreaded processors. In Proceedings of the 25th Euromicro Conference, Sep.1999. Pages: 248-251.
    [73]. W. Grunewald, T. Ungerer. Towards extremely fast context switching in a block multithreaded processor. In Proceedings of the 22th Euromicro Conference, Sep. 1996. Pages: 592-599.
    [74]. M. Thistle, B. Smith. A processor architecture for Horizon. In Proceedings of Supercomputing Conference, Nov. 1988. Pages: 35-41.
    [75]. R. Alverson, et al. The Tera computer system. In Proceedings of Super- computing Coference, June 1990. Pages: 1-6.
    [76]. A. Agarwal, et al. Sparcle: an evolutionary processor design for large-scale multiprocessors. IEEE Micro, June 1993, 13, Pages: 48-61.
    [77]. A. Mikschl, W. Damm. Msparc: a multithreaded Sparc. Lecture Notes in Computer Science, 1123, Pages: 461-469.
    [78]. Intel Xeon Processor Overview. http://www.intel.com/products/ processor/xeon/ 157index.htm.
    [79]. G. Hinton, D. Sager, et al. The microarchitecture of the pentium4 processor. Intel Technical Journal, Q1 2001 Issue, Feb. 2001.
    [80].陈书明,李振涛,万江华,胡定磊,“银河飞腾"高性能数字信号处理器研究进展,计算机研究与发展,43(6):993~1000,2006.
    [81].万江华,陈书明, MOSI:一种基于超长指令字处理器的同时多线程微体系结构,计算机学报,Vol.29. pp.378-383, 2006.
    [82].沈立,王志英,鲁建壮,戴葵,基于控制流的混合指令预取,电子学报, 08期, pp.1141-1144, 2003.
    [83]. T. M. Aamodt, P. Marcuello, P. Chow. A framework for modeling and optimization of prescient instruction prefetch, Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, San Diego, CA, USA, 2003. Page(s):13– 24.
    [84]. T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang. Hardware Support for Prescient Instruction Prefetch, Proceedings of 10th International Sympo- sium on High Performance Computer Architecture, 2004.
    [85]. K. Oner, M. Dubois. Effects of memory latencies on non-blocking processor/ cache architectures, Proceedings of 7th international conference on Super- computing, Tokyo, Japan, 1993. Page(s):338-347.
    [86]. S. Belayneh, D. R. Kaeli. A discussion on non-blocking/lockup-free caches, ACM SIGARCH Computer Architecture News, pp.18-25, Volume 24, Issue 3, June 1996.
    [87]. P. P. Chu, R. Gottipati. Write Buffer Design for On-Chip Cache. Proceedings of IEEE International Conference on Computer Design: VLSI in Computer & Processors, pp.311-316, 1994.
    [88]. T. Chen, H. B. Lin, T. Zhang. Orchestrating data transfer for the cell/B.E. processor, Proceedings of the 22nd annual international conference on Super- computing, 2008. Page(s):289-298.
    [89]. J. A. Fisher, S. M. Freudenberger. Predicting conditional branch directions from previous runs of a program. In 5th Int. Conf. Architectural Support for Program- ming Languages and Operating Systems, 1992.
    [90]. S. Pan, K. So, J. Rahmeh. Improving the accuracy of dynamic branch prediction using branch correlation. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992, Page(s):76-84.
    [91]. C. Yound, M. Smith. Improving the accuracy of static branch prediction using branch correlation. In Proceedings of the 6thInternational Conference on Architectural Support for Programming Languages and Operating Systems, October 1994. Page(s): 232-241.
    [92]. S. McFarling. Combining branch predictors. Digital Equipment Corporation, WRL Tech. Note TN-36, 1993.
    [93]. http://www.simplescalar.com/
    [94]. T. D Brooks, et.al. Watch: A framework for architectural-level power analysis and optimizations. In the Proceedings of the 27th International Symposium on Computer Architecture, 2000.
    [95]. G. J. Hekstra, P. B. G. D. La Hei, F.W. Sijstermans. TriMedia CPU64 design space exploration. In proceedings of ICCD, Oct, 2001.
    [96]. S. Mohanty, V. K. Prasanna, S. Neema, J. Davis. Rapid design space exploration for heterogeneous embedded systems using symbolic search and multi-granular simulation. In LCTES-SCOPES, June 2002. Page(s): 18–27.
    [97]. S. Eyerman, L. Eeckhout, K. D. Bosschere. Efficient Design Space Exploration of High Performance Embedded Out-of-Order Processors, Design, Automation and Test in Europe, 2006. Page(s): 351-256.
    [98]. A. Jaszkiewicz. Multiple Objective Metaheuristic Algorithms for Combinatorial Optimization. PhD thesis, Poznan University of Technology, Poland, 2001.
    [99]. G. Ascia, V. Catania, M. Palesi, D. Patti. A System-level Framework for Evaluating Area/Performance/Power Trade-offs of VLIW-based Embedded Systems, Proceeding of ASP-DAC, 2005.
    [100]. E. Perelman, G. Hamerly, B. Calder. Picking Statistically Valid and Early Simulation Points. International Conference on Parallel Architectures and Compilation Techniques, 2003. Page(s): 244-255.
    [101]. R. E. Wunderlich, T. F. Wenisch, B. Falsafi. SMARTS: accelerating micro- architecture simulation via rigorous statistical sampling. International Sympo- sium on Computer Architecture, 2003. Page(s): 84-97.
    [102]. L. Thiele, S. Chakraborty, M. Gries, S. Kunzli. Design space exploration of network processor architectures. Network Processor Design: Issues and Practices, Vol. 1, Morgan Kaufmann Publishers, 2002. Page(s): 55-89.
    [103]. M. Gries, C. Kulkarni, C. Sauer, K. Keutzer. Comparing analytical modeling with simulation for network processors: A case study, in: Design, Automation and Test in Europe (DATE), Munich, Germany, 2003. Page(s): 256-261.
    [104]. S. Chakraborty, S. K unzli, L. Thiele, A. Herkersdorf, P. Sagmeister. Perfor- mance evaluation of network processor architectures: Combining simulation with analytical estimation, Computer Networks, Elsevier Science 41(5)641-665. 2003.
    [105]. T. S. Karkhanis, J. E. Smith. Automated Design of Application Specific Superscalar Processors: An Analytical Approach, Proceedings of the 34th annual international symposium on Computer architecture, June 9–13, 2007, San Diego, California, USA. Page(s): 402-411.
    [106]. P. J. Joseph, K. Vaswani, M. J.Thazhuthaveetil. Construction and use of linearregression models for processor performance analysis, High-Performance Computer Architecture, 2006. The Twelfth International Symposium on Volume , Issue , 11-15 Feb. 2006 Page(s): 99-108.
    [107]. B. C. Lee, D. M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction, Proceedings of the 12th international conference on Architectural support for programming langu- ages and operating systems, October, 2006. Pages: 185-194.
    [108]. B. C. Lee, D. M. Brooks. Illustrative Design Space Studies with Microarchitectural Regression Models. Proceeding of High Performance Computer Architecture, Page(s):340– 351, Feb. 2007.
    [109]. M. J. Irwin, M. Kandemir, N. Vijaykrishman. SimplePower: A Cycle-Accurate Energy Simulator. In Proc. of IEEE Computer Society Technical Committee on Computer Architecture, 2001.
    [110]. G. Cai, C. H. Lim. Architectural level power/performance optimization and dynamic power estimation. Cool Chips Tutorial colocated with MICRO32, November 1999.
    [111]. M. H. Rashid. SPICE for power electronics and electric power, Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1993.
    [112]. Synopsys: Prime Power, Full-Chip Dynamic Power Analysis for Multimillion- Gate Designs, 2004.
    [113]. J. M. Rabaey, A. Chandrakasan, B. Nikolic. Digital Integrate Circuits:A Design Perspective (2nd Edition).北京:清华大学出版社2004.
    [114]. R. A. Bergramaschi. The A to Z of SoCs. Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD). pp. 790-798. San Jose, California, USA, November, 2002.
    [115]. M. B. Taylor, et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Transactions on Micro, pp.25-35, Vol. 22(2), 2002.
    [116]. K. Sankaralingam, R. Nagarajan, H. Liu. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture. Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA). San Diego, California, USA, June, 2003, pp.422-433.
    [117]. P. Guerrier, A. Greiner. A Generic Architecture for On-Chip Packet-Switched Interconnections. Proceedings of Design, Automation and Test in Europe (DATE), pp.250-256. Paris, France, March, 2000.
    [118]. A. Andriahantenaina, A. Greiner. Micro-network for SoC: Implementation of a 32-port SPIN network. In DATE, Mar. 2003, pp. 1128–1129.
    [119]. E. Rijpkema, et al. Trade offs in the design of a router with both guaranteed and best-effort services for network on chip. IEEE Proc. Computers and DigitalTechniques, vol. 150, no. 5, pp. 294-302, Sep. 2003.
    [120]. M. Dall'Osso, G. Biccari, L. Giovannini, D. Bertozzi, L. Benini. Xpipes: A Latency Insensitive Parameterized Network-on-Chip Architecture for Multi- Processor SoCs. Proceedings of the 21st International Conference on Computer Design (ICCD), pp.536-539, San Jose, CA, USA, October, 2003.
    [121]. T. Bjerregaard, S. Mahadevan, R. G. Olsen, J. Sparso. An OCP Compliant Network Adapter for GALS-based SoC Design Using the MANGO Network-on-Chip. Proceedings of Symposium on System-on-Chip (SoC), pp.171-174, Tampere, Finland, November, 2005.
    [122]. L. Benini, G. D. Micheli. Networks on Chips: A New Soc Paradigm. IEEE Transactions on Computers, pp.70-78, Vol. 35(1), January, 2002.
    [123]. A. Hemani. Network on a Chip: An Architecture for Billion Transistor Era. Proceedings of the 18th IEEE Norchip Conference, pp.166-173. Turku, Finland, November, 2000.
    [124]. A. Jantsch, H. Tenhunen. Networks-on-Chip. Norwell, MA: Kluwer, 2003.
    [125]. S. Pawlowski. Petascale Computing Research Challenges - A Many core Perspective. IEEE HPCA 2007, Feb.10-12, 2007, pp.3–6.
    [126]. J. Parkhurst. From single core to multi-core to many core: are we ready for a new exponential, Proceedings of the 16th ACM Great Lakes symposium on VLSI, Philadelphia, PA, USA, 2006.
    [127]. W. Huang, M. R. Stant. Many-core design from a thermal perspective, Proceedings of the 45th annual conference on Design automation, pp.746-749, Anaheim, California, 2008.
    [128]. M. Harris. Many-core GPU computing with NVIDIA CUDA, Proceedings of the 22nd annual international conference on Supercomputing, Island of Kos, Greece, 2008.
    [129]. T. T. Ye. On Chip multiprocessor communication network design and analysis, Doctor Thesis, Stanford Univeristy, 2003.
    [130]. K. M. Lee, S. J. Lee. Low-power network-on-chip for high-performance SoC design, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp.148-160, Volume 14, Issue 2, February 2006.
    [131]. T. Bjerregaard, S. Mahadevan. A survey of research and practices of Network-on-chip. ACM Computing Surveys,Volume 38, Issue 1, 2006.
    [132]. E. Beigne, et al. An asynchronous NOC architecture providing low latency service and its multi-level design framework. In ASYNC, Mar. 2005, pp.54–63.
    [133]. R. Mullins, A. West, S. Moore. The design and implementation of a low- latency on-chip network. In ASP-DAC, Jan. 2006.
    [134]. M. Millberg, R. T. E. Nilsson, A. Jantsch. Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip.In DATE, Feb. 2004, pp. 890–895.
    [135]. S. Penolazzi, A. Jantsch. A high level power model for the Nostrum NoC. In DSD, Aug. 2006, pp. 673–676.
    [136]. H. C. Chi, J. H. Chen. Design and implementation of a routing switch for on-chip interconnection networks. In AP-ASIC, Aug. 2004, pp. 392–395.
    [137]. A. Janarthanan, K. A. Tomko. MoCSYS: A multi-clock hybrid two-layer router architecture and integrated topology synthesis framework for system-level design of fpga based on-chip networks. 21st International Conference on VLSI Design (VLSI Design 2008) pp. 397-402.
    [138]. S. Vangal, et al. An 80-tile Sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 29–41, Jan. 2007.
    [139]. M. B. Taylor. Design Decisions in the Implementation of a Raw Architecture Workstation. MS Thesis, Cambridge, MA, September, 1999.
    [140]. M. Hosseinabady, M. Kakoee, J. Mathew, D. Pradhan. Reliable network- on-chip based on generalized de Bruijn graph. In HLVDT, Nov. 2007, pp.3–10.
    [141]. P. P. Pande, C. Grecu, et.al. Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures, IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 8, AUGUST 2005.
    [142]. S. KUMAR, et.al. A network-on-chip architecture and design methodology. In Proceedings of the Computer Society Annual Symposium on VLSI (ISVLSI). IEEE Computer Society, 117–124, 2002.
    [143]. K. M. AL-TAWIL, et.al. A survey and comparison of wormhole routing tech- niques in a mesh networks. IEEE Network, 38–45. 1997.
    [144]. A. V. DE MELLO, L. C. OST, et.al. Evaluation of routing algorithms on mesh based nocs. Tech. rep., Faculdade de Informatica PUCRS, Brazil. 2004.
    [145]. W. J. Dally. Virtual Channel Flow Control. IEEE Transaction on Parallel and Distributed Systems, pp.194-205, Vol. 3(2), 1992.
    [146]. M. Millberg, E. Nilsson, R. Thid, S. Kumar. The Nostrum Backbone - A Communication Protocol Stack for Networks on Chip. Proceedings of the 17th VLSI Design Conference, pp. 693-696, Mumbai, India, January, 2004.
    [147]. P. Gupta. Design and Implementing a Fast Crossbar Scheduler. IEEE Transaction on Micro, pp.20-28, Vol. 19(1), 1999.
    [148]. E. Shin. Round-Robin Arbiter Design and Generation. Proceedings of the 15th IEEE International Symposium on System Synthesis (ISSS), pp.243-248. Tokyo, Japan, October, 2002.
    [149]. M. J. Karol, M. G. Hluchyj. Input Versus Output Queueing on a Space-Division Packet Switch. IEEE Trans. on Communications, Dec., 1987.
    [150]. E. Rijpkema, K. Goossens, P. Wielage. A Router Architecture for Networks on Silicon. Proceedings of Progress 2001, 2nd Workshop on Embedded Systems, 162pp.181-188, November, 2001.
    [151]. R. Kumar. Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling, Proceeding of 32nd Annual International Symposium on Computer Architecture, pp.408-419, 2005.
    [152]. X. Chen, L. Peh. Leakage Power Modeling and Optimization in Interconnec- tion Networks. ISLPED, Aug. 2002.
    [153]. J. Hu, U. Y. Ogras. System-Level Buffer Allocation for Application-Specific Networks-on-Chip Router Design. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Dec, 2006.
    [154]. T. C. Huang, U. Y. Ogras. Virtual Channels Planning for Networks-on- Chip. Proceedings of the 8th International Symposium on Quality Electronic Design, 2007.
    [155]. G. Y. Chen, F. H. Li, M. Kandemir. Compiler-Directed Channel Allocation for Saving Power in On-Chip Networks. POPL’06.
    [156]. C. A. Nicopoulos, D. Park, J. Kim. ViChaR: A Dynamic Virtual Channel Regulator for Network-on-chip Router, the 39th Annual IEEE/ ACM International Symposium on Micro architecture, 2006.
    [157]. T. Taha, D. S. Wills. An Instruction Throughput Model of Superscalar Processors. International Workshop on Rapid Systems Prototyping, 2003, pp. 156-163.
    [158]. M. Ohmacht, D. Hoenicke. The eDRAM based L3-Cache of the BlueGene/L Supercomputer Processor Node, Proceedings of 16th Sympo- sium on Computer Architecture and High Performance Computing, pp.18-22, 2004
    [159]. J. J. Guo, M. C. Lai. Hierarchical memory system design for a hetero- geneous multi-core processor. Proceedings of the 2008 ACM symposium on Applied computing, pp.1504-1508, 2008.
    [160]. A. Artieri, V. D’Alto, R. Chesson. Open Multimedia Platform for Next-generation Mobile Devices, STMicroelectronics. Technical Article TA305, 2003.
    [161]. H. J. Stolberg, M. Berekovi, et.al. HiBRID-SoC: A Multi-Core SoC Architecture for Multimedia Signal Processing, Journal of VLSI Signal Processing Systems, Volume 41 , Issue 1, August 2005
    [162]. S. Dutta, R. Jensen, A. Rieckmann. A Multiprocessor SOC for Advanced Set-Top Box and Digital TV Systems. IEEE Design and Test of Computers, September/October 2001, pp. 21-37.
    [163]. G. J. M. Smit, A. B. J. Kokkeler, et.al. Multi-core architectures and streaming applications, Proceedings of international workshop on System level interconnect prediction, pp.35-42, Newcastle, United Kingdom, 2008.
    [164]. H. Shikano, M. Ito, K. Uchiyama, T. Odaka. Software-cooperative power-efficient heterogeneous multi-core for media processing, Proceedings ofconference on Asia and South Pacific design automation, pp.736-741. Seoul, Korea, 2008.
    [165]. The SPARC Architecture Manual Version 8, http://www.sparc.com/standards/ V8.pdf.
    [166].岳虹,戴葵,王志英.基于CORDIC算法实现浮点功能部件的关键技术研究.第九届全国工程与工艺年会论文集. 102-105, 2005.08.
    [167]. H. Corporaal, J. Janssen, M. Arnold. Computation in the Context of Transport Triggered Architectures. International Journal of Parallel Programming. Volume 28, 401-427. August 2000.
    [168]. P. Mishra, P.Grun, N. Dutt. Processor-memory Coexploration Using an Architecture Description Language. ACM Transactions on Embedded Com- puting Systems, February 2004, Pages 140-162.
    [169]. TMS320C64x DSP Library Programmer's Reference [R], Texas Instruments Inc., Apr 2002.
    [170]. C. Lee, M. Potkonjak, W. H. Mangione-Smith. MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems, Proceedings of 30th annual ACM/IEEE international symposium on Micro- architecture, pp.330-335 North Carolina, United States, 1997.
    [171]. C. Plessl, R. Enzler, H. Walder, J. Beutel, M. Platzner, L. Thiele. Reconfigurable hardware in wearable computing nodes, In Proc. 6th Int. Symp. on Wearable Computers (ISWC), pp. 215–222, 2002.
    [172]. C. Plessl, R. Enzler, H. Walder, J. Beutel, M. Platzner, L. Thiele, et al. The case for reconfigurable hardware in wearable computing. Personal and Ubiquitous Computing, 7(5):299–308, Oct. 2003.
    [173]. L. B. Huang, M. C. Lai, et.al. Hardware Support for Arithmetic Units of Processors with Multimedia Extension. Proceeding of International Conference on Multimedia and Ubiquitous Engineering (MUE), 2007.
    [174].黄立波岳虹陆洪毅戴葵.一种高性能子字并行乘法器的设计与实现.计算机工程与应用2007 43 (20): 104-106.
    [175]. X. M. Zhao, Z. Y. Wang. TTA-EC: A Whole Algorithm Processor for ECC Based on Transport Triggered Architecture, Chinese journal of computers, 2007/30/2, pp. 225-233.
    [176].李勇,戴葵等,配置流驱动计算体系结构指导下的ASIP设计,计算机研究与发展,2007年,4期, 714~721.
    [177]. P. Kuukkanen, J. Takala. Bitwise and dictionary modeling for code compression on transport triggered architectures. WSEAS Transactions on Circuits and Systems: 1750-1755, 2004.
    [178]. J. Heikkinen, T. Rantanen, A. G. M. Cilio, J. Takala. Evaluating Template-BasedInstruction Compression on Transport Triggered Architectures. IWSOC 2003: 192-195.
    [179]. J. Heikkinen, A. Cilio, J. Takala, H. Corporaal. Dictionary-Based Program Compression on Transport Triggered Architectures. In Proc. IEEE Int. Symp. on Circuits and Systems, Kobe, Japan, May 23-26, 2005, pp. 1122- 1125.
    [180]. S. Aditya, B. R. Rau, R. C. Johnson. Automatic design of VLIW and EPIC instruction formats. Technical Report HPL-1999-94, Hewlett-Packard Laboratories, 2000.
    [181]. Y. Xie, W. Wolf,H. Lekatsas. Code Compression for VLIW Processors Using Variable-to-fixed Coding. In ISSS’02, October 2–4, 2002, Kyoto, Japan., pp.138-143.
    [182]. S. J. Nam, I. C. Park. Improving dictionary-based code compression in VLIW architectures. IEICE Trans. Fundamentals of Electronics, Commun. and Comput. Sciences, E82-A (11): 231– 2124, Nov. 1999.
    [183]. H. Lekatsas, W. Wolf. SAMC: A code compression algorithm for embedded processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(12):1689–1701, 1999.
    [184]. W. Dally, B. Towles. Route Packets, Not Wires: On-Chip Interconnection Networks. Proceedings of the 38th Design Automation Conference (DAC), pp.684-689, Las Vegas, NV, USA, June, 2001.
    [185]. C. Grecu, M. Jones. Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures, IEEE Transactions on Compu- ters, pp.1025-1040, Volume 54, Issue 8, August 2005.
    [186]. V. Nollet, T. Marescaux. Operating-system controlled network on chip, Proceedings of the 41st annual conference on Design automation, pp.256 - 259, San Diego, CA, USA, 2004.
    [187]. D. Atienza, F. Angiolini. Network-on-Chip design and synthesis outlook (Invited paper), Integration, the VLSI Journal, pp.340-359, Volume 41, Issue 3, May 2008.
    [188].常政威,谢晓娜,桑楠,熊光泽,片上网络映射问题的改进禁忌搜索算法,计算机辅助设计与图形学学报, pp.155-160, Vol.20, No.2, 2008.
    [189]. T. Kogel, M. Doerper, A modular simulation framework for architectural exploration of on-chip interconnection networks. Proceedings of IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pp.7-12, Newport Beach, CA, USA, 2003.
    [190]. W. J. Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE Transaction on Computers, Vol. 39, No. 6, pp, 775-785, June 1990.
    [191]. V. S. Adve. Performance analysis of mesh interconnection networks with deter- ministic routing. IEEE Transaction on Parallel and Distributed Systems, Vol. 5,No. 3, pp. 225–246, Mar. 1994.
    [192]. A. Agarwal. Limits on interconnection network performance. IEEE Trans. on Parallel and Distributed Systems, Vol. 2, No. 4, pp. 398–412, Oct. 1991.
    [193].侯国峰,杨愚鲁.超级递归基准互连网络性能分析,计算机科学,第28卷10期:85-88, 2001.
    [194]. U. Y. Ogras, R. Marculescu. Analytical router modeling for networks-on-chip performance analysis. Proceedings of Design, Automation and Test in Europe Conference (DATE) [C], pp. 1096-1101, Acropolis, Nice, France, 2007.
    [195]. C. L. Lu. Queuing theory [M]. Beijing: Beijing University of Posts and Telecommunications Press, 1994. (陆传赉.排队论[M].北京:北京邮电学院出版社,1994.)
    [196]. A. Frey, Y. Takahashi. A note on an M/G/1/N queue with vacation time and exhaustive service discipline.Operations Research Letters,1997,21(2):95-100.
    [197]. M. Hassan, R. Sarker, M. Atiquzzaman. Modeling IP-ATM gateway using M/G/1/N queue. Proceedings of IEEE Global Telecommunications Conference [C].pp. 465-470, Sydney, 1998.
    [198]. NIRGAM. A Simulator for NoC Interconnect Routing and Application Modeling. http://nirgam.ecs.soton.ac.uk/, 2007.
    [199]. M. Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip. In Proc. Hot Interconnects 4, pp. 141-146, Aug. 1996.
    [200]. Y. Li, S. Panwar, H. J. Chao. The dual Round-Robin matching switch with exhaustive service. In: Gunner C, ed. Proc. of the IEEE Workshop on High Performance Switching and Routing. Kobe: IEEE Communications Society, 2002. 58-63.
    [201]. N. McKeown. The iSLIP scheduling algorithm for input-queued switches. IEEE Trans. on Networking, 1999, 7 (2):188?201.
    [202]. Y. Li, S. Panwar, H. J. Chao. On the performance of a dual Round-Robin switch. In: Ammar M, ed. Proc. of the IEEE INFOCOM. Anchorage: IEEE Communications Society, 2001. 1688?1697.
    [203].刘祥远,多核SoC片上网络关键技术研究,工学博士学位论文,2007.
    [204]. F. Mondinelli, M. Borgatti, et al. A 0.13um 1Gb/s/channel Store-and-Forward Network on-Chip. IEEE international SOC conference. Phoenix, Arizona, USA, April 19-21, 2004.
    [205]. H. C. Chi, J. H. Chen. DESIGN AND IMPELEMENTATIOIN OF A ROUT- ING SWITCH FOR ON-CHIP INTERCONNECTION NETWORKS. Proceedings of IEEE Asia-Pacific Conference on Advanced System Integrated Circuits, pp.392–395, Aug. 2004.
    [206]. S. H. Hsu, J. M. Jou, M. C. Lee, C. M. Sun. Design of a New Pipelined Routerfor NoC. Proceedings of Computer Symposium, December, 2005.
    [207]. S. Vangal, J. Howard, G. Ruhl, S. Dighe. An 80-Tile 1 .28TFLOPS Network- on-Chip in 65nm CMOS. Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers, 11-15, Feb. 2007.
    [208]. R. Mullins, R. West, S. Moore. The Design and Implementation of a Low-Latency On-Chip Network. Proceedings of the 2006 conference on Asia South Pacific design automation, pp.164 - 169, Yokohama, Japan, 2006.
    [209]. Y. Tamir, G. L. Frazier. High-performance multiqueue buffers for VLSI communication switches. In Proceedings of Annual International Symposium on Computer Architecture, 1988.
    [210]. N. Ni, M. Pirvu, L. Bhuyan. Circular buffered switch design with wormhole routing and virtual channels. In Proceedings of the International Conference on Computer Design, 1998.
    [211]. M. Rezazad, et al. The effect of virtual channel organization on the performance of interconnection networks. Proceeding of IPDPS 2005. pp. 264-272.
    [212]. Z. Guz, et al. Efficient link capacity and QoS design for wormhole network-on-chip. Proceedings of the conference Design, Automation and Test in Europe (DATE), pp, 1-6, Munich, Germany, 2006.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700