用户名: 密码: 验证码:
面向片上网络的高性能路由器关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着工艺尺寸比例缩小,未来单芯片上将会集成数百个处理器核心,全局互连线延迟相对于门延迟也越来越大。传统的基于总线、专用互连线、交叉开关等互连方式由于受到带宽、可扩展性、面积、全局互连线延迟等问题的挑战,无法满足片上互连的需求。片上网络由于其具有良好的扩展性、可以预测的互连线长度和延迟、较高的带宽、可重用性等优点逐渐成为非常有前景的片上互连结构。同时,应用程序对片上互连结构提出了低延迟、高吞吐率的要求。虽然网络已经在并行计算和互联网络等领域进行了深入广泛的研究。但是片上网络与之相比较具有以下不同:路由器的延迟成为网络延迟的主要构成部分;具有丰富的互连线资源;有限的存储资源;更加严峻的功耗和面积约束。这些不同点是NoC研究的立足点和出发点。因此,本课题的研究也是针对这些问题展开的,主要工作体现在以下五个方面。
     1.自适应通道双缓冲CDB。通道双缓冲CDB(Channel Double Buffer)用来替代链路中的寄存器,实现链路流水化。CDB之间以及CDB与路由器之间的报片传输采用了ready-valid握手协议。链路采用了局部拥塞控制策略,当下游路由器的输入缓冲器无法接收报片时,链路中的CDB能够缓冲报片。这等效的增加了路由器输入缓冲器的容量。基于逻辑努力建立的延迟模型显示:关键路径延迟与物理链路宽度密切相关;寄存器开销是关键路径延迟的重要构成部分。基于CDB的链路流水线级数与互连线类型、互连线长度和时钟周期宽度密切相关。与插入简单寄存器实现链路流水化相比较,基于CDB的链路流水化将会增加流水线级数,但是流水线级数的增加并不明显。
     2.基于CDB的动态缓冲分配的DVOQR。DVOQR(Dynamic Virtual Output Queue Router)通过虚拟输出队列技术,前瞻路由计算策略,动态缓冲分配和虚拟地址队列结构,从而实现UDB读操作,前瞻路由计算和交叉开关分配能够并行进行,进而能够将路由器流水线压缩到两个时钟周期。动态缓冲分配机制可以有效的利用片上有限的缓冲资源。在随机通讯模式下,与虚通道路由器相比较,在获得相同网络吞吐率下,DVOQR的缓冲容量是虚通道路由器的四分之一。基于逻辑努力建立的延迟模型显示:路由器的端口数量对关键路径延迟的影响更加明显。在4x4 Mesh网络中,随机通讯模式下,DVOQR的吞吐率相对于虫孔路由器和虚通道路由器分别增加了46.9%和28.5%。即使在相同输入加速比下,DVOQR的吞吐率比两倍于其输入缓冲器容量的虚通道路由器仍高1.9%,与四倍于其输入缓冲器容量的虚通道路由器相当。应用程序的模拟结果显示:DVOQ路由器、虫孔路由器和虚通道路由器的平均延迟相对于理想路由器分别增加了6.6%,50.9%和94.6%。
     3.低面积开销的基于编码分配的无缓冲路由器BEA-BLESS。BEA-BLESS(Based on Encoding Allocation BufferLESS router)是一种无缓冲路由器,能够有效的减小NoC对芯片面积需求。FBEA-BLESS和PBEA-BLESS分别针对报片交换和报文交换进行优化。BEA-BLESS通过编码分配策略能够降低路由器的关键路径延迟,提高路由器的工作频率。FBEA-BLESS工作频率是B-BLESS的2倍;网络活锁可以通过GoSS(Go-Stop-Steer)策略来避免。PBEA-BLESS能够以较小的缓冲面积开销来消除接收端的重排序缓冲;改进的GoSS策略可以避免网络活锁和饿死。真实应用程序的模拟结果显示:在BEA-BLESS中,网络平均延迟相对于B-BLESS降低了29.4%;支持报文交换所需要的缓冲器的容量仅仅为重排序缓冲器容量的33.3%。
     4.基于DVOQR的负载均衡的多播路由器。通过借鉴单播通讯下网络吞吐率模型建立的方法,本文建立了面向多播通讯的网络吞吐率模型;并且提出了两种负载平衡的多播路由算法BDOR (Balanced DOR)和MPDOR(Minimal Path DOR)。SM-DVOQR (Supporting Multicast DVOQR)和SMDL-DVOQR(Supporting Multicast Double Lane DVOQR)是基于DVOQR的两种能够高效的支持多播的路由器。SM-DVOQR能够支持XY多播路由算法和YX多播路由算法。单一的采用XY多播路由算法或者YX多播路由算法将会导致网络的X方向和Y方向上的通道负载不平衡。这种不平衡的特性将会随着网络规模的增加而增加。SMDL-DVOQR通过在两个lane上分别支持XY多播路由算法和YX多播路由算法来实现负载均衡的BDOR和MPDOR多播路由算法。模拟结果显示:在Mesh网络中,通过增加路由器的局部输出端口的数量,网络性能可以获得改善,局部端口数量的最优值是2;SMDL-DVOQR由于能够平衡网络负载,因此能够获得比SM-DVOQR更好的性能。
     5.面向DVOQR的漏流功耗优化策略。基于RTL级的DVOQR的功耗分析显示:路由器中的存储单元是漏流功耗的主要消耗部件,占据了总漏流功耗的85%;在低的网络通讯量下,漏流功耗是路由器总功耗的重要构成部分。自适应缓冲管理策略和两项缓冲不关闭策略是两种路由器的漏流功耗优化策略。自适应缓冲管理策略能够有效的降低路由器的漏流功耗,但是在较低的网络注入率下,缓冲项的唤醒操作延迟将会附加到网络平均延迟。在唤醒延迟Twakeup=1时,提前唤醒技术能够完全隐藏唤醒延迟。而两项缓冲不关闭技术能够容忍更大的唤醒延迟。在低注入率下,两项缓冲不关闭技术下,路由器的漏流功耗节约率小于自适应缓冲管理策略。在中等、较高注入率下,这两种策略下的漏流功耗节约率几乎相等。
With the technology scaling down, hundreds of processor cores are integrated into a single chip and global wire delays are in fact increasing while gates delays are scaling down. The conventional interconnect architectures such as bus, ad-hoc wire and crossbar, limited to the bandwidth, scalability, area and global wire delay can not satisfy the requirement of on-chip interconnect. NoC (Networks on Chip) are becoming the promising interconnects for the better scalability, predicted wire length and delay, high bandwidth and reusability. Scientific and commercial applications need on chip interconnects with the low latency and high throughput. The network has been studied widely in the parallel computers and internet. There are some differences between the NoC and those two types network. Router latency is the main component of the network latency in the NoC; NoC has more rich wire resource and less buffer capacity; NoC faces the very tight power and area budgets. The research of the thesis is based on these differences and focused on the low latency router architecture, pipelined physical link, low area overhead router, multicast and the low power design of the routers.
     The primary innovative works in this thesis are as follows:
     1. Adaptive CDB (Channel Double Buffer). CDBs are used to replace the link register implement the pipelined physical link. The ready-valid handshake flow control is adopted between CDBs and routers. The link adopt the local congest control and CDBs in the link can buffer Flits when the buffer in the downstream does not receive a Flit. This is equivalent to increasing the capacity of input buffer in the router. The delay model of CDB built based on the theory of logical effort displays that the delay of critical path is sensitive to the link width and the register overhead is main component of the critical path. The pipeline depth based on CDBs is sensitive to the wire type, wire length and clock cycle. Compared to the method of implement link pipeline by inserting register, the increase of pipeline depth based on CDBs is not obvious.
     2. DVOQR (Dynamic Virtual Output Queue Router) with dynamic buffer allocation based on CDB. The UDB (Unified Dynamic Buffer) read, routing computation and switch allocation can be perform parallel through VOQ (Virtual Output Queue), look-ahead routing computation strategy, dynamic buffer allocation and VOAQ(Virtual Output Address Queue), which can reduce the router pipeline to two stages. Dynamic buffer allocation strategy can efficiently use the limited on-chip buffer resources. DVOQR with buffer being one quarter of that in the virtual channel router has the same throughput with the virtual channel router in the random traffic. The delay model established based on the logical effort theory shows that the critical path delay is more sensitive to the number of ports. The synthetic workload simulation results display that throughput of DVOQR relative to wormhole router and virtual channel router is increased by 46.9% and 28.5% in the 4x4 Mesh network and random traffic; Throughput of DVOQR is still high 1.9% than that of virtual channel router with twice buffer capacity of DVQOR, and is almost the same with that of virtual channel router with four times buffer capacity of DVQOR under the same input speedup. Application workload simulation results show that the network average delay of DVOQR, wormhole router and virtual channel router relative to the ideal router is increased 6.6%, 50.9% and 94.6% respectively.
     3. BEA-BLESS (Based on Encoding Allocation BufferLESS router) with low area overhead. BEA-BLESS does not has input buffer and can reduce NoC to the chip area requirement. FBEA-BLESS is the Flit-switch router and PBEA-BLESS is the packet-switch router. Based on encoding allocation strategy adopted by the BEA-BLESS can reduce the critical path of router and increase the router work frequency. The frequency of BEA-BLESS is 2 times of the B-BLESS (Base BufferLESS router). The livelock can be avoided by the GoSS(Go-Stop-Steer) strategy. PBEA-BLESS router can use a small capacity buffer to eliminate the reordering buffer in the receiving end. The livelock and starvation can be avoided by improved GoSS strategy. Application workload simulation results show that the network average delay of BEA-BLESS relative to the B-BLESS is reduced by 29.4% and the capacity of buffer to support packet switching is only 33.3% of the capacity of the reorder buffer.
     4. Multicast router with load balance based on DVOQR. The network throughput model of multicast has been established, learning from the throughput model of unicast. BDOR (Balanced Dimension Order Routing algorithm) and MPDOR (Minimal Path Dimension Order Routing algorithm) proposed in the thesis are load balance routing algorithm. SM-DVOQR (Supporting Multicast DVOQR) and SMDL-DVOQR (Supporting Multicast Double Lane DVOQR) are based on DVOQR and can support multicast efficiently. SM-DVOQR is able to support XY/YX multicast routing algorithm. These two algorithms will result in the channel load imbalance between the X and Y directions. And the imbalance will increase with the network size increase. SMDL-DVOQR, which has two lane, is able to support BDOR and MPDOR multicast routing algorithms through one lane supporting XY multicast algorithm and the other lane supporting YX multicast algorithm. The simulation results under random multicast traffic display that the network performance will increase with the number of local output port, which is the optimal value of 2 and SMDL-DVOQR, which can balance the network load, can obtain the better network performance than SM-DVOQR.
     5. Leakage power optimization strategies for DVOQR. Power analysis results of DVOQR based on RTL-level display that UDB and VOAQ are the main components consuming the leakage power and occupy the 85% of the total leakage power; and the leakage power is a important component of total power consumption under the low network traffic. Adaptive buffer management and two-entry-never-turned-off are the two leakage power optimization strategies. Adaptive buffer management strategy can effectively reduce leakage power consumption. But the buffer wakeup delay will be attached to the network average delay at a lower network injection rate. The look-ahead wake up technology can fully hide the wake up delay with Twakeup=1. While, Two-entry-never-turned-off strategy can tolerate a larger wake up delay. The leakage power savings rate of Two-entry-never-turned-off strategy is less than that of adaptive buffer management strategy under the low network injection rate. But the leakage power savings rate of two strategies is almost the same under the middle or high network injection rate.
引文
[1] Moore G. Cramming more components onto integrated circuits[J]. Electronics, 1965, 38(8):114-117.
    [2] ITRS. International technology roadmap for semiconductors 2009 Update [EB/OL]. http://public.itrs.net, 2009, April.
    [3] Burger D and Goodman J. R. Billion-Transistor Architectures: There and Back Again[J]. IEEE Computer, 2004, 37(3):22-28.
    [4] Olukotun K, Hammond L and Laudon J. Chip Multiprocessor Architecture:Techniques to Improve Throughput and Latency[M].Morgan& Claypool, 2007.
    [5] Hennessy J. L and Patterson D. A. Computer Architecture: A Quantitative Approach[M].Morgan Kaufmann, 2006.
    [6]孙彩霞.同时多线程处理器中的资源分配策略研究[D]. Ph.D,国防科学技术大学,湖南长沙, 2006.
    [7] Tullsen D. M, Eggers S and Levy H. Simultaneous Multithreading: Maximizing On chip Parallelism[C]. In the Proceedings of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, 1995:392-403.
    [8] Shekhar Borkar. Thousand core chips - A technology perspective[C]. In the Proceedings of the 44th ACM/IEEE Design Automation Conference, San Diego, CA, United states, 2007:746-749.
    [9] Kalla R, Sinharoy B and Tendler J. M. IBM Power5 Chip: A Dual-Core Multithreaded Processor[J]. IEEE Micro, 2004, 24(2):40-47.
    [10] Le H. Q, Starke W. J and Fields J. S. IBM POWER6 microarchitecture[J]. IBM Journal of Research & Development, 2007, 51(6):639-662.
    [11] SUN. UltraSPARC T2? Supplement to the UltraSPARC Architecture 2007[M].draft d1.4.3 ed, 2007.
    [12] L. A. Barroso, K. Gharachorloo, R. McNamara, et al. Piranha: A scalable architecture based on single-chip multiprocessing[C]. In the Proceedings of the 27th Annual International Symposium on Computer Architecture, Vancouver, B.C, 2000:282-293.
    [13] Intel. From a few cores to many: A Terascale computing research overview [EB/OL]. http://download.intel.com/research/platform/terascale/terascale_overview_paper.pdf, 2006.
    [14] Poonacha Kongetira, Kathirgamar Aingaran and Kunle Olukotun. Niagara: A 32-way multithreaded SPARC processor[J]. IEEE Micro, 2005, 25(2):21-29.
    [15] Yatin Hoskote, Sriram Vangal, Arvind Singh, et al. A 5-GHz mesh interconnect for a teraflops processor[J]. IEEE Micro, 2007, 27(5):51-61.
    [16] Sriram Vangal, Jason Howard, Gregory Ruhl, et al. An 80-Tile 1.28TFLOPS network-on-chip in 65nm CMOS[C]. In the Proceedings of the 54th IEEE International Solid-State Circuits Conference( ISSCC'2007), San Francisco, CA, United states, 2007:95-98+589.
    [17] Hammond L, Nayfeh B and Olukotun K. A Single-Chip Multiprocessor[J]. Computer, 1997, 30(9):79-85.
    [18] Spracklen L and Abraham S. Chip Multithreading: Opportunities and Challenges[C]. In the Proceedings of the 11th International Conference on High-Performance Computer Architecture, San Francisco, CA, USA, 2005:249-252.
    [19] Intel. Intel Xeon Processor [EB/OL]. http://www.intel.com/itcenter/products/xeon/index.htm, 2011.
    [20] Oracle/Sun. Sun SPARC Enterprise T-Series [EB/OL]. http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/index.html, 2011.
    [21] Marc Tremblay and Shailender Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor[C]. In the Proceedings of IEEE International Solid-State Circuits Conference, San Francisco, CA, 2008:82-83.
    [22] David Wentzlaff, Patrick Griffin, Henry Hoffmann, et al. On-chip interconnection architecture of the tile processor[J]. IEEE Micro, 2007, 27(5):15-31.
    [23] J. A.Kahl, M. N. Day, H. P. Hofstee, et al. Introduction to the Cell multiprocessor[J]. IBM Journal of Research and Development, 2005, 49(4):589-604.
    [24] Ron Ho. On-Chip Wires: Scaling and Efficiency[D]. Ph.D, Department of electrical engineering and the committee on graduate studies, Stanford University, California, 2003.
    [25] Ron Ho, Ken Mai and Mark Horowitz. Efficient On-Chip Global Interconnects[C]. In the Proceedings of the 2003 Symposium on VLSI Circuits, Kyoto, Japan, 2003:271-274.
    [26] R. Ho, K. Mai and M. Horowitz. The Future of Wires[J]. Proceedings of the IEEE, April 2001, 89(4):490-504.
    [27] ITRS. International technology roadmap for semiconductors 2007 Update [EB/OL]. http://public.itrs.net, 2007, April.
    [28] Stephen W. Keckler, Doug Burger, Charles R. Moore, et al. A wire-delay scalable microprocessor architecture for high performance systems[C]. In the Proceedings of the 2003 IEEE International Solid-State Circuits Conference(ISSCC'2003), United states, 2003:157+168-169.
    [29] M. S. Hrishikesh, Norman P. Jouppi, Keith I. Farkas, et al. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays[C]. In the Proceedings of the 29th Annual International Symposium on Computer Architecture, Anchorage, AK, United states, 2002:14-24.
    [30] G. Hinton, D. Sager, M. Upton, et al. The microarchitecture of the Pentium 4 processor[J]. Intel Technology Journal, 2001, 5(1):15-27.
    [31] Changkyu Kim, Doug Burger and Stephen W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated on-chip Caches[C]. In the Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, United states, 2002:211-222.
    [32] Jan M. Rabaey, Anantha Chandrakasan and Borivoje Nikolic. Digital Integrated Circuits:A Design Perspective (2nd Edition)[M].北京:清华大学出版社, 2004.
    [33] Yuan Xie and Ma Yuchun. Design space exploration for 3D integrated circuits[C]. In the Proceedings of the 9th International Conference onSolid-State and Integrated-Circuit Technology, Beijing, China, 2008:2317-2320.
    [34] Yuan Xie, Gabriel H. Loh, Bryan Black, et al. Design space exploration for 3D architectures[J]. ACM Journal on Emerging Technologies in Computing Systems, 2006, 2(2):65-103.
    [35] D. Sylvester and K. Keutzer. A Global Wiring Paradigm for Deep Submicron Design[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2000, 19(2):242-251.
    [36] Luca Benini and Giovanni De Micheli. Networks on chips: A new SoC paradigm[J]. Computer, 2002, 35(1):70-78.
    [37] John D. Owens, William J. Dally, Ron Ho, et al. Research challenges for on-chip interconnection networks[J]. IEEE Micro, 2007, 27(5):96-108.
    [38] Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks[M].Morgan &cLaypool Publisher, 2009.
    [39] Ranganathan P and Jouppi N. Enterprise IT Trends and Implications for Architecture Research [C]. In the Proceedings of the 11th International Symposium on High Performance Computer Architecture, 2005:253–256.
    [40] Natalie Enright Jerger, Li-Shiuan Peh and Mikko Lipasti. Circuit-switched coherence[C]. In the Proceedings of the 2nd IEEE International Symposium on Networks-on-Chip( NOCS'2008), Newcastle upon Tyne, United kingdom, 2008:193-202.
    [41] Natalie Enright Jerger, Mikko Lipasti and Li-Shiuan Peh. Circuit-switched coherence[J]. IEEE Computer Architecture Letters, 2007, 6(1):5-8.
    [42] U.J. Kapasi, W.J. Dally, S. Rixner, et al. The Imagine Stream Processor[C]. In the Proceedings of the 20th IEEE International Conference on Computer Design, 2002:282-288.
    [43] R.Hofman and B.Drerup. Next generation CoreConnect processor local bus archite cture[C]. In the Proceedings of IEEE International ASIC/SOC Conference, 2002.
    [44] D. Flynn. AMBA: enabling reusable on-chip designs[J]. IEEE Micro, 1997, 17(4):20-27.
    [45] D. Wingard. MircoNetwork-based intergration for SOCs[C]. In the Proceedings of 38th Design Automation Conference, Las Vegas, NV, United states, 2001:673-677.
    [46] B. Cordan, Palmchip Cor. and CO Loveland. An efficient bus architecture for system-on-chip design[C]. In the Proceedings of IEEE Custom Integrated Circuits, San Diego, CA , USA, 1999:623-626.
    [47] Lizheng Zhang, Yuhen Hu and Charlie Chung-Ping Chen. Wave-Pipelined On-Chip Global Interconnect[C]. In the Proceedings of the 10th Conference on Asia South Pacific Design Automation, Shanghai, 2005:127-132.
    [48] Jiang Xu and Wayne Wolf. A Wave-Pipelined On-chip Interconnect Structure for Networks-on-Chips[C]. In the Proceedings of the Symposium on High Performance Interconnects (Hot Interconnects), Palo Alto, California, USA, 2002:10-14.
    [49] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture[J]. ACM SIGARCH Computer Architecture News, 2003, 31(2):422–433.
    [50] Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, et al. On-chip interconnection networks of the TRIPS chip[J]. IEEE Micro, 2007, 27(5):41-50.
    [51] Liang J, Swaminathan S and Tessier R. aSOC: A Scalable, Single-Chip Communications Architecture[C]. In the Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, Philadelphia, PA, USA, 2000:37–46.
    [52] Liang J, Laffely A, Srinivasan S, et al. An architecture and compiler for scalable on-chip communication[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2004, 12(7):711–726.
    [53] Millberg M, Nilsson E, Thid R, et al. The Nostrum backbone - a communication protocol stack for networks on chip[C]. In the Proceedings of 17th VLSI Design Conference, Mumbai, India, 2004:693-696.
    [54] Matteo Dall'Osso, Gianluca Biccari, Luca Giovannini, et al. Xpipes: a Latency Insensitive Parameterized Network-on-Chip Architecture for Multi-Processor SoCs[C]. In the Proceedings of 21st International Conference on Computer Design (ICCD 2003), San Jose, CA, United states, 2003:536-539.
    [55] Evgeny Bolotin, Israel Cidon, Ran Ginosar, et al. QNoC: QoS architecture and design process for network on chip[J]. Journal of Systems Architecture, 2004, 50(2-3):105-128.
    [56] Thomas William Ainsworth and Timothy Mark Pinkston. Characterizing the cell EIB on-chip network[J]. IEEE Micro, 2007, 27(5):6-14.
    [57] Shubo Qi, Minxuan Zhang, Jinwen Li, et al. A high performance router with dynamic buffer allocation for on-chip interconnect networks[C]. In the Proceedings of the 28th IEEE International Conference on Computer Design, Amsterdam, Netherlands, 2010:462-467.
    [58] Marcello Coppola, Riccardo Locatelli, Giuseppe Maruccia, et al. Spidergon: A novel on-chip communication network[C]. In the Proceedings of International Symposium on System on Chip, Tampere, Finland, 2004:15.
    [59] M.B. Taylor, J. Kim, J. Miller, et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs[J]. IEEE Transactions on Micro, 2002, 22(2):25-35.
    [60] S. R. Vangal, J. Howard, G. Ruhl, et al. An 80-tile sub-100-w TeraFLOPS processor in 65-nm CMOS[J]. IEEE Journal of Solid State Circuits, 2008, 43(1):29-41.
    [61] Alan Gara, Matthias A., Blumrich, et al. Overview of the Blue Gene/L system architecture[J]. IBM Journal of Research and Development, 2005, 49(2-3):195-212.
    [62] Steve Scott, Dennis Abts, John Kim, et al. The BlackWidow high-radix clos network[C]. In the Proceedings of the 33rd International Symposium on Computer Architecture, Boston, MA, United states, 2006:16-27.
    [63] Nitin Godiwala, Jud Leonard and Matthew Reilly. A network fabric for scalable multiprocessor systems[C]. In the Proceedings of the Symposium on Hot Interconnects, 2008:137–144.
    [64] W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks[C]. In the Proceedings of the 38th Design Automation Conference, Las Vegas, NV, United states, 2001:684-689.
    [65] J. Huh, D. Burger and S. W. Keckler. Exploring the design space of future CMPs[C]. In the Proceedings of Internatinal Conference on Parallel Architectures and Compilation Techniques (PACT 2001), Barcelona, Spain, 2001:199-210.
    [66] S. S. Mukherjee, P. Bannon, S. Lang, et al. The alpha 21364 network architecture[J]. IEEE Micro, 2002, 22(1):26-35.
    [67] Hangsheng Wang, Li-Shiuan Peh and Sharad Malik. Power-driven Design of Router Microarchitectures in On-chip Networks[C]. the 36th annual IEEE/ACM International Symposium on Microarchitecture, 2003:105-117.
    [68] M. B. Taylor, W. Lee, S. Amarasinghe, et al. Scalar Operand Networks:on-Chip Interconnect for ILP in Partitioned Architectures[C]. In the Proceedings of the International Symposium on High Performance Computer Architecture(HPCA2003), Anaheim, California, USA, 2003:341-353.
    [69] Radu Marculescu, Umit Y. Ogras, Li-Shiuan Peh, et al. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2009, 28(1):3-21.
    [70] Meincke. T. Hemani, A. Kumar and S. Ellervee. Globally Asynchronous Locally Synchronous Architecture for Large High Performance ASICs[C]. In the Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Orlando, USA, 1999:512-515.
    [71] J. Muttersbach, T. Villiger and V. Fichtner. Practical Design of Globally-Asynchronous Locally-Synchronous Systems[C]. In the Proceedings of the 6th International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC), Eilat, Israel, 2000:52-59.
    [72] A. B. Kahng, K. Masuko and S. Muddu. Analytical Delay Models for VLSI Interconnects Under Ramp Input[C]. In the Proceedings of IEEE/ACM International Conference on Computer Aided Design, San Jose, California, USA, 1996:30-36.
    [73] C. L. Ratzlaff, N. Gopa and L. T. Pillage. RICE: Rapid Interconnect Circuit Evaluator[C]. In the Proceedings of ACM/IEEE Design Automation Conference, San Francisco, California, USA, 1991:555-560.
    [74]刘祥远.多核SoC片上网络关键技术研究[D]. PhD,国防科学技术大学,湖南, 2007.
    [75] Luca Carloni, Andrew B. Kahng, Swamy Muddu, et al. Interconnect modeling for improved system-level design optimization[C]. In the Proceedings of 2008 Asia and South Pacific Design Automation Conference(ASP-DAC'2008), Seoul, Korea, Republic of, 2008:258-264.
    [76] G. S. Garcea, N. P. Vander Meijs and R. H. J. M. Otten. Analytic Model for Area and Power Constrained Optimal Repeater Insertion[C]. In the Proceedings of 29th European Solid-State Circuits Conference, 2003:591-594.
    [77] Rahul Nagpal, Arvind Madan, Amrutur Bhardwaj, et al. INTACTE: An interconnect area, delay, and energy estimation tool for microarchitectural explorations[C]. In the Proceedings of the 2007 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems(CASES'2007), Salzburg, Austria, 2007:238-247.
    [78] V. Soteriou and L. S. Peh. Exploring the Design Space of Self-Regulating Power-Aware On/Off Interconnection Networks[J]. IEEE Transactions on Parallel and Distributed Systems, 2007, 18(3):393-408.
    [79] Jan M. Rabaey, ananth Chandrakasan and Borivoje Nikolic.数字集成电路——设计透视(影印版)[M].北京:清华大学出版社, 2004.
    [80] Avinash Kodi, Ashwini Sarathy and Ahmed Louri. Design of adaptivecommunication channel buffers for low-power area-efficient network-on-chip architecture[C]. In the Proceedings of the 3rd ACM/IEEE Symposium on Architectures for Networking and Communications Systems, Orlando, FL, United states, 2007:47-56.
    [81] Avinash Karanth Kodi, Ashwini Sarathy and Ahmed Louri. Adaptive channel buffers in on-chip interconnection networks - A power and performance analysis[J]. IEEE Transactions on Computers, 2008, 57(7):1169-1181.
    [82] George Michelogiannakis, James Balfour and William J. Dally. Elastic-buffer flow control for on-chip networks[C]. In the Proceedings of International Symposium on High-Performance Computer Architecture(HPCA'2009), Takamatsu, Japan, 2009:151-162.
    [83] George Michelogiannakis and William J. Dally. Router Designs for Elastic Buffer On-Chip Networks[C]. the Conference on High Performance Computing Networking, Portland, OR, United states, 2009:1-10.
    [84] M. Mizuno, W. J. Dally and H. Onishi. Elastic interconnects: Repeater-inserted long wiring capable of compressing and decompressing data[C]. In the Proceedings of the IEEE International Solid-State Circuits Conference, San Fransisco, CA, USA, 2001:346–347.
    [85] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks[M]. San Francisco,CA, USA:Morgan Kaufmann Publishers Inc., 2003.
    [86] S. S. Mukherjee. A comparative study of arbitration algorithms for the Alpha 21364 pipelined router[C]. In the Proceedings of the conference on Architectural Support for Programming Languages and Operating Systems, 2002:223-234.
    [87] Amit Kumary, Partha Kundu, Arvind P. Singh, et al. A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS[C]. In the Proceedings of IEEE International Conference on Computer Design( ICCD 2007), Lake Tahoe, CA, United states, 2007:63-70.
    [88] Vangal Sriram, Borkar Nitin and Alvandpour Atila. A Six-Port 57GB/s Double-Pumped Nonblocking Router Core[C]. In the Proceedings of 2005 Symposium on VLSI Circuits, Kyoto, Japan, 2005:268-269.
    [89] Li-Shiuan Peh. Flow control and Micro-architectural mechanisms for extending the performance of interconnection networks[D]. Ph.D, Stanford University California, 2001.
    [90] M. Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip[C]. In the Proceedings of Symposium on Hot Interconnects, Palo Alto, CA, 1996:141-146.
    [91] Li-Shiuan Peh and William J. Dally. Delay model and speculative architecture for pipelined routers[C]. In the Proceedings of the 7th International Symposium on High-Performance Computer Architecture(HPCA'2000), Nuevo Leon, Mex, 2001:255-266.
    [92] Amit Kumar, Li-Shiuan Peh, Partha Kundu, et al. Express virtual channels: Towards the ideal interconnection fabric[C]. In the Proceedings of the 34th Annual International Symposium on Computer Architecture(ISCA'2007), San Diego, CA, United states, 2007:150-161.
    [93] John kim. Low-cost Router Microarchitecture for On-Chip networks[C]. In the Proceedings of the Annual International Symposium on Microarchitecture, New York, NY, United states, 2009:255-266.
    [94] Mark J. Karol, Michael G. Hluchyj and Samuel P. Morgan. Input VS. Output Queueing on a Space-Division Packet Witch[C]. In the Proceedings of IEEE Global Telecommunications Conference: Communications Broadening Technology Horizons, Houston, TX, USA, 1986:659-665.
    [95] Chrysostomos A. Nicopoulos, Dongkook Park, Jongman Kim, et al. ViChaR: A dynamic Virtual Channel Regulator for Network-on-Chip Routers[C]. In the Proceedings of 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, FL, United states, 2006:333-344.
    [96] Pablo Abad, Valentin Puente, Pablo Prieto, et al. Rotary router: An efficient architecture for CMP interconnection networks[C]. In the Proceedings of the 34th Annual International Symposium on Computer Architecture(ISCA2007), San Diego, CA, United states, 2007:116-125.
    [97]朱晓静,胡伟武,马可, et al. Xmesh:一个mesh-like片上网络拓扑结构[J].软件学报, 2007, 18(9)
    [98]荆元利.基于片上网络的系统芯片研究[D].博士,计算机学院,西北工业大学,西安, 2005.
    [99]陆俊林王宏伟,佟冬.层次化片上网络结构的簇生成算法[J].电子学报, 2007, 35(5)
    [100] Thomas Moscibroda and Onur Mutlu. A case for bufferless routing in on-chip networks[C]. In the Proceedings of the 36th Annual International Symposium on Computer Architecture(ISCA'2009), Austin, TX, United states, 2009:196-207.
    [101] George Michelogiannakis, Daniel Sanchez, William J. Dally, et al. Evaluating bufferless flow control for on-chip networks[C]. In the Proceedings of the 4th ACM/IEEE International Symposium on Networks on Chip( NOCS'2010), Grenoble, France, 2010:9-16.
    [102] Zhonghai Lu, Mingchen Zhong and Axel Jantsch. Evaluation of on-chip networks using deflection routing[C]. In the Proceedings of the 2006 ACM Great Lakes Symposium on VLSI(GLSVLSI'2006), Philadelphia, PA, United states, 2006:296-301.
    [103] Crispin Gomez, Maria E. Gomez, Pedro Lopez, et al. Reducing packet dropping in a bufferless NoC[C]. In the Proceedings of the 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, 2008:899-909.
    [104] Mitchell Hayenga, Natalie Enright Jerger and Mikko Lipasti. SCARAB: A single cycle adaptive routing and bufferless network[C]. In the Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture(Micro'2009), New York, NY, United states, 2009:244-254.
    [105] Chaochao Feng, Minxuan Zhang, Jinwen Li, et al. A Low-Overhead Fault-Aware Deflection Routing Algorithm for 3D Network-on-Chip[C]. In the Proceedings of 2011 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Indian, 2011 19-24.
    [106] M. M. K. Martin, M. D. Hill and D. A. Wood. Token coherence: Decoupling performance and correctness[C]. In the Proceedings of International Symposium on Computer Architecture(ISCA'03), San Diego, CA, United states, 2003:182-193.
    [107] N. E. Jerger, L. S. Peh and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for On-chip Hardware Multicast Support[C]. In the Proceedings of International Symposium on Computer Architecture(ISCA'08), Beijing, China, 2008:229-240.
    [108] Z. Lu, B. Yin and A. Jantsch. Connection-oriented Multicasting in Wormhole-switched Networks on Chip[C]. In the Proceedings of IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures, 2006:205-210.
    [109] Pablo Abad, Valentin Puente and Jose-Angel Gregorio. MRR: Enabling fully adaptive multicast routing for cmp interconnection networks[C]. In the Proceedings of International Symposium on High-Performance Computer Architecture(HPCA'2009), Takamatsu, Japan, 2009:355-366.
    [110] E. A. Carara and F. G. Moraes. Deadlock-Free Multicast Routing Algorithm for Wormhole-Switched Mesh Networks-on-Chip[C]. IEEE Computer Society Annual Symposium on VLSI, 2008:341 - 346.
    [111] Y. H. Kang, J. Sondeen and J. Draper. Implementing tree-based multicast routing for write invalidation messages in networks-on-chip[C]. In the Proceedings of 52nd IEEE International Midwest Symposium on Circuits and Systems, 2009:1118 - 1121.
    [112] S. Rodrigo, J. Flich, J. Duato, et al. Efficient unicast and multicast support for CMPs[C]. In the Proceedings of, 2008:364– 375.
    [113] Wenmin Hu, Zhonghai Lu, A. Jantsch, et al. Power-effcient Tree-based Multicast Support for Networks-on-Chip[C]. In the Proceedings of 16th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, 2011:363-368.
    [114] Andrew B. Kahng, Li Bin, Li-Shiuan Peh, et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration[C]. In the Proceedings of the 2009 Design, Automation and Test in Europe Conference and Exhibition(DATE '2009), Nice, France, 2009:423-428.
    [115] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, et al. ORION: a power-performance simulator for interconnection networks[C]. the 35th annual ACM/IEEE international symposium on Microarchitecture, Istanbul, Turkey, 2002:294-305.
    [116] J. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip networks[C]. In the Proceedings of the 20th annual international conference on Supercomputing, New York, NY, USA, 2006:187-198.
    [117] Nilanjan Banerjee, Praveen Vellanki and Karam S. Chatha. A power and performance model for network-on-chip architectures[C]. In the Proceedings of Design, Automation and Test in Europe Conference and Exhibition(DATE'2004), Paris, France, 2004:1250-1255.
    [118] Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano, et al. Run-time power gating of on-chip routers using look-ahead routing[C]. In the Proceedings of the 2008 Asia and South Pacific Design Automation Conference(ASP-DAC'2008), Seoul, Korea, 2008:55-60.
    [119]张晨曦,王志英,张春元, et al.计算机体系结构——第二版[M].北京:高等教育出版社, 2005.
    [120] F. Karim. An Interconnect Architecture for Networking Systems on Chips[J]. IEEE Micro, 2002, 22(5):36-45.
    [121] P.P. Pande, C. Grecu, AndréIvanov, et al. Design of a Switch for Network on Chip Applications[C]. In the Proceedings of IEEE International Symposium on Circuits and Systems, Bangkok, Thailand, 2003:V217-V220.
    [122] Reetuparna Das, Soumya Eachempati, Asit K. Mishra, et al. Design andevaluation of a hierarchical on-chip interconnect for next-generation CMPs[C]. In the Proceedings of the International Symposium on High-Performance Computer Architecture, Takamatsu, Japan, 2009:175-186.
    [123] John Kim, William J. Dally and Dennis Abts. Flattened butterfly: A cost-efficient topology for high-radix networks[C]. In the Proceedings of 34th Annual International Symposium on Computer Architecture(ISCA2007), San Diego, CA, United states, 2007:126-137.
    [124] Boris Grot, Joel Hestness, Stephen W. Keckler, et al. Express cube topologies for on-chip interconnects[C]. In the Proceedings of International Symposium on High-Performance Computer Architecture(HPCA'2009), Takamatsu, Japan, 2009:163-174.
    [125] L. G. Valiant and G. J. Brebner. Universal schemes for parallel communication[C]. In the Proceedings of the 13th Annual ACM Symposium on Theory of Computing, 1981:263-277.
    [126] C.J.Glass and L.M.Ni. The turn model for adaptive routing [C]. In the Proceedings of the IEEE Internatioan Symposium on Computer Architecture, 1992:278-287.
    [127] Jos′e Duato. A new theory of deadlock-free adaptive routing in wormhole networks [J]. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(12):1320-1331.
    [128] Jos′e Duato. A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks[J]. IEEE Transactions on Parallel and Distributed Systems, 1995, 6(10):1055–1067.
    [129] Jos′e Duato. A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks[J]. IEEE Transactions on Parallel and Distributed Systems, 1996, 7(6):841–854.
    [130] Steven L. Scott and Gregory M. Thorson. The Cray T3E network: Adaptive routing in a high performance 3D torus[C]. In the Proceedings of the Symposium on Hot Interconnects, 1996:147–156.
    [131] P. Gratz, B. Grot and S. W. Keckler. Regional congestion awareness for load balance in networks-on-chip[C]. In the Proceedings of the 14th IEEE International Symposium on High-Performance Computer Architecture, Salt Lake City, UT, United states, 2008:203-214.
    [132] A. Kumar, L.-S. Peh and N. K. Jha. Token flow control[C]. In the Proceedings of the 41st International Symposium on Microarchitecture, Lake Como, Italy, 2007:342-353.
    [133] Kermani P and Kleinrock L. Virtual cut-through: A new computer communication switching technique[J]. 1979, 3:267-286.
    [134] William J. Dally. Virtual-channel flow control[C]. In the Proceedings of the 17th Annual International Symposium on Computer Architecture, Seattle, WA, USA, 1990:60-68.
    [135] Dally W J. Virtual-Channel Flow Control[J]. IEEE Transactions on Parallel and Distributed Systems, 1992, 3(2):194-204.
    [136] Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset[J]. SIGARCH Comput. Archit. News, 2005, 33(4):92-99.
    [137] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, et al. GARNET: A detailed on-chip network model inside a full-system simulator[C]. In the Proceedings ofInternational Symposium on Performance Analysis of Systems and Software(ISPASS'2009), Boston, MA, United states, 2009:33-42.
    [138] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, et al. SPLASH-2 programs: Characterization and methodological considerations[C]. In the Proceedings of the 1995 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, 1995:24-36.
    [139] Xuning Chen and Li-Shiuan Peh. Leakage Power Modeling and Optimization in Interconnection Networks[C]. In the Proceedings of the 2003 International Symposium on Low Power Electronics and Design(ISLPED'03), Seoul, Korea, 2003:90-95.
    [140] Yuh-Fang Tsai, Vijaykrishnan Narayaynan, Yuan Xie, et al. Leakage-aware interconnect for on-chip network[C]. In the Proceedings of Design, Automation and Test in Europe(DATE '05), Munich, Germany, 2005:230-231.
    [141] Weiping liao and Lei He. Full-Chip Interconnect Power Estimation and Simulation Considering Concurrent repeater and Flip-Flop Insertion[C]. In the Proceedings of International Conference on Computer-Aided Design, San Jose, CA, United states, 2003:574-580.
    [142] An ki reddy Nalamalpu and Wayne Burleson. A Practical Approach to DSM Repeater Insertion: Satisfying Delay Constraints while Minimizing Area and Power[C]. In the Proceedings of IEEE International ASIC Conference and Exhibit, Arlington, VA, United states, 2001:152-156.
    [143] Seongmoo Heo and Krste Asanovic. PowerOptimal Pipelining in Deep Submicron Technology[C]. In the Proceedings of International Symposium on Lower Power Electronics and Design( ISLPED'04), Newport Beach, CA, United states, 2004:218-223.
    [144] Ivan E. Sutherland, F. Richard, Sproull, et al. Logical Effort:Designing Fast CMOS Circuits[M]. San Francisco,CA, USA:Morgan Kaufman Publishers Inc., 1999.
    [145] Man Lung Mui, Kaustav Banerjee and Amit Mehrotra. A Global Interconnect Optimization Scheme for Nanometer Scale VLSI With Implications for Latency, Bandwidth, and Power Dissipation[J]. IEEE Transactions on Electron Devices, 2004, 51(2):195-203.
    [146] Kaustav Banerjee and Amit Mehrotra. A Power-Optimal Repeater Insertion Methodology for Global Interconnects in Nanometer Designs[J]. IEEE Transactions on Electron Devices, 2002, 49(11):2001-2007.
    [147] S. Thoziyoor, N.Muralimanohar and N. P. Jouppi. CACTI 5.0: An Integrated Cache Timing, Power, and Area Model [R].HP Laboratories Palo Alto, 2007.
    [148] Li-Shiuan Peh, William J. Dally and Peh Li-Shiuan. Delay model for router microarchitectures[J]. IEEE Micro, 2001, 21(1):26-34.
    [149] Deng Pan and Yuanyuan Yang. FIFO Based Multicast Scheduling Algorithm for VOQ Packet Switches[C]. In the Proceedings of International Conference on Parallel Processing, Montreal, Que, Canada, 2004:318-325.
    [150] Kenji Yoshigoe and Kenneth J. Christensen. An Evolution to Crossbar Switches with Virtual Output Queuing and Buffered Cross Points[J]. IEEE Network, 2003, 17(5):48-56.
    [151] J. Duato, J. Flich and T. Nachiondo. A Cost-Effective Technique to Reduce HOL Blocking in Single-Stage and Multistage Switch Fabrics[C]. In the Proceedings of 12th Euromicro Conference on Parallel, Distributed andNetwork-based Proceedings, A Coruna, Spain, 2005:48-53.
    [152] G. L. Frazier and Y. Tamir. The design and implementation of a multiqueue buffer for VLSI communication switches[C]. In the Proceedings of the IEEE International Conference on Computer Design 1989:466-471.
    [153] Y. Tamir and G. L. Frazier. High-performance multiqueue buffers for VLSI communication switches[C]. In the Proceedings of the 15th Annual International Symposium on Computer Architecture, 1988:343-354.
    [154] Paul Gratz, Changkyu Kim, Robert McDonald, et al. Implementation and evaluation of on-chip network architectures[C]. In the Proceedings of the 24th International Conference on Computer Design (ICCD'2006), San Jose, CA, United states, 2006:477-484.
    [155] R. V. Boppana, S. Chalasani and C. S. Raghavendra. Resource deadlocks and performance of wormhole multicast routing algorithms [J]. IEEE Transactions on Parallel and Distributed Systems, 1998, 9(6):535-549.
    [156] X. Lin, P. K. McKinley and L.M. Ni. Deadlock-Free Multicast Wormhole Routing in 2D Mesh Multicomputers[J]. IEEE Transactions on Parallel and Distributed Systems, 1994, 5(8):793-804.
    [157] R. V. Boppana, S. Chalasani and C. S. Raghavendra. On multicast wormhole routing in multicomputer networks[C]. In the Proceedings of 6th IEEE Symposium on Parallel and Distributed Processing, 1994:722-729.
    [158] X. Lin and L. M. Ni. Multicast communication in multicomputer networks[J]. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(10):1105-1117.
    [159] C. M. Chiang and L. M. Ni. Multi-Address Encoding for Multicast[C]. International Workshop on Parallel Computer Routing and Communication, Seattle, WA, United states, 1994:146-160.
    [160] F. A. Samman, T. Hollstein and M. Glesner. Adaptive and Deadlock-Free Tree-Based Multicast Routing for Networks-on-Chip[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2010, 18(7):1067-1079.
    [161] T. Ye, L. Benini and G. D. Micheli. Analysis of Power Consumption on Switch Fabrics in Network Routers[C]. In the Proceedings of 39th Design Automation Conference, New Orleans, LA, United states, 2002:524-529.
    [162] A. Banerjee, R. Mullins and S. Moore. A Power and Energy Exploration of Network-on-Chip Architecture[C]. In the Proceedings of International Symposium on Networks-on-Chip, Princeton, NJ, United states, 2007:163-172.
    [163] Guilherme Guindani, Cezar Reinbrecht, Thiago Raupp, et al. NoC power estimation at the RTL abstraction level[C]. In the Proceedings of IEEE Computer Society Annual Symposium on VLSI(ISVLSI'2008), Montpellier, France, 2008:475-478.
    [164] Y. Tsai, V. Narayaynan, Y. Xie, et al. Leakage-Aware Interconnect for On-Chip Network[C]. In the Proceedings of the conference on Design, Automation and Test in Europe(DATE'05), Munich, Germany, 2005:230-231.
    [165] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, et al. Microarchitectural Techniques for Power Gating of Execution Units[C]. In the Proceedings of International Symposium on Low Power Electronics and Design(ISPED'04), Newport Beach, CA, United states, 2004:32-37.
    [166]周宏伟,张承义and张民选.基于统计信息的Cache漏流功耗评估方法[J].计算机研究与发展, 2008, 45(2):236-374.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700