Cache一致性片上网络路由算法和流控机制优化关键技术研究

英文题名：Research on the Key Techniques of Routing Algorithm and Flow Control Optimizations for Cache-Coherent Networks-on-Chip
作者：马胜
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：片上网络 ; Cache一致性协议 ; 负载整合 ; 完全自适应路由 ; 虚通道分配 ; 切片气泡流控 ; 归约通信 ; 多播通信
英文关键词：Networks-on-Chip ; Cache Coherence Protocol ; Workload Con-
英文关键词：solidation ; Fully Adaptive Routing ; VC Re-allocation ; Flit Bubble Flow Control ; Reduction Communication ; Multicast Communication
学位年度：2012
导师：王志英
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2012-10-12

摘要

半导体技术的发展不断增加芯片中集成的核数，传统总线或点对点通信架构面临着带宽、延迟、功耗和可扩展性等方面的不足。针对这些局限，片上网络应运而生，它能在芯片内部提供一种简单、高效和可扩展的通信机制。另一方面，由于并行编程的高难度和兼容历史代码的需求，cache一致性协议在众核结构上将长期存在。在cache一致性众核结构上，片上网络传输的通信主要由采用的一致性协议所决定。为高效支持这些通信，需要在分析一致性协议结构和通信特征的基础上对片上网络进行优化设计。本文的主要研究成果及创新点如下：
     (1).提出了一种面向负载整合工作模式的路由算法
     Cache一致性协议的层次结构和应用程序有限的并行度导致多应用程序同时运行在一个众核平台上，这种负载整合工作模式要求路由算法提供良好的适应性和动态隔离性。本文提出了基于目标的自适应路由（Destination-Based AdaptiveRouting, DBAR）算法。通过使用一个低开销的拥塞信息传播网络，DBAR获得了本地和远端的网络状态信息，从而能有效避免网络拥塞。更重要的，通过将目标信息集成到输出端口选择中，DBAR能动态隔离多应用程序。在多种网络配置下，DBAR的性能都优于之前的设计。
     (2).提出了一种面向完全自适应路由算法的流控机制
     由于面积和功耗的限制，cache一致性片上网络一般配置有限的虚通道数目，它给完全自适应路由算法设计提出了新的挑战。之前的死锁避免理论要求使用保守虚通道分配策略，严重限制了路由算法的性能。本文提出了全报文发送（Whole Packet Forwarding, WPF）虚通道分配策略，并证明了WPF不会带来死锁。WPF能显著提升路由算法的性能,它对之前的死锁避免理论进行了重要扩展。在虚通道受限环境中，路由算法应该提供较高的路由灵活性，本文进一步给出了一种在较低硬件开销条件下提供较高路由灵活性的完全自适应路由算法。
     (3).提出了面向torus片上网络的切片气泡流控机制
     Cache一致性片上网络需要同时传输长报文和短报文，已有的torus网络死锁避免理论不能高效处理这种混合长度报文的传输，它们要么需要使用两条虚通道，要么需要将短报文视为长报文。本文提出了切片气泡流控（Flit Bubble FlowControl, FBFC）死锁避免理论。FBFC通过在虫孔交换ring网络上维持一个空闲缓存单元避免死锁。FBFC只使用一条虚通道，获得了较高的路由器频率；FBFC不需要将每个短报文视为长报文，提高了缓存利用率。基于FBFC理论，本文给出了两种实现，它们的性能都显著优于已有设计。
     (4).提出了一种高效支持cache一致性协议归约和多播通信的技术
     Cache一致性协议需要使用归约和多播通信，为了防止这些通信成为系统性能瓶颈，必须要对它们提供硬件支持。本文研究了对目录一致性协议的多播cache行作废消息和归约acknowledgement（ACK）消息的硬件支持。本文提出了一种消息组合框架支持归约ACK消息的组合操作，该框架不仅在低到中等负载下降低了报文平均延迟，同时也提升了网络吞吐率，它只需少量的额外硬件开销。此外，本文还提出了均衡自适应多播路由（Balanced, Adaptive Multicast, BAM）算法，该算法均衡了不同维度间的缓存资源，进一步提升了网络吞吐率。综上所述，本文紧紧围绕“面向cache一致性通信优化片上网络设计”这一目标，基于对一致性协议结构和通信特征的分析，优化设计了路由算法和流控机制。这些优化设计不仅取得了一定的系统性能提升，同时也扩展了已有的死锁避免理论。因此，本文既具备一定的工程价值，又具备一定的理论意义。
The advancement of semiconduct technology increases the core count, and makesthe traditional bus or point-to-point communication mechanisms face several challenges,including low bandwidth, high latency, high power consumption, low scalability and etc..To address these challenges, Network-on-Chip (NoC) was proposed. NoC was regardedas a simple, efficient and scalable communication paradigm for future many-core plat-forms. On the other side, due to the difficulty of parallel programming and compatibilityrequirements of history codes, cache coherence protocols will exist in many-core plat-forms. In cache coherent many-core platforms, the traffic delivered by NoC is mostlydecided by the applied cache coherence protocol. To provide efficient communicationsupport for coherent traffic, it is necessary to analyze the traffic characteristics, and thenoptimize the NoC design. The main contributions of this thesis are as follows.
     1. Efficient routing algorithm to support workload consolidation scenarios.
     Duetothehierarchicalcachecoherenceprotocolandlimitedapplicationparallelism,it is quite possible that multiple applications will run concurrently in a many-core plat-form. These workload consolidation scenarios require the routing algorithm to provideboth sufficient adaptivity and dynamic isolation. This thesis proposes Destination-BasedAdaptive Routing (DBAR). By leveraging a low-cost congestion propagation network,DBAR utilizes both local and non-local network status to efficiently avoid congestion.More importantly, by integrating the destination information into the output port selectionprocedure, DBAR dynamically isolate multiple concurrent applications. DBAR offersbetter performance than the best baseline algorithm for many measured configurations.
     2. Efficient design of fully adaptive routing algorithm for cache coherent traffic
     Due to area and power consumption limitations, cache coherent NoC generally con-figure a small number of virtual channels (VCs). Limited VCs pose several challengesto the design of fully adaptive routing algorithms. Previous deadlock avoidance theoriesrequire a conservative VC re-allocation scheme, which strongly limits the performance ofrouting algorithms. This thesis proposes a novel VC re-allocation scheme, whole packetforwarding (WPF). We prove that WPF does not induce deadlock, thus WPF is an im-portant extension of previous deadlock-avoidance theories. WPF can greatly improvethe performance of fully adaptive routing algorithms. To efficiently utilize WPF in VC- limited networks, we design a novel fully adaptive routing algorithm which maintainspacket adaptivity without significant hardware cost.
     3. Flit bubble flow control for torus cache coherent NoCs
     Shortandlongpacketscommonlyco-existincache-coherentnetworks-on-chip(NoC-s). Existing deadlock avoidance designs for torus networks do not efficiently handle thismixofpacketsizes. ThesepreviousdesignseitherleveragetwoVCsorregardeachpacketas a maximum-length packet. We propose a novel deadlock avoidance theory, flit bubbleflow control (FBFC). The insight of FBFC is that maintaining one free flit-size bufferslot inside a ring can avoid deadlock for wormhole torus networks. Only one VC is re-quired. FBFC does not treat short packets as long ones; this yields high buffer utilization.Based on this theory, we present two implementations, and both show large performanceimprovement than previous designs.
     4. Efficient support of collective communication in cache coherence protocols
     Cache coherence protocol utilizes reduction and multicast. Hardware support is nec-essary to prevent these collective communication from becoming a system bottleneck.This research explores support for reduction and multicast communication operationsin a directory cache coherence protocol. This paper makes two primary contributions:an efficient framework to support the reduction of ACK packets and a novel Balanced,Adaptive Multicast (BAM) routing algorithm. By combining ACK packets during trans-mission, this framework not only reduces packet latency, but also improves the networksaturation throughput with little overhead. The balanced buffer resource configuration ofBAM helps to get some additional saturation throughput improvements.
     In summary, this thesis aims to ‘optimizing the design of NoCs for cache coherenceprotocols’. Based on the analysis of the characteristics of coherent traffics, we optimizethe design of routing algorithms and flow control mechanisms for NoCs. The proposedoptimizations not only improve the performance, but also extend the deadlock avoidancetheories. Thus, this thesis has both engineering value and theoretical significance.

引文

[1] Dally W, Towles B. Route packets, not wires: on-chip interconnection network-s [C]. In DAC2001.
    [2] Jantsch H A, et al. Network on chip: An architecture for billion transistor era [C].In NorChip2000.
    [3] Benini L, De Micheli G. Powering networks on chips: energy-efficient and reli-able interconnect design for SoCs [C]. In Proceedings of the14th internationalsymposium on Systems synthesis.
    [4] Culler D, Singh J, Gupta A. Parallel computer architecture: a hardware/softwareapproach [M]. Morgan Kaufmann,1999.
    [5] Martin M M K, Hill M D, Sorin D J. Why on-chip cache coherence is here tostay [J]. Commun. ACM.2012,55(7):78–89.
    [6] Sorin D, Hill M, Wood D. A Primer on Memory Consistency and Cache Coher-ence [M].1st ed. Morgan&Claypool Publishers,2011.
    [7] Enright Jerger N, Peh L. On-Chip Networks [M].1st ed. Morgan&Claypool Pub-lishers,2009.
    [8] Agarwal N, Peh L, Jha N. In-network snoop ordering (INSO): Snoopy coherenceon unordered interconnects [C]. In HPCA2009.
    [9] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Ap-proach [M].3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2003.
    [10] Moore G. Cramming more components onto integrated circuits [J]. Electronics.1965,38(8).
    [11] Wittenbrink C, Kilgariff E, Prabhu A. Fermi GF100GPU architecture [J]. Micro,IEEE.2011,31(2):50–59.
    [12] ITRS. International Technology Roadmap for Semiconductors,2009Edition.http://www.itrs.net.2009.
    [13] Patterson D, Hennessy J. Computer organization and design: the hardware/soft-ware interface [M]. Morgan Kaufmann,2009.
    [14] Hinton G, Sager D, Upton M, et al. The microarchitecture of the Pentium4pro-cessor [C]. In Intel Technology Journal.2001.
    [15] Burger D, Goodman J. Billion-transistor architectures: There and back again [J].Computer.2004,37(3):22–28.
    [16] Sankaralingam K, Nagarajan R, Liu H, et al. Exploiting ILP, TLP, and DLP withthe polymorphous TRIPS architecture [C]. In Computer Architecture,2003. Pro-ceedings.30th Annual International Symposium on.2003:422–433.
    [17] KahleJA,etal.IntroductiontotheCellmultiprocessor[J].IBMJ.Res.Dev.2005,49(4.5):589–604.
    [18] Taylor M, Psota J, Saraf A, et al. Evaluation of the Raw microprocessor: Anexposed-wire-delay architecture for ILP and streams [C]. In ISCA2004.
    [19] Hammond L, Hubbert B, Siu M, et al. The stanford hydra cmp [J]. Micro, IEEE.2000,20(2):71–84.
    [20] Vangal S, et al. An80-tile1.28TFLOPS network-on-chip in65nm CMOS [C]. InISSCC2007.
    [21] del Cuvillo J, Zhu W, Hu Z, et al. Toward a software infrastructure for the cyclops-64cellular architecture [C]. In HPCS2006.
    [22] Wentzlaff D, et al. On-Chip Interconnection Architecture of the Tile Processor [J].Micro, IEEE.2007,27(5):15–31.
    [23] Sterling T. Multicore: the New Moore’s Law [C]. In Invited Presentation to ICS2007.
    [24] Lu Z, Jantsch A. Trends of terascale computing Chips in the next ten years [C]. InASICON2009.
    [25] Guerrier P, Greiner A. A generic architecture for on-chip packet-switched inter-connections [C]. In DATE2000.
    [26] Adriahantenaina A, et al. SPIN: A Scalable, Packet Switched, On-Chip Micro-Network [C]. In DATE2003.
    [27] Balfour J, Dally W. Design tradeoffs for tiled CMP on-chip networks [C]. In ICS2006.
    [28] Kim J, Balfour J, Dally W. Flattened Butterfly Topology for On-Chip Network-s [C]. In MICRO2007.
    [29] Grot B, et al. Express Cube Topologies for on-Chip Interconnects [C]. In HPCA2009.
    [30] Bourduas S, Zilic Z. Modeling and evaluation of ring-based interconnects forNetwork-on-Chip [J]. J. Syst. Archit.2011,57(1):39–60.
    [31] Fallin C, et al. A High-Performance Hierarchical Ring On-Chip Interconnect withLow-Cost Routers [C]. In SAFARI Technical Report.2011.
    [32] Mirza-Aghatabar M, et al. An Empirical Investigation of Mesh and Torus NoCTopologies Under Different Routing Algorithms and Traffic Models [C]. In DSD2007.
    [33] Mishra A K, Vijaykrishnan N, Das C R. A case for heterogeneous on-chip inter-connects for CMPs [C]. In ISCA2011.
    [34] Shin M, Kim J. Leveraging torus topology with deadlock recovery for cost-efficient on-chip network [C]. In ICCD2011.
    [35] Manevich R, Walter I, Cidon I, et al. Best of both worlds: A bus enhanced NoC(BENoC)[C]. In NOCS2009.
    [36] Das R, et al. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs [C]. In HPCA2009.
    [37] Zhao H, et al. A hybrid NoC design for cache coherence optimization for chipmultiprocessors [C]. In DAC2012.
    [38] PehL-S,DallyW.Adelaymodelandspeculativearchitectureforpipelinedrouter-s [C]. In HPCA2001.
    [39] Mullins R, West A, Moore S. Low-latency virtual-channel routers for on-chip net-works [C]. In ISCA2004.
    [40] Matsutani H, et al. Prediction router: Yet another low latency on-chip router archi-tecture [C]. In HPCA2009.
    [41] Kumar A, et al. A4.6Tbits/s3.6GHz single-cycle NoC router with a novel switchallocator in65nm CMOS [C]. In ICCD2007.
    [42] Abad P, et al. Rotary router: an efficient architecture for CMP interconnectionnetworks [C]. In ISCA2007.
    [43] Kim J. Low-cost router microarchitecture for on-chip networks [C]. In MICRO2009.
    [44] Becker D U, Dally W J. Allocator implementations for network-on-chip router-s [C]. In SC2009.
    [45] Fallin C, et al. CHIPPER: A low-complexity bufferless deflection router [C]. InHPCA2011.
    [46] Dimitrakopoulos G, Galanopoulos K. Switch allocator for bufferless network-on-chip routers [C]. In INA-OCMC2011.
    [47] Michelogiannakis G, Jiang N, Becker D, et al. Packet chaining: efficient single-cycle allocation for on-chip networks [C]. In MICRO2011.
    [48] Hayenga M, Lipasti M. The NoX router [C]. In MICRO2011.
    [49] Nicopoulos C, et al. ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers [C]. In MICRO2006.
    [50] Xu Y, et al. Simple virtual channel allocation for high throughput and high fre-quency on-chip routers [C]. In HPCA2010.
    [51] Ramanujam R S, et al. Design of a High-Throughput Distributed Shared-BufferNoC Router [C]. In NOCS2010.
    [52] Ahmadinia A, Shahrabi A. A Highly Adaptive and Efficient Router Architecturefor Network-on-Chip [J]. Comput. J.2011,54(8):1295–1307.
    [53] BeckerD,etal.AdaptiveBackpressure:EfficientBufferManagementforOn-ChipNetworks [C]. In ICCD2012.
    [54] Michelogiannakis G, Balfour J, Dally W. Elastic-buffer flow control for on-chipnetworks [C]. In HPCA2009.
    [55] Kodi A, Sarathy A, Louri A. iDEAL: Inter-router dual-function energy and area-efficient links for network-on-chip (NoC) architectures [C]. In ISCA2008.
    [56] Kim G, Kim J, Yoo S. FlexiBuffer: Reducing leakage power in on-chip networkrouters [C]. In DAC2011.
    [57] Hu J, Marculescu R. DyAD-smart routing for networks-on-chip [C]. In DAC2004.
    [58] Seo D, et al. Near-optimal worst-case throughput routing for two-dimensionalmesh networks [C]. In ISCA2005.
    [59] Singh A, et al. GOAL: a load-balanced adaptive routing algorithm for torus net-works [C]. In ISCA2003.
    [60] Kim J, et al. A low latency router supporting adaptivity for on-chip interconnect-s [C]. In DAC2005.
    [61] Li M, Zeng Q-A, Jone W-B. DyXY-a proximity congestion-aware deadlock-freedynamic routing method for network on chip [C]. In DAC2006.
    [62] AsciaG,etal.ImplementationandAnalysisofaNewSelectionStrategyforAdap-tive Routing in Networks-on-Chip [J]. Computers, IEEE Transactions on.2008,57(6):809–820.
    [63] Gratz P, Grot B, Keckler S. Regional congestion awareness for load balance innetworks-on-chip [C]. In HPCA2008.
    [64] Ramanujam R S, Lin B. Destination-based adaptive routing on2D mesh network-s [C]. In ANCS2010.
    [65] Rodrigo S, et al. Efficient unicast and multicast support for CMPs [C]. In MICRO2008.
    [66] Zhang Z, Greiner A, Taktak S. A reconfigurable routing algorithm for a fault-tolerant2D-Mesh Network-on-Chip [C]. In DAC2008.
    [67] Kinsy M A, et al. Application-aware deadlock-free oblivious routing [C]. In ISCA2009.
    [68] Lu Z, Yin B, Jantsch A. Connection-oriented multicasting in wormhole-switchednetworks on chip [C]. In IEEE Symp. on Emerging VLSI Tech. and Architectures.
    [69] Enright Jerger N, Peh L-S, Lipasti M. Virtual Circuit Tree Multicasting: A Casefor On-Chip Hardware Multicast Support [C]. In ISCA2008.
    [70] Wang L, et al. Recursive Partitioning Multicast: A bandwidth-efficient routing forNetworks-on-Chip [C]. In NOCS2009.
    [71] Wang X, et al. On an efficient NoC multicasting scheme in support of multiple ap-plications running on irregular sub-networks [J]. Microprocess. Microsyst.2011,35:119–129.
    [72] Abad P, Puente V, Gregorio J-A. MRR: Enabling fully adaptive multicast routingfor CMP interconnection networks [C]. In HPCA2009.
    [73] Krishna T, Peh L-S, Beckmann B M, et al. Towards the Ideal On-chip Fabric for1-to-Many and Many-to-1Communication [C]. In MICRO2011.
    [74] Kang Y H, Sondeen J, Draper J. Multicast routing with dynamic packet fragmen-tation [C]. In GLSVLSI2009.
    [75] Peh L, Dally W. Flit-reservation flow control [C]. In HPCA2000.
    [76] Kumar A, et al. Express Virtual Channels: Towards the Ideal Interconnection Fab-ric [C]. In ISCA2007.
    [77] Kumar A, Peh L-S, Jha N K. Token flow control [C]. In MICRO2008.
    [78] Lu Z, Liu M, Jantsch A. Layered switching for networks on chip [C]. In DAC2007.
    [79] Enright Jerger N, Peh L, Lipasti M. Circuit-switched coherence [C]. In NOCS2008.
    [80] Samman F, et al. Wormhole cut-through switching: Flit-level messages interleav-ing for virtual-channelless network-on-chip [J]. Microprocess. Microsyst.2011,35(3):343–358.
    [81] Concer N, Petracca M, Carloni L P. Distributed flit-buffer flow control fornetworks-on-chip [C]. In CODES+ISSS2008.
    [82] Chen L, et al. Critical Bubble Scheme: An Efficient Implementation of GloballyAware Network Flow Control [C]. In IPDPS2011.
    [83] Joshi A, Mutyam M. Prevention flow-control for low latency torus networks-on-chip [C]. In NOCS2011.
    [84] Cheng L, et al. Interconnect-aware coherence protocols for chip multiproces-sors [C].
    [85] Muralimanohar N, Balasubramonian R. Interconnect design considerations forlarge NUCA caches [C].
    [86] Eisley N, Peh L, Shang L. In-network cache coherence [C]. In MICRO2006.
    [87] Agarwal N, Peh L, Jha N. In-network coherence filtering: snoopy coherence with-out broadcasts [C]. In MICRO2009.
    [88] Enright Jerger N. SigNet: Network-on-chip filtering for coarse vector directo-ries [C]. In DATE2010.
    [89] Hu J, Marculescu R. Energy-and performance-aware mapping for regular NoC ar-chitectures [J]. Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on.2005,24(4):551–562.
    [90] Chou C-L, Ogras U, Marculescu R. Energy-and Performance-Aware IncrementalMappingforNetworksonChipWithMultipleVoltageLevels[J].Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on.2008,27(10):1866–1879.
    [91] Chou C-L, Marculescu R. Run-Time Task Allocation Considering User Behaviorin Embedded Multiprocessor Networks-on-Chip [J]. Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactions on.2010,29(1):78–91.
    [92] LeiT,KumarS.Atwo-stepgeneticalgorithmformappingtaskgraphstoanetworkon chip architecture [C]. In DSD2003.
    [93] Grot B, Hestness J, Keckler S, et al. Kilo-NOC: a heterogeneous network-on-chiparchitecture for scalability and service guarantees [C]. In ISCA2011.
    [94] Lee J, Ng M, Asanovic K. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks [C]. In ISCA2008.
    [95] Grot B, Keckler S, Mutlu O. Preemptive virtual clock: a flexible, efficient, andcost-effective QOS scheme for networks-on-chip [C]. In MICRO2009.
    [96] Ouyang J, Xie Y. Loft: A high performance network-on-chip providing quality-of-service support [C]. In MICRO2010.
    [97] Qian Y, Lu Z, Dou W. Analysis of worst-case delay bounds for on-chip packet-switching networks [J]. Computer-Aided Design of Integrated Circuits and Sys-tems, IEEE Transactions on.2010,29(5):802–815.
    [98] Qian Y, Lu Z, Dou W. Applying network calculus for performance analysis ofself-similar traffic in on-chip networks [C]. In CODES+ISSS2009.
    [99] Carloni L, Pande P, Xie Y. Networks-on-chip in emerging interconnect paradigms:Advantages and challenges [C]. In NOCS2009.
    [100] Jose A, Patounakis G, Shepard K. Pulsed current-mode signaling for nearly speed-of-light intrachip communication [J]. Solid-State Circuits, IEEE Journal of.2006,41(4):772–780.
    [101] KimB,StojanovicV.Equalizedinterconnectsforon-chipnetworks:Modelingandoptimization framework [C]. In ICCAD2007.
    [102] Kirman N, Kirman M, Dokania R, et al. Leveraging optical technology in futurebus-based chip multiprocessors [C]. In MICRO2006.
    [103] Shacham A, Bergman K, Carloni L. The case for low-power photonic networks onchip [C]. In DAC2007.
    [104] Chang M, et al. CMP network-on-chip overlaid with multi-band RF-interconnect [C]. In HPCA2008.
    [105] VandeveldeB,etal.Thermo-mechanicsof3D-waferleveland3DstackedICpack-aging technologies [C]. In EuroSimE2008.
    [106] Mishra A, Dong X, Sun G, et al. Architecting on-chip interconnects for stacked3D STT-RAM caches in CMPs [C]. In HPCA2011.
    [107] Park D, Eachempati S, Das R, et al. MIRA: A multi-layered on-chip interconnectrouter architecture [C]. In ISCA2008.
    [108] Hopkins D, et al. Circuit techniques to enable430GB/s/mm2proximity commu-nication [C]. In ISSCC2007.
    [109] Gratz P, et al. On-Chip Interconnection Networks of the TRIPS Chip [J]. Micro,IEEE.2007,27(5):41–50.
    [110] LiangJ,SwaminathanS,TessierR.aSOC:Ascalable,single-chipcommunicationsarchitecture [C]. In PACT2000.
    [111] Millberg M, et al. The Nostrum backbone-a communication protocol stack for net-works on chip [C]. In VLSI Design2004.
    [112] Dall’Osso M, et al. Xpipes: a latency insensitive parameterized network-on-chiparchitecture for multiprocessor SoCs [C]. In ICCD2003.
    [113] Bolotin E, et al. QNoC: QoS architecture and design process for network onchip [J]. Journal of Systems Architecture.2004,50(2):105–128.
    [114] BjerregaardT,SparsoJ.Arouterarchitectureforconnection-orientedserviceguar-antees in the MANGO clockless network-on-chip [C]. In DATE2005.
    [115] Park S, et al. Approaching the theoretical limits of a mesh NoC with a16-nodechip prototype in45nm SOI [C]. In DAC2012.
    [116] Steenhof F, et al. Networks on chips for high-end consumer-electronics TV systemarchitectures [C]. In DATE2006.
    [117] Coppola M, et al. Design of Cost-Efficient Interconnect Processing Units: Spider-gon STNoC [M]. CRC,2008.
    [118] Howard J, et al. A48-core IA-32message-passing processor with DVFS in45nmCMOS [C]. In ISSCC2010.
    [119] Agarwal A, et al. Tile processor: Embedded multicore for networking and multi-media [C]. In Hot Chips2007.
    [120] Seventh ACM/IEEE International Symposium on Networks-on-Chip. http://nocsymposium.org/.2013.
    [121]林世俊，张凡，金德鹏，等．分布式同步的GALS片上网络及其接口设计[J]．清华大学学报:自然科学版．2008，48(1)：32–35．
    [122]马立伟，孙义和．片上网络拓朴优化:在离散平面上布局与布线[J]．电子学报．2007，35(5)：906–911．
    [123]王宏伟，陆俊林，佟冬，等．层次化片上网络结构的簇生成算法[J]．电子学报．2007，35(5)：916–920．
    [124] Yu Z, You K, Xiao R, et al. An800MHz320mW16-core processor with message-passing and shared-memory inter-core communication mechanisms [C]. In ISSCC2012.
    [125]朱晓静，胡伟武，马可，等．Xmesh:一个mesh-like片上网络拓扑结构[J]．软件学报．2007，18(9)：2194–2204．
    [126]朱晓静．片上网络的结构设计与性能分析[D]．合肥：中国科学技术大学，2008．
    [127]张磊，李华伟，李晓维．用于片上网络的容错通信算法[J]．计算机辅助设计与图形学学报．2007，19(4)：508–514．
    [128]黄琨，马可，曾洪博，等．一种分片式多核处理器的用户级模拟器[J]．软件学报．2008，19(4)：1069–1080．
    [129]付方发，张庆利，王进祥，等．支持多种流量分布的片上网络性能评估技术研究[J]．哈尔滨工业大学学报．2007，39(5)：830–834．
    [130]李磊．片上网络NoC的通信研究[D]．杭州：浙江大学，2007．
    [131]武畅．片上网络体系结构和关键通信技术研究[D]．西安：电子科技大学，2008．
    [132]常政威．网络化MPSoC高能效设计技术研究[D]．西安：电子科技大学，2009．
    [133]荆元利．基于片上网络的系统芯片研究[D]．西安：西北工业大学，2005．
    [134]肖翔，董渭清，文敏华．网环步进码片上网络自适应路由算法设计[J]．西安交通大学学报．2009，43(012)：70–74．
    [135]唐杉．基于片上网络互联的SoC调试技术研究[D]．北京：北京邮电大学，2008．
    [136]赵宏智．2D Mesh片上网络中交换机服务性能影响的研究及其拓扑改进[J]．电子学报．2009，37(2)：294–298．
    [137]杨盛光，李丽，高明伦，等．面向能耗和延时的NoC映射方法[J]．电子学报．2008，36(5)：937–942．
    [138]段新明．面向NoC的无死锁路由算法的研究[D]．天津：南开大学，2007．
    [139]陶海洋．片上网络低能耗和低延迟研究[D]．长沙：湖南大学，2009．
    [140]朱兵．基于片上网络的通信路由方法研究[D]．合肥：合肥工业大学，2009．
    [141]刘祥远．多核SoC片上网络关键技术研究[D]．长沙：国防科学技术大学，2007．
    [142]钱悦．片上网络演算模型及性能分析[D]．长沙：国防科学技术大学，2010．
    [143] Li Z, Zhu C, Shang L, et al. Transaction-aware network-on-chip resource reserva-tion [J]. Computer Architecture Letters.2008,7(2):53–56.
    [144] Fu B, et al. An abacus turn model for time/space-efficient reconfigurable rout-ing [C]. In ISCA2011.
    [145] Lai M, Wang Z, Gao L, et al. A dynamically-allocated virtual channel architecturewith congestion awareness for on-chip routers [C]. In DAC2008.
    [146] ShiW,XuW,RenH,etal.Anovelshared-bufferrouterfornetwork-on-chipbasedon Hierarchical Bit-line Buffer [C]. In ICCD2011.
    [147]国家自然科学基金委员会．http://isis.nsfc.gov.cn/.
    [148] Ma S, Enright Jerger N, Wang Z. DBAR: an efficient routing algorithm to supportmultiple concurrent applications in networks-on-chip [C]. In ISCA2011.
    [149] Ma S, Enright Jerger N, Wang Z, et al. Holistic Routing Algorithm Design to Sup-portWorkloadConsolidationinNoCs[J].Computers,IEEETransactionson.2012,99(PrePrints).
    [150] Ma S, Enright Jerger N, Wang Z. Whole packet forwarding: Efficient design offully adaptive routing algorithms for networks-on-chip [C]. In HPCA.2012.
    [151] Ma S, Enright Jerger N, Wang Z. Supporting Efficient Collective Communicationin NoCs [C]. In HPCA.2012.
    [152] DamarajuS,etal.A22nmIAMulti-CPUandGPUSystem-on-Chip[C].InISSCC2012.
    [153] KimJ,KimH.Routermicroarchitectureandscalabilityofringtopologyinon-chipnetworks [C]. In NoCArc2009.
    [154] Scott S L, et al. The Cray T3E Network: Adaptive Routing in a High Performance3D Torus [C]. In Hot Interconnects1996.
    [155] Adiga N R, et al. Blue Gene/L torus interconnection network [J]. IBM J. Res. Dev.2005,49(2.3):265–276.
    [156] Kumar P, et al. Exploring concentration and channel slicing in on-chip networkrouter [C]. In NOCS2009.
    [157] Michelogiannakis G, Pnevmatikatos D, Katevenis M. Approaching ideal NoC la-tency with pre-configured routes [C]. In NOCS2007.
    [158] ValiantL,BrebnerG.Universalschemesforparallelcommunication[C].InSTOC1981.
    [159] Glass C, Ni L. The Turn Model for Adaptive Routing [C]. In ISCA1992.
    [160] Chiu G-M. The odd-even turn model for adaptive routing [J]. Parallel and Dis-tributed Systems, IEEE Transactions on.2000,11(7):729–738.
    [161] Dally W, Seitz C. Deadlock-Free Message Routing in Multiprocessor Intercon-nection Networks [J]. IEEE Trans. Comput.1987.
    [162] DuatoJ.Anewtheoryofdeadlock-freeadaptiveroutinginwormholenetworks[J].ParallelandDistributedSystems,IEEETransactionson.1993,4(12):1320–1331.
    [163] DallyW,TowlesB.PrinciplesandPracticesofInterconnectionNetworks[M].SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc.,2003.
    [164] Kermani P, Kleinrock L. Virtual cut-through: a new computer communicationswitching technique [J]. Computer Networks.1979,3:267–286.
    [165] Dally W J, Seitz C L. The Torus Routing Chip [J]. Distributed Computing.1986,1:187–196.
    [166] Ogras U Y, Hu J, Marculescu R. Key research problems in NoC design: a holisticperspective [C]. In CODES+ISSS2005.
    [167] Marculescu R, et al. Outstanding research problems in NoC design: system, mi-croarchitecture, and circuit perspectives [J]. Trans. Comp.-Aided Des. Integ. Cir.Sys.2009,28:3–21.
    [168] Dally W. Virtual-channel flow control [J]. Parallel and Distributed Systems, IEEETransactions on.1992,3(2):194–205.
    [169] Moscibroda T, Mutlu O. A case for bufferless routing in on-chip networks [C]. InISCA.2009.
    [170] Hayenga M, Enright Jerger N, Lipasti M. SCARAB: A single cycle adaptive rout-ing and bufferless network [C]. In MICRO2009.
    [171] Shin K, Daniel S. Analysis and implementation of hybrid switching [C]. In ISCA1995.
    [172] Stunkel C B, et al. The SP2high-performance switch [J]. IBM Syst. J.1995,34:185–204.
    [173] Galles M. Spider: a high-speed network interconnect [J]. Micro, IEEE.1997,17(1):34–39.
    [174] Hirata Y, et al. A variable-pipeline on-chip router optimized to traffic pattern [C].In NoCArc2010.
    [175] Gratz P, Sankaralingam K, Hanson H, et al. Implementation and Evaluation of aDynamically Routed Processor Operand Network [C]. In NOCS2007.
    [176] Choi B, et al. Denovo: Rethinking hardware for disciplined parallelism [C]. InHotPar2010.
    [177] Kelm J, et al. Cohesion: An adaptive hybrid memory model for accelerators [J].Micro, IEEE.2011,31(1):42–55.
    [178] Shah M, et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARCSOC [C]. In ASSCC2007.
    [179] ButlerM,etal.Bulldozer:Anapproachtomultithreadedcomputeperformance[J].Micro, IEEE.2011,31(2):6–15.
    [180] Nickolls J, Dally W. The GPU computing era [J]. Micro, IEEE.2010,30(2):56–69.
    [181] Hansson A, Goossens K, R dulescu A. Avoiding message-dependent deadlock innetwork-based systems on chip [J]. VLSI design.2007,2007.
    [182] Song Y, Pinkston T. A progressive approach to handling message-dependent dead-lock in parallel computer systems [J]. Parallel and Distributed Systems, IEEETransactions on.2003,14(3):259–275.
    [183] Gharachorloo K, et al. Architecture and design of AlphaServer GS320[C]. In AS-PLOS2000.
    [184] Marty M R, Hill M D. Virtual hierarchies to support server consolidation [C]. InISCA2007.
    [185] Neelakantam N, et al. FeS2: A Full-system Execution-driven Simulator forx86[C]. In Poster presented at ASPLOS2008.
    [186] Magnusson P S, et al. Simics: A Full System Simulation Platform [J]. Computer.2002,35:50–58.
    [187] Yourst M. PTLsim: A cycle accurate full system x86-64microarchitectural simu-lator [C]. In ISPASS2007.
    [188] Martin M M K, et al. Multifacet’s general execution-driven multiprocessor simu-lator (GEMS) toolset [J]. SIGARCH Comput. Archit. News.2005,33:92–99.
    [189] Hoskote Y, et al. A5-GHz Mesh Interconnect for a Teraflops Processor [J]. Micro,IEEE.2007,27(5):51–61.
    [190] llitzky D, et al. Architecture of the Scalable Communications Core’s Network onChip [J]. Micro, IEEE.2007,27(5):62–74.
    [191] Bell S, et al. TILE64-Processor: A64-Core SoC with Mesh Interconnect [C]. InISSCC2008.
    [192] Zhuravlev S, Blagodurov S, Fedorova A. Addressing shared resource contentionin multicore processors via scheduling [C]. In ASPLOS2010.
    [193] Mutlu O, Moscibroda T. Parallelism-Aware Batch Scheduling: Enhancing bothPerformance and Fairness of Shared DRAM Systems [C]. In ISCA2008.
    [194] Schwiebert L, Bell R. Performance tuning of adaptive wormhole routing throughselection function choice [J]. J. Parallel Distrib. Comput.2002,62:1121–1141.
    [195] Feng W-C, Shin K G. Impact of selection functions on routing algorithm perfor-mance in multicomputer networks [C]. In ICS1997.
    [196] Martínez J C, et al. On the Influence of the Selection Function on the Performanceof Networks of Workstations [C]. In ISHPC2000.
    [197] Dally W J, Aoki H. Deadlock-Free Adaptive Routing in Multicomputer NetworksUsing Virtual Channels [J]. Parallel and Distributed Systems, IEEE Transactionson.1993,4:466–475.
    [198] DallyW,SeitzC.Deadlock-FreeMessageRoutinginMultiprocessorInterconnec-tion Networks [J]. Computers, IEEE Transactions on.1987, C-36(5):547–553.
    [199] Duato J. A necessary and sufficient condition for deadlock-free adaptive routingin wormhole networks [J]. Parallel and Distributed Systems, IEEE Transactionson.1995,6(10):1055–1067.
    [200] Duato J. A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks [J]. Parallel and Distributed Systems,IEEE Transactions on.1996,7(8):841–854.
    [201] Bienia C, et al. The PARSEC benchmark suite: characterization and architecturalimplications [C]. In PACT2008.
    [202] Boyd-Wickizer S, et al. An analysis of Linux scalability to many cores [C]. InOSDI2010.
    [203] Das R, et al. Performance and power optimization through data compression inNetwork-on-Chip architectures [C]. In HPCA2008.
    [204] SanchezD,MichelogiannakisG,KozyrakisC.Ananalysisofon-chipinterconnec-tion networks for large-scale chip multiprocessors [J]. ACM Trans. Archit. CodeOptim.2010,7(1):4:1–4:28.
    [205] Michelogiannakis G, et al. Evaluating Bufferless Flow Control for On-chip Net-works [C]. In NOCS2010.
    [206] Fleury E, Fraigniaud P. A General Theory for Deadlock Avoidance in Wormhole-Routed Networks [J]. IEEE Trans. Parallel Distrib. Syst.1998,9:626–638.
    [207] LinX,McKinleyP,NiL.Themessageflowmodelforroutinginwormhole-routednetworks [J]. Parallel and Distributed Systems, IEEE Transactions on.1995,6(7):755–760.
    [208] Schwiebert L, Jayasimha D N. A necessary and sufficient condition for deadlock-free wormhole routing [J]. J. Parallel Distrib. Comput.1996,32:103–117.
    [209] Verbeek F, Schmaltz J. On Necessary and Sufficient Conditions for Deadlock-Free Routing in Wormhole Networks [J]. Parallel and Distributed Systems, IEEETransactions on.2011,22(12):2022–2032.
    [210] Mukherjee S, et al. The Alpha21364network architecture [C]. In Hot Intercon-nects2001.
    [211] Verbeek F, Schmaltz J. A Comment on “A Necessary and Sufficient Condition forDeadlock-Free Adaptive Routing in Wormhole Networks”[J]. Parallel and Dis-tributed Systems, IEEE Transactions on.2011,22(10):1775–1776.
    [212] Anjan K, Pinkston T. An efficient, fully adaptive deadlock recovery scheme:DISHA [C]. In ISCA1995.
    [213] Duato J, Pinkston T. A general theory for deadlock-free adaptive routing using amixed set of resources [J]. Parallel and Distributed Systems, IEEE Transactionson.2001,12(12):1219–1235.
    [214] Vaidya A, Sivasubramaniam A, Das C. Impact of virtual channels and adaptiverouting on application performance [J]. Parallel and Distributed Systems, IEEETransactions on.2001,12(2):223–237.
    [215] Jin Y, Yum K H, Kim E J. Adaptive data compression for high-performance low-power on-chip networks [C]. In MICRO2008.
    [216] Tamir Y, Frazier G. High-performance multiqueue buffers for VLSI communica-tion switches [C]. In ISCA1988.
    [217] Carrion C, et al. A flow control mechanism to avoid message deadlock in k-aryn-cube networks [C]. In HiPC1997.
    [218] LenoskiD,etal.Thedirectory-basedcachecoherenceprotocolfortheDASHmul-tiprocessor [C]. In ISCA1990.
    [219] Laudon J, Lenoski D. The SGI Origin: a ccNUMA highly scalable server [C]. InISCA1997.
    [220] Barroso L A,, et al. Piranha: a scalable architecture based on single-chip multi-processing [C]. In ISCA2000.
    [221] Conway P, Hughes B. The AMD Opteron Northbridge Architecture [J]. Micro,IEEE.2007,27(2):10–21.
    [222] Puente V, et al. The adaptive bubble router [J]. J. Parallel Distrib. Comput.2001,61(9):1180–1208.
    [223] Chen L, Pinkston T. Personal communication.2012.
    [224] Duato J, et al. A comparison of router architectures for virtual cut-through andwormholeswitchingin aNOW environment[J]. J.Parallel Distrib.Comput.2001,61(2):224–253.
    [225] Bhuyan L, et al. Approximate Analysis of Single and Multiple Ring Networks [J].IEEE Trans. Comput.1989,38:1027–1040.
    [226] Ainsworth T, Pinkston T. On Characterizing Performance of the Cell BroadbandEngine Element Interconnect Bus [C]. In NOCS2007.
    [227] Gottlieb A, et al. The NYU Ultracomputer-designing a MIMD, shared-memoryparallel machine [C]. In ISCA1982.
    [228] Leiserson C, et al. The Network Architecture of the Connection Machine CM-5[C]. In Journal of Parallel and Distributed Computing.1992:272–285.
    [229] Cray Research Inc. CRAY T3D System Architecture Overview [C].1993.
    [230] R A N, et al. An Overview of the BlueGene/L Supercomputer [C]. In SC2002.
    [231] Yang X, Liao X, Lu K, et al. The TianHe-1A Supercomputer: Its Hardware andSoftware [J]. J. Comput. Sci. Technol.2011,26(3):344–351.
    [232] Oh J, Prvulovic M, Zajic A. TLSync: support for multiple fast barriers using on-chip transmission lines [C]. In ISCA2011.
    [233] Martin M, Hill M, Wood D. Token Coherence: decoupling performance and cor-rectness [C]. In ISCA2003.
    [234] Panda D. Fast barrier synchronization in wormhole k-ary n-cube networks withmultidestination worms [C]. In HPCA1995.
    [235] Xu H, McKinley P, Ni L. Efficient implementation of barrier synchronization inwormhole-routed hypercube multicomputers [C]. In ICDCS1992.
    [236] Samman F, Hollstein T, Glesner M. New Theory for Deadlock-Free MulticastRouting in Wormhole-Switched Virtual-Channelless Networks-on-Chip [J]. Par-allel and Distributed Systems, IEEE Transactions on.2011,22(4):544–557.
    [237] Duato J, Yalamanchili S, Ni L. Interconnection Networks: An Engineering Ap-proach [M].1st ed. Los Alamitos, CA, USA: IEEE Computer Society Press,1997.
    [238] Bolotin E, et al. The Power of Priority: NoC Based Distributed CacheCoherency [C]. In NOCS2007.
    [239] Enright Jerger N D, Peh L-S, Lipasti M H. Virtual tree coherence: Leveraging re-gions and in-network multicast trees for scalable cache coherence [C]. In MICRO2008.
    [240] Chiang C-M, Ni L M. Multi-address Encoding for Multicast [C]. In1st Interna-tional Workshop on Parallel Computer Routing and Communication1994.
    [241] Muralimanohar N, Balasubramonian R, Jouppi N. CACTI6.0: A tool to modellarge caches, HPL-2009-85[R].2009.
    [242] Gonzalez R, Horowitz M. Energy dissipation in general purpose microproces-sors [J]. Solid-State Circuits, IEEE Journal of.1996,31(9):1277–1284.
    [243] WangL,etal.Efficientlookaheadroutingandheadercompressionformulticastingin networks-on-chip [C]. In ANCS2010.
    [244] Gupta A, Weber W-D, Mowry T. Reducing Memory and Traffic Requirements forScalable Directory-Based Cache Coherence Schemes [C]. In ICPP1990.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700