面向线程推测执行的数据依赖冲突检测关键技术研究

英文题名：Research on the Data Dependence Violation Checking of Thread Level Speculative Execution
作者：赖鑫
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：多核处理器 ; 并行编程 ; 线程推测执行 ; 数据依赖冲突检测 ; Cache一致性 ; 随机网络演算 ; 片上光互连
英文关键词：Multi-processor ; Muti-thread programming ; Thread-Level
英文关键词：Speculation ; Data Dependence Violation Checking ; Cache Coherence ; Statistical Network Calculus ; On-Chip Optical
学位年度：2012
导师：王志英
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2012-09-01

摘要

随着半导体工艺的发展，处理器朝着众核方向发展，片上网络逐渐取代总线成为核间通信的基础架构。新工艺的出现改变了片上的设计范式，使得在单芯片集成更多的处理器核成为可能。然而众核系统运行效率较低，尚有一系列的科学技术问题亟待解决。线程推测执行可以大幅提高众核系统的运行效率，但同时也面临着许多新问题，主要包括推测线程间数据依赖冲突检测问题、片上网络性能评估和设计问题等。本课题针对线程推测执行中数据依赖冲突检测的核心理论和设计技术问题进行研究，为完善线程推测执行中数据依赖冲突检测提供坚实的理论和技术基础，具有重要的理论意义和应用价值，取得研究成果如下：
     1.提出了一种数据依赖冲突检测的有序链表优化实现。在分析一种典型数据依赖冲突检测机制和运行特征的基础上，对用于全局数据依赖冲突检测硬件有序链表提出了改进实现。该改进实现融合了Cache实现机制和双端口RAM的工作原理，使得链表快速查找和插入操作流水化和并行化，其结构规整有利于VLSI实现。对类似于硬件有序链表等用于数据依赖冲突检测的全局部件，推导了数据依赖冲突检测性能分析公式。检测性能分析公式针对不同的数据依赖冲突检测和线程作废方式，推导了推测线程重启概率与内存访问频率、处理器核数和推测线程存在数据依赖概率之间的解析方程。同时利用GCRA（Generic CellRate Algorithm）方程模拟推测线程访存模型，结合网络演算相关理论，推导了全局检测部件缓存和延迟上界公式。利用性能分析公式，结合仿真实验确定了有序链表在不同线程派发情形下的最优存储配置和实现方式。
     2.提出了一种基于SMP系统线程推测执行的存储一致性技术。存储一致性技术利用L1Cache一致性协议解决数据依赖冲突检测，采用L2Cache解决由于线程切换所引发的不可避免的Cache块替换问题。一致性协议扩展自MESI协议，通过多种技术途径来去除集中式数据依赖冲突检测的弊端。协议在L1Cache中增加存储线程推测度的版本优先级寄存器用于存储线程推测度，通过版本优先级寄存器解决推测数据版本比较问题。该技术利用数据写令牌环标记系统中推测线程对数据所做的最新修改，结合作废向量寄存器记录线程之间的RAW数据依赖，进行分布式数据依赖冲突检测。如果来自总线的推测读失效具有更高的推测度，L1Cache在取得数据的总线监听令牌环之后，根据处理器核ID更新作废向量。线程作废采用了延迟作废机制以减少线程作废重启次数。此外，L1Cache根据推测线程不同执行状态增加推测执行子模式，解决了由于线程作废引起的数据依赖冲突检测错位。针对推测线程切换和访存特征，在L2Cache中设置分布-共享缓冲区以缓存被替换L1Cache块。
     3.提出了一种基于随机网络演算理论的数据依赖冲突检测报文通信性能分析方法。数据依赖冲突检测报文主要是由Cache一致性事件所引发，通过将Cache一致性事件引发的信息流抽象成MMOO（Markov-Modulated On-Off）流，分析了在片上网络有无多播支持的不同情形下报文通信性能，主要针对多播报文流在相邻分支节点间传播过程，利用随机网络演算基本理论推导出了中间路由节点的两个性能解析模型，即缓存上界和端到端延迟上界与节点归一化处理能力和节点利用率之间的解析模型，并提出了一种片上网络中分析数据依赖冲突检测报文的通信性能分析方法，最后利用该方法对传统电信号片上网络进行仿真实验。
     4.提出了一种支持高效数据重估依赖检测的片上光互连网络结构。该结构基于混合式链路交换通信网络，利用广播总线和光Token仲裁机制简化了一致性协议的设计，在TorusNX拓扑结构借鉴Corona体系结构设计思想上增加蛇形光导通信环，在光交换器上增加新的光波导从而在片上光网络中构建出一条Cache一致性通信广播总线，综合采用波分复用方式提高片上光互连的通信效率和带宽，着重解决了广播总线光仲裁Token生成、传递和再生的问题，总线仲裁为推测线程提交增加高优先级快速提交通道。实验结果表明，利用该片上光互连网络结构，可以很好的解决线程推测运行中数据依赖冲突检测，支持推测线程快速提交，使得数据依赖冲突检测高效，并提高了非推测执行应用程序执行性能。
With the rapid development of very large scale integration technologies,many-core system prevails. Many-core design pattern based on Network-on-Chip (NoC)has replaced the pattern based on bus communication infrastruture.New technologieshave changed the design methods of NoC, which enables more cores integrated on asignal chip. However, the efficiency of many-core system is not very satisfied. A lot ofscience and engineer issues are left to us to solve to improve the efficiency ofmany-core system. Thread-Level Speculation (TLS) is a new technology that can boostthe efficiency of many-core system dramatically. But it also faces lots of restriction,including data dependence violation checking between different speculative threads,NoC performance evaluation and other design issues. Research on key theories anddesign technologies on data dependence violation checking in TLS will promote thedevelopment of data dependence violation checking technologies with great theoreticaland practical significance. The main contributions are listed as follows.
     1. An optimized hardware linked list implementation is proposed. By studying theoperating mechanism and characteristics of a typical data dependence violationchecking component in SESC simulator, we proposed an enhanced implementationfor hardware linked list which is used for data dependence violation checking. Thenwe proposed an analytical model for data dependence violation checking policybased on the kind of global components, such as hardware linked list. The modelincludes the equation for the possibility of thread restart related to memory accessfrequency, core number in system and the possibility that speculative threads sharedata dependence. We also deduced the backoff and backlog of global componentsused for data dependence violation checking by network calculus. GCRA（GenericCell Rate Algorithm）function is used to mimic memory access model ofspeculative thread. Through simulation, the optimal storage configuration ofhardware linked list is determined at different speculative treads spawning policies.
     2. A memory coherence technology based on SMP for data dependence violationchecking in TLS is proposed. The proposed technology uses cache coherence tosolve data dependence violation checking and L2cache to buffer the victim cacheblock caused by threads swap. The cache coherence protocol extends from MESIprotocol by adding version priority register, write ring and invalidation vector tomanage RAW data dependence between speculative threads. Thread invalidation isdelayed to reduce the total restart number. Furthermore, according to speculativethread lifetime L1cache is added several sub-execution models to defeat the datadependence violation checking misplace caused by thread invalidation. By fully utilizing the characteristics of thread swap and memory access, L2cache provides adistributed-shared region to buffer the victim L1cache block.
     3. A performance evaluation theoretical method for the packet communication onNoC for data dependence violation checking is proposed. The packets used for datadependence violation checking are all triggered by cache coherence events. Weabstracted cache coherence packets flows to MMOO (Markov-Modulated On-Off)flows and evaluate their performance under both circumstance that NoC hasmulti-cast support or not. By using statistical network calculus, we got equationsfor the end-to-end point delay and backlog of MMOO flows from previousbranching node to current node for two different kinds of NoC, with or withoutbroad/multi-cast support. With simulation results, we found several drawbacks oftraditional electronic no.
     4. An on-chip optical technology for data dependence violation checking in TLS isproposed. The proposed technology is based on hybrid data link switch network,and uses optical broadcast bus and optical tokens to simplify the design of the cachecoherence protocol. It mainly solves the issue of the generation, transfer andre-generation. With broadcast channel and optical token. We designed an opticalNoC, whose topology is TorusNX. By altering optical switch, we added two lowerpriority snake rings for broadcast, bus arbitration and one higher priority snake ringfor fast thread commit. The snake ring is first used in Corona architecture.Experimental results show that the proposed technology boost the performance ofnone-speculative application and implementation TLS cache coherence.

引文

[1] Semiconductor Industry Association. The International Technology Roadmap forSemiconductors (ITRS)2011Edition.http://www.itrs.net/Links/2011ITRS/Home2011.htm.
    [2] Lu Z., Jantsch A. Trends of Terascale Computing Chips in the next ten years[C].In Proceeding of IEEE8th International Conference on ASIC. Changsha, Hunan,China, Oct.2009:62-66
    [3] Borkar S. Thousand Core Chips: A Technology Perspective[C]. In Proceedingsof the44th annual Design Automation Conference. San Diego, CA, jun.2007:746-749.
    [4] Owens J.D., Dally W J, Ho R, et al. Research challenges for on-chipinterconnection networks [J]. IEEE Micro.2007,27(5):96–108.
    [5] Kozyrakis Chritoforos, Patterson David. New Direction for ComputerArchitecture Research [J]. IEEE Computer Magazine,24-32,1998.
    [6] Marr D.T., Binns.F, Hill. D.L, et.al. Hyper-threading Technology Architectureand Microarchitecture [J]. Intel Technology Journal,2002.
    [7] Diefendom.K. Power4Focuses on Memory Bandwidth[J]．MicroprocessorReport,1999.
    [8] Li Zhao, Ravi Iyer, Srihari Makineni, et al. Performance, Area and BandwidthImplications on Large-scale CMP Cache Design[C]. Workshop onChip-Multiprocessor Memory systems and Interconnects (CMP-MSI) held withInternational Symposium on High-Performance Computer Architecture(HPCA-13), Phoenix, Arizona,2007.
    [9] Trolet.R, Irigoin.F., Feautrier P. Direct parallelization of Call Statements[C]. InProceedings of the SIGPLAN’86Symposium on Compiler Construction,1986.
    [10] Pagh.W. A Practical Algorithm for Exact Array Dependence Analysis[J].Communication of the ACM,35(8):102-114,1992.
    [11] Sarkar V., Hennessy J. Partitioning Parallel Programs for Macro-dataflow [C]. InConference Proceeding of the1986ACM Conference on Lisp and FunctionalProgramming,192-201,1986.
    [12] Guerrier P., Greiner A. A Generic Architecture for On-Chip Packet-SwitchedInterconnections [C]. In Proceeding of the Design, Automation, Test in EuropeConference. Mar.2000:250-256.
    [13] Dally W.J., Towles B. Route Packets, not Wires: On-Chip InterconnectionsNetworks [C]. In Proceedings of the38th Design Automation Conference. Jun.2001:684-689.
    [14] Jantsch A., Tenhunen H. Networks on Chip[M]. Kluwer Academic Publishers,2003.
    [15] Micheli G.D., Benimi L. Networks on Chips: Technology and Tools[M]. MorganKaufmann,2006.
    [16] Hemani A., Jantsch A., Kumar S., et al. Network on Chip: An Architecture forBillion Transistor Era [C]. In Proceedings of the IEEE Norchip Conference. Nov.2000.
    [17] Benimi L., Micheli G.D. Powering Networks on Chip: Energy-Efficient andReliable Interconnect Design for SoCs[C]. In Proceedings of the14thInternational Symposium on System Synthesis.2001:33-38.
    [18] Bjerregaard T., Mahadeva S. A Survey of Research and Practices ofNetwork-on-Chip [J]. ACM Computing Survey.2006,38(1):1-51.
    [19] Park K., Willinger W. Self-Similar Network Traffic and Performance Evaluation[M].1st ed. John Wiley&Sons,2000.
    [20] Varatkar G.V., Marculescu R. On-Chip traffic Modeling and Sysnthesis forMPEG-2Video Applications [J]. IEEE Transactions on Very Large ScaleIntegration(VLSI) Systems.2004,12(1):108-119.
    [21] Soteriou V., Wang H., Peh L.S. A Statistical Traffic Model for On-ChipInterconnection Networks [C]. In Proceedings of the14th IEEE InternationalSymposium on Modeling, Analysis and Simulation. Sep.2006:104-116.
    [22] Scherrer A., Frahoulet A., Risset T. Long-range Dependence and On-ChipProcessor Traffic[J]. Microprocessors&Microsystems.2009,33(1):72-80.
    [23] Adriahantenaina A, Charlery H, Greiner A, et al. SPIN: A Scalable,Packet-Switched, On-Chip Micro-Network [C]. In Proceedings of theConference on Design, Automation and Test in Europe: Designers’ Forum.2003.
    [24] Louhenper R.,Nilsson O. Evaluation of The EXSITE Program [R]. TechnologyProgram Report21/2003Evaluation Report,2003.
    [25] Benini L., De Micheli G. Networks on Chips: A New SoC Paradigm [J]. IEEEComputer.2002,35(1):70-78.
    [26] Sankaralingam K., Nagarajan R., Liu H., et al. Exploiting ILP, TLP and DLPwith the polymorphous TRIPS Architecture[J]. ACM SIGARCH ComputerArchitecture News.2003,31(2):422-433.
    [27] Sankaralingam K., Nagarajan R., Gratz P., et al. The DistributedMicroarchitecture of the TRIPS Prototype Processor [C]. In Proceedings of the39th Annual IEEE/ACM International Symposium on Microarchitecture. Dec.2006:480-491.
    [28] Liang J., Swaminathan S., Tessier R. aSoC: A Scalable, Single-ChipCommunication Architecture [C]. In Proceedings of the2000InternationalConference on Parallel Architecture and Compilation Techniques.2000:37-46.
    [29] Liang J., Laffely A., Srinivasan S., et al. An Architecture and Compiler forScalable On-Chip Communication [J]. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems.2004,12(7):711-726.
    [30] MIllberg M., Nilsson E., Thid R., et al. The Nostrum Backbone: ACommunication Protocol Stack for Network on Chip [C]. In proceedings of the17th VLSI Design Conference. Jan.2004.
    [31] Kumar S., Jantsch A., Soininen J.P., et al. A Network on Chip Architecture andDesign Methodology [C]. In Proceedings of IEEE Computer Society AnnualSymposium on VLSI.2002:105-112.
    [32] Dall’Osso M., BIccari G., Giovannini L., et al. Xpipes: A Latency InsensitiveParameterized Network-on-Chip Architecture for Multiprocessor SoCs [C]. InProceedings of the21st International Conference on Computer Design.2003:536-539.
    [33] Bolotin E., Cidon I., Ginosar R., et al. QNoC: QoS Architecture and DesignProcess for Network on Chip [J]. Journal of Systems Architecture, Special Issueon Network on Chip.2004,50(2-3):105-128.
    [34] Bjerregaard T., Sparso J. A Router Architecture for Connection-Oriented ServiceGuanrantees in the MANGO Clockless Network-on-Chip [C]. In Proceedings ofthe Design, Automation and Test in Europe Conference.2005:1226-1231.
    [35] Rijpkema E., Goossens K., Wielage P. A Router Architecture for Network onSilicon [C]. In Proceedings of Progress2001,2nd Workshop on EmbeddedSystems. Oct.2001.
    [36] Goosens K., Dielessen J., Rǎdulescu A. The thereal Network on Chip:Concepts, Architectures and Implementations [J]. IEEE Design&Test ofComputer.2005,22(5):414-421.
    [37] Steenhof F., Duque H., Nilsson B., et al. Network on Chip for High-endConsumer-Electronics TV System Architectures [C]. In Proceedings of theConference on Design, Automation and Test in Europe: Designers’ Forum.2006:148-153.
    [38] Hofstee H.P. Power Efficient Processor Architecture and the Cell Processor [C].In Proceedings of the International Symposium on High Performance ComputerArchitecture. Feb.2005:258-262.
    [39] Gschwind M., D’Amora B., O’Brien K., et al. Cell Broadband Engine–EnablingDensity Computing for Data-rich Environment [C]. In Tutorial held inconjunction with the International Symposium on Computer Architecture. Jun.2006.
    [40] Coppola M., Locatelli R., Maruccio G., et al. Spidergon: A Novel on ChipCommunication Network [C]. In Proceedings of2004International Symposiumon System Chip. Nov.2004.
    [41] Coppola M., Grammtikakis M.D., Locatelli R, et al. Design of Cost-EfficientInterconnect Processing Units: Spidergon STNoC [M]. CRC Press,2008.
    [42] Vangal S., Howard J., Ruhl G., et al. An80-tile1.28TFLOPS Network-on-Chipin65nm CMOS [C]. In Proceedings of International Solid-State CircuitsConference. Feb.2007:98-99.
    [43] Hoskote Y., Vangal S., Singh A., et al. A5-Hz Mesh Interconnect for a TeraflopsProcessor [J]. IEEE MICRO.2007,27(5):51-61.
    [44] Vangal S.R., Howard J., Ruhl G., et al. An80-tile sub-100-w TeraFLOPSProcessor in65-nm CMOS [J]. IEEE Journal of Solid-State Circuits.2008,43(1):29-41.
    [45] Bell S., Edwards B., Amann J., et al. TILE63Processor: A64-core SoC withMesh Interconnect [C]. In Proceedings of International IEEE Solid-StateCircuits Conference. Feb.2008:88-89.
    [46] Agarwal A., Bao L., Brown J., et al. Tile Processor: Embeded Multicore forNetworking and Multimedia [C]. In Proceedings of Hot Chips: Symposium onHigh Performance Chips. Aug.2007.
    [47] Wentzlaff D., Griffin P., Hoffman, et al. On-Chip Interconnection Architectureof the TILE Processor [J]. IEEE MICREO.2007,27(5):15-31.
    [48] The4th ACM/IEEE International Symposium on Networks-on-Chip (NOCS2010). http://www.minatec.org/nocs2010/index.htm.
    [49] Horowitz M., Ho R., Mai K.. The future of wires [J]. Proceedings of the IEEE.2001,89(4):490–504.
    [50] Ogras U.Y., Hu J., Marculescu R. Key research problems in NoC design: aholistic perspective [C]. In Proceedings of the3rd IEEE/ACM/IFIP internationalconference on Hardware/software codesign and system synthesis. Sep.2005:69–74.
    [51] Marculescu R., Ogras U.Y., Peh L.S., et al. Outstanding Research Problems inNoC Design: System, Microarchitecture, and Circuit Perspectives [J]. IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems.2009,28(1):3–21.
    [52] Pande P.P., Grecu C., Ivanov A., et al. Design, Synthesis, and Test of Networkson Chips [J]. IEEE Design&Test of Computers.2005,22(5):404–413.
    [53] Carloni L.P., Pande P., Xie Y. Network-on-Chip in Emerging InterconnectParadigms: Advantages and Challenges [C]. In Proceedings of the3rdACM/IEEE International Symposium on Network-on-Chip. San Diego, CA,May2009.
    [54] Jose A.P., Patounakis G., Shepard K.L., Pulsed Current-Mode Signaling fornearly Speed-of-Light Intrachip Communication [J]. In Proceedings of the IEEE.2006,41(4):772-780.
    [55] Kirman N., Kirman M., Dokania R.K., et al. Leveraging Optical Technology inFuture Bus-Based Chip-MultiProcessors [C]. In Proceedings of the39th AnnualIEEE/ACM International Symposium on Microarchitecture. Dec.2006:492-503.
    [56] Shacham A., Bergman K., Carloni L.P., The Case for Low-Power PhotonicNetworks on Chip [C]. In Proceedings of the44th Annual Design AutomationConference. San Diego, CA, Jun.2007:132-135.
    [57] Chang M.F., Cong J., Kaplan A., et al. CMP Network-on-Chip Overlaid withMultiband RF-Interconnect [C]. In IEEE14th International Symposium on HighPerformance Computer Architecture. Feb.2008:191-202.
    [58] Snoeckx K., Beyne E., Swinnen B. Copper-nail TSV Technology for3D-StackedIC Integration [J]. Solid State Technology.2007,50(5):53-55.
    [59] Hopkins D., Chow A., Bosnyak R., et al. Circuit Techniques to Enable430GB/mm2proximity Communication [C]. In Proceedings of InternationalSolid-State Circuits Conference.2007.
    [60] Le Boudec J.Y., Thiran P. Network Calculus: A Theory of DeterministicQueuing Systems for the Internet [M]. Number2050in LNCS, Spring-Verlag,2004.
    [61] Chang C.S. Performance Guarantees in Communication Networks [M].Spring-Verlag,2000.
    [62] Cruz R.L. A Calculus for Network Delay, Part I: Network Elements in Isolation[J]. IEEE Transactions on Information Theory.1991,37(1):114-131.
    [63] Cruz R.L. A Calculus for Network Delay, Part II: Network Analysis [J]. IEEETransactions on Information Theory,1991,37(1):132-141.
    [64] Parekh A.K, Gallager R.G. A Generalized Processor Sharing Approach to FlowControl in Integrated Services Networks: the Single-Node Case [J]. IEEE/ACMTransactions on Networking.1993,1(3):344-357.
    [65] Parekh A.K., Gallager R.G. A Generalized Processor Sharing Approach to FlowControl in Integrated Services Networks: the Multi Nodes Case [J]. IEEE/ACMTransactions on Networking.1994,2(2):137-150.
    [66] Cruz R.L. Quality of service guarantees in virtual circuit switched networks.IEEE JSAC. P.1048‐1056,August1995.
    [67] Cruz R.L. Quality of service management in integrated services networks. InProceeding of the1st Semi‐Annual Research Review, CWC, USCD,1996.
    [68] Chang C.S.. Stability, Queue length, and Delay of Deterministic and StochasticQueueing networks. IEEE Transactions on Automatic Control,39(5):913–931,May1994.
    [69] Sariwan H. A Service Curve Approach to Performance Guarantees in IntegratedService Networks [D]. USA: University of California, San Diego,1996.
    [70] Le Boudec J.Y. Network Calculus Made Easy[R]. Technical Report DI96/218,Ecole Polytechnique Federale, Lausanne (EPFL),1996.
    [71] Le Boudec J.Y. Application of Network Calculus to Guaranteed ServiceNetworks [J].IEEE Transactions on Information Theory.1998,44(4):1087-1096.
    [72] Baccelli, D. Hong. TCP is Max-Plus Linear: What It tells us on its throughput.In Proceedings of ACM SIGCOMM (Stockholm, Sweden), August2000.
    [73] Jiang Y.M. Relationship between guaranteed rate server and latency rate server.Computer Networks,43(3):307–315, Oct.2003.
    [74] Liu Y., Tham C., Jiang Y.M. A stochastic network calculus. Technical report.ECE‐CCN‐0301, Dept. of Electrical and Computer Engineering, NationalUniversity of Singapore, Nov.2003.
    [75] Jiang Y.M., Emstad P.J. Analysis of Stochastic Service Guarantees inCommunication Networks: A Server Model. In Proceedings of the13thInternational Workshop on Quality of Service (IWQoS), June2005.
    [76] Jiang Y.M. A Basic Stochastic Network Calculus. ACM SIGCOMM’06, Sep11–15,2006.
    [77] Fidler M.. An End-to-End Probabilistic Network Calculus with MomentGeneration Functions. In Proceedings of the14th International Workshop onQuality of Service IWQoS,2006.
    [78] Jiang Y.M., Emstad P.J. Analysis of Stochastic Service Guarantees inCommunication Networks: A Traffic Model. In Proceedings of the19thInternational Teletraffic Congress (ITC19), Aug,2005.
    [79] Prabhu M.K. Parallel Pragramming Using Thread-Level Speculation[D],Stanford University,2005.
    [80] Martinez J.F., Torrellas J. Speculative Synchronization: Applying Thread-LevelSpeculation to Explicitly Parallel Applications [C].10th InternationalConference on Architectural Support for Programming Languages andOperatingsystems(ASPLOS),2002.
    [81] Krishnan V., Torrellas J. A Chip Multiprocessor Architecture with SpeculativeMultithreading [J]. IEEE Transaction on Computer, Special Issue onMultithreaded Architecture,1999.
    [82] Krishnan V., Torrellas J. Hardware and Software Support for SpeculativeExecution of Sequential Binaries on a Chip-Multiprocessor [C]. InternationalConference on Supercomputing (ICS),1998.
    [83] Marruelo P., Gonzalez A. Thread Spawning Schemes for SpeculativeMultithreaded Architecture [C]. In Proceeding of the32nd Annual IEEE/ACMInternational Symposium on Microarchitecture,2002.
    [84] Sazeides Y., Smith J.E. The Predictability of Data Values [C]. In Proceeding ofthe30th Annual IEEE/ACM International Symposium on Microarchitecture,1997.
    [85] Cong Liu, Li Shen, Zhiying Wang. Tuning Parallelism of SequentialApplications via Thread Level Speculation. Advanced Science Letter [J].2012.
    [86] Ding C., Shen X., Kelsey K., Tice C., Huang R., Zhang C. Software behaviororiented parallelization, ACM-SIGPLAN Symposium on ProgrammingLanguage Design and Implement (PLDI’2007).
    [87] Renau J., Fraguela B., Tuck J., Liu W., Prvulovic M., Ceze L., Sarangi S., SackP., Strauss K., Montesinos P. SESC Simulator[R], January2005.http://sesc.sourceforge.net.
    [88] Jiang Y.M., LeBoudec, Thiran P., Network Calculus–A theory ofDeterministric Queuing System for the Internet. Springer,2004
    [89] Rundberg P., Stenstr m P., An All-Software Thread-Level Data DependenceSpeculation System for Multiprocessors[J]. The Journal of Instr.-Level Par.,1999.
    [90] Hammond L., Hubbert B., Siu M., Prabhu M.., Chen M., Olukotun K., TheStanford Hydra CMP[J]. IEEE Micro Magazine, March-April2000.
    [91] Gopal S., Vijaykumar T., Smith J., Sohi G. Speculative versioning cache[C]. InHPCA4, February1998.
    [92] Steffan J.G., Colohan C.B., Mowry T.C. Architectural Support for Thread-LevelData Speculation[R], Technical Report CMU-CS-97-188, School of ComputerScience, Carnegie, Mellon University,1997.
    [93] Steffan J.G., Mowry T.C. The Potential for Thread-Level Data Speculation inTightly-Coupled Multiprocessor [R]., Technical Report CSRI-TR-350,Computer Science Research Institute, University of Toronto,1997.
    [94] Steffan J.G., Colohan C.B., Zhai A., et.al. A Scalable Approach to Thread-LevelSpeculation [C], Proceedings of the27th Annual International Symposium onComputer Architecture,2000.
    [95] Steffan J.G., Colohan C.B., Zhai A., et.al. Improving Value Communication forThread-Level Speculation [C]. High-Performance Computer Architecture,2000.Proceedings. Eighth International Symposium,2000,65-75.
    [96] Steffan J.G., Colohan C.B., Zhai A. et al. The STAMPede Approach toThread-Level Speculation [J], ACM Transaction on Computer Systems,23,issue3,2005,253-300.
    [97] McDonald Austen, Chung Jae Woong, Chfi Hassan, et.al. Characterization ofTCC on Chip-Multiprocessors [C]. The Fourteenth International Conference onParallel Architectures and Compilation Techniques.
    [98] Hammond L., Wang V., Chen M. Transactional Memory Coherence andConsistency [C], Proceeding of the31st Annual International Symposium onComputer Architecture,2004.
    [99] Hammond L., Carlstrom Brian D., Wong Vicky, et.al. Transactional Coherenceand Consistency: Simplifying Parallel Hardware and Software [J]. Micro’s TopPicks, IEEE Micro,2004.
    [100] Hammond L., Carlstrom B.D., Wong V., et.al. Programing with TransactionalCoherence and Consistence (TCC)[C], ASPLOS04,2004.
    [101] Hammond L., Willey M., Olukotun K. Data Speculation Support for a ChipMultiprocessor[C],Proc. Eighth Int’l Conf. Architectural Support forProgramming Languages and Operating Systems (ASPLOS-VIII), Oct.1998
    [102] Olukotun K., Hammond L., Willey M. Improving the Performance ofSpeculatively Parallel Applications on the Hydra CMP[C], Proc.1999Int’l Conf.Supercomputing(ICS), June1999.
    [103]刘园.高效的线程级推测及事务执行模型研究[D].安徽合肥：中国科学与技术大学,2007
    [104] Aeroflex Gaisler Inc. Leon3Processor[R]. www.aeroflex.com/gaisler.
    [105] Krste Asanovi′c, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,Parry Husbands,Kurt Keutzer, David A. Patterson,William Lester Plishker, JohnShalf, SamuelWebb Williams, and Katherine A. Yelick. The landscape ofparallel computing research: A view from Berkeley. Technical Report TechnicalReport No. UCB/EECS-2006-183, Electrical Engineering and ComputerSciences, University of California at Berkeley,2006.
    [106] Intel. Intel QuickPath technology. http://www.intel.com/technology/quickpath.
    [107] Rajesh Kota. HORUS: Large scale SMP using AMD Opteron.http://www.hypertransport.org/docs/tech/horus_external_white_paper_final.pdf.
    [108] Paolo Meloni. On the impact of serialization on the cache performance inNetwork-on-Chip based MPSoC,10th Euromicro Conference on Digital SystemDesign Architectures, Methods and Tools,2007
    [109] Matteo Monchie, Gianluca Palermo, Cristina Silvano Oreste Villa. Explorationof Distributed Shared Memory Architectures for NoC-based Multiprocessors.Journal of Systems Architecture,2007,53(10):719-732
    [110] Frédéric Pétrot, Alain Greiner, Pascal Gomez. On Cache coherence and memoryconsistency issues in NoC based Shared memory multiprocessor SoCarchitectures.9th Euromicro Conference on Digital System Design Architectures,Methods and Tools,2006
    [111] Huangzhong Li, Xue Liu, Wenbo He, Jian Li and Wenhua Dou. End-to-EndDelay Analysis in Wireless Network Coding: A Network Calculus-basedApproach, International Conference on Distributed Computing Systems(ICDCS’2011), USA, Jun,2011.
    [112] Markus Fidler, Jens B. Schmitt. On the Way to a Distributed SystemsCalculus:An End-to-End Network Calculus with Data Scaling.SIGMetrics/Performance’06, France, June26–30,2006.
    [113] Florin Ciucu, Oliver Hohlfeld. Scaling of Buffer and Capacity Requirements ofVoice Traffic in Packet Network, in proceedings of SIGMetrics/Performance'06,France, June26-30,2009.
    [114] Vantrease. OPTICAL TOKENS IN MANY-CORE PROCESSORS[D],2010:26.
    [115] Hendry, K. Bergman."Hybrid On-chip Data Networks," IEEE Symposium onHigh Performance Chips (Hot Chips),Aug,2010.
    [116] Johnnie Chan, Aleksandr Biberman, Benjamin G. Lee, and Keren Bergman.Insertion Loss Analysis in a Photonic Interconnection Network for On-Chip andOff-Chip Communications,21st Annual Meeting of the IEEE Lasers andElectro-Optics Society,9-13Nov,2008
    [117] Biberman A., Bergman K. Optical interconnection networks forhigh-performance computing systems, Reports on Progress in Physics,Mar,2012,75(4).
    [118] Kamil S.,,A. Biberman,; Chan J., Lee B.G., Mohiyuddin M., Jain A., BergmanK., Carloni L.P., Kubiatowicz J., Oliker L., Shalf J. Analysis of PhotonicNetworks for a Chip Multiprocessor Using Scientific Applications,3rdACM/IEEE International Symposium on Network-on-Chips,2009.
    [119] Hendry G., Robinson E., Gleyzer V., Chan J., Carloni L. P., Bliss N., BergmanK. Circuit-Switched Memory Access in Photonic Interconnection Networks forHigh-Performance Embedded Computing, Supercomputing (SC),Nov,2010.
    [120] Chan J., Hendry G., Biberman A., Bergman K. Tools and Methodologies forDesigning Energy-Efficient Photonic Networks-on-Chip for High-PerformanceChip Multiprocessors, ISCAS ‘1020, Jun,2010.
    [121] Johnnie Chan, Keren Bergman, Luca P. Carloni. Physical-Layer Modeling andSystem-Level Design of Chip-Scale Photonic Interconnection Networks, IEEETransaction On Computer-Aided Design of Integrated Circuits and Systems, Vol.30, No.10, October2011.
    [122] Chan J., Bergman K. Photonic Interconnection Network Architectures usingWavelength-Selective Spatial Routing for Chip-Scale Communications, Journalof Optical Communications and Networking, Mar,2012,4(3):189-201.
    [123] Hendry R., Hendry G., Bergman K. TDM Photonic Network Using DepositedMaterials, High Performance Embedded Computing (HPEC),Sep,2011.
    [124] Padmaraju K., Ophir N., Biberman A., Chen L., Swan E., Chan J., Lipson M., K.Bergman, Intermodulation Crosstalk From Silicon Microring Modulators inWavelength-Parallel Photonic Networks-on-Chip, PHO Annual2010, Nov,2010.
    [125] Dana Vantrease, Robert Schreiber, Matteo Monchiero, Moray McLaren,Norman P. Jouppi, Marco Fiorentino, Al Davis, Nathan L. Binkert, Raymond G.Beausoleil, Jung Ho Ahn. Corona: System Implications of EmergingNanophotonic Technology, the35th Annual International Symposium onComputer Architecture (ISCA’08), USA:2008.
    [126] Vantrease D., Binkert N., Schreiber R., Lipasti M.H. Light speed arbitration andflow control for nanophotonic interconnects.42nd Annual IEEE/ACMInternational Symposium on Microarchitecture,2009.
    [127] Kurian, George Miller, Jason E. Psota, James Eastep, Jonathan Liu, JifengMichel, Jurgen Kimerling, Lionel C. Agarwal, Anant. ATAC: A1000-CoreCache-Coherent Processor with On-Chip Optical Network. the NineteenthInternational Conference on Parallel Architectures and Compilation Techniques(PACT2010), Sep11-15,2010.
    [128] Chan J., Biberman A., Lee B. G., Bergman K. Insertion Loss Analysis in aPhotonic Interconnection Network for On-Chip and Off-Chip Communications,LEOS2008TuT3, Nov,2008.
    [129] Chan J., Hendry G., Biberman A., Carloni L.P., Bergman K. PhoenixSim: ASimulator for Physical-Layer Analysis of Chip-Scale Photonic InterconnectionNetworks. Mar,2010.
    [130] SimulCraft Inc. http://www.omnetpp.org,2003

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700