同时多线程处理器资源共享控制策略研究

英文题名：Research on Resource Sharing Control in Simultanecous Multithreading Processors
作者：陈红洲
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：同时多线程 ; 体系结构 ; 资源共享控制 ; 取指策略 ; 派遣控制 ; 资源划分 ; 性能
英文关键词：Simultaneous Multithreading ; Computer Architecture ; Resource Sharing Control ; Fetch Policy ; Dispatch Control ; Resource Partitioning ; Performance
学位年度：2009
导师：平玲娣 ; 潘雪增
学科代码：081201
学位授予单位：浙江大学
论文提交日期：2009-09-01
答辩委员会主席：石教英

摘要

随着超大规模集成电路技术持续以指数级地发展,处理器芯片上集成的资源将大量地增加,如何有效利用这些资源成为发挥处理器性能的关键。同时多线程(SMT)处理器通过同时执行来自多个线程的指令利用了线程级并行和指令级并行,其细粒度的资源共享和操作延迟隐藏为处理器带来较好的性能提升。然而同时多线程环境中,线程竞争共享资源胜过分享资源,不合理的资源竞争将会导致共享资源的阻塞、滥用和浪费。资源共享控制的合理性决定了处理器的吞吐量和线程间的公平性。随着处理器和存储器之间性能差距的增大,片外访存操作的长延迟将造成SMT处理器上越来越明显的资源阻塞和资源滥用。另外,各线程在竞争共享资源的过程中随着程序行为的变化表现出变化的资源需求,适应性不强的资源共享控制策略将很难持续地提供优化的资源分配方案。这些问题使如何合理地控制SMT处理器资源在线程之间的共享显得尤其重要。
     围绕上述问题,本文在深入研究分析相关工作的基础上,从避免长延迟load依赖指令阻塞共享资源、利用计算访存并发性有效隐藏片外访存长延迟、增强资源共享策略对程序行为变化的适应性、以及避免控制决策运算影响关键流水线路径这四种途径展开研究,并提出了相应的同时多线程处理器资源共享控制策略,通过模拟实验验证了各策略的有效性。主要取得以下研究成果。
     (1)为避免长延迟load依赖指令阻塞共享资源,提出了长延迟load感知的SMT处理器指令派遣策略DSTALL和DSTALLp。该策略在流水线的派遣阶段实施停止派遣控制决策,根据检测到的或预测到的二级Cache失效信息决定是否停止派遣线程的指令到指令队列。通过避免已取指的长延迟load依赖指令在检测到长延迟load后继续阻塞资源,以及缩短控制决策反馈信息利用延迟的方式,减少了长延迟load给SMT处理器资源共享带来的负面影响。
     (2)针对隐藏片外访存长延迟的问题,提出了利用计算和访存操作并发性的SMT处理器资源划分策略ECMC。它从SMT处理器利用线程级并行隐藏长操作延迟的本质特性出发,周期性地根据线程计算型访存型操作并发能力来调整共享资源在各线程之间的划分,为在计算操作与片外访存操作并发性方面表现较好的线程分配更多的资源,提高了执行时钟上的计算操作与访存操作并发率,较有效地隐藏了Cache失效load操作的长延迟。
     (3)为了避免资源分配优化过程陷入局部次优空间,增强在变化的程序行为中持续优化的能力,提出了一种空间触发的耗散式SMT处理器资源分配策略SDRD。该策略的分配优化自组织机制和分配空间上触发的混沌协同工作,通过控制资源分配方案的相似度使资源分配方案能逃离局部极优方案,为程序行为变化的不同阶段提供持续的资源分配优化。在仅以吞吐量为目标的情况下,同时照顾了吞吐量和公平性性能。
     (4)提出了SMT处理器上非关键路径资源分配器设计模型NCPRD。该模型使资源分配模块独立于处理器关键流水线路径,以避免资源分配在关键流水线路径上的开销给处理器性能带来不可忽略的影响。NCPRD的异步工作模式对于时钟开销较明显的SMT处理器隐式资源共享控制策略也具有参考意义。
With the continuing advancements at an exponential rate in VLSI technology, the volume of resources integrated into a processor chip will increase rapidly. How to make efficient utilization of these resources is the key to exert processor's performance. Simultaneous MultiThreading (SMT) processor takes advantage of both thread level parallelism and instruction level parallelism via concurrently executing instructions from different threads. Its fine-grained resource sharing and long latency hiding brings good performance improvement. However, threads compete for common resources rather than they share them in SMT environment, unreasonable resource competing will result in resource clogging, abusement and wastage. The rationality of resource sharing control determines the throughput performance for processor and the fairness between threads. With the ever increasing performance gap between processor and memory, the long latency from off-chip memory access will make the resource clogging and abusement in SMT processor more eminent. Moreover, threads change their requirment of resources with the the change of their program behavior when they compete for resources, it is difficult for an unadaptable resource sharing control policy to supply continuing optimization of resource distribution. All these problems especially augment the signification of how to reasonably control the sharing of SMT processor resources among the threads.
     According to the problems mentioned above, this dissertation, after a deep study on related works, launched the research in four approaches: preventing long-latency-load dependents from clogging common resources, exploiting the concurrency of compute and memory access operations for hiding of off-chip memory access latency, making the resource sharing control policy adaptable to changing program behavior, and eliminating the influence brought to the critical pipeline by those cycles spent on making resource control decision. Corresponding resource sharing control policy for SMT processors was present per approach, and their validity were proved through simulations. The main contributions are as follows.
     (1)To prevent instructions dependent on long-latency loads from clogging the common resources, a long-latency-load awared dispatch policy for SMT processors is proposed. This policy decides on whether a thread should be dispatch-stalled at the dispatch stage in the pipeline according to the detected or predicted L2 cache miss information. By preventing those long-latency-load dependents that have been fetched from clogging the common resources after detection of long-latency load, and reducing the feedback latency of the L2 cache miss information, it alleviates the negative influence brought to the resource sharing in SMT processor by long-latency loads.
     (2)To address the problem of hiding long latency of off-chip access, a resource partitioning policy for SMT processors exploiting the compute-memory concurrency is proposed. It follows the essential feature of SMT processors that hide long latency by exploiting thread level parallelism, tunes the resource partitions among the threads periodically according to the concurrency level of compute-memory operations of each thread, and provides more resources to the thread that has better performance in compute-memory concurrency. This policy improves the proportion of time for concurrent compute-memory executing to the whole executing time, providing effective hiding of long latency brought by cache-miss loads.
     (3)To avoid the stagnation in the suboptimal resource distribution spaces in the resource distribution optimization procedure, and enhance the capability of performing a continue optimization for resource distribution for changing program behavior, a spatially triggered dissipative resource distribution policy for SMT processors is proposed. In this policy, the self-organization mechanism cooperates with the spatially triggered chaos for the distribution solutions. By taking control of the similarity of the distribution solutions, the policy can escape from the suboptimal solution, and supply persistent optimization for resource distribution in different program phases. The throughput and fairness performance are both improved by taking only the throughput as the optimization target.
     (4) A design model of non-critical path resource distributor for SMT processors is proposed, which separates the resource distribution module from the critical pipeline path to avoid the clock wastage caused by the computation on the allocation solutions. The asynchronous work mode of this non-critical path resource distributor is also meaningful for those implicit resource sharing control policies that consume obvious cycles in making decision.

引文

[1] G. E. Moore. Cramming More Components Onto Integrated Circuits. Electronics,1965, 38(8): 114-117.
    [2] G E. Moore. Excerpts from A Conversation with Gordon Moore: Moore's Law.Intel Corporation. 2005. 1.
    [3] Next-generation Intel Itanium Processors. http://www.intel.com/technology/architecture-silicon/microarchitecture.htm.2009.
    [4] ITRS Reports (2007 Edition and 2008 Update). http://www.itrs.net/reports.html.2008.
    [5] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading:Maximizing On-Chip Parallelism. In Proc. of the 22nd Annual Int'l Symp. on Computer Architecture. Santa Margherita Ligure, Italy: ACM Press, 1995:392-403.
    [6] D. M. Tullsen, S. J. Eggers, et al. Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proc. of the 23th Annual Int'l Symp. on Computer Architecture. PA, USA: ACM Press, 1996:191-202.
    [7] J. L. Lo, S. J. Eggers, et al. Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, 1997, 15(3): 322-354.
    [8] T. Ungerer, B. Robic and J. Silc. Multithreaded Processors. Computer Journal,2002, 45(3): 320-348.
    [9] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, Fourth Edition. Morgan Kaufman. CA, USA: Morgan Kaufmann,2007.
    [10] SPEC'S CPU Benchmarks. Standard Performance Evaluation Corporation. http://www.spec.org/benchmarks.html.
    [11]安虹.高效能通用微处理器芯片体系结构关键技术途径探讨.信息技术快报,2004,2(12):1-16.
    [12]W.A.Wulf and S.A.McKee.Hitting the Memory Wall:Implications of the Obvious.ACM SIGARCH Computer Architecture News,1995,23(1):20-24.
    [13]M.V.Wilkes.The memory wall and the CMOS end-point.SIGARCH Computer Architecture News,1995,23(4):4-6.
    [14]S.A.McKee.Reflections on the Memory Wall.Proceedings of the First Conference on Computing Frontiers.NY,USA:ACM Press,2004:162-167.
    [15]L.Hammond,B.A.Nayfeh and K.Olukotun.A Single-Chip Multiprocessor.Computer,1997,30(9):79-85.
    [16]R.A.Dua and B.Lokhande.A Comparative Study of SMT and CMP Multiprocessors.Technical Report,http://tbp.berkeley.edu/～jdonald/research/cmp/.2006.
    [17]B.Sinharoy,R.N.Kalla,et al.POWER5 system microarchitecutre.IBM Journal of Research and Development,2005,49(4/5):505-521.
    [18]First the Tick,Now the Tock:Next Generation Intel Microarchitecture(Nehalem).White Paper,Intel Corporation,2008.
    [19]K.Park,S.Choi,et al.On-Chip Multiprocessor with Simultaneous Multithreading.ETRI Journal,2000,22(4):13-24.
    [20]H.Hirata,K.Kimura,et al.An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads.Proceedings of the 19th Annual International Symposium on Computer Architecture.NY,USA:ACM Press,1992:136-145.
    [21]R.G.Prasadh and C.Wu.A benchmark evaluation of a multi-threaded RISC processor architecture.Proc.of International Conference on Parallel Processing.1991:84-91.
    [22]S.W.Keckler and W.J.Dally.Processor coupling:Integrating compile time and runtime scheduling for parallelism.Proc.of the 19th Annual International Symposium on Computer Architecture.1992:202-213.
    [23]W.Yamamoto,M.J.Serrano,et al.Performance estimation of multistreamed,superscalar processors.Proc.of the 27th Hawaii Internation Conference on System Sciences.1994:195-204.
    [24]B.Benschneider.An overview of the Alpha AXP 21164 microprocessor.Proc.of the 38th Midwest Symposium on Circuits and Systems.1995:1131-1134.
    [25]K.C.Yeager.The MIPS R10000 superscalar microprocessor.IEEE Micro,1996,169(2):28-40.
    [26]S.J.Eggers,J.S.Emer,et al.Simultaneous Multithreading:A Platform for Next-Generation Processors.IEEE Micro,1997,17(5):12-19.
    [27]S.Hily and A.Seznec.Branch prediction and simultaneous multithreading.Proc.of International Conference on Parallel Computer Architecture and Compilation Technology.1996:169-173.
    [28]M.Ramsay,C.Feucht,and M.Lipasti.Exploring Efficient SMT Branch Predictor Design.Workshop on Complexity Effective Design,2003.
    [29]S.Hily and A.Seznec.Standard Memory Hierarchy Does Not Fit Simultaneous Multithreading.Proceedings of Workshop on Multithreaded Execution Architecture and Compilation,Jan.1998.
    [30]G.E.Suh,L.Rudolph,and S.Devadas.Dynamic Cache Partitioning for Simultaneous Multithreading Systems.Proc.of the 13th IASTED International Conference on Parallel and Distributed Computing Systems.2001:429-443.
    [31]隋秀峰,吴俊敏,陈国良.ARP:同时多线程处理器中共享Cache自适应运行时划分机制.计算机研究与发展,2008,45(7):1269-1277.
    [32]A.Gonzalez,J.Gonzalez,and M.Valero.Virtual-Physical Registers.Proceedings of the 4th International Symposium on High-Performance Computer Architecture.WA,USA:IEEE CS Press,1998:175-184.
    [33]T.Monreal,V.Vinals,et al.Late allocation and early release of physical registers.IEEE Trans.on Computers,2004,53(10):1244-1259.
    [34]M.H.Lipasti,B.R.Mestan,and E.Gunadi.Physical Register Inlining.Proceedings of the 31st Annual International Symposium on Computer Architecture.2004:325-335.
    [35] S. Balakrishnan, G S. Sohi. Exploiting Value Locality in Physical Register Files.Proc. of the 36th Annual International Symposium on Microarchitecture. 2003:265-276.
    [36] 杨华,崔刚等.两级分配多可用重命名寄存器.计算机学报,2006, 29(10):1729-1739.
    [37] J.L. Lo, S.S. Parekh, et al. Software-directed register deallocation for simultaneous multithreading processors. IEEE Trans. on Parallel and Distributed System, 1999, 10(9): 922-933.
    [38] J. Sharkey and D. Ponomarev. An L2-Miss-Driven Early Register Deallocation for SMT Processors. Proceedings of the 21st Annual International Conference on Supercomputing. NY, USA: ACM Press, 2007: 138-147.
    [39] S. Palacharla, N. P. Jouppi, and J. E. Smith. Complexity-Effective Superscalar Processors. Proc. of the 24th Annual Int'l Symposium on Computer Architecture. NY, USA: ACM Press, 1997: 206-218.
    [40] N. Mehta, B. Singer, et al. Fetch Halting on Critical Load Misses. Proc. of the IEEE Int'l Conference on Computer Design. WA, USA: ACM Press, 2004:244-249.
    [41] D. Folegnani and A. Gonzalez. Energy-Effective Issue Logic. Proc. of the 28th Annual Int'l Symposium on Computer Architecture. NY, USA: ACM Press,2001:230-239.

    [42] R. E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 1999, 19(2): 24-36.
    [43] I. Kim and M. H. Lipasti. Understanding Scheduling Replay Schemes. Proceedings of the 10th International Symposium on High Performance Computer Architecture. WA, USA: IEEE CS, 2004: 198-209.
    [44] B. Calder and G Reinman. A Comparative Survey of Load Speculation Architectures. Journal of Instruction-Level Parallelism, 2000, Vol. 2: 1-39.
    [45] G. Z. Chrysos and J. S. Emer. Memory Dependence Prediction using Store Sets.Proceedings of the 25th Annual International Symposium on Computer Architecture. WA, USA: IEEE CS, 1998: 142-153.
    [46] J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access: An In-Depth Look at Intel Innovations for Accelerating Execution of Memory-Related Instructions.White Paper,Intel Corporation,2006.
    [47]I.Park,C.Ooi,and T.Vijaykumar.Reducing design complexity of the load/store queue.Proc.of the International Symposium on Microarchitecture.2003:411-422.
    [48]F.Castro,D.Chaver,et al.Load-store queue filtering:a straightforward approach using global registers.In Conference on Design of Circuits and Integrated Systems.Barcelona,Spain,2006.
    [49]F.Castro,D.Chaver,et al.Using age registers for a simple load-store queue filtering.Journal of Systems Architecture,2009,55(2):79-89.
    [50]J.Emer.Simultaneous Simultaneous Multithreading:Multiplying Alpha Performance.In Proc.Microprocessor Forum.San Jose,CA,1999.
    [51]J.M.Tendler,J.S.Dodson,et al.POWER4 system microarchitecture.IBM Journal of Research and Development,2002,46(1):5-25.
    [52]D.T.Marr,F.Binns,et al.Hyper-Threading Technology Architecture and Microarchitecture.Intel Technology Journal,2002,6(1):4-15.
    [53]Intel Atom Processor Microarchitecture.http://www.intel.com/technology/atom/microarchitecture.htm.
    [54]W.Hu,J.Wang,et al.Godson-3:A Scalable Multicore RISC Processor with x86Emulation.IEEE Micro,2009,29(2):17-29.
    [55]李祖松,许先超等.龙芯2号同时多线程处理品的软件接口设计.软件学报,2007,18(7):1806-1817.
    [56]L.A.Barroso,K.Gharachorloo,and E.Bugnion.Memory System Characterization of Commercial Workloads.Proceedings of the 25th Annual International Symposium on Computer Architecture.NY,USA:ACM Press,1998:3-14.
    [57]T.Sherwood,E.Perelman,et al.Automatically characterizing large scale program behavior.Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems.NY,USA:ACM Press,2002:45-57.
    [58] E. Perelman. Characterizing Time Varying Program Behavior for Efficient Simulation [PhD Dissertation]. CA, USA, University of California, San Diego,2007.
    [59] T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. Proceedings of the 30th Annual International Symposium on Computer Architecture. NY,USA: ACM Press, 2003: 336-349.

    [60] 肖刚,周兴铭等.sma:前瞻性多线程体系结构.计算机学报,1999,22(6):582-590.
    [61] A. Roth and G S. Sohi. Speculative Data-Driven Multithreading. Proc. 7th Int'l Symposium on High Performance Computer Architecture. WA, USA: IEEE CS,2001:37-48.
    [62] R. S. Chappell, J. Stark, et al. Simultaneous Subordinate Microthreading (SSMT).Proc. 26th Int'l Symposium on Computer Architecture. NY, USA: ACM Press,1999: 186-195.
    [63] O. Mutlu, J. Stark, et al. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture. WA,USA: IEEE CS, 2003: 129-140.
    [64] T. Ramirez, A. Pajuelo, et al. Runahead Threads to Improve SMT Performance.Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. WA, USA: IEEE CS, 2008: 149-158.
    [65] M. H. Lipasti, C. B. Wilkerson, and J. P. Shen. Value locality and load value prediction. Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. NY, USA: ACM Press, 1996: 138-147.
    [66] Y. Sazeides and J. E. Smith. The Predictability of Data Values. Proc. of the 30th International Symposium on Microarchitecture. WA, USA: IEEE CS, 1997:248-258.
    [67] M. H. Lipasti and J. P. Shen. Exceeding the dataflow limit via value prediction.Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture.1996:226-237.
    [68]M.Burtscher and B.G.Zorn.Hybrid load-value predictors.IEEE Transactions on Computers,2002,51(7):759-774.
    [69]Y.Chou,B.Fahs,and S.Abraham.Microarchitecture Optimizations for Exploiting Memory-Level Parallelism.Proc.of the 31st Annual Int'l Symposium on Computer Architecture.2004:76-87.
    [70]Y.Chou,L.Spracklen,and S.G.Abraham.Store Memory-Level Parallelism Optimizations for Commercial Applications.Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.WA,USA:IEEE CS,2005:183-196.
    [71]F.J.Cazorla,A.Ramirez,et al.Dynamically controlled resource allocation in SMT processors.Proc.of the 37th Int'l Symposium on Microarchitecture.WA,USA:IEEE CS,2004:171-182.
    [72]何立强,刘志勇.基于负载瞬时IPC性能的同时多线程处理器取指策略.计算机学报,2007,30(4):629-637.
    [73]H.Wang,I.Koren,and M.Krishna.An Adaptive Resource Partitioning Algorithm in SMT Processors.Proc.of the 17th Int'l Conf.on Parallel Architectures and Compilation Techniques.NY,USA:ACM Press,2008:230-239.
    [74]J.Sharkey,D.Balkan,and D.Ponomarev.Adaptive reorder buffers for SMT processors.Proc.of the 15th Int'l Conf.on Parallel Architectures and Compilation Techniques.NY,USA:ACM Press,2006:244-253.
    [75]S.Eyerman and L.Eeckhout.A Memory-Level Parallelism Aware Fetch Policy for SMT Processors.In Proc.of the IEEE 13th Int'l Symp.on High Performance Computer Architecture.2007:240-249.
    [76]J.J.Sharkey and D.V.Ponomarev.Exploiting Operand Availability for Efficient Simultaneous Multithreading.IEEE Transactions on Computers,2007,56(2):208-223.
    [77]H.Wang,R.Sangireddy,and S.Baldawa.Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors.IEEE Trans.on Parallel and Distributed Systems,2009,20(3):389-403.
    [78]K.Luo,J.Gummaraju,and M.Franklin.Balancing throughput and fairness in SMT processors.Proc.of the Int'l Symposium on Performance Analysis of Systems and Software.2001:164-171.
    [79]F.J.Cazorla,P.Knijnenburg,et al.Implicit vs.Explicit Resource Allocation in SMT Processors.Proceedings of the Euromicro Symposium on Digital System Design.WA,USA:IEEE CS,2004:44-51.
    [80]孙彩霞,张民选.使用取指策略控制同时多线程处理器中个体线程的性能.计算机学报,2008,31(2):309-317.
    [81]孙彩霞.同时多线程处理器中的资源分配策略研究[博士学位论文].长沙,国防科学技术大学,2006.
    [82]D.M.Tullsen and J.Brown.Handling long-latency loads in a simultaneous multithreading processor.Proc.of the 34th Annual ACM/IEEE Int'l Symposium on Microarchitecture.WA,USA:IEEE CS,2001:318-327.
    [83]F.J.Cazorla,E.Fernandez,et al.Improving Memory Latency Aware Fetch Policies for SMT Processors.Proc.of the 5th Int'l Symposium on High Performance Computing.Germany:Springer Berlin/Heidelberg,2003:70-85.
    [84]F.J.Cazorla,E.Fernandez,et al.Optimizing Long-Latency-Load-Aware Fetch Policies for SMT Processors.Int'l Journal of High Performance Computing and Networking,2004,2(1):45-54.
    [85]A.EI-Moursy and D.H.Albonesi.Front-End Policies for Improved Issue Efficiency in SMT Processors.Proc.of the 9th Int'l Symposium High Performance Computer Architecture.WA,USA:IEEE CS,2003:31-40.
    [86]C.Shin,S.Lee,and J.Gaudiot.Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread.Journal of Parallel and Distributed Computing,2006,66(10):1304-1321.
    [87]F.J.Cazorla,A.Ramirez,et al.DCache Warn:an I-Fetch policy to increase SMT efficiency.Proc.of the 18th Int'l Parallel and Distributed Processing Symposium.WA,USA:IEEE CS,2004:74-83.
    [88]孙彩霞,张民选.基于多个取指优先级的同时多线程处理器取指策略.电子学报,2006,34(5):790-795.
    [89] S. E. Raasch and S. K. Reinhardt. The Impact of Resource Partitioning on SMT Processors. Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. WA, USA: IEEE CS, 2003: 15-25.
    [90] S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. Proc. of the 33rd Annual Int'l Symposium on Computer Architecture. WA, USA: IEEE CS, 2006: 239-251.
    [91] S. Choi. Hill-climbing SMT Processor Resource Distribution [PhD Dissertation].MD, USA, University of Maryland, 2006.
    [92] A. Glew. MLP yes! ILP no!. In ASPLOS Wild and Crazy Idea Session '98. Oct.1998.
    [93] V. S. Pai and S. Adve. Code Transformations to Improve Memory Parallelism.Proceedings of the 32nd annual ACM/IEEE International Symposium on Microarchitecture. WA, USA: IEEE CS, 1999: 147-155.
    [94] H. Zhou and T. M. Conte. Enhancing Memory Level Parallelism via Recovery-Free Value Prediction. Proc. of the 17th International Conference on Supercomputing. NY, USA: ACM Press, 2003: 326-335.
    [95] N. Kirman, M. Kirman, et al. Checkpointed Early Load Retirement. Proceedings of the 11th International Symposium on High-Performance Computer Architecture. WA, USA: IEEE CS, 2005: 16-27.
    [96] L. Ceze, K. Strauss, et al. CAVA: Using checkpoint-assisted value prediction to hide L2 misses. ACM Transactions on Architecture and Code Optimization,2006, 3(2): 182-208.
    [97] J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss. Proc. of the 11th Int'l Conference on Supercomputing. NY, USA: ACM Press, 1997: 68-75.
    [98] R. D. Barnes, S. Ryoo, and W. W. Hwu. "Flea-Flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. Proc. of the 38th Int'l Symposium on Microarchitecture. 2005: 319-330.
    [99] A. R. Lebeck, J. Koppanalil, et al. A Large, Fast Instruction Window for Tolerating Cache Misses. Proc. of the 29th Annual Int'l Symposium on Computer Architecture.NY,USA:ACM Press,2002:59-70.
    [100]E.Brekelbaum,J.Rupley,et al.Hierarchical Scheduling Windows.Proc.of the 35th Annual Int'l Symposium on Microarchitecture.WA,USA:IEEE CS,2002:27-36.
    [101]A.Cristal,D.Ortega,et al.Out-of-Order Commit Processors.Proc.of the 10th Int'l Symposium on High Performance Computer Architecture.WA,USA:IEEE CS,2004:48-59.
    [102]S.T.Srinivasan,R.Rajwar,et al.Continual Flow Pipelines:Achieving Resource-Efficient Latency Tolerance.IEEE Micro,2004,24(6):62-73.
    [103]D.Ernst and T.Austin.Efficient Dynamic Scheduling through Tag Elimination.Proc.of the 29th Int'l Symposium on Computer Architecture.2002:37-46.
    [104]J.J.Sharkey,D.V.Ponomarev,et al.Instruction Packing:Reducing Power and Delay of the Dynamic Scheduling Logic.Proc.of the Int'l Symposium on Low Power Electronics and Design.NY,USA:ACM Press,2005:30-35.
    [105]J.J.Sharkey and D.V.Ponomarev.Efficient Instruction Schedulers for SMT Processors.Proc.of the 12th Int'l Symposium on High Performance Computer Architecture.2006:288-298.
    [106]J.J.Sharkey and D.V.Ponomarev.Balancing ILP and TLP in SMT Architectures through Out-of-Order Instruction Dispatch.Proc.of the 35th Int'l Conference on Parallel Processing.WA,USA:IEEE CS,2006:329-336.
    [107]L.K.John.Performance Evaluation:Techniques,Tools and Benchmarks.Technical Report ENS 143,University of Texas at Austin,2002.
    [108]W.K.Lam著,王维维译.硬件设计验证:基于模拟与形式的方法.北京:机械工业出版社,2007.
    [109]S.R.Goldschmidt and J.H.Hennessy.The accuracy of trace-driven simulations of multiprocessors.Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems.NY,USA:ACM Press,1993:146-157.
    [110]S.Dwarkadas,J.R.Jump,and J.B.Sinclair.Execution-driven simulation of multiprocessors:address and timing analysis.ACM Transactions on Modeling and Computer Simulation,l 994,4(4):3 l 4-3 3 8.
    [111]喻之斌,金海,邹南海.计算机体系结构软件模拟技术.软件学报,2008,19(4):1051-1068.
    [112]D.Burger,T.M.Austin,and S.Bennett.Evaluating future microprocessors:the Simplescalar tool set.Technical Report 1308,University of Wisconsin-Madison,1996.
    [113]T.Austin,E.Larson,and D.Ernst.SimpleScalar:An Infrastructure for Computer System Modeling.Computer,2002,35(2):59-67.
    [114]D.Madon,E.Sanchez,and S.Monnier.A Study of a Simultaneous Multithreaded Processor Implementation.Proc.of the 5th Int'l Euro-Par Conference on Parallel Processing.London,UK:Springer-Verlag,1999:716-726.
    [115]J.J.Sharkey,D.Ponomarev,and K.Ghose.M-Sim:a flexible,multi-threaded simulation environment.Tech Report CS-TR-05-DP1,State University of New York-Binghamton,2005.
    [116]S.Baldawa and R.Sangireddy.CMP-SIM:An Environment for Simulating Chip Multiprocessor(CMP) Architectures.University of Texas at Dallas,Oct.2006,http://www.utdallas.edu/～rama.sangireddy/CMP-SIM/.
    [117]张福新,章隆兵,胡伟武.基于SimpleScalar的龙芯CPU模拟器Sim-Godson.计算机学报,2007,30(1):68-73.
    [118]D.M.Tullsen.Simulation and modeling of a simultaneous multithreading processor.Proceedings of the 22nd Annual Computer Measurement Group Conference.1996:819-828.
    [119]J.J.Henning.SPEC CPU2000:measuring CPU performance in the new millennium.IEEE Computer,2000,33(7):28-35.
    [120]J.L.Henning.SPEC CPU suite growth:an historical perspective.ACM SIGARCH Computer Architecture News,2007,35(1):65-68.
    [121]A.KleinOsowski and D.J.Lilja.MinneSPEC:A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research.IEEE Computer Architecture Letters,2002,1(1):7-10.
    [122] J. J. Yi, S. V. Kodakara, et al. Characterizing and Comparing Prevailing Simulation Techniques. Proceedings of the 11th International Symposium on High Performance Computer Architecture. WA, USA: IEEE CS, 2005:266-277.

    [123] E. Perelman, G. Hamerly, et al. Using SimPoint for Accurate and Efficient Simulation. Proc. of the ACM SIGMETRICS Int'l Conference on Measurement and Modeling of Computer Systems. 2003: 318-319.

    [124] E. Perelman, G Hamerly, and B. Calder. Picking statistically valid and early simulation points. Proc. of the 12th Int'l Conference on Parallel Architectures and Compilation Techniques. 2003: 244-255.

    [125] R. E. Wunderlich, T. F. Wenisch, et al. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. Proc. of the 30th Annual Int'l Symposium on Computer Architecture. 2003: 84-95.

    [126] A. Snavely, D. M. Tullsen, and G Voelker. Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor. Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. NY, USA: ACM Press, 2002: 66-76.

    [127] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. Proceedings of the 8th annual symposium on Computer Architecture. Los Alamitos, USA:IEEE CS, 1981:81-87.

    [128] J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory-Level Parallelism. Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. WA, USA: IEEE CS, 2006:409-422.

    [129] N. Tuck and D. M. Tullsen. Multithreaded Value Prediction. Proceedings of the 11th International Symposium on High-Performance Computer Architecture.WA, USA: IEEE CS, 2005: 5-15.

    [130] D. Burger, J. R. Goodman, and A. Kagi . Memory Bandwidth Limitations of Future Microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture. 1996: 78-89.
    [131]D.Boggs,A.Baktha,et al.The Microarchitecture of the Intel Pentium 4Processor on 90nm Technology.Intel Technology Journal,2004,8(1):1-17.
    [132]G.S.Sohi and M.Franklin.High-bandwidth data memory systems for superscalar processors.Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems.NY,USA:ACM Press,1991:53-62.
    [133]I.Prigogine.Introduction to Thermodynamics of Irreversible Processes(3rd Ed.).NY,USA:Interscience Publisher,1967,124-134.
    [134]I.Prigogine.Order through Fluctuation:Self-organization and Social System.In:E.Jantsch and C.Waddington(Eds.),Evolution and Consciousness:Human Systems in Transition.London:Addison-Wesley,1976,93-134.
    [135]G.Nicolis and I.Prigogine.Self-organization in Nonequilibrium Systems:From Dissipative Structures to Order through Fluctuations.NY,USA:John Wiley,1977,55-60,429-474.
    [136]朱子玉,李亚民.CPU芯片逻辑设计技术.北京:清华大学出版社,2005.
    [137]汤志忠,杨春武.开放式实验CPU设计.北京:清华大学出版社,2007.
    [138]J.M.Yarbrough著,李书浩,仇广煜等译.数字逻辑应用与设计.北京:机械工业出版社,2000.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700