多核结构上高效的线程级推测及事务执行模型研究

英文题名：Research on the Efficient Thread-level Speculation and Transactional Execution Model on Multi-core Platform
作者：刘圆
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：多核体系结构 ; 推测多线程 ; 事务存储 ; 程序剖析 ; 动态优化
英文关键词：multi-core processor architecture ; speculative multithreading ; program profile ; dynamic optimization
学位年度：2007
导师：杨寿保 ; 安虹
学科代码：081201
学位授予单位：中国科学技术大学
论文提交日期：2007-09-01

摘要

片上多核作为当今处理器设计的主流技术,需要运行多线程应用才能充分发挥性能。推测多线程方法能够简化并行编程,允许程序员或者编译器在不完全保证正确性的情况下,尝试激进的优化方式来开发和利用更多的程序并行性。实现这种方法的难点在于访存操作的局部缓存,已提出的一些推测多线程方案都使用了非常复杂的缓存机制,不光增加了硬件设计复杂度,也在一定程度上影响了应用开发的效率。实现这种技术的另一个难点是如何有效地减少误推测对并行性能的不确定性影响。为此,本文尝试采用事务存储和动态剖析技术来解决这两大难题,为多核平台寻找一种能够高效地推测并行化应用程序的软硬件协同的解决方案。
     本文围绕基于事务存储的线程级推测技术开展了深入系统的研究,涉及结构模型、编程和执行模型、动态优化方法等方面的内容。主要研究成果包括:(1)本文首先提出了一个基于事务存储的推测多线程体系结构模型SPoTM(Speculatire Parallelization on Transactional Memory)。SPoTM利用事务存储来实现线程间的读写操作隔离,提供了线程乱序执行、顺序提交、冲突检测以及推测失败后回退等功能。(2)本文还为SPoTM结构设计了一个基于循环并行的推测多线程编程模型,提供了实现该编程模型所需的推测线程系统库以及指令集扩展等。SPoTM编程模型实现简单,并行化需要的代码调整很少,对多线程并行程序设计的简化非常明显。(3)本文选取SPEC CPU 2000中的若干典型程序,在为SPoTM结构开发的模拟执行平台fastTM和sim-SPoTM上进行了详细的评测,量化分析了各种硬件机制对推测执行性能的影响,以寻找性价比较好的实现方案。本文还全面分析了在推测执行条件下Cache局部性的变化,并提出和验证了几个改善局部性的方法。(4)针对当前推测多线程优化中普遍使用的离线剖析方式受到培训输入集限制的问题,本文提出并实现了一种在运行时根据在线剖析结果自动变换推测多线程程序的动态优化方法。该方法在运行时执行剖析和优化工作,不需要单独的剖析过程以及通用的测试输入集,同时也适用于那些运行时行为特征呈阶段性变化的程序。实验表明,在指导事务划分和选择并行循环方面,动态优化方法能够达到和离线优化方法相近的效果。
     在设计评测SPoTM结构模型,开发动态软件优化系统的过程中,我们得到了一些关于如何有效利用推测多线程技术的定性结论。首先,为了提升推测执行性能,我们认为更多的努力应当投入到软件优化方面,而不是激进地调整硬件结构和执行机制。其次,推测多线程技术并不能使自动并行完全取代手工并行,这种技术可以作为手工并行的辅助工具来使用。最后,不论是手工并行还是自动并行,一个渐进的并行代码变换过程都是需要的,而在此过程中,剖析指导的优化技术起着非常关键的作用。
Multi-core architecture has become the main stream of processor designs, but to make full use of the parallel computing resources on the multi-core platforms the multi-threads application is desired. The speculative parallel threading technique has been proposed in order to simplify the parallel programming. Its distinct characteristic is to relax the constraints about sequential semantics between threads, which allowed programmers or compiler to attempt the aggressive optimizing ways even though the validity of transformations can't been guaranteed in the static compiling phase.
     It is an issue to buffer memory accesses during implementing speculative multithreading. Current speculative multithreading projects used too complicated buffering mechanism, which increased the complexity of hardware designs and impacted the efficiency of multithreaded application developments. The other problem is how to reduce the indetermination of parallel performance gains from mis-speculation. So this dissertation uses transactional memory and dynamic profiling technique to address the two problems. And the research target is to find an efficient software-hardware associative solution to speculative multithreading for multi-core platforms.
     This dissertation focuses on the implementation of the speculative technique based on transactional memory, which covers architecture model, programming model, threaded execution model, and dynamic optimizing methods. The detailed work includes the following aspects: First, a speculative multithreading architecture based on transactional memory, named as SPoTM (Speculative Parallelization on Transactional Memory), has been proposed. SPoTM isolates the load/store operations contained in different threads through the transactional, and support out-of-order execution, in-order commitment, violation detection and recovery from speculation failure. Second, a simple programming model which targets the loop parallelization has been designed, and the speculative system library and the ISA extension go along with it. It needs very few modifications to parallel sequential programs using this programming model, so this model significantly simplifies the parallel programming. Third, we have developed two simulation tools for the verification and experiments, one of which is fastTM performing the function-level simulation, the other of which is sim-SPoTM, supporting cycle-precious simulation. To evaluate the effect of various software and hardware factors on the speculative execution performance, we attempt some design choice and use several applications in SPEC CPU 2000 benchmark as test cases running on the SPoTM simulation platform. We also consider and analyze the change of Cache locality under the speculative multi-threading environment, and propose a few methods to improve the locality. Finally, an online profile guided dynamic optimization framework has been proposed on the SPoTM platform as the core component of the continuous gradual profile guided software parallel optimizing system for speculative execution. The offline profile way can't guide effectively and accurately the optimization of the program without a representative training input, but in most cases there aren't such training inputs. We attempt to adopt the online profile to extend the usage of profile in speculative optimization. The evaluation shows that the ability of this approach is comparable to the traditional offline implementation on two aspects: identifying the loops suitable to be speculatively parallelized; and performing transactional partition optimization. So we believe that this approach is able to serve as an individual guide to speculatively parallelize the applications when traditional offline profile is unavailable due to the lack of general training inputs.
     We have drawn some conclusions of the speculative parallel threading technique itself during the process of the implementation and evaluation to the SPoTM architecture. First, we think that more efforts should be devoted to the software optimization, not complicated hardware design, because we have found that even the very aggressive hardware mechanism achieved only limited performance gains. Second, it is impractical to improve the performance of most applications through automatic parallelization using speculation, and this speculative multi-threading technique can be regarded as an assistant tool for the sophisticated manual parallelization. Finally, profile technique plays a key role in gradual speculative multi-threading optimization, whether it is the offline way or the online way, but in the future the latter is more and more important because of the requirement of dynamic optimization.

引文

[1] Ananian C. Scott, Asanovic Krste, Kuszmaul Bradley C, et al. 2005. Unbounded Transactional Memory. Proceedings of the 11th International Symposium on High-Performance Computer Architecture. pp. 316-327.

    [2] Asanovic K, Bodik R., Catanzaro B.C., et al. 2006. The Landscape of Parallel Computing Research: A View from Berkeley [R]. Technical Report No. UCB/EECS-2006-183.
    [3] Austin T., Larson E., Ernst D., 2002. SimpleScalar: An Infrastructure System Modeling [J]. IEEE Computer, 35(2):59-67.
    [4] Ball T, Lams J.R. 1996. Efficient Path Profiling [C]. Proceeding of the 29th Annual IEEE/ACM International Symposium on Microarchitecture.
    [5] Barroso A. et al. 2000. Piranha: A Scalable Architecture Based on Single-chip Multiprocessing [C]. Proceedings of the 27th Annual International Symposium on Computer Architecture.
    [6] Calder B., Feller P., Eustace A. 1997. Value Profiling [C]. Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture.
    [7] Carlos Garcia Quinones, Carlos Madriles. 2005. Mitosis compiler: An Infrastructure for Speculative Threading Based on Pre-computation Slices [C]. Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.
    [8] Chandra Krintz. 2003. Profile-based Optimizations: Coupling On-line and Off-line Profile Information to Improve Program Performance [C]. Proceedings of the International Symposium on Code Generation and Optimization.
    [9] Chen Michael, Olukotun Kunle. 2003. The JRPM System for Dynamically Parallelizing Java Programs [C]. Proceedings of the 30th Annual Symposium on Computer Architecture.
    [10] Chen, T, Lin, J., Dai, X., et al. 2004. Data Dependence Profiling for Speculative Optimization [C]. Proceedings of the 13th International Conference on Compiler Construction (CC), pages 57-72.

    [11] Clabes J. et al., 2004. Design and Implementation of the Power5 Microprocessor [C]. In ISSCC Digest of Technical Papers.

    [12] Consel C, Lawall J. L, Meur A. 2004. A Tour of Tempo: A Program Specializer for the C Language [J]. Sci. Comput. Program. 52,1-3, 341-370.
    [13] Diefendorff K. 1999. Power4 Focuses on Memory Bandwidth [J]. Microprocessor Report.
    [14] Du Zhao-Hui, Lim Chu-Cheow, Li Xiao-Feng, et al. 2004. A Cost-Driven Compilation Framework for Speculative Parallelizing Sequential Program [C]. Proceedings of ACM Conference on Programming Languages, Design, and Implementation.
    [15] Eggers Susan, Emer Joel, Levy Henry, et al. 1997. Simultaneous Multithreading: A Platform for Next-generation Processors [J]. IEEE Micro, (September/October).
    [16] Fisher J.A., Freudenberger S.M. 1992. Predicating Conditional Branch Directions from Previous Runs of a program [C]. Proceedings of the 5th International Conference on Architecture Support for Programming Languages and Operating System.
    [17] Gupta R., Mehofer E., et al. 2002. Profile Guided Compiler Optimizations [M]. The Compiler Design Handbook: Optimizations & Machine Code Generation, Auerbach Publications.
    [18] Grohoski Greg, 1998. Reining in Complexity. IEEE Computer Magazine [J], Vol. 31, Issue 1,41-42.
    [19] Hammond L, Willey M., Olukotun K. 1998. Data Speculation Support for a Chip Multiprocessor [C]. ASPLOS-VIII, 32-33, issue 5.
    [20] Hammond L, Hubbert B., Siu M., et al. 2000. The Stanford Hydra CMP [J]. IEEE MICRO Magazine.
    [21] Hammond L, Wong V, Chen M., 2004. Transactional Memory Coherence and Consistency [C], Proceedings of the 31st Annual International Symposium on Computer Architecture.
    [22] Hammond L, Carlstrom Brian D., Wong Vicky, et al. 2004. Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software [J]. Micro's Top Picks, IEEE Micro.
    [23] Hammond L, Carlstrom B.D., Wong V, et al. 2004. Programming with Transactional Coherence and Consistency (TCC) [C], ASPLOS04.
    [24] Hennessy J.L, Patterson D.A., 2003. Computer Architecture: A Quantitative Approach [M].3rd ed. Morgan Kaufmann Publishers, Inc.
    [25] Herlihy M., Eliot J., Moss B.1992. Transactional Memory: Architectural Support for Lock-free Data Structures [R]. Technical Report, Digital Cambridge Research Lab,Cambridge, Massachusetts.
    [26] Herlihy M., Moss B., 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures [C]. Proceedings of the 20th Annual International Symposium on Computer Architecture.
    [27] Herlihy M., Luchangco V., Moir M., et al. 2003. Software Transactional Memory for Dynamic-Sized Data Structures [C]. Proceedings of the 22nd Annual ACM Symposium on Principles of Distributed Computing.
    [28] Ho R., Mai K., Horowitz M. 2001. The Future of Wires [C]. Proceeding IEEE, Apr. 2001, pp.490-504.
    [29] Hu S., Bhargava R., Kurian L.J., 2003. The Role of Return Value Prediction in Exploiting Speculative Method-Level Parallelism [J]. Journal of Instruction-Level Parallelism, 1-21.
    [30] Huang J., Lilja D.J. 1998. An Efficient Strategy for Developing a Simulator for a Novel Concurrent Multithreaded Processor Architecture [C]. Proceedings of the 6th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.
    [31] Intel Corporation. 2002. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization [R].
    [32] Kistler T., Franz M. 2003. Continuous Program Optimization: A Case Study [J]. ACM Trans. on Programming. Languages and Systems.
    [33] Kozyrakis Christoforos, Patterson David, 1998. A New Direction for Computer Architecture Research [J], IEEE Computer Magazine, 24-32.
    [34] Krishnan V., Torrellas J., 1997. Efficient Use of Processing Transistors for Larger On-Chip Storage: Multithreading [C], Workshop on Mixing Logic and DRAM: Chips that Compute and Remember.
    [35] Krishnan V, Torrellas J., 1998. Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor [C], International Conference on Supercomputing (ICS).
    [36] Krishnan V., Torrellas J., 1999. A Chip Multiprocessor Architecture with Speculative Multithreading [J], IEEE Transactions on Computers, Special Issue on Multithreaded Architecture.
    [37] Lam Monica, et al. 1994. The SUIF1 Compiler Infrastructure [EB].http://suif.stanford.edu/.
    [38] Li Xiao-Feng, Du Zhao-Hui, Yang Chen, et al. 2004. Speculative Parallel Threading Architecture and Compilation [C]. The 9th Asia-Pacific Computer Systems Architecture Conference.
    [39] Li Xiao-Feng, Yang Chen, Du Zhao-Hui, et al. 2005. Exploiting Thread-Level Speculative Parallelism with Software Value Prediction [C]. The Tenth Asia-Pacific Computer Systems Architecture Conference.
    [40] Li Zhao, Ravi Iyer, Srihari Makineni, et al. 2007. Performance, Area and Bandwidth Implications on Large-scale CMP Cache Design [C]. The Workshop on Chip-Multiprocessor Memory Systems and Interconnects (CMP-MSI) held along with International Symposium on High-Performance Computer Architecture (HPCA-13), Phoenix, Arizona.
    [41] Luk C, Cohn R., Muth R., et al. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation [C]. Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.
    [42] Makholm H. 1999. Specializing C - An Introduction to the Principles behind C-mix/ii [R].Technical report, DIKU Copenhagen.
    [43] Marathe V. J., Scott M. L, 2004. A Qualitative Survey of Modern Software Transactional Memory Systems [R]. Technical Report TR 839, Department of Computer Science,University of Rochester.
    [44] Marcuello P., Gonzalez A. 2002. Thread Spawning Schemes for Speculative Multithreaded Architecture [C]. Proceedings of the 8th International Symposium on High-Performance Computer Architecture.
    [45] Marcuello P., Gonzalez A. 1999. Value Prediction for Speculative Multithreaded Architectures [C]. Proceeding of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture.
    [46] Marr D.T., Binns E, Hill D.L et al. 2002. Hyper-threading technology architecture and microarchitecture [J]. Intel Technology Journal.
    [47] Martin M.M.K, Sorin D.J., Beckmann B.M., et al. 2005. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset [EB]. Computer Architecture News (CAN).
    [48] Martinez J.E and Torrellas J., 2002. Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications [C], 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
    [49] McDonald Austen, Chung JaeWoong, Chafi Hassan, et al. 2005. Characterization of TCC on Chip-Multiprocessors [C]. The Fourteenth International Conference on Parallel Architectures and Compilation Techniques.
    [50] Mendelson Avi, Mandelblat Julius, Gochman Simcha, et al, 2006. CMP Implementation in Systems Based on the Intel Core Duo Processor [J], Intel Technology Journal, 10, Issue 02.
    [51] Mudge Travor, 2001. Power: a First-class Architectural Design Constraint [C], In Proceeding of the 7th International Conference on High Performance Computing, 52-58.
    [52] Nathan L.B., Erik GH. and Steven K.R, 2003. Network-Oriented Full-System Simulation using M5 [C]. In the sixth workshop on Computer Architecture Evaluation using Commercial Workloads.
    [53] Nayfeh B.A., Hammond L, Olukotun K., 1996. Evaluation of Design Alternatives for a Multiprocessor Microprocessor [C], Proceedings of the 23rd International Symposium on Computer Architecture.

    [54] Olukotun K, Hammond L, 2005. The Future of Microprocessors [J], QUEUE, 27-34.
    [55] Olukotun K., Hammond L, Willey M. 1999. Improving the Performance of Speculatively Parallel Applications on the Hydra CMP [C], Proceedings of the 1999 ACM International Conference on Supercomputing, Rhodes, Greece.
    [56] Oplinger J.T., Heine D.L., et al. 1999. In Search of Speculative Thread-Level Parallelism[C].Working Conference on Parallel Architectures and Compilation Techniques, 303 - 313
    [57]Ortego P.M., Sack P., 2005. SESC: SuperESCalar Simulator [EB].http://sesc.sourceforge.net/.
    [58] Part Y.N., Patel S.J., Evers M. et al. 1997. One Billion Transistors, One Uniprocessor [J],One Chip. IEEE Computer.

    [59] Patt Y.N. 1996. First Let's Get the Uniprocessor Right [J]. Microprocessor Report.
    [60] Prabhu M.K., 2005. Parallel Programming Using Thread-level Speculation [D], Stanford University.
    [61] Pugh W, 1992. A Practical Algorithm for Exact Array Dependence Analysis [J].Communciation of the ACM, 35(8):102-114.
    [62] Rajwar Ravi, Herlihy Maurice, Lai Konrad. 2005. Virtualizing Transactional Memory [C].Proceedings of the 32nd Annual International Symposium on Computer Architecture.
    [63] Rajwar Ravi, Bernstein Philip A. 2003. Atomic Transactional Execution in Hardware: A New High-Performance Abstraction for Databases [C]? The 10th International Workshop on High Performance Transaction Systems.
    [64] Rajwar Ravi, Goodman James R. 2003. Transactional Execution: Toward Reliable,High-Performance Multithreading [J]. IEEE Micro, 23(6):117-125.
    [65] Rajwar Ravi, Goodman James R.. 2002. Transactional Lock-Free Execution of Lock-Based Programs [C]. Proceedings of the Tenth Symposium on Architectural Support for Programming Languages and Operating Systems. pp. 5--17.
    [66] Ricardo E. Gonzalez, 1997. Low-power Processor Design [R]. Technical Report:CSL-TR-97-726.
    [67] Sarkar V, Hennessy J., 1986. Partitioning Parallel Programs for Macro-dataflow [C], In Conference Proceedings of the 1986 ACM Conference on Lisp and Functional Programming,192-201.
    [68] Sazeides Y, Smith J.E. 1997. The Predictability of Data Values [C]. Proceeding of the 30th Annual IEEE/ACM International Symposium on Micro-architecture.
    [69] Scott Hamilton, 1999. Taking Moore's Law into the Next Century [J]. IEEE Computer Magazine, Vol.32, No.1, 43-48.
    [70] Shavit N., Touitou D. 1995. Software Transactional Memory [C]. Proceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing.
    [71] Sean Lie. 2004. Hardware Support for Unbounded Transactional Memory [D]. Masters Thesis, Massachusetts Institute of Technology.
    [72] Serwood Timoth, Sair Suleyman, Calder Brad. 2003. Phase Tracking and Prediction [C].Proceeding of 30~(th) Annual International Symposium on Computer Architecture.
    [73] SPEC. 1999. SPEC CPU 2000 benchmark suite. Standard Performance Evaluation Corporation [EB]. http://www.spec.org/.
    [74] Steffan J.G, Colohan C.B., Mowry T.C., 1997. Architectural Support for Thread-Level Data Speculation [R], Technical Report CMU-CS-97-188, School of Computer Science, Carnegie Mellon University.
    [75] Steffan J.G, Mowry T.C., 1997. The Potential for Thread-Level Data Speculation in Tightly-Coupled Multiprocessors [R], Technical Report CSRI-TR-350, Computer Science Research Institute, University of Toronto.
    [76] Steffan J.G, Colohan C.B., Zhai A., et al., 2000. A Scalable Approach to Thread-Level Speculation [C], Proceedings of the 27th Annual International Symposium on Computer Architecture.
    [77] Steffan J.G., Colohan C.B., Zhai A., et al., 2002. Improving value communication for thread-level speculation[C]. High-Performance Computer Architecture, 2002. Proceedings.Eighth International Symposium, 65-75.
    [78] Steffan J.G, Colohan C.B., Zhai A., et al., 2005, The STAMPede Approach to Thread-Level Speculation[J], ACM Transactions on Computer Systems,23, issue3,253 - 300.
    [79] Triolet R., Irigoin F., Feautrier P. 1986. Direct parallelization of Call Statements [C]. Proceedings os the SIGPLAN '86 Symposium on Compiler Construction.
    [80] Tullsen D.M, Eggers S.J, Levy H.M. 1995. Simultaneous Multithreading: Maximizing On-chip Parallelism [C]. The 22nd Annual International Symposium on Computer Architecture.
    [81] Wu Y, Breternitz M., Quek J., et al. 2004. The Accuracy of Initial Prediction in Two-phase Dynamic Binary Translators [C]. Proceedings of CGO'04.
    [82] Zhai A., Colohan C.B., Steffan J.G, et al., 2002. Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads [C], Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), San Jose, CA, USA.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700