逻辑核动态可重构的众核处理器体系结构

英文题名：Manycore Processor Architecture with Dynamically Reconfigurable Logic Core
作者：任永青
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：众核处理器 ; 物理核 ; 逻辑核 ; 推测执行能力评估器
英文关键词：Manycore Processor ; Physical Core ; Logic Core ; Speculative Execution Capability Estimator
学位年度：2010
导师：安虹 ; 李国杰
学科代码：081201
学位授予单位：中国科学技术大学
论文提交日期：2010-05-01

摘要

随着半导体技术的发展,摩尔定律继续有效,单块芯片上集成的处理器核数将不断增加；同时为追求更高的性能功耗比和性能面积比,众核结构成为芯片设计的必然选择。众核处理器中丰富的计算资源和高效的片上通信使得面向吞吐率的应用具有天然的性能优势,但是由于处理器核粒度变小,在单核上执行串行应用的性能无法保证。为解决这一问题,近年来具有逻辑核构造能力的众核处理器体系结构研究备受关注,其基本思想是基于多个细粒度处理器核(称为物理核)构建粗粒度逻辑核,期望利用众核结构丰富的计算资源,将不断增加的处理器核转化为单线程串行应用的性能提升。目前已有的工作对这种众核结构的通信开销处理、逻辑核粒度配置灵活性和应用映射方式等方面尚缺乏深入的研究。
     本文针对细粒度众核结构上串行程序的有效执行问题,从执行模型、微结构设计和动态资源控制等几方面展开深入探讨,对于探索逻辑核动态可重构的众核处理器体系结构具有重要的学术意义和应用价值。本文主要研究内容和成果包括以下几个方面。
     (1)研究了具有逻辑核构造能力的众核处理器重构开销问题,提出逻辑核动态可重构的众核结构FTPA (Flexible Tiled Processor Architecture)。FTPA采用类数据流驱动执行的指令集体系结构,在不改变串行编程模型前提下,利用数据流驱动和线程级推测相结合的执行模型,同时开发单线程程序中的指令级并行和线程级并行。为解决众核处理器逻辑核重构开销过大问题,FTPA将物理核内资源通过片上路由网络划分为易重构的计算资源和不易重构的共享资源,从而使得逻辑核粒度能够在两个层面以两种频度进行异步调整,具有高度灵活性。
     (2)研究了串行程序采用细粒度线程级推测执行模型时,应用推测执行能力的实时评估机制。针对串行应用不同执行阶段并行性特征存在的显著差异,利用时间局部性,为众核结构逻辑核粒度动态重构进行有效指导,本文提出基于“推测执行阶段”和“推测深度”概念的线程级推测执行能力量化评估方法,并以此为基础提出利用推测深度的局部历史、全局历史和锦标赛三种推测执行能力评估器设计,只需要数十位存储资源,就可以有效预测串行程序并行性变化趋势,对推测深度作出有效估计。
     (3)研究了将推测执行能力评估器用于指导FTPA众核结构逻辑核动态重构的有效性。为有效处理众核结构分布式执行导致的通信开销,以指令窗口和功能部件为核心的计算资源可以按照平铺式和深度式两种映射方式构建逻辑核,从而适应具有不同并行性特征的应用。本文将线程级推测执行能力评估器用于指导FTPA逻辑核动态重构,分别从平铺式映射和深度式映射两方面对性能和资源利用进行了详细实验评估。结果表明,相对于采用固定粒度逻辑核的FTPA配置,动态逻辑核重构方式只需一半物理核计算资源就可以有效支持细粒度线程级推测执行,性能降低不到13%,资源利用率显著提高。
     本文的研究工作可以得出如下认识：
     (1)逻辑核是众核处理器上加速串行应用的有效手段,但是将细粒度物理核资源耦合在一起需要高效的结构支持,如本文提出的计算资源和共享资源的分离设计,平铺式和深度式映射方式等。
     (2)在众核处理器上采用细粒度线程级推测执行模型加速串行程序需要在性能和资源利用率之间进行权衡,合理的逻辑核重构必须建立在对应用执行特征精确认识的基础上,线程级推测执行能力评估器是一种有效尝试。
     本文提出的FTPA众核处理器所采用的计算资源和共享资源分离方法、平铺式和深度式逻辑核重构以及线程级推测执行能力评估器设计等都可以作为一般方法论进行推广,应用于其他众核结构中。
With the evolving of semiconductor technology, the Moore's Law is continuing, and the number of processor cores integrated on single chip goes on increasing. For power and area efficiency, manycore processor architecture is an unescapable choice. With abundant of computing resource and highly efficient on-chip-network, manycore is suitable for applications with throughput requierments. As the processor core integrated on manycore will be finer, the performance of single thread application may diminish while executing on a single core. For this problem, recently, manycore processor with capability of constructing reconfigurable logic core become a remarkable solution, in which several cores (named physical core) are combined as a coarse grain logic core, expecting to efficiently translate transistor resources into performance gaining of sequential programs. There is little research effert is made on communication overhead, logical core flexibility and application mapping for these manycore architecture.
     Aiming at efficient execution of sequential applications on manycore processor with fine grain cores, in this dissertation, intensive study is carried out on execution model, micro-architecture, and resource tuning, etc, and much academic value is achieved for manycore architecture with dynamically reconfirable logic core. The main content and achievement includes:
     (1) Proposed manycore processor FTP A (Flexible Tiled Processor Architecture) with dynamically reconfigurable logic core. FTPA takes advantages of dataflow-lile execution model EDGE (Explicit Dataflow Graph Execution) instruction set architecture, and ILP (instruction level parallelism) and TLP (thread level parallelism) are exploited in the way of dataflow execution and fine grain thread level speculative execution while not impacting serial programming model. To overcome the overhead of logic core reconfiguration, In FTPA, the computing resources and shared resources are separated through on-chip network, resulting in resource tuning in two levels and two frequencies, meaning much more flexbility.
     (2) Designed an estimator of speculative execution capability to direct the logic core dynamic reconfiguration. To achieve reasonable logic core reconfigurtion, based on temporal locality and the observing about execution phases, three estimators of fine grain thread level speculative execution capability are proposed on the conception of speculative execution phase and depth, named local history, global history and tournament estimator. Experiments results show that, the estimator of speculative execution capability is able to predict the trend of concurrency changing accurately in different execution phases, while consuming only tens of bits hardware memory resources.
     (3) Explored the efficiency of applying the estimator of speculative execution capability on logic core dynamic reconfiguration. For different design constraints and applications styles, the computing resources, including instruction window and function units, are able to form logic core in two ways, flat and deep. The estimator of speculative execution capability is used for logic core grain tuning of FTP A in the two ways and experiments results demonstrate that, with the direction of estimator of speculative execution capability, comparing to fixed grain of logic core, nearly half resources are enough for concurrency exploiting of sequential application, with less than 13% performance diminishing, which means much higher resource utilization ratio.
     Several conclusions are achieved from the work:
     (1) Constructing logic core is an efficient way for sequential program execution on manycore, but requires reasonable micro-architecture support, such as splitting of computing and shared resources, flat and deep models of application mapping, etc.
     (2) Reasonable trade-off between performance and resource utilization must be achieved while sequential program executing on manycore with fine grain thread level speculative. For this purpose, accurate understanding about concurrency variation of application execution must be obtained, and the estimator of thread level speculative execution capability is an attractive attempt.
     The schemes in this disseratation, such as separation of computing and shared resource, logic core constructing in flat and deep manners and the estimator of speculative execution capability, are able to be expanded as universal techniques.

引文

从明.类数据流驱动的分片式处理器体系结构[D].博士论文.中国科学技术大学,合肥.2009
    王莉.类数据流驱动的分片式处理器上的编译及优化技术[D].博士论文.中国科学技术大学,合肥.2009
    Microbench[J] www.cs.utexas.edu/cart/code/microbench.tgz.
    Scale Compiler[J]. http://www-ali.cs.umass.edu/Scale/
    SPEC CPU2000 Benchmark[J]. http://www.spec.org/cpu2000/
    The International Technology Roadmap for Semiconductor[J] http://www.itrs.net/.
    Adams, D. A. A computation model with data flow sequencing[D], Stanford University. 1969.Thesis
    Agarwal, A., S. Amarasinghe, et al. The Raw Compiler Project[R]. MIT Laboratory for Computer Science.1997.
    Agarwal, V, M. S. Hrishikesh, et al. Clock rate versus IPC:the end of the road for conventional microarchitectures[J]. SIGARCH Comput. Archit. News 28(2):248-259.2000.
    Allen, J. R., K. Kennedy, et al. Conversion of control dependence to data dependence[C]. Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, Austin, Texas, ACM.1983.
    Amdahl, G. Validity of the single processor approach to achieving large scale computing capabilities[C], ACM New York, NY, USA.1967.
    Arvind and D. E. Culler Dataflow architectures[B]. Annual review of computer science vol.1, 1986, Annual Reviews Inc.:225-253.1986.
    Arvind and R. A. Iannucci A critique of multiprocessing von Neumann style[C]. Proceedings of the 10th annual international symposium on Computer architecture, Stockholm, Sweden, ACM. 1983.
    Asanovic, K., R. Bodik, et al. The landscape of parallel computing research:A view from berkeley[R]. University of California at Berkeley.2006.
    Austin, T, E. Larson, et al. SimpleScalar:An infrastructure for computer system modeling[J]. Computer:59-67.2002.
    Barua, R., W. Lee, et al. Maps:a Compiler-Managed Memory System for RAW Machines[R]. Massachusetts Institute of Technology.1998.
    Breach, S. E. Design and evaluation of a multiscalar processor[D], The University of Wisconsin-Madison.1998.Thesis
    Burger, D., S. W. Keckler, et al. Scaling to the end of silicon with EDGE architectures[J]. Computer 37(7):44-55.2004.
    Chaudhry, S., R. Cypher, et al. Simultaneous speculative threading:a novel pipeline architecture implemented in sun's rock processor[J]. SIGARCH Comput. Archit. News 37(3):484-495. 2009.
    Chiou, D., D. Sunwoo, et al. The FAST methodology for high-speed SoC/computer simulation[C]. Proceedings of the IEEE/ACM international conference on Computer-aided design. San Jose, California,2007.
    Chung, E. S., M. K. Papamichael, et al. ProtoFlex:Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs[J]. ACM Trans. Reconfigurable Technol. Syst.2(2):1-32.2009.
    Culler, D., A. Sah, et al. Fine-grain parallelism with minimal hardware support:A compiler-controlled threaded abstract machine[C], ACM New York, NY, USA.1991.
    Dennis, J. B. First version of a data flow procedure language[C]. Programming Symposium, Proceedings Colloque sur la Programmation, Springer-Verlag.1974.
    Dennis, J. B. Dataflow computer architecture[R]. MASSACHUSETTS INSTITUTE OF TECHNOLOGY LABORATORY FOR COMPUTER SCIENCE.1979.
    Dennis, J. B., J. B. Fosseen, et al. Data flow schemas[C]. Proceedings of the International Sympoisum on Theoretical Programming, Springer-Verlag.1974.
    Dennis, J. B. and D. P. Misunas A preliminary architecture for a basic data-flow processor[C]. Proceedings of the 2nd annual symposium on Computer architecture, ACM.1975.
    Desikan, R., S. Sethumadhavan, et al. Scalable selective re-execution for EDGE architectures[C]. Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, Boston, MA, USA, ACM.2004.
    Dirk Grunwald, Artur Klauser, Srilatha Marine, and Andrew Pleszkun. Confidence esimation for speculation control. In Proceedings 25th Annual International Symposium on Computer Architecture, SIGARCH Newsletter, Barcelona, Spain.1998.
    Dong Li, Madhu Saravana Sibi Govindan, Behnam Robatmili, Aaron Smith, Doug Burger, and Steve Keckler, Compiler-assisted Hybrid Operand Communication[R], TR-09-33, November 2009
    Eatherton, W. The Push of Network Processing to the Top of the Pyramid. Symposium on Architectures for Networking and Communications Systems.2005.
    Emer, J., P. Ahuja, et al. Asim:A Performance Model Framework[J]. Computer 35(2):68-76.2002.
    Erik Lindholm, J. N., Stuart Oberman, John Montrym NVIDIA TESLA:A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE[J]. IEEE Micro 28(2):39-55.2008
    Franklin, M. The multiscalar architecture[D], University of Wisconsin at Madison.1993.Thesis
    Gratz, P., C. K. Kim, et al. Implementation and evaluation of on-chip network architectures[C]. 24th International Conference on Computer Design, San Jose, CA.2006.
    Gratz, P., K. Sankaralingam, et al. Implementation and Evaluation of a Dynamically Routed Processor Operand Network[C]. Proceedings of the First International Symposium on Networks-on-Chip, IEEE Computer Society.2007.
    Gulati, D. P., C. Kim, et al. Multitasking workload scheduling on flexible core chip multiprocessors[J]. SIGARCH Comput. Archit. News 36(2):46-55.2008.
    Gurd, J. R., C. C. Kirkham, et al. The Manchester prototype dataflow computer[J]. Commun. ACM 28(1):34-52.1985.
    Hammond, L., B. A. Hubbert, et al. The Stanford Hydra CMP[J] IEEE Micro 20(2):71-84.2000
    Hammond, L., V. Wong, et al. Transactional Memory Coherence and Consistency[J]. SIGARCH Comput. Archit. News 32(2):102.2004.
    Ho, R., K. Mai, et al. The future of wires[J]. Proceedings of the IEEE 89(4):490-504.2001.
    Ho, R., K. Mai, et al. Efficient on-chip global interconnects[R].2003.
    Howard.J et al. A 48-Core IA-32 Message-Passing Processor with DVFS in 45 nm CMOS[C]. Proceeding of International Solid-State Circuit Conference.2010.
    Huh, J., D. Burger, et al. Exploring the Design Space of Future CMPs[C]. Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society.2001.
    Ipek, E., M. Kirman, et al. Core fusion:accommodating software diversity in chip multiprocessors[J]. SIGARCH Comput. Archit. News 35(2):186-197.2007.
    J. Hennessy, D. P. Computer Architecture:A Quantitative Approach,[J].2007.
    Jacobson, Q., S. Bennett, et al. Control Flow Speculation in Multiscalar Processors[C]. Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, IEEE Computer Society.1997.
    John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christoforos Kozyrakis, James C. Hoe, Derek Chiou, Krste Asanovic. RAMP:Research Accelerator for Multiple Processors[J]. IEEE Micro,27(2):46-57,2007
    Kessler, R. The alpha 21264 microprocessor[J]. IEEE Micro 19(2):24-36.1999.
    Kim, C. A technology-scalable composable architecture[D], The University of Texas at Austin. 2007.Thesis
    Kim, C., D. Burger, et al. An adaptive, non-uniform Cache structure for wire-delay dominated on-chip Caches[J]. Acm Sigplan Notices 37(10):211-222.2002.
    Kim, C., S. Sethumadhavan, et al. Composable Lightweight Processors[C]. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.2007.
    Krste Asanovic, R. B., Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick The Landscape of Parallel Computing Research:A View from Berkeley[R]. 2006.
    Larry Seiler, D. C., Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, Pat Hanrahan. Larrabee:a many-core x86 architecture for visual computing[J]. ACM Trans. Graph 27(3):1-15.2008.
    Le, H., W. Starke, et al. Ibm power6 microarchitecture[J]. IBM Journal of Research and Development 51(6):639-662.2007.
    Lee, W., R. Barua, et al. Space-time scheduling of instruction-level parallelism on a raw machine[C]. Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, San Jose, California, United States, ACM. 1998.
    Luo, Y., V. Packirisamy, et al. Dynamic performance tuning for speculative threads[J]. SIGARCH Comput. Archit. News 37(3):462-473.2009
    Mercaldi, M., S. Swanson, et al. Instruction scheduling for a tiled dataflow architecture[C]. Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, San Jose, California, USA, ACM.2006.
    Moritz, C. A. Exploring Optimal Cost-Performance Designs for RAW processors[R]. Massachusetts Institute of Technology.1998.
    Nagarajan, R., S. K. Kushwaha, et al. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures[C]. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society.2004.
    Nagarajan, R., K. Sankaralingam, et al. A design space evaluation of grid processor architectures[C]. Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, Austin, Texas, IEEE Computer Society.2001.
    Nesbit, K. J., M. Moreto, et al. Multicore Resource Managemen[J]. IEEE Micro 28(3):6-16.2008.
    Nikhil, R. S. Can dataflow subsume von Neumann computing?[C]. Proceedings of the 16th annual international symposium on Computer architecture, Jerusalem, Israel, ACM.1989.
    Pellauer, M., M. Vijayaraghavan, et al. A-Ports:an efficient abstraction for cycle-accurate performance models on FPGAs[C]. Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays. Monterey, California, USA,2008.
    Penry D, Fay D, Hodgdon D, Wells R, Schelle G, August D, Connors D. Exploiting parallelism and structure to accelerate the simulation of chip multi-processors[C]. In Proceedings of The 12th International Symposium on High-Performance Computer Architecture, Austin, USA, Feb. 11-15,pp.29-40.2006.
    Perelman, E., G. Hamerly, et al. Using SimPoint for accurate and efficient simulation[J]. SIGMETRICS Perform. Eval. Rev.31(1):318-319.2003
    Petersen, A., A. Putnam, et al. Reducing control overhead in dataflow architectures[C]. Proceedings of the 15th international conference on Parallel architectures and compilation techniques, Seattle, Washington, USA, ACM.2006.
    Rabbah, R., I. Bratt, et al. Versatile tiled-processor architectures:The raw approach[B], Defense Technical Information Center.2004.
    Ranganathan, N., R. Nagarajan, et al. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance[R]. Citeseer.2002.
    Robatmili, B., K. Coons, et al. Balancing Local and Global Parallelism for Single-Thread Applications in a Composable Multi-core System[J]. PESPMA 2008:2.2008a.
    Robatmili, B., K. E. Coons, et al. Strategies for mapping dataflow blocks to distributed hardware[C]. Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture-Volume 00, IEEE Computer Society.2008b.
    Sankaralingam, K., R. Nagarajan, et al. TRIPS:A polymorphous architecture for exploiting ILP, TLP, and DLP[J]. ACM Trans. Archit. Code Optim.1(1):62-93.2004.
    Sankaralingam, K., R. Nagarajan, et al. Exploiting ILP, TLP, and DLP with the polymorphous trips architecture[J]. IEEE Micro 23(6):46-51.2003.
    Sethumadhavan, S., F. Roesner, et al. Late-binding:enabling unordered load-store queues[J]. SIGARCH Comput. Archit. News 35(2):347-357.2007.
    Sethumadhavan, S., R. McDonald, et al. Design and implementation of the TRIPS primary memory system[C].24th International Conference on Computer Design, San Jose, CA.2006.
    Sherwood, T., E. Perelman, et al. Discovering and Exploiting Program Phases[J]. IEEE Micro 23(6):84-93.2003.
    Smith, A., J. Gibson, et al. Compiling for EDGE Architectures[C]. Proceedings of the International Symposium on Code Generation and Optimization, IEEE Computer Society. 2006.
    Smith, J. Multiscalar as a new architecture paradigm[J]. ACM Comput. Surv.28(4es):34.1996.
    Sohi, G. S., S. E. Breach, et al. Multiscalar processors[C]. Proceedings of the 22nd annual international symposium on Computer architecture, S. Margherita Ligure, Italy, ACM.1995.
    Steffan, J. G., C. B. Colohan, et al. A scalable approach to thread-level speculation[J], SIGARCH Comput. Archit. News 28(2):1-12.2000.
    Swanson, S., K. Michelson, et al. WaveScalar[C]. Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.2003.
    Swanson, S., A. Schwerin, et al. The WaveScalar architecture[J]. ACM Trans. Comput. Syst.25(2): 4.2007.
    Tarjan, D, M. Boyer, et al. Federation:Repurposing scalar cores for out-of-order instruction issue[C]. Design Automation Conference,2008. DAC 2008.45th ACM/IEEE.2008.
    Taylor, M. The Raw Prototype Design Document v5.2[R]. Department of Electrical Engineering and Computer Science, MASSACHUSETTS INSTITUTE OF TECHNOLOGY.2005.
    Taylor, M., J. Kim, et al. The Raw microprocessor:A computational fabric for software circuits and general-purpose programs[J]. IEEE Micro 272:02.2002.
    Taylor, M., J. Kim, et al. A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network[C]. IEEE International Solid-State Circuits Conference.2003 a.
    Taylor, M. B. and W. Lee Scalar Operand Networks[J]. IEEE Trans. Parallel Distrib. Syst.16(2): 145-162.2005.
    Taylor, M. B., W. Lee, et al. Scalar Operand Networks:On-Chip Interconnect for ILP in Partitioned Architectures[C]. Proceedings of the 9th International Symposium on High-Performance Computer Architecture, IEEE Computer Society.2003b.
    Vangal, S.R. Howard, J. Ruhl, G. Dighe, S. Wilson, H. Tschanz, J. Finan, D. Singh, A. Jacob, T. Jain, S. Erraguntla, V. Roberts, C. Hoskote, Y. Borkar, N. Borkar, S. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS[J]. IEEE Journal of SolidState Circuits.43(1):29-41.2008
    Veen, A. H. Dataflow machine architecture[J]. ACM Comput. Surv.18(4):365-396.1986.
    Vivek, S. and H. John Partitioning parallel programs for macro-dataflow[C]. Proceedings of the 1986 ACM conference on LISP and functional programming. Cambridge, Massachusetts,
    United States, ACM.1986.
    Vijaykumar, T. N. Compiling for the multiscalar architecture[D], The University of Wisconsin-Madison.1998.Thesis
    Vijaykumar, T. N. and G. S. Sohi Task selection for a multiscalar processor[C]. Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, Dallas, Texas, United States, IEEE Computer Society Press.1998.
    Wah, B. DATAFLOW COMPUTERS:THEIR HISTORY AND FUTURE[B]. Wiley Encyclopedia of Computer Science and Engineering.2008.
    Waingold, E., M. Taylor, et al. Baring it all to Software:The Raw Machine[R]. Massachusetts Institute of Technology.1997.
    Zhong, H., S. A. Lieberman, et al. Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications[C]. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, IEEE Computer Society.2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700