基于FPGA的片上多处理器建模方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
片上多处理器的发展给计算机系统结构研究带来新的发展空间和挑战。一方面片上多处理器的发展使微处理器性能的提升由挖掘指令级并行性转变为开发线程级和数据级并行性。为了开发片上多处理器的这种并行性,我们必须抛弃传统单核处理器系统的架构,重新设计处理器系统的软硬件结构,包括硬件微结构、编程模型、编译器、运行时系统等等。而另一方面,传统上用于单核处理器结构研究的软件模拟器已经明显不能满足片上多处理器系统下这种软硬件研究的需要。处理器核数的膨胀使软件模拟器的性能成比例降低,无法进行周期精确的硬件结构模拟,更无法进行全系统模拟和系统软件的研究。由于以上原因,多核处理器体系结构的研究缺乏大量的实验评测和全面、有效的指导,而软件模拟器成为了多核时代处理器体系结构研究的瓶颈。因此,新的处理器模拟工具是有效开展片上多处理器结构研究的关键。FPGA天生的并行性使它在模拟片上多处理器时具有较高的模拟性能和高度的可扩放性,成为研究多核处理器体系结构理想的模拟平台。
     本文研究了基于FPGA的片上多处理器建模方法。主要研究内容和成果包括:(1)研究了处理器的功能模型、性能模型以及原型,提出了一种功能与时序分离的处理器性能模型架构。其中功能部分只完成处理器的动作,不考虑硬件结构和动作的时序。时序部分则模拟处理器微结构,控制处理器动作发生的时序,并驱动功能部分模拟处理器的动作。由于功能部分与处理器的微结构无关,所以相同的功能部分可以重用于各种时序部分,并且可以兼容各种模拟方式,包括使用软件的模拟或者跨平台的模拟。这种架构使已有工作有效的被重用,减少了建模工作量。(2)研究了模拟器模块间的同步方式,针对FPGA模拟的特点提出基于管道的性能模拟技术。这种技术允许不同的处理器模块在同一时刻模拟不同的目标时钟周期,使运行速度较快的模块不必等待运行速度较慢的模块,显著提升了系统的模拟性能。模拟器各个模块之间的性能差距越大,管道模拟能发挥的作用也越大。(3)提出了使用软硬件协同模拟调节FPGA资源使用量和简化建模的方法。片上多处理器的模拟需要大量的FPGA资源,我们使用软件实现的存储缓存机制可以将数据缓存到宿主机器上,有效调节FPGA资源。基于FPGA的模拟不适合实现某些复杂的结构,可以使用软件实现这些结构的功能,简化FPGA建模过程。同时,FPGA模拟调试复杂且编译时间过长,我们通过使用软件实现模块并进行调试,有效减小建模难度,缩短编译时间。(4)研究了多核模拟的分时复用方法,提出了细粒度的分时复用技术。该技术将每个模块分为逻辑与状态两部分,将状态根据模拟核数复制多份,并将逻辑部分重用。细粒度的分时复用技术以模拟器各个模块内的规则为复用单位,使在任意时刻一个模块内可以同时进行多个处理器核的模拟,提高了系统资源的利用率。(5)分析了基于FPGA的模拟器性能瓶颈,提出了若干模拟性能的优化技术。包括在功能部分与时序部分之间统计功能部分延迟的机制,以及在时序部分各模块之间统计延迟的机制。(6)基于以上研究工作实现了RAMP-Pink模拟平台。RAMP-Pink平台是对事务存储和推测多线程提供统一支持的多核处理器模拟平台,采用了Alpha指令集;实现了RAMP-Pink平台上创建多线程的机制,取代PThreads库,该机制也可用于其他无操作系统支持的多核模拟平台;设计并实现了一个基于目录的MESI Cache一致性协议。
     在研究基于FPGA的处理器建模和设计实现RAMP-Pink系统的过程中我们得到一些如何进行多核处理器硬件建模的认识。首先,软件模拟片上多处理器的关键问题是软件的串行性无法适应不断膨胀的处理器核数,为此采用具有高度可扩放性的FPGA模拟平台可以应对核数膨胀问题并带来硬件级别的模拟性能。其次,FPGA建模的复杂度和建模周期都远远超过软件建模,采用功能与时序分离的模拟架构和软硬件协同的模拟技术可以有效减少建模工作量,缩短建模周期。最后,实现多核模拟需要较多的FPGA资源,通过细粒度的分时复用以及软硬件协同模拟技术可以调节FPGA资源的使用量。
     本文的研究工作和结果可用于指导基于FPGA的多核处理器建模和进一步的优化。
With the development of chip multiprocessor, the research of computer architecture now faces new opportunities and challenges. On one hand, the performance gain of multiprocessor has changed from instruction-level parallelism to thread-level and data-level parallelism. In order to find the parallelism, we must break the frameworks of traditional software and hardware, and redesign them, including microarchitecture, programming model, compiler, runtime system and so on. On the other hand, software simulator, traditional used in single-core processor architecture research, can not meet the research demand of multi-core processor. Software simulator performance reduce proportionally to processor cores, thus can not support cycle accurate simulation, full system simulation and the research of system software. For the reasons given above, research in multi-core processor architecture lack abundant experimental evaluations and comprehensive guides. And software simulator becomes the bottleneck of multi-core architecture research. Thus, the key of doing research in multi-core processor architecture effectively is adopting a new simulator. The inherently parallelism of FPGA gives it better simulation performance and scalability in hardware level, and becomes an ideal simulation platform
     This dissertation focuses on the ways of modeling chip multi-processor. The major research contributions include:(1) Based on the study of functional emulator, performance model and prototype, we propose a novel performance model framework where functions and timing are departed. Specifically, the functional partition is just responsible for correct simulation of the processor actions, without considering the microarchitecture and timing sequence. The timing partition models the microarchitecture of the processor, determines the time of processor actions, and drives the functional partition to simulate the corresponding microarchitecture. Due to the microarchitecture independent, one functional partition can be reused to multiple timing partitions, and it is compatible with other simulation patterns, including software simulations and cross platform simulations. This framework reuses off the shelf modules and saves previous modeling work effectively.(2) By studying the synchronization method of modules in simulator, a synchronization technique based on port is proposed. Port synchronization technique enables multiple modules in a model simulate different model cycles at the same time. In this way, modules with high speed do not need to wait for the ones with low speed, and thus promotes the system performance. The larger speed difference between the modules, the better port performances can be observed.(3) With the technique of software-hardware co-modeling, we propose the methods of adjusting FPGA resource occupation and simplifying modeling process. Since simulating chip multi-processor needs vast FPGA resource, we use software memory buffer technique to store data in the host computer, reduce the occupation of FPGA resource. It is very difficult to simulate complex structure in FPGA, so we use software-hardware co-modeling technique to ease this process. RTL code is complex in debugging and time-consuming in compilation, software simulation can be used to reduce the modeling complexity and the compiling time.(4) Time-division multiplexing technique is investigated and a fine grained time-multiplexing technique is proposed. We divide a module into two parts:state and logic, where we duplicate state for multi-core and reuse the logic part. Fine granularity time-multiplexing takes rule as the reuse unit and makes multiple cores simulate in one module at the same time. It also increase FPGA resource utilization rate.(5) The performance bottleneck is analyzed and several optimize techniques are proposed. These techniques include the delay statistic between functional and timing partition, and the delay statistic between modules in timing partition.(6) Based on all the above research, we implement a RAMP-Pink simulation platform. RAMP-Pink supports both transactional memory and thread level speculation. We adopted the alpha ISA, and provide a multi-thread creation mechanism to replace PThreads library. This mechanism can also be used on other multi-core simulation platforms without OS support. During deployment, a MESI Cache coherence protocol is designed and implemented.
     Through the research of processor modeling based on FPGA and RAMP-Pink system implementation, we have got some important conclusions about hardware modeling. Firstly, the key problem of software simulator for multi-core processor is that it does not scale well while the number of cores increasing. Thus, FPGA platform with highly scalability could solve the problem of core increasing and gets hardware-level performance. Secondly, FPGA modeling is more complex and the time-consuming is much higher than software modeling. With function-timing partition modeling framework and hardware-software co-modeling technique we can effectively reduce modeling work and modeling period. Thirdly, modeling multi-core processor needs abundant FPGA resources. Fine-grained time-multiplexing and software-hardware co-modeling technique can leverage the occupation of FPGA resources.
     The research and experimental results in this dissertation can be provided to guide the simulation of multi-core processor based on FPGA and take further optimization.
引文
The International Technology Roadmap for Semiconductor website http://www.itrs.net/.[EB/OL]
    Williams J. http://www.itee.uq.edu.au/-jwilliams/mblaze-uclinux/. MicroBlaze uClinux Project Home Page[EB/OL].
    Agarwal V, Hrishikesh M S, et al.2000. Clock rate versus IPC:the end of the road for conventional microarchitectures[J]. SIGARCH Comput. Archit. News 28(2):248-259.
    Allan A, Edenfeld D, Joyner W, et al.2002.2001 technology roadmap for semiconductors [J]. Computer,35(1):42-53.
    Angepat H, and Chiou D.2008. RAMP-White/FAST-MP[Z]. Given on August 2008 at RAMP Retreat.
    Arvind, Nikhil R, Rosenband D, et al.2004. High-level synthesis:an essential ingredient for designing complex ASICs[C]. Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design, pp.775-782.
    Asanovic K, Bodik R, Catanzaro B C, et al.2006. The Landscape of Parallel Computing Research: A View from Berkeley[R]. Tech. Rep. UCB/EECS-2006-183, UC Berkeley.
    Austin T, Larson E, Ernst D.2002. SimpleScalar:An infrastructure for computer system modeling[J]. IEEE Computer,35(2):59-67.
    Binkert N, Dreslinski R, Hsu L, et al.2006. The M5 simulator:Modeling networked systems [J]. IEEE Micro,26(4):52-60.
    Bluespec, Inc. Bluespec Language Reference Manual[R].
    Brooks D, et al.2000. Watch:A Framework for Architectural-level Power Analysis and Optimization[C]. Proc. of International Symp. on Computer Architecture, pp.83-94.
    Burke D, Wawrzynek J, Asanovic K, et al.2008. RAMP-Blue:Implementation of a Manycore 1008 Processor System[C]. In Proceedings of the Reconfigurable Systems Summer Institute (RSSI).
    Chang C, Wawrzynek J, Brodersen R W.2005. BEE2:A high-end reconfigurable computing system[J], IEEE Des. and Test Comput.22(2):114-125.
    Chiou D.2005. FAST:FPGA-based Acceleration of Simulator Timing models[C]. In Proceedings of the first Workshop on Architecture Research using FPGA Platforms, held in conjunction with HPCA-11, San Francisco, CA.
    Chiou D, Sanjeliwala H, Sunwoo D, et al.2006. FPGA-based Fast, Cycle-Accurate, Full-System Simulators[C]. In Proceedings of the second Workshop on Architecture Research using FPGA Platforms, held in conjunction with HPCA-12, Austin, TX.
    Chiou D, Sunwoo D, Kim J, et al.2007a. FPGA-Accelerated Simulation Technologies(FAST): Fast, Full-System, Cycle-Accurate Simulators[C]. In MICRO'07:Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp.249-261.
    Chiou D, Sunwoo D, Kim J, et al.2007b. The FAST Methodology for High-Speed SoC/Computer Simulation[C]. In Proceedings of International Conference on Computer-Aided Design (ICCAD).
    Chung E, Nurvitadhi E, Hoe J, et al.2007. ProtoFlex[R]. At RAMP Tutorial, ISCA 2007.
    Dagum L, Menon R.1998. OpenMP:An Industry-Standard API for Shared-Memory Programming [J], IEEE Computational Science & Engineering,5(1):46-55.
    Dave N, Pellauer M, Arvind, et al.2006. Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA[C]. In Proceedings of the Workshop on Architecture Research using FPGA Platforms, held at HPCA-12.
    Emer J, Ahuja P, Borch E, et al.2002. Asim:aperformance model framework [J]. Computer, 35(2):68-76.
    Eswaran K P, Gray J N, Lorie R A, et al.1976. The notions of consistency and predicate locks in a database system[J]. Commun. ACM,19(11):624-633.
    Guo R, An H, Dou R L, et al.2008. LogSPoTM:A Scalable Thread Level Speculation Model Based on Transactional Memory[C], in Proceedings of the 13th Asia-Pacific Computer Systems Architecture Conference (ACSAC 2008), Hsinchu, Taiwan, pp.1-8.
    Hamilton S.1999. Taking Moore's law into the next century[J]. Computer,32(1):43-48.
    Hammond L, Carlstrom B, Wong V, et al. Transactional coherence and consistency:simplifying parallel hardware and software [J]. Micro, IEEE,24(6):92-103,2004.
    Hammond L, Hubbert B, Siu M, et al.2000. The Stanford Hydra CMP[J]. IEEE Micro, 20(2):71-84.
    Herlihy M, Moss J E B.1993. Transactional memory:architectural support for lock-free data structures[J]. SIGARCH Comput. Archit. News,21(2):289-300.
    Howard J, Dighe S, Hoskote Y, et al.2010. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS[C]. Proceedings of the International Solid-State Circuits Conference.
    Johnson M.1991. Superscalar Microprocessor Design[B]. Englewood Cliffs, NJ:Prentice Hall, Inc.
    Krasnov A. et al.2007. RAMP-Blue:A Message-Passing Manycore System in FPGAs[C]. Proc. Int'l Conf. Field-Programmable Logic and Applications (FPL07), Springer.
    Krishnan V, Torrellas J.1997. Efficient use of processing transistors for larger on-chip storage: Multithreading[C]. In:Workshop on Mixing Logic and DRAM:Chips that Compute and Remember.
    Krishnan V, Torrellas J.1998. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor [C]. In Proceedings of the 12th international conference on Supercomputing, ACM, Melbourne, Australia, pp.85-92.
    Krishnan V, Torrellas J.1999. A chip-multiprocessor architecture with speculative multithreading [J]. Transactions on Computers,48(9):866-880.
    Larry Seiler, Doug Carmean, Eric Sprangle, et al.2008. Larrabee:a many-core x86 architecture for visual computing [J]. ACM Transactions on Graphics (TOG),27(3).
    Larson E, Chatterjee S, and Austin T.2001. MASE:A Novel Infrastructure for Detailed Microarchitectural Modeling[C]. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS).
    Le H, Starke W, et al.2007. Ibm power6 microarchitecture [J]. IBM Journal of Research and Development 51 (6):639-662.
    Lewis B and Berg D.1998. Multithreaded Programming with Pthreads[M]. Prentice-Hall, Inc., Upper Saddle River, NJ.
    Lindholm E, Nickolls J, Oberman S, et al.2008. NVIDIA Tesla:AUnified Graphics and Computing Architecture[J]. IEEE Micro 28(2):39-55.
    Magnusson P S, Christensson M, Eskilson J, et al.2002. Simics:A full system simulation platform[J]. IEEE Computer,35(2):50-58.
    Martinez J F, Torrellas J.2002. Speculative synchronization:applying thread-level speculation to explicitly parallel applications[C]. In:Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ACM, San Jose, California, pp.18-29.
    Mauer C J, Hill M D, Wood D A.2002. Full-System Timing-First Simulation[Z]. ACM SIGMETRICS Performance Evaluation Review.30(1), pp.108-116.
    Milo M K, Martin, Daniel J, et al.2005. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset[N/OL]. Computer Architecture News (CAN).
    Njoroge N, Casper J, Wee S, et al.2007. ATLAS:A Chip-Multiprocessor with Transactional Memory Support[C]. In Proceedings of the Conference on Design Automation and Test in Europe (DATE).
    Oner K, Barroso L A, Iman S, et al.1995. The design of RPM:an FPGA-based multiprocessor emulator[C]. In FPGA'95:Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays, pp.60-66.
    Parashar A, Adler M, Pellauer M, et al.2008. Hybrid CPU/FPGA Performance Models[C]. In WARP'08:The 3rd Workshop on Architectural Research Prototyping.
    Pellauer M, Vijayaraghavan M, Adler M, et al.2008. A-Ports:An Efficient Abstraction for Cycle-Accurate Performance Models on FPGAs[C]. In FPGA'08:Proceedings of the 16th international ACM/SIGDAsymposium on Field programmable gate arrays, New York, NY, USA:ACM, pp.87-96.
    Penry D A, Fay D, Hodgdon D, et al.2006. Exploiting Parallelism and Structure to Accelerate the Simulation of Chip Multi-processors[C]. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), pp.27-38.
    Pfister G.1982. The Yorktown Simulation Engine[C]. In Proceedings of 19th Conference on Design Automation (DAC).
    Sewook Wee, Jared Casper, Njuguna Njoroge, et al.2007. A practical FPGA-based framework for novel CMP research[C]. Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays, February 18-20, Monterey, California, USA.
    Shivakumar and Jouppi N P.2001. Cacti 3.0:An integrated cache timing, power and area model, Technical report[R]. Compaq Computer Corporation.
    Sutter H.2005. The free lunch is over:A fundamental turn toward concurrency in software [J]. Dr. Dobb's Journal,30(3).
    Tan Z, A Waterman, R Avizienis, et al.2010. RAMP-Gold:An FPGA-based Architecture Simulator for Multiprocessors[J]. In DAC'10:Proceedings of the 47th Annual Design Automation Conference.
    Uhlig R A, Mudge T N.1997. Trace-Driven Memory Simulation:A Survey[J]. ACM Computing Surveys,29(2):128-170.
    Vangal SR, Howard J, Ruhl G, et al.2008. An 80-tile sub-100 W teraFLOPS processor in 65 nm CMOS[J]. IEEE J. Solid-State Circuits, (43):29.
    Wawrzynek J, Patterson D, Oskin M, et al.2006. RAMP:A Research Accelerator for Multiple Processors[R]. U.C. Berkeley technical report.
    Wenbo Dai, Hong An, Qi Li, et al.2011. A Priority-aware NoC to Reduce Squashes in Thread Level Speculation for Chip Multiprocessors[C]. Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2011), IEEE Computer Society Press, Busan, Korea.
    刘圆.2007.多核结构上高效的线程级推测及食物执行模型研究[D].博士论文,中国科学技术大学,合肥。
    郭锐.2009.支持推测并行化的可扩展事务存储体系结构设计与性能评价[D].博士论文,中国科学技术大学,合肥。

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700