类数据流驱动的分片式处理器体系结构
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
纳米工艺代微处理器设计中存在的功耗、线延迟和设计复杂度等问题严重地制约了传统的程序执行模型和处理器体系结构的发展。分片式处理器体系结构就是为了应对这些挑战性问题而产生的,其基本思想是将计算、存储和互连资源组织成片式的基本结构单元,这些片式单元是相对简单的、分布式控制且可重用的;大量的片式单元由高能效、可扩展的片上网络连接起来构成高效能的处理器。这种体系结构避免了片上长线延迟的产生,能够充分利用摩尔定律发展带来的丰富而廉价的晶体管资源,获得系统性能的提升。但目前分片式处理器体系结构还处于初级研究阶段,有许多关键技术值得探讨。
     本文分别从分片式处理器的程序执行模型和体系结构两个方面开展了深入的研究。主要研究内容和成果包括:(1)研究了类数据流计算模型的原理,提出了一种适于分片式处理器体系结构的类数据流驱动程序执行模型。在该程序执行模型中,由编译器将串行程序划分成一系列包含几十至上百条指令的超块;程序的执行以超块为原子单位进行取指、执行和提交。超块内部的计算采用数据流表示,用数据流图作为机器语言,向硬件显式表达指令间的并行性,无需硬件动态检测依赖,从而降低了硬件设计的复杂性;超块间采用控制流表示,既可以有效地利用程序中的数据局部性,又可以进一步利用线程级推测并行性。(2)分析了类数据流驱动的分片式处理器体系结构的设计空间,研究了影响分片式处理器性能的关键因素。首先,为了提高分片式处理器上计算资源的利用率,分别从数据流和控制流的角度分析了激进推测执行的可行性,并对推测深度给出了量化的标准;其次,为了给分片式处理器选择合适的互连网络结构,分析了多种互连拓扑结构对处理器性能的影响;然后,为了缓解分片式处理器结构及其多跳式的互连结构可能对访存造成的影响,分析了多种应用在分片式处理器的访存特征,研究了数据预取机制对降低访存延迟的作用;最后,为了更准确地探究应用对结构的需求,综合分析了应用在类数据流驱动的程序执行模型上的行为特征。(3)研究了分片式处理器的优化设计方案,提出了一种既能充分挖掘并行性,又能有效降低通信代价的片式单元设计思想。将单个片式单元的计算复杂度限制在应用潜在的指令级并行粒度上,同时,结合程序的通信局部性特征适当增大片式单元内的局部通信相联度,而无需改变整体的通信网络设计。实验表明,这种设计思想既能够满足应用对于指令级并行性的需求,又能够有效地降低关键路径上的数据流通信延迟。(4)基于该优化设计方案,设计并实现了一种类数据流驱动的分片式处理器体系结构TPA-PI。TPA-PI处理器采用DISC-I指令集体系结构,遵循类数据流驱动的程序执行模型。TPA-PI在开发更大的指令级并行性、片式单元有限的计算能力以及日益严峻的线延迟约束之间为单个片式单元的设计找到一个较好的设计折衷点,使得TPA-PI设计具有较好的可扩展性。(5)在TPA-PI的软件模拟环境上,评估了类数据流驱动程序执行模型及TPA-PI体系结构设计的有效性。实验结果印证了类数据流驱动的程序执行模型与控制流执行模型相比所具有的性能优势、片式单元的设计思想的正确性以及优化后的TPA-PI体系结构设计的合理性。
     本文的研究工作获得了如下一些重要的认识。首先,在分片式处理器体系结构设计中,程序执行模型、处理器核粒度、片上互连模型以及目标应用的特征都是影响其性能的重要因素。其次,将类数据流驱动的程序执行模型与分片式处理器相结合能够有效地利用片上提供的大量计算资源,在利用数据流驱动执行开发指令级并行性的同时,利用控制流的局部性开发更高层次的超块级和线程级并行性,适应不同特征的应用的需求。
     本文的研究工作和结果可用于指导分片式处理器的体系结构设计和进一步的优化。
The development of traditional program execution model and processor architecture is restricted by many problems arose in nanometer technology designs, such as power dissipation, wire delay, design complexity, etc. Tiled processor architecture is a potential solution to these challenges. It organizes computation, storage and interconnects resources into basic tiled architectural units, which are relatively simple, distributed and reusable. A high-productivity processor could be composed of plenty of such tiled units, interconnected by highly efficient and scalable on-chip networks. In the tiled architecture, the design of single tile could be much less complicated, and the wire delay problem of long interconnect could be eliminated, and thus plentiful but cheap transistor resources introduced by Moore's Law could be fully utilized to achieve higher performance. However, tiled architecture is just under studies currently, and some key points need to be investigated and discussed.
     This dissertation focuses on exploring and investigating the tiled processor architecture in the views of both execution model and architecture. The major research contributions include: (1) Based on the study of the theory of dataflow-like computation model, a novel dataflow-like driven program execution model which suits for tiled processor architecture is proposed. In this model, sequential program is partitioned into a series of VLIBs (Very Long Instruction Block), which may contain tens or hundreds of instructions and form the atomic units for instruction fetching, execution and commitment. The relationship intra-VLIB follows the dataflow style, in which dataflow diagram is introduced as the machine language to expose parallelism to hardware explicitly, so that the design complexity and costs on hardware of dynamic dependence detection could be alleviated. The control-flow style kept inter-VLIB could effectively utilize the dataflow locality in program, and exploit the parallelism from thread-level in the mean time. (2) The design space of dataflow-like driven architecture for tiled processor is further explored in many aspects, and the key factors which impact the performance are analyzed. Firstly, to improve the utilization of computational resources in tiled processor, the feasibility of aggressive speculative execution is studied in the aspects of dataflow and control-flow respectively, and a quantitative metric is given for the depth of speculation. Secondly, to determine a good topology of interconnect network for tiled processor, the impact of many topologies on processor performance are evaluated and analyzed. Thirdly, to alleviate the effect on memory access by tile processor architecture and multi-hop interconnect network, the behavior of memory access of many applications on tiled processor is analyzed, and the data prefetching scheme to reduce memory access latency on tiled processor is proposed and studied. Fourthly, a systematic analysis on the behaviors of applications executed on dataflow-like computing model is shown, and a more accurate requirement imposed on architecture is derived from applications. (3) An optimized scheme for tiled processor is investigated, and a novel proposal about tiled unit is presented, by which parallelism can be fully exploited and the cost on communication can be effectively reduced. The computing power of single tile should be restricted to the granularity of potential instruction-level parallelism of applications. Meanwhile, the local communicational connectivity should be increased proportionally according to locality of dataflow in program, without any impacts on the design of global communication network. Our experiments showed the need for instruction-level parallelism in applications could be met, and the communicational latencies in critical path could be effectively reduced. (4) Based on the above scheme, an optimized dataflow-like driven tiled processor architecture called TPA-PI is designed and implemented. In TPA-PI processor, an instruction set named DISC-I is defined and realized following the data-flow driven execution model. The design of TPA-PI Tile finds a good tradeoff among the exploitation of more instruction-level parallelism, the limited computing power of single tile and more serious restrictions of wire delay, which provides a good scalability on performance, architecture and technology. (5) Based on TPA-PI experimental platform, the effectiveness of Dataflow-like execution model and architectural design are evaluated. The experimental results verified the advantage of dataflow-like execution model on performance compared with control-flow execution model, the validity of proposal of tiled unit and the reasonableness of optimized TPA-PI architecture.
     Based on the work of this dissertation, some important conclusions on dataflow-like driven tiled processor architecture are drawn as following: Firstly, the granularity of processor cores, execution model, on chip interconnect model and the selection of applications are key factors which determine the performance of dataflow-like driven tiled processor architecture. Secondly, the combination of dataflow-like driven program execution model and tiled processor architecture could utilize massive computational resources effectively, and it has a considerable potential to exploit parallelism both in instruction-level and thread-level, meeting the requirement of applications with arbitrary characteristic.The research and experimental results in this dissertation can provide direction for the design and optimization of tiled processor architecture.
引文
The International Technology Roadmap for Semiconductor website http://www.itrs.net/. Microbench website www.cs.utexas.edu/cart/code/microbench.tgz.
    
    Adams Duane Albert. 1969. A computation model with data flow sequencing[DJ, Stanford University.Thesis
    Agarwal A., Amarasinghe S, R. Barua M. Frank, Lee W., Sarkar V., Srikrishna D., Taylor M. 1997. The Raw Compiler Project[R]. MIT Laboratory for Computer Science.
    Agarwal Vikas, Hrishikesh M.S., Keckler Stephen W., Burger Doug. 2000. Clock rate versus IPC: the end of the road for conventional microarchitectures[J]. SIGARCH Comput. Archit. News 28(2): 248-259.
    Allen J. R., Kennedy Ken, Porterfield Carrie, Warren Joe. 1983. Conversion of control dependence to data dependence[C]. Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, Austin, Texas, ACM.
    Amdahl GM. 1967. Validity of the single processor approach to achieving large scale computing capabilities[C], ACM New York, NY, USA.
    Arvind, Culler David E. 1986. Dataflow architectures[B]. Annual review of computer science vol. 1, 1986, Annual Reviews Inc.: 225-253.
    Arvind, Iannucci Robert A. 1983. A critique of multiprocessing von Neumann style[C]. Proceedings of the 10th annual international symposium on Computer architecture, Stockholm, Sweden, ACM.
    Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW. 2006. The landscape of parallel computing research: A view from berkeley[R]. University of California at Berkeley.
    Austin T, Larson E, Ernst D. 2002. SimpleScalar: An infrastructure for computer system modeling[J]. Computer: 59-67.
    Barua Rajeev, Lee Walter, Amarasinghe Saman, Agarwal Anant. 1998. Maps: a Compiler-Managed Memory System for RAW Machines[R]. Massachusetts Institute of Technology.
    Breach Scott Elliott. 1998. Design and evaluation of a multiscalar processor[D], The University of Wisconsin - Madison.Thesis
    Burger D., Keckler S. W., McKinley K. S., Dahlin M., John L. K., Lin C., Moore C. R., Burrill J., McDonald R. G, Yoder R., Team Trips. 2004. Scaling to the end of silicon with EDGE architectures[J]. Computer 37(7): 44-55.
    Chen TF, Baer JL. 1995. Effective hardware based data prefetching for high-performance processors[J]. Ieee Transactions on Computers 44(5): 609-623.
    Culler DE, Sah A, Schauser KE, Von Eicken T, Wawrzynek J. 1991. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine[C], ACM New York, NY, USA.
    Dahlgren F, Dubois M, Stenstrom P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors[C].
    Dennis J. B. 1974. First version of a data flow procedure language[C]. Programming Symposium, Proceedings Colloque sur la Programmation, Springer-Verlag.
    Dennis J. B., Fosseen J. B., Linderman J. P. 1974. Data flow schemas[C]. Proceedings of the International Sympoisum on Theoretical Programming, Springer-Verlag.
    Dennis Jack B. 1979. Dataflow computer architecture[R]. MASSACHUSETTS INSTITUTE OF TECHNOLOGY LABORATORY FOR COMPUTER SCIENCE.
    Dennis Jack B., Misunas David P. 1975. A preliminary architecture for a basic data-flow processor[C]. Proceedings of the 2nd annual symposium on Computer architecture, ACM. Desikan Rajagopalan, Sethumadhavan Simha, Burger Doug, Keckler Stephen W. 2004. Scalable selective re-execution for EDGE architectures[C]. Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, Boston, MA, USA, ACM.
    Eatherton W. 2005. The Push of Network Processing to the Top of the Pyramid. Symposium on Architectures for Networking and Communications Systems.
    Firoozshahian Amin, Solomatnikov Alex. 2009. A Memory System Design Framework: Creating Smart Memories [C]. ISCA.
    Franklin Manoj. 1993. The multiscalar architecture[D], University of Wisconsin at Madison.Thesis
    Gratz P., Kim C. K., McDonald R., Keckler S. W., Burger D., Ieee. 2006. Implementation and evaluation of on-chip network architectures[C]. 24th International Conference on Computer Design, San Jose, CA.
    Gratz Paul, Sankaralingam Karthikeyan, Hanson Heather, Shivakumar Premkishore, McDonald Robert, Keckler Stephen W., Burger Doug. 2007. Implementation and Evaluation of a Dynamically Routed Processor Operand Network[C]. Proceedings of the First International Symposium on Networks-on-Chip, IEEE Computer Society.
    Gurd J R, Kirkham C. C, Watson I. 1985. The Manchester prototype dataflow computer[J]. Commun. ACM 28(1): 34-52.
    
    Ho R, Mai K, Horowitz M. 2003. Efficient on-chip global interconnects[R].
    Ho R, Mai KW, Horowitz MA. 2001. The future of wires[J]. Proceedings of the IEEE 89(4): 490-504.
    Huh Jaehyuk, Burger Doug, Keckler Stephen W. 2001. Exploring the Design Space of Future CMPs[C]. Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society.
    
    J. Hennessy D. Patterson. 2007. Computer Architecture: A Quantitative Approach,[J].
    Jacobson Quinn, Bennett Steve, Sharma Nikhil, Smith James E. 1997. Control Flow Speculation in Multiscalar Processors[C]. Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, IEEE Computer Society.
    
    Kessler RE. 1999. The alpha 21264 microprocessor[J]. IEEE Micro 19(2): 24-36.
    Kim C. 2007. A technology-scalable composable architecture[D], The University of Texas at Austin.Thesis
    Kim C., Burger D., Keckler S. W. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches[J]. Acm Sigplan Notices 37(10): 211-222.
    Kim Changkyu, Sethumadhavan Simha, Govindan M. S., Ranganathan Nitya, Gulati Divya, Burger Doug, Keckler Stephen W. 2007. Composable Lightweight Processors[C]. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.
    Kim SB, Park MS, Park SH, Min SL, Shin H, Kim CS, Jeong DK. 1993. Threaded prefetching: an adaptive instruction prefetch mechanism[J]. Microprocessing and Microprogramming 39(1): 1-15.
    Krste Asanovic Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley[R].
    Le HQ, Starke WJ, Fields JS, O'Connell FP, Nguyen DQ, Ronchetti BJ, Sauer WM, Schwarz EM, Vaden MT. 2007. Ibm power6 microarchitecture[J]. IBM Journal of Research and Development 51(6): 639-662.
    Lee Walter, Barua Rajeev, Frank Matthew, Srikrishna Devabhaktuni, Babb Jonathan, Sarkar Vivek, Amarasinghe Saman. 1998. Space-time scheduling of instruction-level parallelism on a raw machinc[C]. Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, San Jose, California, United States, ACM.
    Mai K, Ho R, Alon E, Dean L, Kim Y, Dinesh P, Horowitz M 2004. Architecture and circuit techniques for a reconfigurable memory block[C]. IEEE International Solid-State Circuits Conference.
    Mai K, Paaske T, Jayasena N, Ho It Dally WJ, Horowitz M. 2000. SmartMemories: AModularRecongurableArchitecture[C].
    Melvin SW, Shebanow MC, Patt YN. 1988. Hardware support for large atomic units in dynamically scheduled machines[C]. Proceedings of the 21st annual workshop on Microprogramming and microarchitecture San Diego, California, United States, IEEE Computer Society Press.
    Mercaldi Martha, Swanson Steven, Petersen Andrew, Putnam Andrew, Schwerin Andrew, Oskin Mark, Eggers Susan J. 2006. Instruction scheduling for a tiled dataflow architecture[C]. Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, San Jose, California, USA, ACM.
    Moritz C. A. 1998. Exploring Optimal Cost-Performance Designs for RAW processors[R]. Massachusetts Institute of Technology.
    Nagarajan Ramadass, Kushwaha Sundeep K., Burger Doug, McKinley Kathryn S., Lin Calvin, Keckler Stephen W. 2004. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures[C]. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society.
    Nagarajan Ramadass, Sankaralingam Karthikeyan, Burger Doug, Keckler Stephen W. 2001. A design space evaluation of grid processor architectures[C]. Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, Austin, Texas, IEEE Computer Society.
    Nikhil R. S. 1989. Can dataflow subsume von Neumann computing?[C]. Proceedings of the 16th annual international symposium on Computer architecture, Jerusalem, Israel, ACM.
    Papadopoulos Gregory M., Culler David E. 1990. Monsoon: an explicit token-store architecture[J]. SIGARCH Comput. Archit. News 18(3a): 82-91.
    Petersen Andrew, Putnam Andrew, Mercaldi Martha, Schwerin Andrew, Eggers Susan, Swanson Steve, Oskin Mark. 2006. Reducing control overhead in dataflow architectures[C]. Proceedings of the 15th international conference on Parallel architectures and compilation techniques, Seattle, Washington, USA, ACM.
    Rabbah RM, Bratt I, Agarwal A, Asanovic K, LAB MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE. 2004. Versatile tiled-processor architectures: The raw approach[B], Defense Technical Information Center.
    Ranganathan N, Nagarajan R, Burger D, Keckler SW. 2002. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance[R]. Citeseer.
    
    Robatmili B, Coons K, Burger D. 2008a. Balancing Local and Global Parallelism for Single-Thread Applications in a Composable Multi-core System[J]. PESPMA 2008: 2.
    
    Robatmili Behnam, Coons Katherine E., Burger Doug, McKinley Kathryn S. 2008b. Strategies for mapping dataflow blocks to distributed hardware[C]. Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture - Volume 00, IEEE Computer Society.
    
    Roesner F., Burger D., Keckler S. W. 2008. Counting dependence predictors[C]. 35th Annual International Symposium on Computer Architecture, Beijing, PEOPLES R CHINA.
    
    Sankaralingam K., Nagarajan R., Liu H. M., Kim C, Huh J., Burger D., Keckler S. W., Moore C. 2003. Exploiting ILP, TLP, and DLP with the polymorphous trips architecture[J]. IEEE Micro 23(6): 46-51.
    
    Sankaralingam Karthikeyan, Nagarajan Ramadass, Liu Haiming, Kim Changkyu, Huh Jaehyuk, Ranganathan Nitya, Burger Doug, Keckler Stephen W., McDonald Robert G, Moore Charles R. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP[J]. ACM Trans. Archit. Code Optim. 1(1): 62-93.
    
    Sethumadhavan S , McDonald R., Desikan R., Burger D., Keckler S. W., Ieee. 2006. Design and implementation of the TRIPS primary memory system[C]. 24th International Conference on Computer Design, San Jose, CA.
    Smith Aaron, Gibson Jon, Maher Bertrand, Nethercote Nick, Yoder Bill, Burger Doug, McKinle Kathryn S., Burrill Jim. 2006. Compiling for EDGE Architectures[C]. Proceedings of the International Symposium on Code Generation and Optimization, IEEE Computer Society. Smith AJ. 1982. Cache memories[J]. ACM Computing Surveys (CSUR) 14(3): 473-530.
    Smith Jim. 1996. Multiscalar as a new architecture paradigm[J]. ACM Comput. Surv. 28(4es): 34. Sohi Gurindar S., Breach Scott E , Vijaykumar T. N. 1995. Multiscalar processors[C]. Proceedings of the 22nd annual international symposium on Computer architecture, S. Margherita Ligure, Italy, ACM.
    Swanson Steven, Michelson Ken, Schwerin Andrew, Oskin Mark. 2003. WaveScalar[C]. Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.
    Swanson Steven, Schwerin Andrew, Mercaldi Martha, Petersen Andrew, Putnam Andrew, Michelson Ken, Oskin Mark, Eggers Susan J. 2007. The WaveScalar architecture[J]. ACM Trans. Comput. Syst. 25(2): 4.
    Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H. Johnson P, Lee JW,Lee W.2002.The Raw microprocessor:A compt cational fabric for software circuits and general-purpose programs[J].IEEE Micro 272:02.
    Taylor MB,Kim J,Miller J,Wentzlaff D,Ghodrat F,Greenwald B,Hoffman H,Johnson P,Lee W,Saraf A.2003a.A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network[C].IEEE International Solid-State Circuits Conference.
    Taylor Michael.2005.The Raw Prototype Design Document v5.2[R].Department of Electrical Engineering and Computer Science,MASSACHUSETTS INSTITUTE OF TECHNOLOGY
    Taylor Michael Bedford,Lee Walter.2005.Scalar Operand Networks[J].IEEE Trans.Parallel Distrib.Syst.16(2):145-162.
    Taylor Michael Bedford,Lee Walter,Amarasinghe Saman,Agarwal Anant.2003b.Scalar Operand Networks:On-Chip Interconnect for ILP in Partitioned Architectures[C].Proceedings of the 9th International Symposium on High-Performance Computer Architecture,IEEE Computer Society.
    Tilera.2007.The Tile Processor~(TM) Architecture:Fmbedded Multicore for Networking and Digital Multimedia[C].Hotchips.
    Tilera.2008.TILE64~(TM) Processor Product Brief[R].
    Varma A,Sinha G.1992.A class of prefetch schemes for on-chip data caches[C].
    Veen Arthur H.1986.Dataflow machine architecture[J].ACM Comput.Surv.18(4):365-396.
    Vijaykumar T.N.1998.Compiling for the multiscalar architecture[D],The University of Wisconsin - Madison.Thesis
    Vijaykumar T.N.,Sohi Gurindar S.1998.Task selection for a multiscalar processor[C].Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture,Dallas,Texas,United States,IEEE Computer Society Press.
    Wah Benjamin.2008.DATAFLOW COMPUTERS:THEIR HISTORY AND FUTURE[B].Wiley Encyclopedia of Computer Science and Engineering.
    Waingold E,Taylor M.,Sarkar V,Lee V,Lee W.,Kim J.,Frank M,Finch P.,Devabhaktumi S.,Barua R.,Babb J.,Amarsinghe S.,Agarwal A.1997.Baring it all to Software:The Raw Machine[R].Massachusetts Institute of Technology.
    李国杰.1981.一种新的体系结陶--数据流计算机[J].计算机研究与发展(11).
    陈国良.1984数据流计算机[J].计算机研究与发展21(09):34-46

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700