面向科学计算的PIM体系结构技术研究

英文题名：Studies on the PIM Architectures and Techniques for Scientific Applications
作者：温璞
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：向量处理 ; 存储器内置处理器 ; 存储墙 ; 并行处理 ; 十亿晶体管结构 ; 每秒千万亿次浮点运算
英文关键词：vector processing ; processor in memory ; memory wall ; parallel processing ; billion transistor architectures ; petaflops
学位年度：2007
导师：杨学军
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2007-04-01

摘要

现代高性能计算机系统通常采用处理器和存储器相分离的结构设计,即以处理器为中心、通过层次式Cache和复杂存储互连网络把分离的处理器和存储器相连接。在分离结构设计中,处理器以高速度为设计目标、存储器以高集成度为设计目标,不同的目标追求导致不同的发展趋势,并且随着工艺水平的进步和处理器体系结构的发展,处理器和存储器之间的速度差距越来越大,形成“存储墙”,而在分离结构中,大量的芯片资源则用于缓解日益增大的速度差距。
     PIM技术把处理器和存储器紧密结合在一个芯片上,把分离的结构统一起来,具有高访存带宽、低访存延时、低功耗的优点,成为解决“存储墙”问题的一种有效方法。当前基于PIM技术的研究主要有:PIM微体系结构研究、PIM并行体系结构研究、PIM编程模型研究以及基于PIM的编译优化技术研究等方面,研究的出发点是为了最大限度发挥PIM高带宽、低延迟的结构特性。
     我们的研究重点集中在两个方面:一是PIM微体系结构研究,即PIM中的处理器采用何种结构才能充分发挥PIM高带宽、低延迟的结构特点是我们的研究重点之一;另一方面是PIM并行系统相关问题的研究。文章通过对面向科学计算的PIM体系结构研究,提出并设计实现了一种具有向量处理能力的高性能PIM结构——V-PIM结构,探讨了基于V-PIM的并行系统以及面向V-PIM结构的软件优化技术。文章的创新主要体现在以下几点:
     1、提出并设计了具有向量处理能力的V-PIM结构
     V-PIM是基于向量处理逻辑的PIM结构。向量具有成熟的编程模型和强大的数据并行表达能力,PIM具有高性能的存储系统,因此向量和PIM结合比较自然。我们从基于面积的效用(Performance/Area)角度分析了基于寄存器和基于存储器结构的两种向量结构,认为存储器-存储器结构与PIM结构结合在芯片资源利用以及降低功耗等方面具有优势,因此采用了基于存储器.存储器的向量结构形式。文章描述了V-PIM的结构设计,给出了扩展的向量指令集,并通过FPGA进行了验证。
     2、提出面向V-PPIM并行系统的V-Parcels通信机制
     并行系统中的通信子系统对系统的计算效率、可扩展性和适用性具有重要影响。为了减少在并行V-PIM系统中执行向量指令时的数据通信开销,我们提出了面向并行V-PPIM系统的V-Parcels通信机制。其基本思想是根据向量元素分布规律,动态产生合理的计算-通信模式,在保持性能最大化的前提下,尽可能降低向量指令执行时的通信开销。V-Parcels通信机制可以把计算传递到数据所在节点而无需在节点间频繁地传递数据。
     3、基于V-PIM并行系统的线程判别模型
     V-PIM处理器通常执行时间局部性较差但空间局部性较好的线程,这类线程我们通常称为轻量级线程;而其它适合Host处理器处理的线程通常被称为重量级线程。为了提高系统整体的性能,需要利用软件的方法将这两类线程区别开来。我们设计了一种编译级的两类线程区分算法,依据一段线程分别在V-PIM和Host处理器上的执行性能来确定,当通过编译识别出线程类型后再调度到相应的处理器上进行处理,从而加速整体性能。算法实现简单,对编译器改动较少,运行效果接近实际情况,具有较好的实用性。
     4、提出一种多运算核构成的运算簇和PIM主存相结合的结构——COPE
     提出一种多运算核构成的运算簇和PIM主存相结合的结构——COPE(Composite Organization for Push Execution,COPE)。COPE具有主存向执行部件“推”数据进行执行的特点。COPE以存储器为中心,由作为存储器的PIM向运算簇“推”数据,运算簇负责执行,只具有简单的控制功能,通过片上互连网络相连。程序存储在PIM存储器中,由PIM中的处理器静态调度部分程序块在运算簇上执行,所需数据由PIM存储器提供,运算簇以数据驱动方式执行,中间结果通过寄存器通信直接传递到下一个运算簇中,无需把临时值传递到寄存器中,从而避免使用大量既影响性能又不可扩展的硬件机制。
Current high performance computer systems usually adopt decoupled architectures in which the processor and memory is separated and connected by a hierarchy of caches and complexly interconnection systems. In these processor-centric designs, the processor and memory use high-volume semiconductor processes and is highly optimized for the particular characteristics and features demanded by the product, e.g. high switching rates for logic processes, and long retention times for DRAM processes. With the progress of semiconductor and manufacture techniques and the fast development of processor architectures, the speed of processors has far exceeded the speed of memories, the processor-memory gap become larger and larger, which leads to the occurrence of "memory wall". And these processor-centric designs invest a lot of power and chip area to bridge the widening gap between processor and main memory speed.
     PIM (Processor-In-Memory), which merging processor and memory into a single chip, has made the processor and memory reunited and has the well-known benefits of allowing high-bandwidth and low-latency communication between processor and memory, and reducing energy consumption. As a result, many different systems based on PIM architectures have been proposed. The researches are mainly focused on PIM micro-architecture, PIM parallel system, PIM program model, and PIM compiler optimization. One common in these researches is to maximum exploit the benefits of high-bandwidth, low-latency of PIM architecture.
     Our researches about PIM architecture techniques are mainly stressed on two aspects: one is about the PIM micro-architecture, that is to say, finding the suitable processor architecture in PIM and making the best of the benefits that the PIM architecture supplied. The other aspect of our researches stressed on is relative to the PIM parallel system. After researching on the scientific computation-oriented PIM architectures, a Vector-based Processor-In-Memory (V-PIM) architecture, which coupled with the characteristics of vector processing and PIM architecture, is proposed, the parallel system based on V-PIM is presented, and the software optimal technology is discussed. Primary researches and innovative work in this paper can be summarized as following:
     1. Put forward and design V-PIM architecture the Vector-based Processor-In-Memory architecture
     V-PIM is a Vector-based Processor-In-Memory architecture. Vector architecture has a mature program model and powerful ability to express data parallelism explicitly. And PIM architecture has the high-performance memory system. So it's nature to union the vector and PIM architecture together. After comparing the register-register vector architecture and memory-memory vector architecture based on the utility of performance to area (performance/area), the results show that the union of memory-memory vector architecture and PIM architecture is superior to that of register-register vector architecture and PIM for that it has lower power and better on-chip resource utility. We adopt memory-memory vector architecture in our V-PIM design. This paper describes the designation of V-PIM architecture, presents the extended vector instruction set, and verifies the V-PIM architecture by FPGA-based platform.
     2. Propose the V-Parcels communication mechanism for V-PPIM parallel system
     The communication sub-system is important to the computing efficiency, scalability and suitability of parallel system. For reduce the communication traffic and improve the computing performance of vector exectution, V-Parcels communication mechanism for V-PPIM parallel system is proposed. Supporting vector operations transfer between V-PIM nodes is the main characteristic of the V-Parcels. Based on the analysis of vector elements distribution, it can dynamically generate V-Parcels communication package to transfer data or operations so as to local the computation, minimum the communication, and maximum the computing performance.
     3. Compile-time thread distinguishment algorithm on V-PIM-based architecture
     On V-PIM-based architecture, the low temporal locality thread running on V-PIM processor is called Light-Weight Thread (LWT), while the low cache miss rate thread running on host processor is called Heavy-Weight Thread (HWT). The way of thread distinguishment can impact the system performance directly. For improving the system performance, we need a suitable thread distinguishment algorithm. Based on the execution performance on V-PIM and host of a thread, we present a compile-time method to distinguish the LWT and HWT. Once the compiler identifies the type of the thread, it can schedule the thread to the proper processor and accelerate the system performance. The thread distinguishment algorithm is simple and easily implemented, and the running result approaches the real situation.
     4. Put forward COPE architecture a composite organization of PIM and multiple computing clusters for push execution
     We present COPE (Composite Organization for Push Execution), a new PIM architecture that combines PIM memory and multiple execution clusters on a chip to overcome the challenges of power, wire latency, memory wall, and so on that facing the future teraflops chips. In the memory-centric COPE architecture, the PIMs play the role of smart memories, and the multiple execution clusters play the role of processing units. The data is pushed to the execution clusters and executed by execution clusters. Multiple execution clusters are interconnected by on-chip operation network. As smart memory, PIM memory holds both code and data, and steers the instruction execution in clusters. The execution clusters are data-driven execution model. Temporal computing results can pass to the next processing unit through register communication directly, and it is not necessary to write them into registers again. It can avoid using massively hardware mechanisms which neither improve the performance nor enhance the scalability.

引文

[1]Hennessy J,Patterson D.Computer Architecture:A Quantitative Approach.3~(rd) ed.San Francisco:Morgan Kaufmann,2002
    [2]Moore G E.Cramming More Components onto Integrated Circuits.Electronics,1965,38(8):114-117
    [3]Amdahl G M.Validity of Single-Processor Approach to Achieving Large-Scale Computing Capability.In AFIPS Conference Proceedings.Reston:AFIPS Press,1967.483-485
    [4]Allan A,Edenfeld D,Joyner W,et al.2001 Technology Roadmap for Semiconductors.IEEE Computer,2002,35(1):42-53
    [5]唐志敏.万亿微处理器展望.信息技术快报,2004,8
    [6]IBM Microelectronics.Blue Logic SA-27E ASIC.In News and Ideas of IBM Microelectronics,http://www.chips.ibm.com/news/1999/sa27e/,1999-02
    [7]Subramanian S Iyer,Howard L Kalter.Embedded DRAM Technology:Opportunities and Challenges.IEEE Spectrum,1999,36(4):56-64
    [8]Faraboschi P,Fisher J,Desoli G.Clustered Instruction-Level Parallel Processors.Hewlett-Packard Technical Report,Tech Rep:HPL-98-204,1998
    [9]Steven Roos,Henk Corporaal,Reinoud Lamberts.Clustering on the Move.The 4~(th) International Conference on Massively Parallel Computing Systems,Ischia Italy,2002
    [10]Viktor S Lapinskii,Margarida F Jacome,Gustavo A De Veciana.Cluster Assignment for High-Performance Embedded VLIW Processors.ACM Transactions on Design Automation of Electronic Systems,2002,7(3):430-454
    [11]Joan-Manuel Parcerisa.Design of Clustered Superscalar Microarhitectures:[Ph D Dissertation].Universitat Politecnica de Catalunya:Departament d'Arquitectura de Computadors,2004
    [12]Kim J S,Taylor M B,Miller J,et al.Energy Characterization of a Tiled Architecture Processor with On-Chip Networks.In:Proc.of the 2003 International Symposium on Low Power Electronics and Design.New York:ACM Press,2003.424-427
    [13]Intel News Realse.Intel Drives Moore's Law Forward With 65 Nanometer Process Technology,http://www.intel.com/pressroom/archive/releases/20040830net.htm,2004-08-30
    [14]Intel News Realse.Intel First to Demonstrate Working 45nm Chips,http://www.intel.com/pressroom/archive/releases/20060125comp.htm,2006-01-25
    [15]Agarwal V,Hrishikesh M S,Keckier S W,et al.Clock Rate Versus IPC:The end of the Road for Conventional Microarchitectures. In: Proceedings of the 27th International Symposium on Computer Architecture. New York: ACM Press, 2000.248-259
    [16]Wulf W, McKee S. Hitting the Memory Wall: Implications of the Obvious.Computer Architecture News, 1995,23 (1): 20-24
    [17]Doug Burger, James R Goodman. Billion-Transistor Architectures. IEEE Computer,1997, 30(9): 46-49
    [18]Doug Burger, James R Goodman. Billion-Transistor Architectures: There and Back Again. IEEE Computer, 2004, 37 (3): 22-28
    [19] Yale N Part, Sanjay J Patel et al. One billion transistors, one uniprocessor, one chip.IEEE Computer, 1997, 30 (9): 51-57
    [20]Mikko H Lipasti, John Paul Shen. Superspeculative Microarchitecture for Beyond AD 2000. IEEE Computer, 1997, 30 (9): 59-66
    [21] Lance Hammond, Basem A Nayfeh, Kunle Olukotun. A Single-Chip Multiprocessor. IEEE Computer, 1997, 30 (9): 79-85
    [22]Carmelo Acosta, Ayose Falcon, Alex Ramirez, et al. A Complexity-Effective Simultaneous Multithreading Architecture. In: Proceedings of the 2005 International Conference on Parallel Processing. Washington: IEEE Computer Society, 2005. 157-164
    [23]Gurindar S Sohi, Scott E Breach, Vijaykumar T N. Multiscalar Processors. In:Proceedings of the 22nd annual international symposium on Computer architecture.New York: ACM Press, 1995. 414-425
    [24] Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, et al. Baring It All to Software: Raw Machines. IEEE Computer, 1997, 30 (9): 86-93
    [25] Patterson D, Anderson T. Cardwell N, et.al. Intelligent RAM (IRAM): Chips that Remember and Compute. In: Proc. Int'l Symp. Solid-State Circuits. Washington:IEEE Computer Society, 1997. 224-225
    [26] Patterson D, Anderson T, Cardwell N, et.al. A Case for Intelligent RAM. IEEE Micro, 1997, 17(2): 34-44
    [27] Yi Kang, Michael Huang, Seung-Moon Yoo. FlexRAM: Toward an Advanced Intelligent Memory System. In: Proceedings of the 1999 IEEE International Conference on Computer Design. Washington: IEEE Computer Society, 1999.192-201
    [28] J Jay B Brockman, Shyamkumar Thoziyoor, Shannon K Kuntz et al. A Low Cost,Multithreaded Processing-in-Memory System. In: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 3rd international symposium on computer architecture. New York: ACM Press, 2004. 16-22
    [29] Jeff Draper, Jacqueline Chame, Mary Hall, et al. The Architecture of the DIVA Processing-In-Memory Chip.In:Proceedings of the 16th international con.terence on supercomputing.New York:ACM Press,2002.14-25
    [30]Thomas L Sterling.An Introduction to the Gilgamesh PIM Architecture.In:Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing.London:Springer-Verlag,2001.16-32
    [31]Ed Upchurch,Thomas Sterling,Jay Brockman.Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs.In:Proceedings of the 2004ACM/IEEE conference on Supercomputing.Washington:IEEE Computer Society,2004.Article No.12
    [32]Jan M Rabaey,Anantha Chandrakasan,Borivoje Nikolic.Digital Integrated Circuits A Design Perspective(Senond Edition).北京:清华大学出版社,2004
    [33]Wayne Wolf.Modern VLSI Design Systems on Silicon(Second Edition).北京:科学出版社,2003
    [34]IBM Press Release.IBM chip advance spurs 'system-on-achip' products.1999
    [35]Schoneman K.A Modular Embedded DRAM Core Concept in 0.24 micron Technology.In:Proceedings of the 1998 IEEE International Workshop on Memory Technology,Design and Testing.Washington:IEEE Computer Society,1998.18-23
    [36]Tsung-Chuan Huang,Slo-Li Chu.A Statement Based Parallelizing Framework for Processor-In-Memory Architectures.Information Processing Letters,2003,85(3):159-163
    [37]Slo-Li Chu,Tsung-Chuan Huang.SAGE:An Automatic Analyzing System for a New High-Performance SoC Architecture-Processor-In-Memory.Journal of Systems Architecture:the EUROMICRO Journal,2004,50(1):1-15
    [38]Richard C Murphy.Design Parameters for Distributed PIM Memory Systems:[MS Thesis].University of Notre Dame:Computer Science and Engineering Department,2000
    [39]余一娇.Pentium 4之后微处理器设计技术新进展.华中科技大学计算机学院,Tech Rep:TR-2005-05.2005
    [40]张悠慧,汪东升,郑纬民.统一缓存:基于用户层通信的合作缓存技术.计算机研究与发展,2003,40(7):1117-1123
    [41]Burger D,Goodman J,Kagi A.Memory Bandwidth Limitations of Future Microprocessors.In:Proceedings of the 23rd annual international symposium on Computer architecture.New York:ACM Press,1996.78-89
    [42]G McFarland.CMOS Technology Scaling and Its impact on cache delay:[Ph D Dissertation].Stanford University:Department of Electrical Engineering,1997
    [43]Smith A J.Cache Memories.ACM Computing Surveys,1982,18(3):473-530
    [44]Kai Hwang.高等计算机系统结构--并行性、可扩展性、可编程性.王鼎兴, 沈美明,郑纬民,等译.北京:清华大学出版社,1995
    [45]Boris Weissman.Performance Counters and State Sharing Annotations:a Unified Approach to Thread Locality.In:Proceedings of the 8~(th) international conference on Architectural support for programming languages and operating systems.New York:ACM Press,1998.127-138
    [46]Alpha 21164 Microprocessor:Hardware Reference Manual.Digital Equipment Corp.1994
    [47]Gwennap L.Intel's p6 uses decoupled superscalar design.Microprocessor Report,1995,9(2)
    [48]Intel News Release.New Dual-Core Intel(?) Itanium(?) 2 Processor Doubles Performance,Reduces Power Consumption.http://www.intel.com/pressroom/archive/releases/20060718comp.htm,2006-07-18
    [49]Mikko H Lipasti,John Paul Shen.Superspeculative Microarchitecture for Beyond AD 2000.IEEE Computer,1997,30(9):59-66
    [50]Lance Hammond,Basem A Nayfeh,Kunle Olukotun.A Single-Chip Multiprocessor.IEEE Computer,1997,30(9):79-85
    [51]Carmelo Acosta,Ayose Falcon,Alex Ramirez.A Complexity-Effective Simultaneous Multithreading Architecture.In:Proceedings of the 2005International Conference on Parallel Processing.Washington:IEEE Computer Society,2005.157-164
    [52]Gurindar S Sohi,Scott E Breach,Vijaykumar T N.Multiscalar Processors.In:Proceedings of the 22nd annual international symposium on Computer architecture.New York:ACM Press,1995.414-425
    [53]Cho S,Yew P C,Lee G.Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor.In:Proceedings of the 26~(th) Annual Intemational Symposium on Computer Architecture.New York:ACM Press,1999.100-110
    [54]Sohi G S,Franklin M.High-Bandwidth Data Memory Systems for Superscalar Processors.In:Proceedings of the Fourth International Conference on Architectural Support for Programming Language and Operating Systems.New York:ACM Press,1991.53-62
    [55]Wilson K M,Olukotun K,Rosenblum M.Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors.In:Proceedings of the 23~(rd) Annual International Symposium on Computer Architecture.New York:ACM Press,1996.147-157
    [56]Parlacharla S,Jouppi N P,Smith J E.Complexity-Effective Superscalar Processors.In:Proceedings of the 24~(th) Annual International Symposium on Computer Architecture.New York:ACM Press,1997.206-218
    [57]Thomas Sterling,Larry Bergman.A Design Analysis of a Hybrid Technology Multithreaded Architecture for Petaflops Scale Computation.In:Proceedings of the 13~(th) international conference on Supercomputing.New York:ACM Press,1999.286-293
    [58]Gao G,Theobald K.The HTMT Program Execution Model.University of Delaware,Department of Electrical and Computer Engineering,Computer Architecture and Parallel Systems Laboratory,CAPSL Technical Memo No.9,1997
    [59]Theobald K B,Gao G R,Sterling T L.Superconducting Processors for HTMT:Issues and Challenges.In:Proceedings of the 7~(th) Symposium on the Frontiers of Massively Parallel Computation.Washington:IEEE Computer Society,1999.260-267
    [60]DARPA.High Productivity Computing Systems(HPCS).http://www.darpa.mil/ipto/programs/hpcs/index.htm,2004
    [61]Cray Inc.The Cascade Project.http://www.cray.com/cascade/,2002
    [62]Mootaz Elnozahy.PERCS:IBM effort in HPCS.http://www.ncsc.org/casc/meetings/vision-public.pdf,2003
    [63]John L Gustafson.Sun's HPCS Approach:Hero.http://www.ncsc.org/casc/meetings/CASC2.pdf,2003
    [64]Nunomura Y,Shimizu T,Tomisawa O.M32R/D-integrating DRAM and microprocessor.IEEE Micro,1997,17(6):40-48
    [65]Shimizu T,Korematu J,Satou M,et al.A Multimedia 32 b RISC Microprocessor with 16 Mb DRAM.Dig.Technical Papers,1996 IEEE Int'l Solid-State Circuits Conf.,IEEE.San Francisco,1996
    [66]Mary Hall,Peter Kogge,Jeff Koller,et al.Mapping Irregular Applications to DIVA,a PIM-based Data-Intensive Architecture.In:Proceedings of the 1999 ACM/IEEE conference on Supercomputing.New York:ACM Press,1999.Article No.57
    [67]973国家重点基础研究发展计划.高性能科学计算研究项目简介.http://lsec.cc.ac.cn/～adm973/index.htm,2005-12
    [68]NAS Parallel Benchmarks.http://www.nas.nasa.gov/Software/NPB 2004-09-15
    [69]Ahmad Faraj,Xin Yuan.Communication characteristics in the NAS parallel benchmarks.The 14th IASTED Int' Conf.Parallel and Distributed Computing and Systems,Cambridge,MA,2002
    [70]Bailey D,Barscz E,Barton J,et al.The NAS parallel benchmarks.NASA Ames Research Center,TechRep:RNR-94-007,1994
    [71]Rob F,Van der Wijngaart.NAS parallel benchmarks version 2.4.NASA Ames Research Center,TechRep:NAS-02-007,2002
    [72]Daniel Etiemble.Numerical applications and sub-word parallelism:The NAS benchmarks on a Pentium 4.The 16~(th) Annual International Symposium on High Performance Computing Systems and Applications(HPCS'2002).Moncton,NB,Canada,2002.
    [73]Hsu W C,Smith J E.Performance of cached DRAM organizations in vector supercomputers.In-Proc.20~(th) ISCA.Washington:IEEE Computer Society Press,1993.327-336
    [74]Carter J,Hsieh W,Stoller L,et al.Impulse:Building a Smarter Memory Controller.In:Proceedings of the 5~(th) International Symposium on High Performance Computer Architecture.Washington:IEEE Computer Society,1999.70-79
    [75]陈国良编著.并行计算--结构、算法、编程(修订版).北京:高等教育出版社,2003
    [76]郑纬民,汤志忠.计算机系统结构.第2版.北京:清华大学出版社,2001
    [77]张晨曦,王志英,张春元,等.计算机体系结构.北京:高等教育出版社,2002
    [78]Christoforos Kozyrakis.Scalable Vector Media-processors for Embedded Systems:[Ph D Dissertation].University of California,Berkeley:Computer Science Division,2002
    [79]Graham S L,Bacon D F,Sharp O J.Compiler Transformations for High Performance Computing.ACM Computing Surveys,1994,26(4):345-420
    [80]Kai Hwang,Faye A Briggs.计算机结构与并行处理.郑衍衡,王鼎兴,沈美明,等译.北京:科学出版社,1990
    [81]Harry F Jordan,Gita Alaghband.并行处理基本原理.迟利华,刘杰,译.北京:清华大学出版社,2004
    [82]Hockney R W,Jesshope C R.Parallel Computers Architecture,Programming and Algorithms.Bristol:Adam Hilger Ltd,1981
    [83]Richard M Russell.The Cray-1 Computer System.Communications of the ACM archive,1978,21(1):63-72
    [84]Takashi Hashimoto,Kazuaki Murakami,Tetsuo Hironaka,et al.A Micro-vectorprocessor Architecture:Performance Modeling and Benchmarking.In:Proceedings of the 7~(th) international conference on Supercomputing.New York:ACM Press,1993.308-317
    [85]陆洪毅.32位高性能嵌入式向量微处理器关键技术的研究与实现:[博士论文].国防科学技术大学:计算机学院,2002
    [86]Smith J E,Faanes G,Sugumar R.Vector Instruction Set Support for Conditional Operations.In:Proceedings of 27~(th) Intl.Symposium on Computer Architecture.Washington:IEEE Computer Society,2000.260-269
    [87]Alex Peleg,Uri Wieser.MMX Technology Extensions to the Intel Architecture.IEEE Micro,1996,16(4):1-20
    [88]Roger Espasa,Federico Ardanaz,Joel Emer,et al.Tarantula:A Vector Extension to the Alpha Architecture.In:Proceedings of the 29~(th) annual intemational symposium on Computer architecture.Washington:IEEE Computer Society,2002.281-292
    [89]Pradeep K Dubey,Ron Hochsprung,Hunter Scales,et al.AltiVec Extension to PowerPC Accelerates Media.IEEE Micro,2000,20(2):85-95
    [90]Intel Corporation.Using Streaming SIMD Extensions 3 in Algorithms with Complex Arithmetic.2004
    [91]Gebis J,William S,Kozyrakis C,et al.VIRAMI:A Media-Oriented Vector Processor with Embedded DRAM.41st Design Automation Student Design Contest,San Diego,CA,2004
    [92]The SPARC Architecture Manual Version 8.Menlo Park:SPARC International,Inc,1992
    [93]Ashley Saulsbury,Fong Pong,Andreas Nowatzyk.Missing the Memory Wall:The Case for Processor/Memory Integration.ACM SIGARCH Computer Architecture News,1996,24(2):90-101
    [94]孟丹,张志宏,陈明宇.高生产率计算系统.计算机研究与发展,2005,42(4):563-569
    [95]Lilia Vitalyevna Yerosheva.High-Level Prototyping for the HTMT Petaflop Machine:[MS Thesis].University of Notre Dame:Computer Science and Engineering Department,2001
    [96]Watanabe T.Architecture and performance of the NEC supercomputer SX system.Parallel Computing 1987,5:247-255
    [97]The Earth Simulator Center.Earth-Simulation.http://www.top500.org/system/5628,2002
    [98]Krste Asanovic.Vector Microprocessor:[Ph D Dissertation].University of California,Berkeley:Computer Science Division,1998
    [99]Johannes M Mulder,Nhon T Quach,Michael J Flynn.An area utility model for on-chip memories and its application.Stanford University,Tech Rep:CSL-TR-90-413,1990
    [100]Aneesh Aggarwal,Manoj Franklin.Energy Efficient Asymmetrically Ported Register Files.In:Proceedings of the 21st International Conference on Computer Design.Washington:IEEE Computer Society,2003.2-7
    [101]Rixner S,Dally W J,Khailany B,et al.Register organization for media processing.In:Proceedings of the 6~(th) Intemational Symposium on High-Performance Computer Architecture.Washington:IEEE Computer Society,2000:375-386
    [102]Takashi Hashimoto,Kazuaki Murakami,Tetsuo Hironaka,et al.A Micro-vectorprocessor Architecture Performance Modeling and Benchmarking.In:Proceedings of the 7~(th) international conference on Supercomputing.New York: ACM Press, 1993. 308-317
    [103] Dongarra J, DuCroz J, Duff I, et al. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 1990,16 (1): 1-17
    [104] McMahon F. The livermore fortran kernels: A computer test of the numerical performance range. Lawrence Livermore National Laboratory, Tech Rep:UCRL-53745, 1986
    [105] Ingrid Y Bucher. The Computational Speed of Supercomputers. In:Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems. New York: ACM Press, 1983. 151-165
    [106] Anderson E, Bai Z, Bischof C, et al. LAPACK Users' Guide: Third Edition.SIAM, 1999
    [107] Thomas L Sterling, Hans P Zima. Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing. In: Proc. of the 2002 ACM/IEEE Conference on Supercomputing. New York: ACM Press, 2002. 1-23
    [108] Kogge P M. The EXECUBE Approach to Massively Parallel Processing. The Conference on Parallel Processing. Chicago, 1994
    [109] Thorsten von Eicken, David E Culler, Seth Copen Goldstein, et al. Active Messages: A Mechanism for Integrated Communication and Computation. In: Proceedings of the 19~(th) Annual International Symposium on Computer Architecture.New York: ACM Press, 1992. 256-266
    [110] Peter M Kogge, Jay B Brockman, Thomas Sterling, et al. Processing in memory: Chips to petaflops. In Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA'97. Denver, Colorado, 1997
    [111] David Callahany, Bradford L Chamberlainy, Hans P Zima. The Cascade High Productivity Language. In: Proceeding of the 9~(th) International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2004). Washington: IEEE Computer Society, 2004. 52-60
    [112] Calin Cascaval, David A Padua. Estimating cache misses and locality using stack distances. In: Proceedings of the 17th annual international conference on Supercomputing. New York: ACM Press, 2003. 150-159
    [113] Ho R, Mai K W, Horowitz M A. The Future of Wires. Proceedings of the IEEE,2001, 89(4): 490-504
    [114] Bradford M Beckmann, David A Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. In: Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Computer Society, 2004. 319-330
    [115] Keckler S W, Doug Burger, Moore C R, et al. A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems. In: Proceedings of the 2003 IEEE International Solid State Circuit Conference Washington: IEEE Computer Society, 2003. 1068-1069
    [116] Allen F, Almasi G, Andreoni W, et.al. Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer. IBM Systems Journal, 2001,4 (2): 310-327
    [117] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, et al.Exploiting ILP, TLP, and DLP with TRIPS Architecture. In: Proceedings of the 30~(th) annual international symposium on Computer architecture. New York: ACM Press,2003. 422-433
    [118] Teraflops Research Chip. Advancing Multi-Core Technology into the Tera-scale Era. http://www.intel.com/research/platform/terascale/teraflops.htm,2007-03
    [119] Technology at Intel Magazine. Intel's Tera-Scale Research Prepares for Tens,Hundreds of Cores.http://www.intel.com/technology/magazine/computing/tera-scale-0606.htm,2006-06
    [120] Brucek Khailany, William J Dally, Scott Rixner, et al. Imagine: Media Processing with Streams. IEEE MICRO, 2001,21 (2): 35-46
    [121] Steven P VanderWiel, David J Lilja. Data Prefetch Mechanisms. ACM Computing Surveys, 2000, 32 (2): 174-199
    [122] Nir Oren. A Survey of prefetching techniques. University of the Witwatersrand,Johannesburg, South Africa, Tech Rep: CS-2000-10, 2000
    [123] Chia-Lin Yang, Alvin R Lebeck. Push vs. Pull: Data Movement for Linked Data Structures. In: Proceedings of the 14~(th) international conference on Supercomputing. New York: ACM Press, 2000. 176-186
    [124] Chia-Lin Yang, Alvin R Lebeck. A Programmable Memory Hierarchy for Prefetching Linked Data Structures. In: Proceedings of the 4~(th) International Symposium on High Performance Computing. London: Springer-Verlag, 2002.160-174
    [125] William J Dally, Stuart Fiske J A, John S keen, et al. The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro, 1992,12(2): 23-39
    [126] Jung Ho Ahn, William J Dally, Brucek Khailany, et al. Evaluating the Imagine Stream Architecture. In: Proceedings of the 31~(st) annual international symposium on Computer architecture. Washington: IEEE Computer Society, 2004. 14-25

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700