基于SMP的CC-NUMA类大规模系统中Cache一致性协议研究与实现

英文题名：Research and Implementation of the Cache Coherence Protocol for the Large Scale System of the SMP-based CC-NUMA Category
作者：庞征斌
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：SMP ; CC-NUMA ; Cache一致性协议 ; 目录结构 ; 目录Cache ; 全局共享I/O ; 描述符 ; 一致性块传输 ; 消息传递 ; 共享存储多处理机
英文关键词：SMP ; CC-NUMA ; cache coherence protocol ; directory scheme ; directory cache ; distributed shared I/O ; coherent block data transfer ; message passing ; shared memory multiprocessor
学位年度：2007
导师：周兴铭
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2007-03-01

摘要

随着对高性能计算需求的日益增强,对高性能计算机的架构与实现提出了越来越高的要求。提高系统的可编程性、可用性和系统综合效能,成为当前高性能计算机的设计目标。分布共享存储多处理机系统以其方便的编程环境和较好的可扩展性而成为高性能计算机体系结构发展的主流,CC-NUMA(Cache Coherent Non-UniformMemory Access)结构成为高性能计算领域实现高效能的重要体系结构。
     构造大规模CC-NUMA系统受诸多因素制约,其中Cache一致性协议是限制系统可扩展性的关键因素,同时也对系统性能产生重要影响。由于Cache一致性实现的复杂性,当前多数CC-NUMA系统规模较小,可扩展性有限。许多高性能计算平台利用CC-NUMA计算机构建集群,但这样严重影响了大系统的可编程性。因此为大规模CC-NUMA系统设计扩展性好、简洁高效的Cache一致性协议十分必要。
     本论文主要工作是针对基于SMP(Symmetric Multi-Processors)结点的大规模CC-NUMA新系统——SCCMP(Scalable Cache Coherence Multi-Processors)的要求,分析其体系结构特点,设计了可扩展、低复杂性和高效的Cache一致性协议,设计了可扩展的目录结构,实现并优化了与Cache一致性处理紧密相关的目录访问,提供了Cache一致性的高效消息传递通信支持,最后验证了协议的正确性及高效性。论文的具体工作和创新点如下:
     (1)研究了SCCMP的构成层次和结构特点,设计和实现了可扩展、高效的混合Cache一致性协议——HYSCC(HYbrid Scalable Cache Coherence)协议。HYSCC协议通过融合监听协议特点的可扩展目录协议实现,有效支持了SCCMP系统内部两个不同层次的Cache一致性实现要求,降低了协议设计的复杂性,实现协议的简洁高效。HYSCC协议通过多虚信道网络传输技术、非阻塞并发处理和精简协议消息类型等技术实现协议自身的高效性。HYSCC协议增加一类专门处理SMP结点内部脏数据共享的命令类型和协议处理方法,降低了SMP结点因内部共享导致脏数据副本写回所带来的协议处理复杂性,大大简化了SCCMP结点控制器内部协议设计的复杂度。
     (2)通过分析SCCMP系统中分布共享I/O访问对系统Cache一致性实现的影响,在HYSCC协议中设计和实现了支持I/O属性访问的Cache一致性命令类型和协议处理流程,设计和实现了I/O访问数据一致性的硬件维护机制,高效实现了全局共享I/O的并发访问。
     (3)研究了目录结构的可扩展实现方法,设计了符合SCCMP系统特点的有限指针(Dir_5NB)和组合粗向量CCV(Combined Coarse Vector)的混合表示——Dir_5NB+CCV的目录结构。该目录结构兼具指针和位向量表示的优点,在不同共享度时采用与之对应的共享信息表示格式,合理地减少了目录存储的开销。Dir_5NB+CCV通过混合的多元化表示,在一定程度上降低了共享信息的非精确性,减少多余的失效开销,并且利于高速的硬件实现。
     (4)为缓解因目录访问而带来的数据访问冲突,设计了双体并行访问存储器结构和双目录Cache访问结构,优化目录访问和处理。SCCMP系统没有采用单独的目录存储器,利用双体并行访问存储器结构使得存储数据和对应目录的访问并行进行。为缓解由此带来的存储器访问压力,设计和实现了与双体并行访问存储器对应的双目录Cache结构,引入目录Cache访问层次,利用程序访问的局部性对目录访问进行优化。实验结果验证了双体并行访问存储器和双目录Cache结构对性能有大幅提升作用。
     (5)为高效支持消息传递编程模型,研究了在SCCMP系统中有效实现共享存储和消息传递相结合的通信方法,提出了层次的一致性消息通信模型。在SCCMP结点控制器一级提供消息传递通信接口,实现了无死锁的消息通信协议,实现了基于硬件的一致性块传输机制,支持高效的消息传递通信。
     (6)基于FPGA实现完成了SCCMP结点控制器的逻辑设计和协议验证。在四个结点的FPGA原型系统上进行NAS NPB等应用测试,验证了HYSCC协议的正确性。用ASIC实现了验证后的SCCMP结点控制器,并在64结点的ASIC原型系统上进行了性能测试。测试结果表明NAS NPB等应用运行正确;EP、SP、FT、MG等对存储带宽要求很高的应用在ASIC原型系统上呈现出良好的可扩展性;通信测试表明点点通信最大带宽在1.3GB/s以上,应用测试最大带宽在1.1GB/s以上,基于硬件一致性块传输实现使NPB MPI应用测试获得了更高的性能。
     (7)本研究成果适用于基于SMP超结点的CC-NUMA类型的大规模系统,并已在某重点工程中得到成功应用。
With the increasing requirements of high performance computing,the framework and implementation of high performance computer is becoming more challenge. Programmability,usability and system performance have become the object when designing a high performance computer system.The distributed shared memory multi-processor system becomes the main platform of high performance computing, which features easy programming and good scalability.As the popular scalable system approach,the CC-NUMA(Cache Coherent Non-Uniform Memory Access) is becoming the important architecture for high producity in high performance computing.
     There are many factors affecting CC-NUMA system performance,of which cache coherence protocol becomes the key for system scalability.Most existing CC-NUMA computers are small and with limited scale,due to the complex implementation of cache coherence.Usually,CC-NUMA clusters are used as high performance computers,which bring bad programmability.So,it is very important and necessary to design and develop a cache coherence protocol with good scalability and efficiency for the large scale CC-NUMA system.
     This paper researches the high efficiency implementation of the cache coherence protocol based on the Scalable Cache Coherence Multi-Processors(SCCMP),the large scale SMP-based CC-NUMA system.The main study includes designing the high efficiency scalable cache coherence protocol according to the architecture features, designing and implementing the scalable directory scheme,efficiently implementing the directory access,effectively supportting cache coherent message passing communications,and validating the protocol.Primary innovative work in this paper can be summarized as following:
     (ⅰ) We designed and implemented efficient HYbrid Scalable Cache Coherence (HYSCC) protocol,after analyzed the hierarchy and the structure features of the SCCMP system.HYSCC protocol efficiently fulfils needs of different hierarchy in SCCMP system and eases the designment and implementation of itself by taking the advantage of snooping bus protocol and directory character.HYSCC protocol ensures the system scalability based on our scalable directories.High efficiency is yielded by multiple virtual channels,concurrent unblocking process and compact massage type,HYSCC protocol supports special messages and process for the case that the dirty data become shared due to the sharement among processors in a SMP node,which reduces the dirty data written back complexity and simplifies the protocol designment in a SMP node.
     (ⅱ) We discussed the impacts of the distributed shared I/O accesses to the cache coherence,and provided special messages and cache coherence dealing procedures to support the cache coherent access with I/O attributes.Moreover,we proposed an effective method to concurrently process I/O accesses,and implemented a coherence maintenance mechanism for I/O attribute data in SCCMP system.
     (ⅲ) We did our research on the feasible and scalable directory scheme,and we proposed the Dir_5NB+CCV directory scheme for the SCCMP system.The Dir_5NB+CCV scheme is a combination of the modified limited pointer directory(Dir_5NB) scheme and the combined coarse vector(CCV) directory scheme,which keeps both pointer representation and full-map vector representation advantages.By hybrid presentation effectively decreasing directory memory overheads,utilizing the advantage of Dir_5NB scheme and CCV scheme,the Dir_5NB+CCV scheme cuts down shared informantion inaccuracy,reduces excrescent invalidations and suits an efficient hardware implementation.
     (ⅳ) We proposed a dual storage module structure and dual directory cache(DC) structure to relieve the access collision and to improve directory performance.There is no special directory storage in SCCMP system,but the dual storage module structure has data and corresponding directory item accessed concurrently.To relievate memory access bottleneck,the dual directory cache structure is designed and implemented,which corresponding with the dual storage module structure and introducing cache hierarchy. This way can optimize directory access by utilizing program locality and relieve memory access pressure.Experiments show that with dual storage module and directory cache structure,the system performance is improved greatly.
     (ⅴ) We researched the effective way to integrate message passing communication paradigm into shared memory in SCCMP system.We proposed a hierarchical coherent communication model,provided communication interface in SCCMP node controller, effectively implemented a deadlock-free communication protocol and a coherent block data transfer mechanism to support the multi-domain MPI communication.
     (ⅵ) We designed the SCCMP node controller and implemented FPGA prototype for validation.The HYSCC protocol was validated on a 4-node FPGA prototype,and an ASIC chip of SCCMP node controller was fabricated.Experiments were done on a 64-node ASIC system.All tested applications,including NAS NPB benchmark,got correct results on the system.Memory-intensive applications,such as EP,SP,FT,MG, got good scalability.Communication tests showed that the maximum communication bandwidth was more than 1.3GB/s and the maximum communication bandwidth yielded by applications can be over 1.1GB/s.
     (ⅶ) Our research results are applicable to the large scale system of the SMP-based CC-NUMA category,and also have been successfully used in some important project.

引文

[1]周毓麟,沈隆钧.高性能计算的应用与战略地位.中国科学院院刊,1999,(3):184-188.
    [2]ASCI Project.http://www.lanl.gov/projects/asci/asci.html,http://www.llnl.gov/asci/,http://www.sandia.gov/ASCI/,http://www.sandia.gov/ASCI/,http://www.lanl.gov/asci/,http://www.lanl.gov/projects/asci/asci.html.
    [3]PETAFLOP.http://www.petaflop.info/.
    [4]PetaFLOPS Enabling Technologies and Applications.http://www.hq.nasa.gov/hpcc/petaflops/.
    [5]N.R.Adiga,G.Almasi,G.S.Almasi,Y.Aridor,R.Barik,D.Beece,et al.An Overview of the BlueGene/L Supercomputer.In Proceedings of the 2002ACM/IEEE Conference on Supercomputing(SC'02).Baltimore,Maryland,USA,2002.
    [6]TOP500 Supercomputing Sites.www.top500.org/.
    [7]API NetWorks Inc.HyperTransport Technology I/O White Paper.2003.
    [8]InfiniBand Trade Association.InfiniBand Architecture Specification,Release 1.0.2000.
    [9]IrfiniBand Trade Association.http://www.infinibandta.org/home.
    [10]T.Sato,S.Kitawaki,M.Yokokawa.Earth Simulator Running.In the International Supercomputer Conference(ISC).Heideberg,2002.
    [11]Kiyoshi Otsuka.Present Status of Development of the Earth Simulator.In Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems(IWIA'01).Washington,DC,USA,pages 93,2001.
    [12]Darren J.Kerbyson,Adolfy Hoisie,Harvey J.Wasserman.A Comparison between the Earth Simulator and AlphaServer Systems Using Predictive Application Performance Models.In Proceedings of the 17th International Symposium on Parallel and Distributed Processing(IPDPS '03).Washington,DC,USA,pages 64.2,2003.
    [13]The ASCI Q System:30 TeraOPS Capability at Los Alamos National Laboratory.www.sandia.gov/supercomp/sc2002/flyers/ASCI_Q_rev.pdf.
    [14]刘杰.可扩展数值并行计算关键技术及其应用研究.博士学位论文.长沙:国防科技大学,2004.
    [15]W.A.Wulf,S.A.McKee.Hitting the memory wall:implications of the obvious.Computer Architecture News,23(1):20-24,1995.
    [16]P.J.Mucci,S.Moore.Performance analysis of HPC architectures.In HPC User Forum.Innovative Computing Laboratory,University of Tennessee:Princeton,NJ,2003.
    [17]D.Patterson,T.Anderson,N.Cardwell et al.A case for intelligent RAM:IRAM.IEEE Micro,17(2):34-44,1997.
    [18]High Productivity Computing Systems.http://www.highproductivity.org/.
    [19]Gabriele Jost,Haoqiang Jin,Dieter an Mey,Ferhat F.Hatay.Comparing the OpenMP,MPI,and Hybrid Programming Paradigms on an SMP Cluster.In Fifth European Workshop on OpenMP(EWOMP03).Aachen,Germany,2003.
    [20]OpenMP:Simple,Portable,Scalable SMP Programming.http://www.openmp.org/drupal/.
    [21]PVM:Parallel Virtual Machine.http://www.epm.ornl.gov/pvm/.
    [22]The Message Passing Interface(MPI) standard.http://www-unix.mcs.anl.gov/mpi/index.htm.
    [23]High Performance Fortan.http://hpff.rice.edu/index.htm.
    [24]Thomas E.Ludwig,Wolfgang Karl,Amdt Bode.Euro-Par 2000 Parallel Processing.Springer,2000.
    [25]R.W.Numrich,J.K.Reid.Co-Array Fortran for parallel programming.ACM Fortran Forum,17(2):1-31,1998.
    [26]CoArray Fortran.http://lacsi.rice.edu/software/caf/.
    [27]张云泉,孙家昶,袁国兴,张林波.2004年高性能计算机发展趋势分析与展望.2004.
    [28]Kai Hwang,Zhiwei Xu.Scalable Parallel Computing:Technology,Architecture,Programming.McGraw-Hill Companies,Inc.,1998.
    [29]Alan Charlesworth.Starfire:Extending the SMP Envelope.IEEE Micro,18(1):39-49,1998.
    [30]The CRAY T3E.http://www.psc.edu/machines/cray/t3e/t3e.html.
    [31]D.E.Culler,J.P.Singh,A.Gupta Parallel Computer Architecture:A Hardware/Software Approach.Morgan Kaufmann Publishers Inc.,San Francisco,California,1999.
    [32]Kourosh Gharachodoo,Madhu Sharma,Simon Steely,Stephen van Doren.Architecture and design of AlphaServer GS320.In Proceedings of the ninth international conference on Architectural support for programming languages and operating systems(ASPLOS-Ⅸ).New York,NY,USA,pages 13-24,2000.
    [33]Alan J.Hu,Masahiro Fujita,Chris Wilson.Formal verification of the HAL S1System cache coherence protocol.In Proceedings of the 1997 International Conference on Computer Design(ICCD'97).Washington,DC,USA,pages 438,1997.
    [34]Fong Pong,Michel Dubois.Verification techniques for cache coherence protocols.ACM Comput.Surv.,29(1):82-126,1997.
    [35]Fong Pong,Michel Dubois.Formal verification of complex coherence protocols using symbolic state models.Journal of ACM,45(4):557-587,1998.
    [36]James Laudon,Daniel Lenoski.The SGI Origin:a ccNUMA highly scalable server.SIGARCH Comput.Archit.News,25(2):241-251,1997.
    [37]Rajarshi Mukherjee,Yozo Nakayama,Toshiya Mima.Verification of an Industrial CC-NUMA Server.In Proceedings of the 2002 conference on Asia South Pacific design automation/VLSI Design(ASP-DAC'02).Washington,DC,USA,pages 747,2002.
    [38]J.Cantin,M.Lipasti,J.Smith.Dynamic Verification of Cache Coherence Protocols.In Workshop on Memory Performance Issues.In conjunction with ISCA.2001.
    [39]HP/Convex SPP2000 at Caltech/JPL.http://ct.gsfc.nasa.gov/annual.reports/ess97/jpl/spp2000.html.
    [40]T.D.Lovett,R.M.Clapp,R.J.Safrane.NUMA-Q:An SCI based Enterprise Server.In Sixth International Workshop on SCI-based Low-cost/High-performance Computing.1996.
    [41]Zarka Cvetanovic.Performance analysis of the Alpha 21364-based HP GS1280multiprocessor.In Proceedings of the 30th annual international symposium on Computer architecture(ISCA'03).New York,NY,USA,pages 218-229,2003.
    [42]HP 9000 Superdome Server-overview & features.http://www.hp.com/productsl/servers/scalableservers/superdome/index.html.
    [43]Overview of Recent Supercomputers.http://www.top500.org/orsc/2006.
    [44]SGI-Products:Servers and Supercomputers:SGI Altix Family.http://www.sgi.com/products/servers/altix/.
    [45]Simultaneous multithreading resources:Press releases,academic papers and presentations on Hyper-Threading and Chip-Multiprocessors.http://www.princeton.edu/～jdonald/research/hyperthreading/.
    [46]M.A.Heinrich.The Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols.PhD.Thesis.Stanford University,1998.
    [47]IEEE.IEEE Standard for Scalable Coherent Interface(SCI).1993.
    [48]Ravi Iyer,Nancy M.Amato,Lawrence Rauchwerger,Laxmi Bhuyan.Comparing the memory system performance of the HP V-class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications.In Proceedings of the 13th international conference on Supercomputing(ICS'99).New York,NY,USA,pages 339-347,1999.
    [49]Zheng Zhang,Josep Torrellas.Reducing Remote Conflict Misses:NUMA with Remote Cache versus COMA.In Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture(HPCA'97).Washington,DC,USA,pages 272,1997.
    [50]Babak Falsafi,David A.Wood.Reactive NUMA:a design for unifying S-COMA and CC-NUMA.In Proceedings of the 24th annual international symposium on Computer architecture(ISCA'97).New York,NY,USA,pages 229-240,1997.
    [51]Tao Li,Lizy Kurian John.ADir_pNB:A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols.IEEE Transactions on Computers,50(9):921-934,2001.
    [52]A.Agarwal,R.Simoni,J.Hennessy,M.Horowitz.An evaluation of directory schemes for cache coherence.In Proceedings of the 15th Annual International Symposium on Computer architecture(ISCA '88).Los Alamitos,CA,USA,pages 280-298,1988.
    [53]H.Nilsson,Per Stenstr.The Scalable Tree Protocol--A Cache Coherence Approach for Large-Scale Multiprocessors.In Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.pages 498-506,1992.
    [54]Sang Hwa Chung,Soo Cheol Oh,Sejin Park,Hankook Jang.Utilizing Network Cache on an SCI-Based PC Cluster.In Proceedings of the 15th International Parallel and Distributed Processing Symposium(IPDPS'01).Los Alamitos,CA,USA,pages 1530-2075,2001.
    [55]Soo Cheol Oh,Sang Hwa Chung,Hankook Jang.Design and Implementation of CC-NUMA Card Ⅱ for SCI-Based PC Clustering.In Proceedings of the IEEE International Conference on Cluster Computing(CLUSTER'02).Los Alamitos,CA,USA,pages 145,2002.
    [56]Christoper Ho,Heidi Ziegler,Michel Dubois.In-Memory Directories:Eliminating the Cost of Directories in cc-NUMAs.In Symposium on Parallel Algorithms and Architectures(SPAA '98).pages 222-230,1998.
    [57]Manuel E.Acacio,Jos Gonz,Jos M.Garc,Jos Duato.A New Scalable Directory Architecture for Large-Scale Multiprocessors.In Proceedings of the 7th International Symposium on High-Performance Computer Architecture(HPCA '01).Washington,DC,USA,pages 97-106,2001.
    [58]Manuel E.Acacio,Jose Gonzalez,Jose Manuel Garcia,Jose Duato.Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration.In Proceedings of the 2002 Euromicro Workshop on Parallel and Distributed Processing.2002.
    [59]A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors.In Proceedings of the 16th International Symposium on Parallel and Distributed Processing(IPDPS'02).Washington,DC,USA,pages 62,2002.
    [60]E.Ender Bilir,Ross M.Dickson,Ying Hu,Manoj Plakal,Daniel J.Sorin,Mark D.Hill,et al.Multicast snooping:a new coherence method using a multicast address network.In Proceedings of the 26th annual international symposium on Computer architecture(ISCA'99).Washington,DC,USA,pages 294-304,1999.
    [61]Daniel Lenoski,James Laudon,Truman Joe,David Nakahira,Luis Stevens,Anoop Gupta,et al.The DASH Prototype:Logic Overhead and Performance.IEEE Transactions on Parallel and Distributed Systems,4(1):41-61,1993.
    [62]Daniel Lenoski,James Laudon,Kourosh Gharachorloo,Anoop Gupta,John Hennessy.The directory-based cache coherence protocol for the DASH multiprocessor.In Proceedings of the 17th annual international symposium on Computer Architecture(ISCA '90).New York,NY,USA,pages 148-159,1990.
    [63]B.C.Brock,G.D.Carpenter,E.Chiprout,M.E.Dean,P.L.De,E.N.Elnozahy,et al.Experience with building a commodity Intel-based ccNUMA system.IBM Journal of Research and Development,45(2):207-228,2001.
    [64]A.K.Nanda,A.-T Nguyen,M.M.Michael,D.J.Joseph.High-Throughput Coherence Control and Hardware Messaging in Everest.IBM Journal of Research and Development,(45):229-243,2001.
    [65]Anoop Gupta,Wolf Dietrich Weber,Todd Mowry.Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes.In Proceeding of the 1990 International Conference on Parallel Processing(ICPP '90).St.Charles,Ill.,pages 312-321,1990.
    [66]D.Chaiken,J.Kubiatowics,A.Agarwal.LimitLESS Directories:A Scalable Cache Coherence Scheme.In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS).New York,NY,pages 224-234,1991.
    [67]Thomas Simoni Richard.Cache coherence directories for scalable multiprocessors.PhD.Thesis.Stanford,CA,USA:1992.
    [68]Anant Agarwal,Ricardo Bianchini,David Chaiken,Kirk L.Johnson,David Kranz,John Kubiatowicz,et al.The MIT Alewife machine:architecture and performance.In Proceedings of the 22nd annual international symposium on Computer architecture(ISCA '95).New York,NY,USA,pages 2-13,1995.
    [69]Tom Lovett,Russell Clapp.STiNG:a CC-NUMA computer system for the commercial marketplace.SIGARCH Comput.Archit.News,24(2):308-317,1996.
    [70]Russell M.Clapp.STING Revisited:Performance of Commercial Database Benchmarks on a CC-NUMA Computer System.In Workshop on Duplicating,Deconstructing and Debunking(WDDD).Anchorage,Alaska,2002.
    [71]Radhika Thekkath,Amit Pal Singh,Jaswinder Pal Singh,Susan John,John Hennessy.An Evaluation of a Commercial CC-NUMA Architecture---The CONVEX Exemplar SPP1200.ipps,00:8,1997.
    [72]T.Brewer,G.Asffalk.The evolution of the HP/Convex Exemplar.In compcon.Los Alamitos,CA,USA,pages 81,1997.
    [73]Jong Hyuk Choi,Kyu Ho Park.Segment Directory Enhancing the Limited Directory Cache Coherence Schemes.In Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing.pages 258-267,1999.
    [74]Manuel E.Acacio,Jose Gonzalez,Jose M.Garcia,Jose Duato.A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors.IEEE Transactions on Parallel and Distributed Systems,16(1):67-79,2005.
    [75]Maged M.Michael,Ashwini K.Nanda.Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors.In Proceedings of the 5th International Symposium on High Performance Computer Architecture(HPCA '99).Washington,DC,USA,pages 142,1999.
    [76]Shubhendu S.Mukherjee,Mark D.Hill.Using prediction to accelerate coherence protocols.In Proceedings of the 25th annual international symposium on Computer architecture(ISCA '98).Barcelona,Spain,pages 179-190,1998.
    [77]Ravi Iyer,Laxmi Narayan Bhuyan.Switch Cache:A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors.In Proceedings of the 5th International Symposium on High Performance Computer Architecture(HPCA '99).Washington,DC,USA,pages 152,1999.
    [78]E.D.Moreno,S.T.Kofuji.Efficiency of remote access caches in future SMP-based CC-NUMA multiprocessors:initial results.In Proceedings of the 1997 International Symposium on Parallel Architectures,Algorithms and Networks(ISPAN '97).Washington,DC,USA,pages 190,1997.
    [79]Chang Kyu Lee,Kyu Ho Park,Bong Wan Kim,Jong Hyuk Choi.Fast & Cost Effective Cache Invalidation in DSM.In Proceedings of the Seventh International Conference on Parallel and Distributed Systems(ICPADS'00).Washington,DC,USA,pages 492,2000.
    [80]J.Kuskin,D.Ofelt,M.Heinrich,J.Heinlein L,R.Simoni,K.Gharachorloo P,et al.The Stanford FLASH multiprocessor.In Proceedings of the 21ST annual international symposium on Computer architecture(ISCA '94).Los Alamitos,CA,USA,pages 302-313,1994.
    [81]Kourosh Gharachorloo.Memory Consistency Models for Shared-memory Multiprocessors.Ph D.Dissertation.STANFORD UNIVERSITY,1995.
    [82]Yehuda Afek,Geoffrey Brown,Michael Merritt.Lazy caching.ACM Trans.Program.Lang.Syst.,15(1):182-205,1993.
    [83]Weisong Shi,Weiwu Hu,Ming Zhu.An innovative implementation for directory-based cache coherence in shared memory multiprocessors.SIGARCH Comput.Archit.News,25(5):2-9,1997.
    [84]S.V.Adve,V.S.Pai,P.Ranganathan.Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems.Proc.of the IEEE,Special Issue on Distributed Shared Memory,87(3):445-455,1999.
    [85]Daniel J.Scales,Kourosh Gharachorloo,Chandramohan A.Thekkath.Shasta:a low overhead,software-only approach for supporting free-grain shared memory.In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems(ASPLOS-Ⅶ).New York,NY,USA,pages 174-185,1996.
    [86]Daniel J.Scales,Kourosh Gharachorloo.Design and performance of the Shasta distributed shared memory protocol.In Proceedings of the 11th international conference on Supercomputing(ICS '97).New York,NY,USA,pages 245-252,1997.
    [87]Milo M.K,Mark D.Hill,David A.Wood.Token Coherence:A New Framework for Shared-Memory Multiprocessors.IEEE Micro,23(6):108-116,2003.
    [88]M.Martin,M.Hill,D.Wood.Token Coherence:Decoupling Performance and Correctness.In proceedings of the 30th Annual International Symposium on Computer Architecture(ISCA-30).San Diego,CA,2003.
    [89]Michael R.Marty,Jesse D.Bingham,Mark D.Hill,Alan J.Hu,Milo M.K,David A.Wood,et al.Improving Multiple-CMP Systems Using Token Coherence.In Proceedings of the 11th International Symposium on High-Performance Computer Architecture(HPCA '05).Washington,DC,USA,pages 328-339,2005.
    [90]Milo M.K.Martin.TOKEN COHERENCE.PhD thesis.UNIVERSITY OF WISCONSIN-MADISON,2003.
    [91]Ashwini K.Nanda,Anthony Tnmg Nguyen,Maged M.Michael,Douglas Joseph.High-Throughput Coherence Controllers.In Proceedings of the Sixth IEEE Symposium on High-Performance Computer Architecture.pages 145-155,2000.
    [92]Anthony Trung Nguyen,Josep Torrellas.Design Trade-Offs in High-Throughput Coherence Controllers.In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT '03).Washington,DC,USA,pages 194,2003.
    [93]Ritwik BHATTACHARYA,Ganesh GOPALAKRISHNAN.Issues in multiprocessor memory consistency protocol design and verification.In European joint conference on theory and practice of software(ETAPS 2002).pages 1-15,2002.
    [94]Joseph E.Stoy,Xiaowei Shen,Arvind.Proofs of Correctness of Cache-Coherence Protocols.In Proceedings of the International Symposium of Formal Methods Europe on Formal Methods for Increasing Software Productivity(FME '01).London,UK,pages 43-71,2001.
    [95]Hubert Garavel,Cesar Viho,Massimo Zendri.System design of a CC-NUMA multiprocessor architecture using formal specification,model-checking,co-simulation,and test generation.International Journal on Software Tools for Technology Transfer,3(3):314-331,2001.
    [96]Milo M.K.Formal Verification and its Impact on the Snooping versus Directory Protocol Debate.In Proceedings of the 2005 International Conference on Computer Design(ICCD '05).Washington,DC,USA,pages 543-449,2005.
    [97]Kenneth L.McMillan.Symbolic Model Checking.Ph D.Dissertation.Carnegie Mellon University,1992.
    [98]Cindy Eisner,Irit Shitsevalov,Russ Hoover,Wayne Nation,Kyle Nelson,Ken Valk,et al.A methodology for formal design of hardware control with application to cache coherence protocols.In Proceedings of the 37th conference on Design automation(DAC '00).New York,NY,USA,pages 724-729,2000.
    [99]K.L.McMillan.The SMV language.Cadence Berkeley Labs,1999.
    [100]K.L.McMillan,Getting started with SMV.Cadence Berkeley Labs.1999..
    [101]asgeirThEiriksson,John Keen,Alex Silbey,Swami Venkataraman,Michael Woodacre.Origin system design methodology and experience:1M-gate ASICs andbeyond.In Compcon '97.Proceedings,IEEE.San Jose,CA,USA,pages 157-164,1997.
    [102]David L.Dill.The Murphi Verification System.In Proceedings of the Eighth International Conference on Computer Aided Verification.New Brunswick,NJ,USA,pages 390-393,1996.
    [103]Fong Pong,Michael Browne,Gunes Aybay,Andreas Nowatzyk,Michel Dubois.Design Verification of the S3.mp Cache-Coherent Shared-Memory System.IEEE Transactions on Computers,47(1):135-140,1998.
    [104]Manoj Plakal,Daniel J.Sorin,Anne E.Condon,Mark D.Hill.Lamport Clocks:Verifying a Directory Cache-Coherence Protocol.In the 10th Annual ACM Symposium on Parallel Algorithms and Architectures(SPAA).Puerto Vallarta,Mexico,1998.
    [105]Leslie Lamport,John Matthews,Mark Tuttle,Yuan Yu.Specifying and Verifying Systems with TLA+.In Proceedings of the Tenth ACM SIGOPS European Workshop(2002).pages 45-48,2002.
    [106]Leslie Lamport,Rajeev Joshi,John Matthews,Serdar Tasiran,Mark Tuttle,Yuan Yu,et al.Checking Cache-Coherence Protocols with TLA+.Formal Methods in System Design,22(2):125-131,2003.
    [107]LESLIE LAMPORT'S HOME PAGE.http://research.microsoft.com/users/lamport/.
    [108]The Wildfire Challenge Problem.http://research.microsoft.com/users/lamport/pubs/wildfire-challenge.pdf.
    [109]Kenneth L.McMillan.Parameterized Verification of the FLASH Cache Coherence Protocol by Compositional Model Checking.In Proceedings of the 11th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods.London,UK,pages 179-195,2001.
    [110]Milo M.K.Martin.Formal Verification and its Impact on the Snooping versus Directory Protocol Debate.In Proceedings of the 2005 International Conference on Computer Design(ICCD' 05).2005.
    [111]Sequent Computer Systems Inc Sequent's NUMA-Q Architecture White Paper.
    [112]Arndt Bode.Technology Trends for Petascale Computing.Technical University of Munich/Technische Universitat MUnchen.2007.
    [113]Maged M.Michaely,Ashwini K.Nandaz,Beng-Hong Limz,Michael L.Scotty.Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors.In Proceedings of the 24th Anual International Symposium on Computer Architecture(ISCA).Denver,CO,1997.
    [114]Maged M.Michael,Ashwini K.Nanda,Beng Hong Lim.Coherence Controller Architectures for Scalable Shared-Memory Multiprocessors.IEEE Transactions on Computers,48(2):245-255,1999.
    [115]George Bosilca,Aurelien Bouteiller,Franck Cappello,Samir Djilali,Gilles Fedak,Cecile Germain,et al.MPICH-Ⅴ:Toward a Scalable Fault Tolerant MPI for Volatile Nodes.In Proceedings of the 2002 ACM/IEEE conference on Supercomputing(Supercomputing '02).Los Alamitos,CA,USA,2002.
    [116]J.Liu,J.Wu,S.Kini,D.Buntinas,W.Yu,B.Chandrasekaran,et al.MPI over InfiniBand:Early Experiences.2003.
    [117]Nanette J.Boden,Danny Cohen,Robert E.Felderman,Alan E.Kulawik,Charles L.Seitz,Jakov N.Seizovic,et al.Myrinet:A Gigabit-per-Second Local Area Network.IEEE Micro,15(1):29-36,1995.
    [118]Richard B.Gillett.Memory Channel Network for PCI.IEEE Micro,16(1):12-18,1996.
    [119]Compaq,Intel,Microsoft.Ⅵ Architecture Specification V1.0.1997.
    [120]D.Dunning,G.Regnier,G.McAlpine,D.Cameron,B.Shubert,F.Berry,et al.The Virtual Interface Architecture.IEEE Micro,:66-76,1998.
    [121]John Heinlein,Kourosh Gharachorloo,Scott Dresser,Anoop Gupta.Integration of message passing and shared memory in the Stanford FLASH multiprocessor.In Proceedings of the sixth international conference on Architectural support for programming languages and operating systems(ASPLOS-Ⅵ).New York,NY,USA,pages 38-50,1994.
    [122]Anant Agarwal.Retrospective:the MIT Alewife machine:architecture and performance.In ISCA '98:25 years of the international symposia on Computer architecture.New York,NY,USA,pages 103-110,1998.
    [123]John Heinlein,Kourosh Gharachorloo,P.Bosch Robert,Mendel Rosenblum,Anoop Gupta.Coherent Block Data Transfer in the FLASH Multiprocessor.In Proceedings of the 11th International Symposium on Parallel Processing(IPPS '97).Washington,DC,USA,pages 18-27,1997.
    [124]Julita Corbalan,Xavier Martorell,Jesus Labarta.Evaluation of the memory page migration influence in the system performance:the case of the SGI O2000.In Proceedings of the 17th annual international conference on Supercomputing(ICS '03).New York,NY,USA,pages 121-129,2003.
    [125]John David Kubiatowicz.Integrated Shared-Memory and Message-Passing Communication in the AlewifeMultiprocessor.PhD.Thesis.Massachusetts Institute of Technology,1998.
    [126]Pang Zhengbin,Zhang Jun,Li Yongjin,Xia Jun,Xu Weixia.A Cost-Effective Dir5NB+CCV Directory Scheme and Its Efficient Implementation on SCCMP System.第十四届全国信息存储技术学术会议.武汉:2006.
    [127]MIPS Technologies,Inc.MIPS R10000 Microprocessor User's Manual.1996.
    [128]Rajesh A.Bordawekar.Quantitative Characterization and Analysis of the I/O Behavior of a Commercial Distributed-Shared-Memory Machine.IEEE Trans.Parallel Distrib.Syst.,11(5):509-526,2000.
    [129]庞征斌,李琼,徐炜遐.CC-NUMA系统中I/O数据一致性高性能实现研究.计算机工程与科学,2005,27(A1):245-248.
    [130]庞征斌,李琼,李永进,张峻,徐炜遐.ccNUMA系统分布共享I/O的数据一致性维护.计算机研究与发展,(Suppl):2007.
    [131]HyperTransport Technology consortium.HyperTransport I/O Link Specification.2003.
    [132]陆大(纟金) 随机过程及其应用.北京:清华大学出版社,2002.
    [133]Anoop Gupta,Wolf Dietrich Weber.Cache Invalidation Patterns in Shared-Memory Multiprocessors.IEEE Trans.Comput.,41(7):794-810,1992.
    [134]邓让钰,谢伦国.多处理机系统中共享数据分布形式.计算机工程与科学,20(A1):66-69,1998.
    [135]Michael S.Warren,Eric H.Weigle,Wu Chun Feng.High-density computing:a 240-processor Beowulf in one cubic meter.In Proceedings of the 2002ACM/IEEE conference on Supercomputing(Supercomputing '02).Los Alamitos,CA,USA,pages 1-11,2002.
    [136]Yongkang Zhu,David Albonesi,Alper Buyuktosunoglu.A novel SIMD architecture for the CELL heterogeneous chip-multiprocessor.In Hot Chips 17.Palo.Alto,CA,2005.
    [137]Kourosh Gharachorloo,Andreas Nowatzyk,Robert Mc Namara,Robert Stets,Scott Smith,Shaz Qadeer,et al.Piranha:A Scalable Architecture Based on Single-Chip Multiprocessing.In Proceedings of the 27th annual international symposium on Computer architecture(ISCA '00).Los Alamitos,CA,USA,pages 282,2000.
    [138]Rakesh Kumar,Victor Zyuban,Dean M.Tullsen.Interconnections in Multi-Core Architectures:Understanding Mechanisms,Overheads and Scaling.In Proceedings of the 32th annual International Symposium on Computer Architecture(ISCA '05).Los Alamitos,CA,USA,pages 408-419,2005.
    [139]Basilio B.Fraguela,Jose Renau,Paul Feautrier,David Padua,Josep Torrellas.Programming the FlexRAM parallel intelligent memory system.In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming(PPoPP '03).New York,NY,USA,pages 49-60,2003.
    [140]D.Patterson,K.Asanovic,A.Brown,R.Fromm,J.Golbus,B.Gribstad,et al.Intelligent RAM(IRAM):the Industrial Setting,Applications,and Architecture.In International Conference on Computer Design(ICCD'97).Austin,Texas,1997.
    [141]Christoforos Kozyrakis,David Patterson.Vector vs.superscalar and VLIW architectures for embedded multimedia benchmarks.In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture(MICRO 35).Los Alamitos,CA,USA,pages 283-293,2002.
    [142]Brian R.Gaeke,Parry Husbands,Xiaoye S.Li,Leonid Oliker,Katherine A.Yelick,Rupak Biswas,et al.Memory-Intensive Benchmarks:IRAM vs.Cache-Based Machines.In Proceedings of the 16th International Symposium on Parallel and Distributed Processing(IPDPS'02).Washington,DC,USA,pages 30.2,2002.
    [143]Christoforos E.Kozyrakis,Stylianos Perissakis,David Patterson,Thomas Anderson,Krste Asanovic,Neal Cardwell,et al.Scalable Processors in the Billion-Transistor Era:IRAM.Computer,30(9):75-78,1997.
    [144]Maya Gokhale,Bill Holmes,Ken Iobst.Processing in Memory:The Terasys Massively Parallel PIM Array.Computer,28(4):23-31,1995.
    [145]Z.Baker,V.Prasanna.Performance Modeling and Interpretive Simulation of PIM Architectures and Applictions.In Proc.of Euro-Par 2002.2002.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700