CC-NUMA系统存储体系结构关键技术研究

英文题名：Research on Key Technologies of CC-NUMA Based Memory Architecture
作者：潘国腾
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：CC-NUMA ; 访存延迟 ; Cache一致性协议 ; 目录结构 ; 存储开销 ; 可扩展性 ; 访存调度算法 ; 模拟验证
英文关键词：CC-NUMA ; memory access latency ; cache coherence protocol ; directory organization ; memory overhead ; scalability ; memory scheduling algorithm ; simulation and verification
学位年度：2007
导师：谢伦国
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2007-03-01

摘要

分布共享存储(DSM)系统支持全系统统一地址编程空间,有效地将传统的共享主存多处理器系统和分布主存系统的优点结合起来,兼具可编程性好和可扩展性高的优势,成为大规模并行高性能计算机研究领域首选的硬件平台。采用CC-NUMA机制是实现DSM系统的有效技术途径,但如何高效维护Cache一致性是实现CC-NUMA系统的难点之一,它不仅决定着系统的正确性,而且对系统的性能有着极其重要的影响。目前国内外对Cache一致性的研究主要集中在目录结构的可扩展性和协议的高效实现两方面。
     由于CC-NUMA系统中各处理器通过共享存储器进行通信,因此,处理器访问存储器的时延,特别是当处理器数目非常大的时候,处理器访问远程存储器的时延将极大地影响计算机系统的性能。这样,如何尽可能地提高访存带宽、降低访存延迟、减小远程访存与本地访存时延的差距就成为CC-NUMA系统是否好用、实用的关键。
     针对这些问题,本文围绕如何实现高效的CC-NUMA系统存储体系结构,着重对基于目录的Cache一致性协议的可扩展性、目录协议的优化技术、提高访存带宽、降低访存延迟,以及大规模CC-NUMA系统模拟验证环境等关键技术展开研究探索。本文的主要工作和创新点是:
     1.提出了一种基于SMP结点的可扩展CC-NUMA体系结构模型—SCDSM,并在此系统上实现了一种高效、无死锁、基于目录的Cache一致性协议。在协议实现中,针对共享读总线脏命中时Cache状态和目录状态不一致的问题,提出了一种强制写回(FWB)方法,解决了目录协议和监听协议兼容的难题;提出了本地访存请求直接转发(LMRDF)技术,解决了基于SMP结构的CC-NUMA系统由于等待总线监听结果造成的请求延迟问题,SCDSM系统性能由此可以提高10%-15%。
     2.为多处理器系统中共享数据的分布建立了马尔科夫模型,并对共享数据的分布模式进行了分析,得出结论:CC-NUMA系统中共享数据的平均Cache副本数一般比较小。该理论分析结果对我们提出更有效的目录组织方案有很好的指导意义。
     3.针对目录存储开销影响Cache一致性协议的可扩展性问题,本文提出了基于目录Cache的两级目录组织方案,有效地降低了目录信息所需要的存储空间,使协议实现具有较好的可扩展性。对基于目录Cache的两级目录模型进行了模拟和性能验证,结果表明,并行测试程序的运行时间都有不同程度的减少。
     4.存储墙问题是影响系统性能进一步提升的瓶颈,如何降低访存延迟是存储系统设计面临的巨大挑战。本文提出了四种不同约束强度的访存调度算法,并对四种调度算法进行了性能分析,分析结果表明,带体地址冲突消解和防饿死机制的贪婪启发式访存调度算法具有最佳性价比。具体实现了采用带体地址冲突消解和防饿死机制的贪婪启发式访存调度算法的DDR2访存控制器。
     5.为了更有效地模拟验证复杂系统和大规模系统的正确性,本文提出了分布环境下的多结点模拟验证平台CoSim:为了配合模拟测试任务的进行以及Cache一致性协议的功能验证,本文提出了CMCV模型。在CoSim平台上,对Verilog代码编写的SCDSM系统进行了全面的功能验证。另外还使用Verilog语言构造了类似Stream Copy程序行为的QSCV模型,对SCDSM系统的LMRDF技术和访存带宽进行了评测和分析。
     以上关键技术和相应解决方案均已在工程项目中得到实际应用,对推进高效的CC-NUMA系统存储体系结构的进一步研究具有一定的理论意义和重要参考价值。
Distributed Shared Memory(DSM) system provides a global shared address space, which trades off shared memory multi-processor and distribute memory system.With the advantages of programmability and scalability,DSM has become the preferred hardware platform for massive parallel high performance computer systems. CC-NUMA is an effective mechanism to improve the performance of DSM systems. The maintenance of cache coherence,which not only determines system correctness, but also greatly impacts system performance,has been the primary difficulty to implement CC-NUMA systems.Currently researches focus on the scalable and high performance implementation of directory-based cache coherence system.
     Processors in CC-NUMA systems communicate with each other through shared memory,so latency of remote memory access,especially with great number of processors,will dramatically impact the system performance.The key of effective implementation of CC-NUMA systems lies on improving the memory bandwidth, shortening memory access latency and reducing the gap between local and remote memory access latency.
     This dissertation is devoted to the implementation of effective CC-NUMA systems memory architecture.It researches the scalability of directory-based cache coherence, the optimization of directory protocols,the simulation and verification platform for CC-NUMA systems,and the technology of improving memory bandwidth and reducing access latency.The main work and contributions of the dissertation are as follows:
     1.A new scalable CC-NUMA architecture based on SMP nodes,called SCDSM,is proposed.A lock-free,high performance directory-based cache coherence protocol is implemented based on SCDSM.A FWB strategy is proposed to address the inconsistent problem between cache state and directory state when read request hits dirty cache block on the bus of SMP node.The strategy solves the difficult problem of compatibility of directory protocols and snooping protocols.A LMRDF strategy is proposed to decrease request sending delay caused by waiting the hit result on bus in CC-NUMA system based on SMP node.This technique improves the performance of SCDSM system by 10%-15%.
     2.A Markov Chains model is built for the distribution of shared data in CC-NUMA systems.We analyze the distributing pattern of shared data based on this model.It is proved that,the average number of cache copies of shared data is small in CC-NUMA systems.This theoretical analysis of distributing pattern for shared data in CC-NUMA systems can be helpful in proposing more effective directory organization methods.
     3.A two-level directory organization scheme based on directory cache is proposed to address the problem of directory memory overhead prohibiting the scalability of cache coherence protocol.This scheme can reduce the memory overhead of directory information and improve the scalability of the protocol.Simulation and analysis showed that the execution times of a number of parallel benchmarks were shortened to various degrees.
     4.Memory wall is the bottleneck of system performance.To reduce memory access latency is the challenge we have to face.Four memory scheduling algorithms with different constraint degrees are presented,the simulation and analysis showed that the greedy memory scheduling algorithm with conflict elimination of bank address and starvation avoidance strategy is effective.The DDR2 based memory controller is implemented on hardware.
     5.A distributed multi-node simulation and verify platform named CoSim is proposed to effectively verify the correctness of complex or large systems.To assist simulation tests and verification of cache coherence protocol,the CMCV model is proposed.A QSCV model similar to stream copy with Verilog hardware description language is built to evaluate the LMRF technical and the memory bandwidth of SCDSM system.
     In summary,the dissertation provides a feasible solution for a number of challenging problems of CC-NUMA systems,and these solutions have been implemented in engneering.It is believed that the research will make a nice groundwork for the further research and engineering on CC-NUMA based memory architecture.

引文

[1] Berrendorf, R., et al. Intel Paragon XP/S Architecture, Software Environment,and Performance. Research Centre Juelich(KFA), Tech Rep:KFA-ZAM-IB-9409, 1994

    [2] Kwan, T.T., B.K. Totty, and D.A. Reed. Communication and computation performance of the CM-5. Proceedings of the 1993 ACM/IEEE conference on Supercomputing. Portland, Oregon, United States: ACM Press, 1993. 191-201

    [3] Osendorfer, C, et al. ViSMI: software distributed shared memory for Infiniband clusters. Proceedings of the 3rd IEEE International Symposium on Network Computing and Applications. Cambridge, MA, USA: IEEE Computer Society Press 2004. 185-191

    [4] Iftode, L. and J.P. Singh, Shared virtual memory: progress and challenges.Proceedings of the IEEE, Special Issue on Distributed Shared Memory, 1999.87(3): p. 498-507.

    [5] Bilas, A., D. Jiang, and J.P. Singh, Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems. Journal of Parallel and Distributed Computing 2003. 63(12): p. 1257-1276.

    [6] Cheung, B.W.L., C.L. Wang, and F.C.M. Lau. LOTS: a software DSM supporting large object space. 2004 IEEE International Conference on Cluster Computing. San Diego, California, 2004

    [7] Cheung, B.W.-L., C.-L.Wang, and K. Hwang. JUMP-DP: A Software DSM System with Low-Latency Communication Support 2000 International Conference on Parallel and Distributed Processing Techniques and Applications.Las Vegas, Nevada, USA, 2000

    [8] Keleher, P., et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. 1994 Winter Usenix Conference. San Francisco, 1994

    [9] Carter, J.B., J.K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. in Proceedings of the thirteenth ACM symposium on Operating systems principles. Pacific Grove, California: ACM Press, 1991. 152-164

    [10] Keleher, P.J. The relative importance of concurrent writers and weak consistency models. in Proc. of the 16th IEEE International Conference on Distributed Computing Systems (ICDCS '96). Hong Kong: IEEE Computer Society, 1996.91-98

    [11] Bershad, B.N., M.J. Zekauskas, and W.A. Sawdon. The Midway distributed shared memory system. Proceedings of the 38th IEEE Computer Society International Conference. San Francisco, CA, USA: IEEE Computer Society, 1993.528-537
    [12]Hu,W.,et al.JIAJIA:An SVM System Based on a New Cache Coherence Protocol in Proceedings of the High Performance Computing and Networking (HPCN'99).Amsterdam,Netherlands:Springer-Verlag,1999.463-472
    [13]胡伟武,施巍松,唐志敏,基于新型Cache一致性协议的共享虚拟存储系统.计算机学报,1999.20(5):p.467-475.
    [14]Pfister,G.F.Aspects of the InfiniBand architecture.Proceedings of the 2001IEEE International Conference on Cluster Computing.California,USA:IEEE Computer Society,2001.369-371
    [15]Boden,N.J.,et al.,Myrinet:A gigabit -per -second local area network.IEEE Micro,1995.15(1):p.29-36.
    [16]Koop,M.J.,et al.Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand.Proceedings of the 14th IEEE Symposium on High-Performance Interconnects.Stanford,California:IEEE Computer Society Press,2006.52-56
    [17]Drost,R.,et al.Challenges in Building a Flat-Bandwidth Memory Hierarchy for a Large-Scale Computer with Proximity Communication.Proceedings of the 13th Symposium on High Performance Interconnects.Stanford University:IEEE Computer Society,2005.13-22
    [18]Acacio,M.E.,et al.,An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration.IEEE Transactions on Parallel and Distributed Systems 2004.15(8):p.755-768.
    [19]S.,N.D.,et al.,Scheduler-activated dynamic page migration for multiprogrammed DSM multiprocessors.Journal of parallel and distributed computing,2002.62(6):p.1069-1103.
    [20]杜静,戴华东,杨学军,页迁移系统中反向页表技术的设计与实现.计算机科学,2004.31(12):p.210-213.
    [21]Laudon,J.and D.Lenoski.The SGI Origin:a ccNUMA highly scalable server.Proceedings of the 24th annual international symposium on Computer architecture.Denver,Colorado,United States:ACM Press,1997.241-251
    [22]Corbalan,J.,X.Martorell,and J.Labarta,Page migration with dynamic space-sharing scheduling policies:the case of the SGI 02000.International Journal of Parallel Programming 2004.32(4):p.263-288.
    [23]Corbalan,J.,X.Martorell,and J.Labarta.Evaluation of the memory page migration influence in the system performance:the case of the SGI O2000.Proceedings of the 17th annual international conference on Supercomputing San Francisco,CA,USA:ACM Press,2003.121-129
    [24]Nikolopoulos,D.S.,et al.A case for user-level dynamic page migration,in Proceedings of the 14th international conference on Supercomputing. Santa Fe,New Mexico, United States ACM Press, 2000. 119-130
    [25] Wu, X., et al. Scheduling Traffic Matrices On General Switch Fabrics.Proceedings of the 14th IEEE Symposium on High-Performance Interconnects.Stanford, CA: IEEE Computer Society Press, 2006. 87-92
    [26] Gupta, S. and A. Aziz. Multicast Scheduling for Switches with Multiple Input-Queues. 10th Symposium on High Performance Interconnects HOT Interconnects. Stanford University, 2002
    [27] Tsigas, P. and Y. Zhang. A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures.Crete Island, Greece: ACM Press 2001. 134 - 143
    [28] Lovett, T. and R. Clapp. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. Proceedings of the 23rd Annual International Symposium on Computer Architecture. Philadelphia, Pennsylvania, United States:IEEE Computer Society, 1996. 308-317
    [29] Clapp, R., et al. STiNG Revisited: Performance of Commercial Database Benchmarks on a CC-NUMA Computer System. Workshop on Duplicating,Deconstructing, and Debunking (WDDD 2002). Anchorage, Alaska, 2002
    [30] Chong, Y.-K. and K. Hwang, Performance Analysis of Four Memory Consistency Models for Multithreaded Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1995. 6(10): p. 1085-1099.
    [31] Brewer, T. and G. Astfalk. The evolution of the HP/Convex Exemplar.Proceedings of the 42nd IEEE International Computer Conference. San Jose,California: IEEE Computer Society, 1997. 81-86
    [32] Baer, J.-L. and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the ACM/IEEE conference on Supercomputing. Albuquerque, New Mexico, United States: ACM Press, 1991.176-186
    [33] Joseph, D. and D. Grunwald. Prefetching using Markov predictors. Proceedings of the 24th Annual International Symposium on Computer Architecture. Denver,Colorado, United States: ACM Press, 1997. 252-263
    [34] Charney, M.J. and A.P. Reeves. Generalized correlation-based hardware prefetching. School of Electrical Engineering, Cornell University, Tech Rep:EE-CEG-95-1, 1995
    [35] Z. Hu, S.K. and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. Proceedings of the 29th Annual Intl. Symp.on Computer Architecture(ISCA02). Anchorage, AK, USA: IEEE Computer Society, 2002. 209-220
    [36] Mowry, T. and A. Gupta, Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Parallel and Distributed Computing, Special issue on shared-memory multiprocessors, 1991. 12(2): p.87-106.
    [37] Laudon, J., A. Gupta, and M. Horowitz. Interleaving: A multithreading technique targeting multiprocessors and workstations. Proceedings of the the 6th International Conference on Architectural Support of Programming Languages and Operating Systems. San Jose, California: ACM Press, 1994. 308-318
    [38] Bradford, J.P. Hardware and software mechanisms for multithreading in uniprocessors and heterogeneous multiprocessors: [PhD dissertation].Indiana,USA: Purdue University 2001
    [39] Tullsen, D.M. and J.A. Brown. Handling long-latency loads in a simultaneous multithreading processor. Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture. Austin, Texas: IEEE Computer Society, 2001.318-327
    [40] Zhang, Z. and J. Torrellas. Reducing remote conflict misses: NUMA with remote cache versus COMA. Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture. San Antonio, Texas: IEEE Computer Society, 1997.272-281
    [41] Moga, A. and M. Dubois. The effectiveness of SRAM network caches in clustered DSMs. Proceedings of the 4th International Symposium on High-Performance Computer Architecture. Las Vegas, NV, USA: IEEE Computer Society, 1998. 103-112
    [42] Iyer, R.R. and L.N. Bhuyan, Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors. IEEE Transactions on Computers,2000. 49(8): p. 779-797.
    [43] Dahlgren, F. and J. Torrellas, Cache-Only Memory Architectures. IEEE Computer, 1999. 32(6): p. 72-79.
    [44] Saulsbury, A., et al. An argument for simple COMA. Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. Raleigh. NC,USA: IEEE Computer Society, 1995. 276-285
    [45] Joe, T. and J.L. Hennessy. Evaluating the memory overhead required for COMA architectures. Proceedings of the 21st annual international symposium on Computer architecture. Chicago, Illinois. United States: IEEE Computer Society Press, 1994.82-93
    [46] Przybylski, S.A., Cache and memory hierarchy design: a performance-directed approach. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1990
    [47] Goodman, J.R. Using cache memory to reduce processor-memory traffic.Proceedings of the 10th annual international symposium on Computer architecture. Stockholm, Sweden: IEEE Computer Society Press, 1983. 124-131
    [48] Lilja, D.J., Cache coherence in large-scale shared-memory multiprocessors:issues and comparisons. ACM Computing Surveys, 1993. 26(3): p. 303-338.
    [49] Strenstrom, P., A Survey of Cache Coherence Schemes for Multiprocessors.IEEE Computer, 1990. 23(6): p. 12-24.
    [50] Jouppi, N.P. Cache write policies and performance. Proceedings of the 20th annual international symposium on Computer architecture. San Diego, California:ACM Press, 1993. 191-201
    [51] Mounes-Toussi, F. and D.J. Lilja. Write buffer design for cache-coherent shared-memory multiprocessors. Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors. Austin,TX, USA: IEEE Computer Society, 1995. 506-511
    [52] Srbljic, S., An Adaptive Coherence Protocol Using Write Invalidate and Write Update Mechanisms. Journal of Computing and Information Technology, 1996.4(3): p. 187-197.
    [53] Anderson, C. and A.R. Karlin. Two Adaptive Hybrid Cache Coherency Protocols. Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture (HPCA '96). San Jose, CA, USA: IEEE Computer Society, 1996.303-313
    [54] Veenstra, J.E. and R.J. Fowler. The Prospects for On-Line Hybrid Coherency Protocols on Bus-Based Multiprocessors. Department of Computer Science,University of Rochester, Tech Rep: TR490,1994
    [55] Eggers, S.J. and R.H. Katz. Evaluating the performance of four snooping cache coherency protocols. Proceedings of the 16th annual international symposium on Computer architecture. Jerusalem, Israel: ACM Press, 1989. 2-15
    [56] Hlayhel, W., J. Collet, and L. Fesquet. Implementing Snoop-Coherence Protocol for Future SMP Architectures. Proceedings of the 5th International Euro-Par Conference on Parallel Processing. Toulouse, France: Springer-Verlag, 1999. 745-752
    [57] Bilir, E.E., et al. Multicast snooping: a new coherence method using a multicastaddress network. Proceedings of the 26th International Symposium on Computer Architecture. Atlanta, GA, USA: IEEE Computer Society, 1999.294-304
    [58] Galles, M. and E. Williams. Performance Optimizations, Implementation, and Verification of the SGI Challenge Multiprocessor. Proceedings of the 27th Hawaii International Conference on System Sciences. Wailea, HI, USA: IEEE Computer Society, 1994. 134-143
    [59] Papamarcos, M.S. and J.H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture. Ann Arbor, USA: IEEE Computer Society, 1984. 348-354
    [60] Singhal, A. and A.J. Goldberg. Architectural support for performance tuning: a case study on the SPARCcenter 2000. Proceedings of the 21st annual international symposium on Computer architecture. Chicago, Illinois, United States: IEEE Computer Society, 1994. 48-59
    [61] Cekleov, M., et al. SPARCcenter 2000: multiprocessing for the 90's. Proceedings of the Compcon Spring '93. San Francisco, CA, USA: IEEE Computer Society,1993. 345-353
    [62] Gostin, G., J.-F. Collard and K. Collins. The architecture of the HP Superdome shared-memory multiprocessor. Proceedings of the 19th annual international conference on Supercomputing. Cambridge, Massachusetts: ACM Press, 2005.239-245
    [63] Grbic, A. Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors: [PhD dissertation]. Toronto, Canada: University of Toronto,2003
    [64] Lenoski, D., et al. The directory-based cache coherence protocol for the DASH multiprocessor. Proceedings of the 17th annual international symposium on Computer Architecture. Seattle, Washington, United States: ACM Press, 1990.148-159
    [65] Gharachorloo, K., et al. Architecture and design of AlphaServer GS320.Proceedings of the 9th international conference on Architectural support for programming languages and operating systems. Cambridge, Massachusetts:ACM Press, 2000. 13-24
    [66] Tang, C. Cache design in the tightly coupled multiprocessor system. AFIPS Conference Proceedings of National Computer Conference. Montvale, New Jersey: AFIPS Press, 1976. 749-753
    [67] Censier, L.M. and P. Feautrier, A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, 1978. C-27(12): p.1112-1118.
    [68] Rothberg, E., J.P. Singh, and A. Gupta. Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. Proceedings of the 20th annual international symposium on Computer architecture. San Diego, California: ACM Press, 1993. 14-26
    [69] Michael, M.M., et al. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. Proceedings of the 24th annual international symposium on Computer architecture. Denver, Colorado, United States ACM Press, 1997.219-228
    [70] Lenoski, D., et al.. The Stanford Dash multiprocessor. IEEE Computer, 1992. 25(3): p. 63-79.
    [71] Agarwal, A., et al. The MIT Alewife machine: architecture and performance.Proceedings of the International Conference on Computer Architecture.Barcelona, Spain: ACM Press, 1998. 509-520
    [72] Nanday, A.K., et al. High-Throughput Coherence Controllers. Proceedings of the 6th International Symposium on High-Performance Computer Architecture.Touluse, France: IEEE Computer Society, 2000. 145-155
    [73] Nguyen, A.-T. and J. Torrellas. Design Trade-Offs in High-Throughput Coherence Controllers. in Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques New Orleans, LA, USA:IEEE Computer Society, 2003. 194-205
    [74] Kuskin, J., et al. The Stanford FLASH multiprocessor. Proceedings of the 21ST annual international symposium on Computer architecture. Chicago, Illinois,United States: IEEE Computer Society Press, 1994. 302-313
    [75] Reinhardt, S.K., J.R. Larus, and D.A. Wood. Tempest and typhoon: user-level shared memory. Proceedings of the 21ST annual international symposium on Computer architecture. Chicago, Illinois, United States: IEEE Computer Society Press, 1994. 325-336
    [76] Li, T. and L.K. John, ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols. IEEE Transactions on Computers 2001. 50(9): p. 921-934.
    [77] Agarwal, A., et al. An evaluation of directory schemes for cache coherence.Proceedings of the 15th Annual International Symposium on Computer architecture. Honolulu, Hawaii, United States: IEEE Computer Society Press,1988.280-298
    [78] Chaiken, D., J. Kubiatowicz, and A. Agarwal. LimitLESS directories: A scalable cache coherence scheme. Proceedings of the fourth international conference on Architectural support for programming languages and operating systems. Santa Clara, California: ACM Press, 1991. 224 - 234
    [79] Chang, Y. and L.N. Bhuyan, An Efficient Tree Cache Coherence Protocol for Distributed Shared Memory Multiprocessors. IEEE Transactions on Computers,1999. 48(3): p. 352-360.
    [80] Rhee, Y. and J. Lee. A Scalable Cache Coherent Architecture for Large-Scale Mesh-Connected Multiprocessors. Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '97). Taipei, Taiwan:IEEE Computer Society, 1997. 64-70
    [81] Simoni, R. and M. Horowitz. Dynamic Pointer Allocation for Scalable Cache Coherence Directories. Proceedings of the International Symposium on Shared Memory Multiprocessing. Tokyo Japan: IPS Press, 1991. 72-81
    [82] Simoni, R. Cache coherence directories for scalable multiprocessors.Departments of Electrical Engineering and Computer Science, Stanford University, Tech Rep: CSL-TR-92-550, 1992
    [83] Thapar, M. and B. Delagi, Distributed-directory scheme: Stanford distributed-directoryprotocol. IEEE Computer, 1990. 23(6): p. 78-80.
    [84] Nilsson, H. and P. Stenstrom. The Scalable Tree Protocol-a cache coherence approach forlarge-scale multiprocessors. Proceedings of the 4th IEEE Symposium on Parallel and Distributed Processing. Arlington, TX, USA: IEEE Computer Society, 1992. 498-506
    [85] Thapar, M., B. Delagi, and M.J. Flynn. Linked List Cache Coherence for Scalable Shared Memory Multiprocessors. Proceedings of International Parallel Processing Symposium. Newport Beach, California, USA: IEEE Computer Society, 1993.34-43
    [86] Gustavson, D.B., The Scalable Coherent Interface and Related Standards Projects.IEEE Micro, 1992. 12(1): p. 10-22.
    [87] Grbic, A., et al. Design and Implementation of the NUMAchine Multiprocessor.Proceedings of the 35th IEEE Design Automation Conference. San Francisco,California: ACM Press, 1998. 66-69
    [88] Grindley, R., et al. The NUMAchine multiprocessor. Proceedings of the 2000 International Conference on Parallel Processing. Toronto, Ont., Canada: IEEE Computer Society, 2000. 487-496
    [89] Grbic, A. Hierarchical Directory Controllers In The Numachine Multiprocessor [Master Thesis]. Toronto, Canada: University of Toronto, 1996
    [90] Steen, A.J.v.d. An evaluation of Itanium 2-based high-end servers. High Performance Computing Group, Utrecht University, Tech Rep: HPCG-2004-04,2004
    [91] Salkazanou, P. Innovative Technologies for HPC. The 14th Daresbury Machine Evaluation Workshop. Daresbury, UK, 2003
    [92] http://www.supercomp.de/isc2006/.
    [93] http://www.top500.org/.

    [94] SGI, http://www.sgi.com/products/servers/altix/4000/.
    [95] Chaudhuri, M. and M. Heinrich, The Impact of Negative Acknowledgments in Shared Memory Scientific Applications. IEEE Transactions on Parallel and Distributed Systems, 2004. 15(2): p. 134-150.
    [96] Chaudhuri, M. and M. Heinrich, Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols. IEEE Transactions on Parallel and Distributed Systems, 2004. 15(8): p. 699-712.
    [97] Martin, M.M.K. Token Coherence: [PhD dissertation]. University of Wisconsin -Madison, 2003
    [98] Martin, M.M.K., M.D. Hill, and D.A. Wood, Token Coherence: A New Framework for Shared-Memory Multiprocessors. IEEE Micro, 2003. 23(6): p.108-116.

    [99] Martin, M.M.K., M.D. Hill, and D.A. Wood. Token Coherence: Decoupling Performance and Correctness. Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA-30). San Diego, California: ACM Press 2003. 182-193

    [100] Barroso, L.A., et al. Piranha: a scalable architecture based on single-chip multiprocessing. Proceedings of the 27th annual international symposium on Computer architecture. Vancouver, British Columbia, Canada: ACM Press, 2000.282 - 293

    [101] Michael, W. A scalable coherent cache system with a dynamic pointing scheme.Proceedings of the 1992 ACM/IEEE conference on Supercomputing.Minneapolis, Minnesota, United States: IEEE Computer Society Press, 1992.358-367

    [102] Gupta, A., W.-D. Weber and T.C. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes.Proceedings of the 1990 International Conference on Parallel Processing.Urbana-Champaign, IL: Pennsylvania State University Press, 1990. 312-321

    [103] Park, C.H., et al. An Adaptive Limited Pointers Directory Scheme for Cache Coherence of Scalable Multiprocessors. Proceedings of the 5th International Euro-Par Conference on Parallel Processing. Toulouse, France: Springer-Verlag,Heidelberg, 1999. 753-756

    [104] Lee, D., H. Kweon, and B. Ahn. Dynamic limited directory scheme using data locality. Proceedings of the 4th International Conference on High Performance Computing in the Asia-Pacific Region. Beijing, China: IEEE Computer Society 2000. vol.l:154-157

    [105] Choi, J.H. and K.H. Park. Segment Directory Enhancing the Limited Directory Cache Coherence Schemes. Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing. San Juan, Puerto Rico: IEEE Computer Society, 1999. 258-267

    [106] Shen, X., Arvind, and L. Rudolph. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. Proceedings of the 13th international conference on Supercomputing. Rhodes, Greece: ACM Press 1999.135-144

    [107] Stenstrom, P., M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. Proceedings of the 20th annual international symposium on Computer architecture. San Diego, California,United States: ACM Press, 1993. 109-118
    [108] Gupta, A. and W.-D. Weber, Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computers, 1992. 41(7): p. 794-810.
    [109] Simoni, R. and M. Horowitz. Modeling the performance of limited pointers directories for cache coherence. Proceedings of the 18th annual international symposium on Computer architecture. Toronto, Ontario, Canada: ACM Press,1991.309-319
    [110] Chang, Y. and L.N. Bhuyan. An Efficient Hybrid Cache Coherence Protocol for Shared Memory Multiprocessors. Proceedings of the International Conference on Parallel Processing. Bloomingdale, IL: IEEE Computer Society, 1996. 172-179
    [111] Acacio, M.E., et al., A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 2005. 16(1): p. 67-79.
    [112] Acacio, M.E., et al. A New Scalable Directory Architecture for Large-Scale Multiprocessors. Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Monterrey. Mexico: IEEE Computer Society, 2001. 97-106
    [113] Michael, M.M. and A.K. Nanda. Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors. Proceedings of the 5th International Symposium on High Performance Computer Architecture. Orlando,FL, USA: IEEE Computer Society, 1999. 142-151
    [114] Edler, J. and M.D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator.[EB/OL]. Available from: http://www.cs.wisc.edu/～markhill/DineroIV.
    [115] Poulsen, D.K. and P.-C. Yew. Execution-driven tools for parallel simulation of parallel architectures and applications. Proceedings of the 1993 ACM/IEEE conference on Supercomputing. Portland, Oregon, United States: ACM Press,1993.860-869

    [116] http://www.simplescalar.com/.
    [117] Austin, T., E. Larson, and D. Ernst, SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, 2002. 35(2): p. 59-67.

    [118] http://simos.stanford.edu/.
    [119] Rosenblum, M., et al.. Using the SimOS Machine Simulator to Study Complex Computer Systems. ACM Transactions on Modeling and Computer Simulation,1997. 7(1): p. 78-103.
    [120] http://rsim.cs.uiuc.edu/rsim/.
    [121] Magnusson, P.S., et al., Simics: A full system simulation platform. IEEE Computer, 2002. 35(2): p. 50-58.
    [122] http://www.cs.wisc.edu/gems/.
    [123] Martin, M.M.K., et al., Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. ACM SIGARCH Computer Architecture News, 2005. 33(4): p. 92-99.
    [124] Hughes, C.J., et al., RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors. IEEE Computer, 2002. 35(2): p. 40-49.
    [125] Pai, V.S., P. Ranganathan, and S.V. Adve. RSIM Reference Manual Version 1.0.Department of Electrical and Computer Engineering, Rice University, Tech Rep:Technical Report 9705, 1997
    [126] Woo, S.C., et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. Proceedings of the 22nd Annual International Symposium on Computer Architecture. S. Margherita Ligure, Italy: IEEE Computer Society,1995.24-36
    [127] Patterson, D., et al., A Case for Intelligent RAM. IEEE Micro, 1997. 17(2): p.34-44.
    [128] Saulsbury, A., F. Pong, and A. Nowatzyk. Missing the memory wall: the case for processor/memory integration. Proceedings of the 23 rd annual international symposium on Computer architecture. Philadelphia, Pennsylvania, United States:ACM Press, 1996. 90-101
    [129] Wulf, W.A. and S.A. McKee, Hitting the memory wall: implications of the obvious. ACM SIGARCH Computer Architecture News, 1995. 23(1): p. 20-24.
    [130] Burger, D., J.R. Goodman, and A. Kagi. Memory Bandwidth Limitations of Future Microprocessors. Proceedings of the 23rd annual international symposium on Computer architecture. Philadelphia, Pennsylvania, United States: ACM Press,1996. 78-89
    [131] Stankovic, V. and N. Milenkovic, Access Latency Reduction in Contemporary DRAM Memories. Facta Univ. Ser.: Elec. Energ, 2004. 17(1): p. 81-97.
    [132] Gong, S., et al. Design of an Efficient Memory Subsystem for Network Processor.in Proceedings of the 2005 conference on Asia South Pacific design automation.Shanghai, China: ACM Press, 2005. 897-900
    [133] Grun, P., N. Dutt, and A. Nicolau. Access pattern based local memory customization for low power embedded systems. in Proceedings of the conference on Design, automation and test in Europe Munich, Germany: IEEE Computer Society Press, 2001. 778-784
    [134] JEDEC. JEDEC PUBLISHES DDR2 STANDARD. [EB/OL]. 2003. Available from:http://www.jedec.org/Home/press/press_release/jedec_publishes_DD2Std.pdf.
    [135] Micron Technology, I. DDR2 Offers New Features and Functionality. [EB/OL].2005. Available from:http://download.micron.com/pdf/technotes/ddr2/TN4702.pdf.
    [136] Micron Technology, I., Updated JEDEC DDR2 Specifications. 2004.

    [137] Micron Technology, I., DDR2 Posted CAS# Additive Latency._______________
    [138]Rixner,S.,et al.Memory Access Scheduling.Proceedings of the 27th annual international symposium on Computer architecture Vancouver,British Columbia,Canada:ACM Press,2000.128-138
    [139]Association,J.S.S.T.,DDR2 SDRAM SPECIFICATION,Arlington,VA:JEDEC.2006.p.119
    [140]Gluska,A.Coverage-oriented verification of banias.Proceedings of the 40th Annual ACM/IEEE Design Automation Conference.Anaheim,CA,USA:ACM Press,2003.280-285
    [141]Liu,C.-N.J.,et al.A novel approach for functional coverage measurement in HDL.Proceedings of the IEEE International Symposium on Circuits and Systems.Geneva,Switzerland:IEEE Press,2000.vol.4:217-220
    [142]Gluska.A.Practical methods in coverage-oriented verification of the merom microprocessor.Proceedings of the 43rd Annual ACM/IEEE Design Automation Conference.San Francisco:ACM Press,2006.332-337
    [143]Wadekar,S.A.A RT level verification method for SoC designs.Proceedings of IEEE International Conference on Systems-on-Chip.Portland:IEEE Computer Society,2003.29-32
    [144]韩俊刚,杜慧敏.数字硬件的形式化验证.北京:北京大学出版社,2001
    [145]Kern,C.and M.R.Greenstreet,Formal verification in hardware design:a survey.ACM Transactions on Design Automation of Electronic Systems,1999.4(2):p.123-193.
    [146]Bultan,T.,J.Fischer,and R.Gerber.Compositional verification by model checking for counter-examples.Proceedings of the 1996 ACM SIGSOFT international symposium on Software testing and analysis.San Diego,California:ACM Press,1996.224-238
    [147]Burch,J.R.,et al.,Symbolic Model Checking for Sequential Circuit Verification.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,1994.13(4):p.401-424.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700