高性能存储系统研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

高性能存储系统研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on High Performance Cache and Memory System
作者：郇丹丹
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：龙芯2号 ; 高速缓存 ; 存储系统 ; 写失效 ; 栈 ; 快速地址计算 ; 预取 ; Page模式控制 ; 自适应
英文关键词：Godson-2 ; Cache ; Memory system ; Store miss ; Stack ; Fast address generation ; Prefetch ; Page mode control ; Adaptive
学位年度：2006
导师：刘志勇 ; 胡伟武
学科代码：081201
学位授予单位：中国科学院研究生院（计算技术研究所）
论文提交日期：2006-03-01

摘要

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈。如何通过设计高性能存储系统弥补处理器与存储系统性能的差距长期以来是体系结构领域的研究热点。
     本文从提高处理器的IPC值和优化处理器的访存延时及带宽的角度出发,结合分析龙芯2号处理器运行SPEC CPU2000测试程序的访存行为特征,对存储系统性能优化技术进行研究,提出了一系列存储系统的性能优化技术并对所提出的优化技术进行性能评测与分析。本文主要的创新点及贡献包括:
     1.通过对Cache写失效行为的分析,提出一种新的提高处理器带宽利用率的Cache写失效处理策略——Cache自适应写分配策略。该策略在访存失效队列中收集全修改Cache块,对全修改Cache块采用非写分配策略,并能够自适应地切换为写分配策略。与传统的Cache写失效处理策略相比,Cache自适应写分配策略硬件代价小,避免了不必要的数据传输,降低Cache污染,减少存储管理队列阻塞的频率。结果表明,采用Cache自适应写分配策略,STREAM基准测试程序带宽平均提高62.6%,SPEC CPU2000程序的IPC值平均提高5.9%。
     2.通过对栈访问行为的分析,提出一种栈高速缓存方案——快速地址计算的自适应栈高速缓存组织方案。该方案将栈访问从数据高速缓存的访问中分离出来,充分利用栈空间数据访问的特点,提高指令级并行度,减少数据高速缓存污染,降低数据高速缓存失效率,并采用快速地址计算策略,减少栈访问的命中时间。该栈高速缓存在发生栈溢出时,能够自适应地关闭,以避免栈切换对处理器性能的影响。栈高速缓存标志中增加进程标识,进程切换时不需要将数据写到低层存储系统中,适用于多进程环境。SPEC CPU2000程序运行结果表明,采用快速地址计算的自适应栈高速缓存方案,25.8%的访存指令可以并行执行,数据高速缓存失效率平均降低9.4%,IPC值平均提高6.9%。
     3.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略——结合访存失效队列状态的预取策略。该预取策略保持了指令和数据访问的次序,有利于预取流的提取。并将指令流和数据流的预取相分离,避免相互替换。在预取发起时机的选择上,结合访存失效队列的状态,减小对处理器正常访问请求的影响。通过流过滤机制提高预取准确性,降低预取对访存带宽的需求。结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%。
     4.通过对内存访问地址的空间局部性分析,提出一种内存控制策略——处理器核指导的内存Page模式控制策略。该策略由处理器核指导,自适应地根据程序访存地址的空间局部性特征动态调整Page模式,融合Open Page策略和Close Page策略的优点。处
With the processor-memory performance gap continuing to grow, the performance of memory access becomes the major bottleneck of the performance improvement for modern microprocessors. It becomes a hot spot of research activities to propose new cache and memory control mechanisms and policies in order that processor-memory gap will be decreased.
     Based on investigations of memory access behavior, through experimentations of SPEC CPU2000 benchmarks running on Godson-2 processor, several policies that can improve performance of cache and memory system significantly are proposed and evaluated in this dissertation. The proposed techniques can increase memory access bandwidth while decrease access latency so that IPC of the processor is increased. Following contributions are presented in this dissertation.
     1.Cache adaptive write allocate policy that improves the bandwidth of microprocessor significantly is proposed by investigating cache store misses. Cache adaptive write allocate policy collects fully modified blocks in miss queue. Fully modified blocks are written to lower level memory based on non-write allocate policy which can switch to write allocate policy adaptively. Comparing with other cache store miss policies, cache adaptive write allocate policy avoids unnecessary memory traffic, reduces cache pollution and decreases memory queue full rate without increasing hardware overhead. Experiment results indicate that on average 62.6% memory bandwidth in STREAM benchmarks is improved by utilizing cache adaptive write allocate policy. The performance of SPEC CPU2000 benchmarks is also improved efficiently. The average IPC speedup is 5.9%.
     2.Adaptive stack cache with fast address generation policy is proposed by investigating stack access behavior of programs. Adaptive stack cache with fast address generation policy decouples stack references from other data references, improves instruction-level parallelism, reduces data cache pollution, and decreases data cache miss ratio. Stack access latency can be reduced by using fast address generation scheme proposed here. Adaptive stack cache with fast address generation policy can also avoid unnecessary memory traffic. Stack cache can be disabled adaptively, when it is overflow. It can also be applied to multithread scheme by adding thread identifier. Our experiment results indicate that about 25.8% of all memory reference instructions in SPEC CPU2000 benchmarks are executed in parallel by adopting adaptive stack cache with fast address generation. On average 9.4% data cache miss is reduced. The performance is improved significantly. The average IPC speedup is 6.9%.
     3.Prefetching policy using miss queue information is proposed by investigating instruction cache misses and data cache misses. The prefetching policy increases the efficiency

引文

[Acquaviva99] JT. Acquaviva. Data Prefetching Efficiency on Two Commercial Systems. In: Proceedings of the fifth European SGI/Cray MPP Workshop. June 4, 1999.
    [Alameldeen04] A. R. Alameldeen and D. A. Wood. Adaptive Cache Compression for High-Performance Processors. In: Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’2004). Munich, Germany, June 19-23, 2004. 212~223.
    [AMD02] AMD Athlon64 datasheet. http://www.amd.com / us-en /assets /content_type /white_papers_and_tech_docs.
    [Annavaram01] M. Annavaram, J. Patel, and E. Davidson. Data Prefetching by Dependence Graph Precomputation. In: Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’2001). Goteborg, Sweden, May 2001. 52~61.
    [Austin95a] Todd M. Austin, Dionisios M. Pnevmatikatos, and Guri S. Sohi. Streamlining Data Cache Access with Fast Address Calculation. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). Santa Margherita Ligure, Italy, June 22-24, 1995. 369~380.
    [Austin95b] Todd M. Austin and Guri S. Sohi. Zero-cycle Loads: Microarchitecture Support for Reducing Load Latency. In: Proceedings of the 28th Annual International Symposium on Microarchitecture. Ann Arbor, Michigan, United States, November 29-December 01, 1995. 82~92.
    [Bernstein95] David Bernstein, Doron Cohen, Ari Freund. Compiler Techniques for Data Prefetching on the PowerPC. In: Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques. Limassol, Cyprus, June 27-29, 1995. 19~26.
    [Bose98] Pradip Bose, Thomas M. Conte. Performance Analysis and Its Impact on Design. In: IEEE Computer. 1998, 31(5): 41~49.
    [Burger96] Doug Burger, James R. Goodman, and Alain K?gi. Memory Bandwidth Limitations of Future Microprocessors. In: Proceedings of the 23rd International Symposium on Computer Architecture (ISCA-23). Philadelphia: ACM Press, May 1996. 78~89.
    [Burger97] Doug Burger, James R. Goodman, and Alain Kagi. Limited Bandwidth to Affect Processor Design. In: IEEE Micro. December 1997, 17(6): 55~62.
    [Cain04] Harold W. Cain and Mikko H. Lipasti. Memory Ordering: A Value-based Approach. In: Proceedings of the 31st International Symposium on Computer Architecture (ISCA’2004). Munich, Germany, June 2004. 90~101.
    [Callahan91] D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In: Proceedings of the 4th Annual International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’91). Boston, MA, April 1991. 40~52.
    [Canal00] R. Canal, A. Gonzalez, and J. E. Smith. Very Low Power Pipelines Using Significance Compression. In: Proc. 33rd IEEE/ACM Micro. Dec. 2000. 181~190.
    [Carter99] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, and et al. Impulse: Building a Smarter Memory Controller. In: Proc. Fifth International Symposium on High Performance Computer Architecture (HPCA’99). Orlando FL., 1999. 70~79.
    [Charney97] M. J. Charney and T. R. Puzak. Prefetching and Memory System Behavior of the SPEC95 Benchmark Suite. In: IBM Journal of Research and Development. May 1997, 41(3): 265~286.
    [Cantin03] J. F. Cantin and M. D. Hill. Cache Performance for SPEC CPU2000 benchmarks. May 2003. http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/.
    [Chen89] C.-L. Chen and C.-K. Liao. Analysis of Vector Access Performance on Skewed Interleaved Memory. In: Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA’89). 1989. 387~394.
    [Chen95] T.-F. Chen, and J.-L. Baer. Effective Hardware-based Data Prefetching for High-performance Processors. In: IEEE Transactions on Computers. May 1995, 44(5): 609~623.
    [Chen97] I.C. K. Chen, L. C.C. Lee, and T. Mudge. Instruction Prefetching Using Branch Prediction Information. In: Proceedings of International Conference on Computer Design, VLSI in Computers and Processors. October 1997. 593~601.
    [Chi97] S.A. Chi, R.M. Shiu, J.C Chiu, S.E Chang, and C.P. Chung. Instruction Cache Prefetching with Extended BTB. In: Proceedings of the 1997 International Conference on Parallel and Distributed Systems. December 1997. 360~365.
    [Cho99a] S. Cho, P-C. Yew, and G. Lee. Access Region Locality for High-Bandwidth Processor Memory System Design. In: Proceedings of the 32nd annual IEEE/ACM International Symposium on Microarchitecture. Haifa, Israel, Nov. 1999. 136~146.
    [Cho99b] S. Cho, P-C. Yew, and G. Lee. Decoupling Local Variables Accesses in a Wide-Issue Superscalar Processor. In: Proceedings of the 26th International Symposium on Computer Architecture (ISCA’99). Atlanta, Georgia, United States, May 01-04, 1999. 100~110.
    [Cho01] S. Cho, P-C. Yew, and G. Lee. A High-Bandwidth Memory Pipeline for Wide Issue Processors. In: IEEE Transactions on Computers. July 2001, 50(7): 709~723.
    [Collins99] Jamison D. Collins, Dean M. Tullsen. Hardware Identification of Cache Conflict Misses, In: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture. Haifa, Israel, November 16-18, 1999. 126~135.
    [Collins01] Jamison D. Collins, Dean M. Tullsen. Runtime Identification of Cache Conflict Misses: The Adaptive Miss Buffer. In: ACM Transactions on Computer Systems (TOCS). November 2001, 19(4): 413~439.
    [Connell00] F.P.O’Connell and S.W.White. POWER3: the Next Generation of PowerPC Processors. In: IBM Journal of Research and Development. 2000, 44(6): 873~884.
    [Cooper98] Keith D. Cooper and Timothy J. Harvey. Compiler-Controlled Memory. In: Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. San Jose, California, United States, October 1998. 2~11.
    [Corbal98] J. Corbal, R. Espasa, and M. Valero. Command Vector Memory Systems: High Performance at Low Cost. In: Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques (PACT’98). Oct. 1998. 68~77.
    [Cuppu99] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In: Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). Atlanta GA, May 1999. 222~233.
    [Cuppu01] Vinodh Cuppu and Bruce Jacob. Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance? In: Proceedings of the 28th International Symposium on Computer Architecture (ISCA'2001). Goteborg Sweden, June 2001. 62~71.
    [Dahlgren93] F. Dahlgren, M. Dubois, and P. Stenstr?m. Fixed and Adaptive Sequential Prefetching in Shared-memory Multiprocessors. In: Proceedings of the 1993 International Conference on Parallel Processing (ICPP’93). St. Charles, Vol.1, August 1993. 56~63.
    [Davis00] B. Davis, T. Mudge, B. Jacob, and V. Cuppu. DDR2 and Low-latency Variants. In: 101Proc. Solving the Memory Wall Workshop, held in conjunction with the 27th International Symposium on Computer Architecture (ISCA’2000). Vancouver BC, Canada, June 2000.
    [Diefendorff00] Keith Diefendorff. PC Processor Microarchitecture. Microprocessor Report. 13(9): 16~22, July 12 2000.
    [Ding00] C. Ding and K. Kennedy. Memory Bandwidth Bottleneck and its Amelioration by Compiler. In: Proceedings of International Parallel and Distributed Processing Symposium (IPDPS 2000). Cancun, Mexico, May 2000. 181~190.
    [EMS00] EMS. 2000. 64Mbit-Enhanced SDRAM. Enhanced Memory Systems, http://www.edram.com/Library/datasheets/SM2603,2604pb_r1.8.pdf.
    [Gao93] Q. S. Gao. The Chinese Remainder Theorem and the Prime Memory System. In: Proceedings of the 20th Annual International Symposium on Computer Architecture. May 1993. 337~340.
    [Gee93] J. D. Gee, M. D. Hill, D. N. Pnevmatikos, and A. J. Smith. Cache Performance of the SPEC92 Benchmark Suite. In: IEEE Micro. Aug. 1993, 13(4): 17~27.
    [Gornish99] Edward H. Gornish, and Alexander V. Veidenbaum. An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors. In: International Journal of Parallel Programming. 1999, 27(1): 35~70.
    [Henessy02] J. L. Henessy and D. A. Patterson. Computer Architecture-A Quantitative Approach, third edition. Morgan Kaufmann Publishers Inc., 2002.
    [Henning00] J. L. Henning. SPEC CPU2000: Measuring CPU performance in the New Millennium. Computer. 2000, 33(7):28~35.
    [Horel99] Tim Horel and Gary Lauterbach. UltraSparc-III: Designing Third-Generation 64-bit Performance. In: IEEE Micro. May/June 1999, 19(3): 73~85.
    [Hsu92] W.-C. Hsu and J. E. Smith. Prefetching in Supercomputer Instruction Caches. In: Proceedings of the 1992 Conference on Supercomputing (SC’92). November 1992. 588~597.
    [Hsu93] W.-C. Hsu and J. E. Smith. Performance of Cached DRAM Organizations in Vector Supercomputers. In: Proc. of the 20th Annual International Symposium on Computer Architecture (ISCA’93). May 1993. 327~336.
    [Hsu98] W.-C. Hsu and J. E. Smith. A Performance Study of Instruction Cache Prefetching Methods. In: IEEE Transactions on Computers. 47(5):497~508, May 1998.
    [Hu02] Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In: Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’2002). Anchorage, Alaska, May 25~29, 2002. 209~220.
    [Hu03] Shiwen Hu and Lizy K. John. Avoiding Store Misses to Fully Modified Cache Blocks. Laboratory for Computer Architecture. The University of Texas at Austin, Technical Report: TR-030701-01, July 2003.
    [Hu05] Wei-Wu Hu, Fu-Xin Zhang, and Zu-Song Li. Microarchitecture of the Godson-2 Processor. In: Journal of Computer Science and Technology. March 2005, 20(2): 243～249.
    [Huh01] J. Huh, D. Burger, and S. Keckler. Exploring the Design Space of Future CMPs. In: Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques (PACT 2001). Barcelona, Spain: IEEE CS Press, September 2001. 199~210.
    [Iacobovici04] Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, Santosh G. Abraham. Effective Stream-based and Execution-based Data Prefetching. In: Proceedings of the 18th Annual International Conference on Supercomputing. Malo, France, June 26 - July 01, 2004. 1~11.
    [IBM94] IBM Microelectronics and Motorola Corporation. PowerPC Microprocessor Family: The Programming Environments. Motorola Inc., 1994.
    [Jacob02] Bruce Jacob and David Wang. DRAM: Architectures, Interfaces, and Systems A Tutorial. Four-hour tutorial presented at 29th International Symposium on Computer Architecture (ISCA'2002). Anchorage AK, May 26, 2002.
    [Jacob03] Bruce Jacob. A Case for Studying DRAM Issues at the System Level. In: IEEE Micro. July/August 2003, 23(4): 44~56.
    [JEDEC04] JEDEC Standard. Double Data Rate (DDR) SDRAM Specification. JESD79D (Revision of JESD79C). January 2004. JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. www.jedec.org/DOWNLOAD/search/JESD79D.pdf
    [John97] L. K. John and A. Subramanian. Design and Performance Evaluation of a Cache Assist to Implement Selective Caching. In: Proceedings of the 1997 International Conference on Computer Design (ICCD '97). October 12-15, 1997. 510~518.
    [Joseph97] Doug Joseph and Dirk Grunwald. Prefetching Using Markov Predictors. In: Proceedings of the 24th International Symposium of Computer Architecture (ISCA’97). June 1997. 252~263.
    [Jouppi90] N. P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In: Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). Seattle, Washington, United States: IEEE Computer Society, May 1990. 364~373.
    [Jouppi93] N. Jouppi. Cache Write Policies and Performance. In: ACM SIGARCH Computer Architecture News. May 1993, 21(2): 191~201.
    [Kalla04] Ron Kalla, Balaram Sinharoy, and Joel M. Tendler. IBM Power5 Chip: A Dual-core Multithreaded Processor. In: IEEE Micro. March/April 2004, 24(2): 40~47.
    [Kandiraju02] G. B. Kandiraju and A. Sivasubramaniam. Characterizing the D-TLB Behavior of SPEC CPU2000 Benchmarks. In: Proceedings of the 29th ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. June 2002. 129~139.
    [Kanno99] Y. Kanno, et al.. A DRAM System for Consistently Reducing CPU Wait Cycles. 1999 Symposium on VLSI Circuits Digest of Technical Papers. 1999. 131~132.
    [Kessler99] R. E. Kessler. The Alpha 21264 Microprocessor. In: IEEE Micro. March/April 1999, 19(2): 24~36.
    [Krewell02] Kevin Krewell. Alpha EV7 Processor: A High Performance Tradition Continues, In-Stat MDR, April 5, 2002.
    [Kroft81] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In: Proceedings of the 8th annual symposium on Computer Architecture (ISCA’81). May 1981. 81~87.
    [Lai01] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block Prediction & Dead-block Correlating Prefetchers. In: Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’2001). Goteborg, Sweden, May 2001. 144~154.
    [Lee01] Hsien-Hsin S. Lee, Mikhail Smelyanskiy, Gary S. Tyson, and Chris J. Newburn. Stack Value File: Custom Microarchitecture for the Stack. In: Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01). January 20~24, 2001. 5~14.
    [Lewis02] J. Lewis, B. Black, and M. Lipasti. Avoiding Initialization Misses to the Heap. In: Proceedings of the 29th International Symposium on Computer Architecture (ISCA’2002). Anchorage, AK, USA: IEEE Computer Society, May 2002. 183~194.
    [Lin01] Wei-fen Lin, Steven K.Reinhardt, and Doug Burger. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In: Proceedings of the Seventh International Symposium on High -Performance Computer Architecture (HPCA'01). January 2001. 301~312.
    [Lipasti95] Mikko H. Lipasti, William J. Schmidt, Robert R. Roediger, and Steven R. Kunkel. SPAID: Software Prefetching in Pointer- and Call-intensive Environments. In: Proceedings of the 28th Annual ACM/IEEE International Symposium on Microarchitecture. Ann Arbor, MI, November 1995. 231~236.
    [Lipasti96] Mikko H. Lipasti, and John Paul Shen. Exceeding the Dataflow Limit via Value Prediction. In: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. Paris, France, December 1996. 226~237.
    [Luk98] C. K. Luk, and T. C. Mowry. Cooperative Prefetching: Compiler and Hardware Support for Effective Instruction Prefetching in Modern Processors. In: Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture. November 1998. 182~194.
    [Mathew00] B.K. Mathew, S.A. McKee, J.B. Carter, and A. Davis. Design of a Parallel Vector Access Unit for SDRAM Memory Systems. In: Proceedings of the Sixth International Symposium on High Performance Computer Architecture (HPCA’2000). January 2000. 39~48.
    [McCalpin00] John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.
    [McFaring89] S. McFaring. Program Optimization for Instruction Caches. In: Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’89). April 3-6, Boston. 183~191.
    [McKee96] S. McKee, A. Aluwihare, B. Clark, R. Klenke, T. Landon, C. Oliver, M. Salinas, A. Szymkowiak, K. Wright, W. Wulf, and J. Aylor. Design and Evaluation of Dynamic Access Ordering Hardware. In: Proceedings of the 10th ACM International Conference on Supercomputing (ICS’96). Philadelphia, PA, 1996. 125~132.
    [Misubishi95] 4MCDRAM: 4M (256K-word by 16-bit) Cached DRAM with 16K(1024-word by 16-bit) SRAM, M5M4V4169TP Target Specification(Rev. 4.0). March 1995, Mitsubishi LSIs, Mitsubishi Electric.
    [Miura01] Seiji Miura, Kazushige Ayukawa, Takao Watanabe. A Dynamic-SDRAM-mode -control Scheme for Low-power Systems with a 32-bit RISC CPU. ISLPED 2001. 358~363.
    [Mosys94] Multibanked DRAM Technology White Paper. Mosys Incorporated. July 1994.
    [Mowry92] Todd C. Mowry, Monica S. Lam, and Anoop Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching. In: Proceedings of the 5th Annual International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V). October, 1992. 62~73.
    [Mukherjee02] S.S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb. The Alpha 21364 Network Architecture. In: IEEE Micro. Jan./Feb. 2002, 22(1): 26~35.
    [NEC99] 128M-BIT Virtual Channel SDRAM. NEC Electronics Inc, 1999. http://www.necel.com/home.nsf/ViewAttachments/M14412EJ3V0DS00/$file/M14412EJ3V0DS00.pdf.
    [Palacharla94] S. Palacharla and R. Kessler. Evaluating Stream Buffers as a Secondary Cache Replacement. In: Proceedings of the 21st International Symposium on Computer Architecture (ISCA’94). Chicago, Illinois, April 1994. 24~33.
    [Patterson97] D. Patterson, T. E. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent RAM: IRAM. In: IEEE Micro. September 1997, 17(2): 34~44.
    [Pierce96] J. Pierce and T. Mudge. Wrong Path Instruction Prefetching. In: Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-29). December 1996. 165~175.
    [Peter05] Peter J. Denning. The locality principle. In: Communication of the ACM. July 2005, 48(7): 19~24.
    [PMC03] RM9000x2TM Family User Manual. PMC-Sierra, Inc. 2003, http://www.pmc-sierra.com.
    [Rambus99] Rambus. 1999. Direct RMC.d1 Data Sheet Rambus. http://www.rambus.com/ developer/downloads/RMC.d1.0036.00.8.pdf.
    [Rau89] B. R. Rau, M. S. Schlansker, and D. W. L. Yen. The CYDRA 5 Stride-insensitive Memory System. In: Proceedings of the 1989 International Conference on Parallel Processing. Vol. 1, 1989. 242~246.
    [Reinman99] G. Reinman, B. Calder, and T. Austin. Fetch Directed Instruction Prefetching. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture (Micro-32). Haifa, Israel, November 1999. 16~27.
    [Rixner00] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’2000). 2000. 128~138.
    [Sair00] S. Sair and M. Chamey. Memory Behavior of the SPEC2000 Benchmark Suite. In: IBM Thomas J. Waston Research Center Technical Report RC-21852, October 2000.
    [Santhanam97] Vatsa Santhanam, Edward H. Gornish, and Wei-Chung Hsu. Data Prefetching on the HP PA-8000. In: Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). Denver, Colorado, United States, June 01-04, 1997. 264~273.
    [Schumann97] Reinhard C. Schumann. Design of the 21174 Memory Controller for DIGITAL Personal Workstations. In: Digital Technical Journal. 1997, 9(2): 57~70.
    [Skadron97] K. Skadron and D.W. Clark. Design Issues and Tradeoffs for Write Buffers. In: Proceedings of the third International Symposium on High-Performance Computer Architecture (HPCA-3). San Antonio, Texas, USA, February 1997. 144~155.
    [Smith82] A. J. Smith. Cache Memories. In: ACM Computing Surveys. September 1982, 14(3): 473~530.
    [Sohi01] Guri Sohi and Manoj Franklin. High-Performance Data Memory Systems for Superscalar Processors. In: Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems. Santa Clara, California, April 1991. 53~62.
    [SPEC00] Standard Performance Evaluation Corp. SPEC CPU2000 Documentation. http://www.spec.org/osg/cpu2000/docs. 2000.
    [Sun96] Sun Microsystems, Inc. SPARCserver 1000E. March 1996.
    [Suzuki99] Kazumasa Suzuki, etc. A 2000-MOPS Embedded RISC Processor with a Rambus DRAM Controller. In: IEEE Journal of Solid-state Circuits. 1999, 34(7): 1010~1021.
    [Kalogeropulos04] Spiros Kalogeropulos, Mahadevan Rajagopalan, Vikram Rao, Yonghong Song, and Partha Tirumalai. Processor Aware Anticipatory Prefetching in Loops. In: Proceeding of 10th International Symposium on High Performance Computer Architecture (HPCA’04). Madrid, Spain, February 2004. 106~118.
    [Trodden03] Jay Trodden and Don Anderson. HyperTransport System Architecture, first edition, Mindshare, Inc, 2003.
    [Tullsen95] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In: Proceedings of the 22nd International Symposium on Computer Architecture (ISCA-22). Santa Margherita Ligure, Italy, 1995. 392~403.
    [Vanderwiel00] Steven P. Vanderwiel and David J. Lijia. Data Prefetch Mechanisms. In: ACM Computing Surveys (CSUR). June 2000, 32(2): 174~199.
    [Veidenbaum97] A. Veidenbaum. Instruction Cache Prefetching using Multi-level Branch Prediction. In: Proceedings of the International Symposium on High Performance Computing (ISHPC). November 1997. 51~70.
    [Watanabe99] T. Watanabe, et al., Access Optimizer to Overcome the Future Walls of Embedded DRAMs in the Era of Systems on Silicon. 1999 ISSCC Digest of Technical Papers. 370~371.
    [Wulf95] W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. In: ACM Computer Architecture News. March 1995, 23(1): 20~24.
    [Yang00] J. Yang, Y. Zhang, and R. Gupta, Frequent Value Compression in Data Caches, In: Proc. IEEE/ACM 33rd Micro. Dec. 2000. 258~265.
    [Yeager96] Kenneth Yeager. The MIPS R10000 Superscalar Microprocessor. In: IEEE Micro. April 1996, 16(2): 28~40.
    [Young93] Honesty C. Young and Eugene J. Shekita. An Intelligent I-Cache Prefetch Mechanism. In: Proceeding of IEEE Inernational Conference on Computer Design (ICCD’93). IEEE Computer Society Press, Los Alamitos, CA, October 1993. 44~49.
    [Zhu02a] Zhichun Zhu and Xiaodong Zhang. Access-mode Predictions for Low-Power Cache Design. In: IEEE Micro. March/April, 2002, 22(2): 58~71.
    [Zhu02b] Zhichun Zhu, Zhao Zhang, and Xiaodong Zhang. Fine-grain Priority Scheduling on Multi-channel Memory System. In: Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8). Cambridge, MA, February 2-6, 2002. 107~116.
    [Zhang00] Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang, A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality. In: roceedings of the 33rd Annual International Symposium on Microarchitecture. 2000. 10~13.
    [Zhang01] L. Zhang, Z. Fang, M. Parker, B.K. Mathew, L. Schaelicke, J.B. Carter, W.C. Hsieh, and S.A. McKee. The Impulse Memory Controller. In: IEEE Transactions on Computers, Special Issue on Advances in High Performance Memory Systems. November 2001. 1117~1132.
    [何 05] 何立强, 同时多线程处理器前端系统的研究. 中国科学院研究生院博士学位论文, 2005.4.
    [胡 03] 胡伟武, 唐志敏. 龙芯 1 号处理器结构设计. 计算机学报, 2003, 26(4): 385~396.
    [刘 97] 刘志勇, 李恩有, 乔香珍. 高速缓冲存储器系统中的地址映射变换技术与装置. 中国专利, CN97120245.1. 1997.11.06.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700