面向性能优化的压缩cache技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着半导体工艺技术的不断发展,微处理器与主存速度差距的日益扩大,现代处理器都需要在片内设置一级或多级cache来缓解越来越严重的访存压力。此外,随着芯片容量的不断扩大,多核与多线程结构正成为当代处理器设计的主流。这些结构通过开发线程级并行性,极大地提高了处理器的计算吞吐率,但同时也对处理器存储子系统的访存吞吐能力提出了严重的挑战。处理器设计者需要在固定的芯片面积内权衡和折衷,或者增加内核数量或线程数量以获得更高的计算能力,或者增加片内Cache容量来提高处理器的访存能力,从而使这两种能力达到相互平衡,避免任何一种能力成为性能瓶颈。将压缩技术应用于片内cache数据的保存,能显著增加cache的有效容量,减少cache失效,缓解处理器计算能力与其访存压力之间的矛盾,使上述权衡向计算能力倾斜。但是,压缩cache技术也会给处理器的性能带来负面影响,因为,它在处理器的访存延迟中加入了数据解压缩的延迟开销,使cache的命中延迟增加。为此,本文以提升系统性能为目标,从简化压缩编码算法降低解压缩延迟,优化压缩cache层次结构,改善压缩cache替换策略等几个方面着手,对压缩cache的性能优化技术进行了深入研究,主要取得了以下一些研究成果:
     1.对原本应用于L2 cache压缩的常见模式压缩(Frequent Pattern Compression,FPC)算法进行了简化,并分解了该算法的解压缩流程,提出并设计了一种基于简单常见模式压缩(Simple Frequent Pattern Compression,S-FPC)编码算法的解压缩流程,减少L2压缩cache行的解压缩延迟开销1个处理器周期,并且使该算法能被应用于L1数据cache的压缩。对简化后的压缩编码算法的压缩效果进行了模拟试验和评估,并详细描述了S-FPC压缩编码算法的硬件实现。
     2.提出并设计了一种基于统一的简单常见模式压缩(S-FPC)编码的压缩cache层次结构UCCH(Unified Compressed Cache Hierarchy)。UCCH结构在L1数据cache和L2 cache以统一的压缩编码保存数据,能显著提高片内L1数据cache和L2 cache的有效容量。另外,在UCCH结构中,L1数据cache的压缩结合了部分cache行预取功能,可充分发挥预取技术显著降低cache失效率的优点,却不会招致通常的预取技术可能产生的cache污染与访存带宽需求增加的缺点,也不需要额外添加预取缓冲。UCCH结构的设计显著改善了压缩cache的性能。
     3.提出了一种新颖的基于LRU修正的压缩cache替换策略MLRU-C(ModifiedLRU Replacement Policy for Compressed Cache),用于改善L2压缩cache的替换行为。MLRU-C替换策略利用压缩cache中额外的tag资源,构造了一种影子tag机制,该机制能对传统LRU替换策略经常出现的几种错误的cache替换行为进行鉴别,并将其记录到一个错误记录表MRT(Mistake Record Table)中,然后根据此记录表对LRU替换错误进行及时纠正。模拟实验表明,MLRU-C能有效地改善L2压缩cache的替换行为,减少L2压缩cache的失效率。
     4.研究了压缩cache技术对多线程处理器性能的影响,并通过模拟实验验证了UCCH结构能够改善多线程处理器的性能。由于多线程处理器中有多个同时运行的线程共享整个片内cache层次结构,破坏了从L1数据cache到L2 cache的数据局部性,增大了cache失效率,并使L1-L2-主存之间的总线传输带宽压力显著增长,因此,虽然多线程处理器降低了对访存延迟的敏感性,但却显著增加了对cache层次结构的容量以及访问带宽的敏感性。由于UCCH结构能够显著增加L1数据cache和L2 cache的有效容量,同时由于在L1-L2-主存之间直接以压缩格式传输数据,能显著降低L1-L2-主存之间的总线传输带宽需求,因此UCCH结构能够改善SMT处理器的访存性能。
Since innovations in CMOS technology in recent years have led to performance gap between processor and memory widening, modern processors use one or more levels of on-chip caches to alleviate the ever-increasing pressure of memory accesses. In addition, as the chip density increasing, chip multiprocessors (MP) and multithreading (MT) are becoming mainstream architectures of current processor design. The both architectures can greatly improve processor performance and throughput by exploiting both thread-level and instruction-level parallelism, but the growing memory access demand in MP/MT environment challenge the throughput ability of their memory sub-system. The processor designer must determine the tradeoff between cores and caches in a fixed area budget so that neither cores nor caches is the only performance bottleneck. Compressed cache technology can change the tradeoff between cores and caches and allow a design where more on-chip area is allocated to processor cores since on-chip cache compression can increase the effective cache size without significantly increasing its area and avoid some misses. Unfortunately, cache compression also has a negative side effect, since compressed cache lines have to be decompressed before being used by processor. This means that storing compressed lines increases cache hit latency. So this paper researched on the compressed cache technology for performance optimization. The methods, such as optimizing compressed cache hierarchy, simplifying compressed algorithm and improving cache replacement policy etc. were proposed to improve performance of compressed cache.The main contributions of this paper are as follows:
     1. With simplifying the Frequent Pattern Compression (FPC) algorithm, which used by L2 cache compression, and dividing the decompression process of compressed cache line into two stages, we proposed a novel decompression process of L2 compressed cache line based on Simple Frequent Pattern Compression (S-FPC) algorithm. The proposed scheme can decrease L2 decompressed latency 1 cycle and support compressing L1 data cache data. We evaluated the scheme by simulation experiments and described the hardware implementation of the compression scheme in detail.
     2. We proposed a unified compressed cache hierarchy (UCCH) that uses a unified compression algorithm in both L1 D-cache and L2 cache, called Simple Frequent Pattern Compression (S-FPC). UCCH can increase the cache capacity of L1 D-cache and L2 cache without any sacrifice of the L1 cache access latency. The layout of compressed data in L1 data cache of UCCH enables partial cache line prefetching and does not introduce prefetch buffers or increase cache pollution and memory traffic. The experiment shows UCCH can distinctly improve the performance.
     3. We proposed a novel modified LRU replacement policy for compressed cache (MLRU-C). MLRU-C replacement policy uses extra tags in compressed cache to construct a shadow tag struct, which be used to identify and record the mistake replacement in LRU policy. The mistake replacements in LRU policy recorded by shadow tag struct would be stored in Mistake Record Table (MRT). The MLRU-C would correct the mistake replacement decision according to the mistake replacement record in MRT. The experiment shows that MLRU-C can evidently decrease L2 compressed cache miss rate.
     4. We proposed using compressed cache technology to improve multithreading processor performance. Because the data locality of L1 D-cache and L2 cache is hurted by sharing on-chip cache hierarchy between threads, MT technolgy distinctly increases the cache miss rate and memory traffic. The demands for cache capcity and data bus bandwidth between levels of caches increase apparently. Because our UCCH scheme can increase capacity of L1 D-cache and L2 cache and decrease miss rate of both L1 D-cache and L2 cache distinctly, it can alleviate the L1-L2-main memory bandwidth demand and improve the performance of MT processor.
引文
[1] John L. Hennessy and David A. Patterson. Computer Architecture:A Quantitative Approach. Morgan Kaufmann Publishers, Inc., San Francisco, 2nd edition, 1996.
    
    [2] Gonzalez, J., and Gonzalez, A. Memory Address Prediction for Data Speculation. Tech. Rep. TR-UPC-DAC-1996-51, Universitat Politecnica de Catalunya,1996.
    [3] Gonzalez, J., and Gonzalez, A. Speculative Execution Via Address Prediction and Data Prefetching. In Proceedings of the 11~(th) Int. Conf. on Supercomputing (1997).
    [4] Tien-Fu Chen and Jean-Loup Baer. Effective Hardware-Based Data Prefetching for High Performance Processors. IEEE Transactions on Computers, 44(5):609-623,May 1995.
    [5] Doug Joseph and Dirk Grunwald. Prefetching Using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture,pages 252-263, June 1997.
    [6] Norman P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364-373, May 1990.
    [7] Todd Mowry and Anoop Gupta. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, June 1991.
    [8] Viji Srinvasan, Edward S. Davidson, and Gary S. Tyson. A Prefetch Taxonomy. IEEE Transactions on Computers, 53(2): 126-140, February 2004.
    [9] Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. Dependence Based Prefetching for Linked Data Structures. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 115-126, October 1998.
    [10] Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104-114, May 1990.
    
    [11] International Technology Roadmap for Semiconductors. ITRS 2004 Update.Semiconductor Industry Association, 2004.http://www.itrs.net/Common/2004Update/2004Update.htm.
    
    [12] Haitham Akkary and Michael A. Driscoll. A Dynamic Multithreading Processor. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, pages 226-236, November 1998.
    [13]Susan J.Eggers,Joel S.Emer,Henry M.Levy,Jack L.Lo,Rebecca L.Stamm,and Dean M.Tullsen.Simultaneous Multithreading:A Platform for Next-generation Processors.IEEE Micro,17(5):12-18,September/October 1997.
    [14]G.S.Sohi and Amir Roth.Speculative Multithreaded Processors.IEEE Computer,34(4),April 2001.
    [15]G.S.Sohi,S.Breach,and T.N.Vijaykumar.Multiscalar Processors.In Proceedings of the 22nd Annual International Symposium on Computer Architecture,pages 414-425,June 1995.
    [16]Jim Dundas and Trevor Mudge.Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss.In Proceedings of the 1997International Conference on Supercomputing,pages 68-75,July 1997.
    [17]Onur Mutlu,Jared Stark,Chris Wilkerson,and Yale N.Patt.Runahead Execution:An Effective Alternative to Large Instruction Windows.IEEE Micro,23(6):20-25,Nov/Dec 2003.
    [18]Mikko H.Lipasti and John Paul Shen.Exceeding the Dataflow Limit via Value Prediction.In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture,pages 226-237,December 1996.
    [19]张晨曦,王志英,张春元,戴葵,朱海滨,计算机体系结构.北京:高等教育出版社,2000
    [20]Kenneth K.Chan,Cyrus C.Hay,John R.Keller,Gordon P.Kurpanek,Francis X.Schumacher,and Jason Zheng.Design of the HP PA 7200 CPU.HP journal,Feb 1996.
    [21]R.Kessler.The Alpha 21264 Microprocessor:Out-Of-Order Execution at 600 Mhz.In Hot Chips 10,August 1998.
    [22]胡伟武,张民选.高性能通用微处理器研发现状及发展策略,中国计算机学会通讯
    [23]http://www.eetchina.com/ART_8800432005_617693_bf264f2b20060904.HTM?fro m=RSS
    [24]L.A.Belady.A study of replacement algorithms for virtual storage computers.IBM Systems Journal,5(2):78-101,1966.
    [25]K.S.Yim,J.Kim,and K.Koh,"Performance Analysis of On-Chip Cache and Main Memory Compression Systems for High-End Parallel Computers",In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications,pp.469-475,2004.
    [26]Luca Benini,Davide Bruni,Alberto Macii,and Enrico Macii.Hardware-Assisted Data Compression for Energy Minimization in Systems with Embedded Processors. In Proceedings of the IEEE 2002 Design Automation and Test in Europe,pages 449-453,2002.
    [27]D.Burger,Doug and T.M.Austin,.The SimpleScalar Tool Set,Version 2.University of Wisconsin-Madison Computer Science Department,TN- 1342,1997.In Proceedings of MICRO-28,pp.93-103,November 1995.
    [28]Alaa R.Alameldeen and David A.Wood.Adaptive Cache Compression for High-Performance Processors.In Proceedings of the 31st Annual International Symposium on Computer Architecture,pages 212-223,June 2004.
    [29]Debra A.Lelewer and Daniel S.Hirschberg.Data Compression.ACM Computing Surveys,19(3):261-296,September 1987.
    [30]DaVid Salomon著,吴乐南等译,数据压缩原理与应用(第二版),电子工业出版社,2003.9.
    [31]Majid Rabbani and Paul W.Jones.Digital Image Compression Techniques.SPIE Optical Engineering Press,first edition,1991.
    [32]Robert Endre Tarjan and Andrew Chi-Chih Yao.Storing a Sparse Table.Communications of the ACM,22(11):606-611,November 1979.
    [33]Falk Scholer,Hugh E.Williams,John Yiannis,and Justin Zobel.Compression of Inverted Indexes for Fast Query Evaluation.In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,pages 222-229,2002.
    [34]Nivio Ziviani,Edleno Silva de Moura,Gonzalo Navarro,and Ricardo Baeza-Yates.Compression:A Key for Next-Generation Text Retrieval Systems.IEEE Computer,33(11):37-44,November 2000.
    [35]Poonacha Kongetira,Kathirgamar Aingaran,and Kunle Olukotun.Niagara:A 32-Way Multithreaded Sparc Processor.IEEE Micro,25(2):21-29,Mar/Apr 2005.
    [36]R.B.Tremaine,P.A.Franaszek,J.T.Robinson,C.O.Schulz,T.B.Smith,M.E.Wazlowski,and P.M.Bland.IBM Memory Expansion Technology(MXT).IBM Journal of Research and Development,45(2):271-285,March 2001.
    [37]Guido Araujo,Paulo Centoducatte,Mario Cartes,and Ricardo Pannain.Code Compression Based on Operand Factorization.In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture,pages 194-201,November 1998.
    [38]Ramon Canal,Antonio Gonzalez,and James E.Smith.Very Low Power Pipelines Using Significance Compression.In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture,pages 181-190,December 2000.
    [39]Thomas M.Conte,Sanjeev Banerjia,Sergei Y.Larin,Kishore N.Menezes,and Sumedh W.Sathaye.Instruction Fetch Mechanisms for VLIW Architectures with Compressed Encodings. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, pages 201-211, December 1996.
    [40] Charles Lefurgy, Eva Piccininni, and Trevor Mudge. Evaluation of a high performance code compression method. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 93-102,November 1999.
    [41] Haris Lekatsas, Jurg Henkel, and Wayne Wolf. Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems. In Proceedings of the International Symposium on Systems Synthesis, pages 63-68,2001.
    [42] Haris Lekatsas and Wayne Wolf. Code compression for embedded systems. In Proceedings of the 35th Annual Conference on Design Automation, pages 516-521,1998.
    [43] Y. Yoshida, B.Y. Song, H. Okuhata, T. Onoye, and I. Shirakawa. An Object Code Compression Approach to Embedded Processors. In Proceedings of the International Symposium on Low Power Electronics and Design, pages 265-268,August 1997.
    [44] David A. Huffman. A Method for the Construction of Minimum-Redundancy Codes. Proc. Inst.Radio Engineers, 40(9): 1098-1101, September 1952.
    [45] Debra A. Lelewer and Daniel S. Hirschberg. Data Compression. ACM Computing Surveys, 19(3):261-296, September 1987.
    [46] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic Coding for Data Compression. Communications of the ACM, 30(6):520-540, June 1987.
    [47] Jeffrey Scott Vitter. Design and Analysis of Dynamic Huffman Codes. Journal of the ACM, 34(4):825-845, October 1987.
    [48] Jacob Ziv and Abraham Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23(3):337—343, May 1977.
    [49] Jacob Ziv and Abraham Lempel. Compression of Individual Sequences Via Variable-Rate Coding. IEEE Transactions on Information Theory, 24(5):530 -536,September 1978.
    [50] Ramon Canal, Antonio Gonzalez, and James E. Smith. Very Low Power Pipelines Using Significance Compression. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 181-190,December 2000.
    [51] Daniel Citron. Exploiting Low Entropy to Reduce Wire Delay. IEEE TCCA Computer Architecture Letters, 3, January 2004.
    [52] Daniel Citron and Larry Rudolph. Creating a Wider Bus Using Caching Techniques. In Proceedings of the First IEEE Symposium on High-Performance Computer Architecture, pages 90-99, February 1995.
    [53] Matthew Farrens and Arvin Park. Dynamic Base Register Caching: A Technique for Reducing Address Bus Width. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 128-137, May 1991.
    [54] Krishna Kant and Ravi Iyer. Compressibility Characteristics of Address/Data transfers in Commercial Workloads. In Proceedings of the Fifth Workshop on Computer Architecture Evaluation Using Commercial Workloads, pages 59-67,February 2002.
    [55] Nam Sung Kim, Todd Austin, and Trevor Mudge. Low-Energy Data Cache Using Sign Compression and Cache Line Bisection. In Second Annual Workshop on Memory Performance Issues (WMPI), in conjunction with ISCA-29,2002.
    [56] Jun Yang and Rajiv Gupta. Energy Efficient Frequent Value Data Cache Design. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture, pages 197-207, November 2002.
    [57] Jun Yang and Rajiv Gupta. Frequent Value Locality and its Applications. ACM Transactions on Embedded Computing Systems, 1(1):79—105, November 2002.
    [58] Jun Yang, Youtao Zhang, and Rajiv Gupta. Frequent Value Compression in Data Caches.In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 258-265, December 2000.
    [59] Youtao Zhang, Jun Yang, and Rajiv Gupta. Frequent Value Locality and Value-centric Data Cache Design. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 150-159, November 2000.
    [60] James A. Storer and Thomas G. Szymanski. Data Compression via Textural Substitution. Journal of the ACM, 29(4):928-951, October 1982.
    [61] P.A. Franaszek, P. Heidelberger, D.E. Poff, R.A. Saccone, and J.T. Robinson.Algorithms and Data Structures for Compressed-Memory Machines. IBM Journal of Research and Development, 45(2):245-258, March 2001.
    [62] Peter Franaszek, John Robinson, and Joy Thomas. Parallel Compression with Cooperative Dictionary Construction. In Proceedings of the Data Compression Conference, DCC'96, pages 200-209, March 1996.
    [63] Jonghyun Lee, MarianneWinslett, Xiaosong Ma, and Shengke Yu. Enhancing Data Migration Performance via Parallel Data Compression. In Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS), pages 47-54, April 2002.
    [64] Lynn M. Stauffer and Daniel S. Hirschberg. Parallel Text Compression. Technical Report TR91-44, REVISED, University of California, Irvine, 1993.
    [65] Dinero IV Trace-Driven Uniprocessor Cache Simulator. http:// www.cs.wisc.edu/~markhill/DineroIV
    [66] R. Brett Tremaine, T. Basil Smith, Mike Wazlowski, David Har, Kwok-Ken Mak,and Sujith Arramreddy. Pinnacle: IBM MXT in a Memory Controller Chip. IEEE Micro,21(2):56-68, March/April 2001.
    [67] Bulent Abali, Hubertus Franke, Dan E. Poff, Jr. Robert A. Saccone, Charles O.Schulz, Lorraine M. Herger, and T. Basil Smith. Memory Expansion Technology (MXT): Software Support and Performance. IBM Journal of Research and Development, 45(2):287-301, March 2001.
    [68] Morten Kjelso, Mark Gooch, and Simon Jones. Design and Performance of a Main Memory Hardware Data Compressor. In Proceedings of the 22nd EUROMICRO Conference, 1996.
    [69] Erik G. Hallnor and Steven K. Reinhardt. A Compressed Memory Hierarchy using an Indirect Index Cache. Technical Report CSE-TR-488-04, University of Michigan, 2004.
    [70] Jose Luis Nunez and Simon Jones. Gbit/s Lossless Data Compression Hardware. IEEE Transactions on VLSI Systems, 11(3):499-510, June 2003.
    [71] Georgios Keramidas, Konstantinos Aisopos, Stefanos Kaxiras: Dynamic Dictionary-Based Data Compression for Level-1 Caches. 114-129 Electronic Edition. ARCS2006.
    [72] P. Pujara and A. Aggarwal. Restrictive Compression Techniques to Increase Level 1 Cache Capacity. International Conference on Computer Design, 2005.
    [73] Alaa R. Alameldeen and David A. Wood. Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches. Technical Report 1500,Computer Sciences Department, University of Wisconsin-Madison, April 2004.
    [74] E. Hallnor and S. Reinhardt. A Unified Compressed Memory Hierarchy. 11th Symposium on High Performance Computer Architecture, 2005.
    [75] Magnus Ekman and Per Stenstrom. A Robust Main-Memory Compression Scheme. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 74-85, June 2005.
    [76] Jang-Soo Lee, Won-Kee Hong, and Shin-Dug Kim. Design and Evaluation of a Selective Compressed Memory System. In Proceedings of Internationl Conference on Computer Design (ICCD'99), pages 184-191, October 1999.
    [77] LIQIANG HE, ZHIYONG LIU, An Effective Cache Overlapping Storage Structure for SMT Processors, Proceedings of the Fourth Annual ACIS International Conference on Computer and Information Science (ICIS'05)
    [78] J. L. Henning, "SPEC CPU 2000: Measuring CPU Performance in the new millennium", IEEE Computer, July 2000.
    [79]Tullsen D.M.,Eggers S.J.,Levy H.M.Simultaneous Multithreading:maximizing on-chip parallelism.In:Proc.of 22nd Annual International Symposium on Computer Architecture,Santa Margherita Ligure,Italy,1995.392-403
    [80]F.Baboescu,D.M.Tullsen.Memory subsystem design for multithreaded processors.Technical Report UCSD.1997
    [81]S.Hily,A.Seznec.Contention on 2nd level cache may limit the effectiveness of simultaneous multithreading,IRISA Report No 1086,1997
    [82]Matthew Curtis - Maury,Xiaoning Ding,Christos D.Antonopoulos,Dimitrios S.Nikolopoulos,An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors.
    [83]J.Sharkey,et.al.,M-Sim:A Flexible,Multi-Threaded Simulation Environment,Tech.Report CS-TR-05-DPI,Dept.of Computer Science,SUNY Binghamton,2005.
    [84]Tullsen D.M.,Eggers S.J.,Emer J.S.,Levy H.M.,Lo J.L.,and Stamm R.L Exploiting choice:instruction fetch and issue on an implementable simultaneous multithreading processor.In:Proc.of 23nd Annual International Symposium on Computer Architecture,1996.191-202
    [85]田兴彦.软件可控cache优化:[博士学位论文].长沙:国防科学技术大学计算机学院,2004
    [86]Wayne A.Wong and Jean-Loup Baer.Modified LRU Policies for Improving Second-level Cache Behavior.In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture,pp.49-60,January 2000.
    [87]Hongbo Yang,R.Govindarajan,Guang R.Gao,and Ziang Hu.Compiler-assisted cache replacement:Problem formulation and performance evaluation.In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing(LCPC'03),College Station,Texas,Oct 2003.
    [88]Kristof Beyls and Erik H.D'Hollander.Compile-Time Cache Hint Generation for EPIC Architectures.In Proceedings of the 2nd International Workshop on Explicitly Parallel Instruction Computing(EPIC) Architectures and Compiler Techniques,Istanbul,Turkey,November 2002.
    [89]Teresa L.Johnson and Wen-mei W.Hwu.Run-time Adaptive Cache Hierarchy Management via Reference Analysis.In Proceedings of the 24th International Symposium on Computer Architecture,pp.315-326,June 1997.
    [90]Wayne A.Wong and Jean-Loup Baer.Modified LRU Policies for Improving Second-level Cache Behavior.HPCA-6.
    [91]David W.Wall.Limits of Instruction-Level Parallelism.In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating systems, 1991, pages 176-188.
    [92]Franklin, M. (1993) The Multiscalar Architecture. Computer Science Technical Report No. 1196, University of Wisconsin-Madison, WI.
    [93] Sohi, G. S. (1997) Multiscalar: another fourth-generation processor. Computer, 30,72.
    
    [94] Sohi, G. S., Breach, S. E. and Vijaykumar, T. N. (1995) Multiscalar processors. In Proc. 22nd ISCA, Santa Margherita Ligure, Italy, June 22-24, pp. 414-425. ACM Press, New York.
    
    [95] Vijaykumar, T. N. and Sohi, G. S. (1998) Task selection for a multiscalar processor. In Proc. 31st Int. Symp. MICRO, Dallas, TX, November 30-December 2, pp.81-92. IEEE Computer Society, Los Alamitos, CA.
    
    [96] Rotenberg, E. et al. (1997) Trace processors. In Proc. 30th Int. Symp. MICRO,Research Triangle Park, NC, December 1-3, pp. 138-148. IEEE Computer Society,Los Alamitos, CA.
    [97] Smith, J. E. and Vajapeyam, S. (1997) Trace processors: moving to fourth-generation microarchitectures. Computer, 30, 68-74.
    
    [98] Vajapeyam, S. and Mitra, T. (1997) Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences. In Proc. 24th ISCA, Denver, CO,June 2-4, pp. 1-12. ACM/IEEE Computer Society Press, Los Alamitos, CA.
    [99] Marcuello, P., Gonzales, A. and Tubella, J. (1998) Speculative multithreaded processors. In Proc. International Conference. Supercomp., Melbourne, Australia,July 13-17, pp. 77-84. ACM Press, New York.
    
    [100] Theo Ungerer, Borut Robic and Jurij Silc. A Survey of Processors with Explicit Multithreading. ACM Computing Surveys, Vol.35, No. 1, March 2003, pp. 29-36.
    
    [101] Hansen, C. (1996) MicroUnity's MediaProcessor architecture. IEEE Micro, 16,34-41.
    [102] Alverson, R. et al. (1990) The Tera computer system. In Proc. Int. Conf.Supercomputing, Amsterdam, The Netherlands, June, pp. 1-6.
    [103] http://www.search.com/reference/IBM_RS64
    
    [104] Werner Damm, Alfred Mikschl, Oliver Dammbruck, Berthold Hagmann,Christian Kahlke, Frank Kemper, Christian Kirchhoff, Frank Koster, Alexander Metzner, Jorg Richter, Christian Ruschmeyer, Jochen Schmidtke, Lutz Twele,and Gerhard Wagner. MSparc: A multi-threaded Sparc. Projektbericht, Carl von Ossietzky Universitat Oldenburg, 1994.
    
    [105] Marr D.T., Binns F., Hill D.L., Hinton G., Koufaty D.A., Miller J.A., and Upton M. Hyperthreading technology architecture and microarchitecture: a hypertext history. Intel Technology J. 2002, 6,1
    
    [106] Emer J.S. Simultaneous multithreading: multiplying alpha's performance. In: Proc.of the Microprocessor Forum(San Jose, CA), 1999
    [107] Halstead R.H. and Fujita .T. MASA: A multithreaded processor architecture for parallel symbolic computing. In: Proc. of 15th International Symposium on Computer Architecture(Honolulu, HI), 1988. 443-451
    
    [108] Agarwal A., Bianchini R., Chaiken D., et al. The MIT Alewife machine: architecture and performance. In:Proc. of the 22th Annual International Symposium on Computer Architecture, 1995. 2-13