片上多核处理器二级Cache结构及资源管理技术研究

英文题名：L2Cache Organization and Management for Chip Multiprocessors
作者：晏沛湘
论文级别：博士
学科专业名称：电子科学与技术
中文关键词：多核处理器 ; 混合Cache结构 ; Reuse替换策略 ; 容量划分 ; 容量共享 ; 组均衡 ; 可重构Cache结构 ; 异构Cache结构
英文关键词：Chip Multi-processor ; Hybrid Cache ; Reuse Replacement ; Ca-
英文关键词：pacity Partitioning ; Capacity Sharing ; Set Balancing ; Reconfigurable Cache ; Het-
英文关键词：erogeneous Cache
学位年度：2012
导师：张民选
学科代码：0809
学位授予单位：国防科学技术大学
论文提交日期：2012-06-19

摘要

处理器与内存之间访问速度差距日益增大，有效组织和利用片上Cache资源以减少片外存储访问对于提升处理器性能至关重要。随着多核处理器的普及和半导体工艺的进步，芯片将集成更多的核，给二级Cache结构设计带来更大的压力和挑战。当前主流多核处理器采用基于LRU替换策略的共享或者私有二级Cache结构设计。然而，单一的共享或者私有Cache结构设计不能有效权衡容量与访问延迟。共享Cache结构能够有效利用资源，但是全局线延迟导致较慢的访问速度；私有Cache结构通过数据复制获得较快访问速度，但是容量限制导致较多的访问失效。此外，受组相联度、应用等因素的影响，LRU替换策略与理论最优替换策略之间的性能差距日趋增大。针对上述问题，本文深入研究了多核处理器中二级Cache资源的组织与管理机制，提出一种基于全局替换策略的可变相联度混合Cache结构模型，研究基于存储访问需求变化的动态容量划分与组均衡管理机制，并提供低功耗与可扩展优化。论文的创新点如下：
     1.提出面向CMP的可变相联度混合Cache结构CMP-VH。CMP-VH将二级Cache划分成一种优化的私有/共享结构，Tag私有，数据部分私有部分共享。CMP-VH基于数据块的重用信息进行全局替换，并支持核间容量划分以适应不同应用存储访问需求的变化。使用Simics模拟器搭建8核片上多处理器平台，对SPLASH并行程序负载的模拟实验结果表明，在相同总容量前提下，CMP-VH结构下的平均二级Cache失效率与传统共享Cache结构接近，比传统私有Cache结构降低约23.37%。
     2.提出基于数据项动态分配的容量划分技术VH-PAD。VH-PAD根据各个核的容量需求进行资源分配，包含初始化、重划分和回退三个阶段。初始化阶段赋予每个核相同数目资源；重划分阶段基于当前划分容量的饱和程度评估容量需求以指导容量划分；回退阶段基于当前占用容量判断是否撤销重划分阶段操作。VH-PAD通过控制共享数据项资源的动态分配实施核间容量调整。在Simics搭建的模拟平台上使用PARSEC基准程序进行实验，发现在相同总容量前提下，VH-PAD机制下的平均二级Cache失效率比传统私有Cache结构降低约41.33%。
     3.提出基于概率控制的容量划分技术VH-PS。VH-PS根据各个核的资源利用率进行资源分配，使用概率控制各个核对共享资源的竞争能力，从而实现核间容量划分。VH-PS提供一种性能监控机制评估各个核在增加一定容量后可以获得的失效率增益，并以此为基础赋予各个核不同等级的使用共享资源的概率。通过提升失效率增益大的核的概率等级，降低失效率增益小的核的概率等级，达到降低总失效率目的。VH-PS中的概率控制可以采用伪随机数或者PSR比例实现。在Simics搭建的模拟平台上使用PARSEC基准程序进行实验，发现在相同总容量前提下，与传统私有Cache结构相比，采用伪随机数实现的VH-PS下的平均二级Cache失效率降低约46.78%；采用PSR比例实现的VH-PS下的平均二级Cache失效率降低约43.05%。
     4.提出基于Tag组饱和度的组均衡管理技术。由于CMP-VH中私有Tag阵列限制最大组相联度与最大可用容量，本文提出核内、核间两种Tag组均衡管理机制。将CMP-VH中的替换分成Tag项主导替换与Data项主导替换两类，并使用Tag项主导替换数目评估每个组的饱和程度，允许饱和度高的组使用核内或者核间相应饱和度低的组中资源。在Simics搭建的模拟平台上使用PARSEC基准程序进行实验，发现在相同总容量前提下，与基准CMP-VH结构相比，核内组均衡机制的平均二级Cache失效率降低约11.04%，核间组均衡机制的平均二级Cache失效率降低约18.94%。
     5.提出异构可变相联度Cache结构HV-Way Cache及异构可变相联度混合Cache结构模型CMP-VHR。HV-Way Cache使用异构Tag阵列优化V-WayCache结构，以降低面积、功耗等开销。同时，为了适应未来众核处理器对低功耗与可扩展性的要求，使用异构Tag阵列和可重构数据阵列搭建异构可变相联度混合Cache结构模型，支持根据应用需求进行功耗优化。实验结果表明，HV-Way Cache结构能以较少的性能损失获得面积、功耗等开销的大幅降低。
With the ever widening processor-memory speed gap, it is essential to efficientlyorganize and utilize on-chip cache resources, as system performance can be improved bythe reduced memory accesses. Chip multi-processors (CMP) are very popular nowadays.The number of cores integrated in a single chip increase with the advance of semiconduc-tor technology, posing increasing pressure on L2cache design. Most of the mainstreamCMPs adopt shared or private L2cache based on LRU replacement strategy. However,neither shared nor private L2cache can provide large capacity and fast access. SharedL2cache can maximize on-chip cache capacity, but the average access latency is heavilyinfluenced by wire delays. Private L2cache has the advantage of low access latency, buthave more off-chip accesses than share L2cache. Besides, due to the set associativity anddiversity of applications, the performance gap between LRU replacement strategy andthe optimal replacement strategy is getting wider. To address these problems, this thesisfacilitates further study on the organization and management of L2cache resources forCMPs,proposesaCMPorientedvariablewayhybridcachebasedonaglobalreplacementstrategy, exploits dynamic capacity partitioning and set balancing mechanisms based onrun-time access demands, and provides schemes for low power and scalable design. Theinnovations of this paper are as follows:
     Firstly, propose a CMP oriented Variable way Hybrid cache (CMP-VH). CMP-VHturns the L2cache into an optimized private/shared organization. The tag array is private,while the data array is private and shared organized. Adopting a global replacement strat-egy based on reuse counts of each cache block, CMP-VH provide capacity partitioningmechanismsamongcoreswithadaptiontothevariablecacheaccessdemands. UsingSim-ics simulator to build an8-core CMP platform, the simulation results of parallel workloadSPLASH show that in condition of the same total capacity, CMP-VH achieves a compa-rable average L2miss rate with conventional shared cache organization, and reduce theaverage L2miss rate by23.37%compared with contentional private cache organization.
     Secondly, propose a capacity partitioning mechanism based on dynamic allocationof data entries (VH-PAD). VH-PAD assigns each core a certain amount of resources oncapacity demands, and contains three stages of initial, repartitioning and rollback. Inthe initial stage, cache resources are equally allocated among cores. In the repartitioning stage, a new allocatedcapacity isassignedtoeachcoreaccordingtothecapacitydemandspredicted by the utilization of current allocated capacity. The rollback stage determineswhether to cancel operations taken place in the repartitioning stage. Capacity partitioningis accomplished by controlling the allocation of shared data resources in VH-PAD. Usingprograms from PARSEC benchmark to run on a Simics platform, our experiments showthat in condition of the same total capacity, VH-PAD exceeds conventional private cachein the average L2miss rate by41.33%.
     Thirdly, propose a probabilistic controlled capacity partitioning mechanism (VH-PS). VH-PS allocates resources among cores on the utilization of capacity, and adoptsprobabilities to control the competition to shared resources to accomplish capacity parti-tioning. Providing a monitor scheme to evaluate marginal gains by some extra assignedresources, VH-PS correspondingly assign each core different levels of probabilities to usethe shared resources. Total miss rate can be achieved by upgrading probabilities of coreswith large marginal gains and downgrading probabilities of cores with small marginalgains. Probabilistic controlled partitioning is implemented by generators of pseudo ran-dom number or a PSR scheme. Using programs from PARSEC benchmark to run on aSimics platform, our experiments show that in condition of the same total capacity, VH-PS with generators of pseudo random number exceeds conventional private cache in theaverage L2miss rate by46.78%, and VH-PS with a PSR scheme exceeds conventionalprivate cache in the average L2miss rate by43.05%.
     Fourthly, propose two mechanisms for set balancing based on the saturation levelsof tag sets. To relieve limitation of private tag array to the maximum set associativity andthe upper bound of available capacity, intra-core and inter-core set balancing mechanism-s are proposed. Classified replacement into tag inducted and data inducted replacement,we use the number of tag inducted replacement to evaluate the set saturation levels andallows over saturated sets to use resources from other sets of the same core or from thecorresponding set of other cores that are not over saturated. Using programs from PAR-SEC benchmark to run on a Simics platform, our experiments show that in condition ofthe same total capacity, using a CMP-VH as the baseline organization, intra-core set bal-ancing reduces the average L2miss rate by11.04%and inter-core set balancing reducesthat by18.94%.
     Fifthly, propose a heterogenous variable way cache (HV-Way cache) and a hetero- geneous variable way hybrid cache (CMP-VHR). Adopting a heterogenous tag array tooptimize V-Way cache, HV-Way cache reduces the area and energy overhead. Besides, tomeet the low-power and scalable demands for future many-core processors, CMP-VHRcomposed of private heterogeneous tag arrays and a configurable data array is also pro-posed. It is supported in CMP-VHR to optimize power consumption. Experiment resultsshow that HV-Way cache can greatly reduce the area and power overhead at expense of alittle performance lose.

引文

[1] Hammond L, Nayfeh B A, Olukotun K. A Single-Chip Multiprocessor [J]. Com-puter.1997,30:79–85.
    [2] IBM. The POWER4Processor Introduction and Tuning Guide.[EB/OL].2001.http://www.redbooks.ibm.com/redbooks/pdfs/sg247041.pdf.
    [3] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Ap-proach [M].4th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2007.
    [4] MooreGE.CrammingMoreComponentsontoIntegratedCircuits[J].Electronics.1965,38(8):114–117.
    [5] ITRS. International Technology Roadmap for Semiconductors2011EditionExecutive Summary.[EB/OL].2011. http://www.itrs.net/Links/2011ITRS/2011Chapters/2011ExecSum.pdf.
    [6] Olukotun K. Chip Multiprocessor Architecture: Techniques to Improve Through-put and Latency [M].1st ed. Morgan and Claypool Publishers,2007.
    [7]黄国睿，张平，魏广博．多核处理器的关键技术及其发展趋势[J]．计算机工程与设计．2009，10(30)：2414–2418．
    [8] Herlihy M, Moss J E B. Transactional Memory: Architectural Support for Lock-free Data Structures [C]. In Proceedings of the20th Annual International Sympo-sium on Computer Architecture. New York, NY, USA,1993:289–300.
    [9]张磊，韩银和，李晓维．面向集成电路可靠性挑战的多核处理器虚拟化技术[J]．信息技术快报．2010，8(2)：1–8．
    [10] Kapasi U J, Dally W J, Rixner S, et al. The Imagine Stream Processor [C]. In Pro-ceedings of IEEE International Conference on Computer Design: VLSI in Com-puters and Processors.2002:282–288.
    [11] Waingold E, Taylor M, Srikrishna D, et al. Baring It All to Software: Raw Ma-chines [J]. Computer.1997,30(9):86–93.
    [12] Taylor M B, Lee W, Miller J, et al. Evaluation of the Raw Microprocessor: AnExposed-Wire-Delay Architecture for ILP and Streams [J]. SIGARCH Comput.Archit. News.2004,32(2):2–13.
    [13] SankaralingamK,NagarajanR,LiuH,etal.ExploitingILP,TLP,andDLPwiththepolymorphous TRIPS architecture [J]. SIGARCH Comput. Archit. News.2003,31(2):422–433.
    [14] Burger D, Keckler S W, McKinley K S, et al. Scaling to the End of Silicon withEDGE Architectures [J]. Computer.2004,37(7):44–55.
    [15] Kongetira P, Aingaran K, Olukotun K. Niagara: A32-Way Multithreaded SparcProcessor [J]. IEEE Micro.2005,25:21–29.
    [16] Shimpi A L, Clark J, Whitehead R. AMD’s dual core Opteron&Athlon64X2-Server/Desktop Performance Preview.[EB/OL].2005. http://www.anandtech.com/show/1665.
    [17] Wasson S. Intel’s Woodcrest processor previewed.[EB/OL].2006. http://techreport.com/articles.x/10021/1.
    [18] Clark J, Whitehead R. Intel Clovertown: Quad Core for the Masses.[EB/OL].2007. http://www.anandtech.com/show/2201.
    [19] Intel. Intel Core i7processor extreme edition: Product brief.[EB/OL].http://download.intel.com/products/processor/corei7EE/extremeprodbrief:pdf.
    [20] Intel. Intel Xeon Processor7500Series Product Brief.[EB/OL]. http://www.intel.com/Assets/en_US/PDF/prodbrief/323499.pdf.
    [21] AMD. Key architectural features-AMD Phenom II processors.[EB/OL].http://www.amd.com/us/products/desktop/processors/phenomii/Pages/phenom-ii-key-architectural-features.aspx.
    [22] AMD. AMD Opteron6000Series Platform.[EB/OL]. http://www.amd.com/us/products/server/processors/6000-series-platform/Pages/6000-series-platform.aspx.
    [23] Microsystems S. UltraSPARC T2processor brochure.[EB/OL].2007. http://www.sun.com/processors/UltraSPARCT2/brochure.pdf.
    [24] Oracle. SPARC T3-1server.[EB/OL]. http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t3-1-ds-173098.pdf.
    [25] Kahle J A, Day M N, Hofstee H P, et al. Introduction to the cell multiprocessor [J].IBM J. Res. Dev.2005,49(4/5):589–604.
    [26]中国科学院计算技术研究所，龙芯中科技术有限公司.龙芯3A处理器用户手册.[EB/OL].2011. http://www.loongson.cn/uploadfile/file/shouce/3A_user_manual_P1_V1.7.pdf.
    [27] Bechtosheim A. Technologies for Data-Intensive Computing.[EB/OL].2009.http://www.hpts.ws/session1/bechtolsheim.pdf.
    [28] Lowney G. Why Intel is designing multi-core processors [C]. In Proceedings ofthe eighteenth annual ACM symposium on Parallelism in algorithms and architec-tures. New York, NY, USA,2006:113–113.
    [29] DixitA,HealdR,WoodA.TheImpactofNewTechnologyonSoftErrorRates[C].InProceedingsoftheInternationalConferenceonReliabilityPhysicsSymposium.2011:5B.4.1–5B.4.7.
    [30] Moore S K. Multicore is Bad News for Supercomputers [J]. IEEE Spectr.2008,45(11):15–15.
    [31] Patterson D. The Trouble With Multi-core [J]. IEEE Spectr.2010,47(7):28–32.
    [32] Smith A J. Cache Memories [J]. ACM Comput. Surv.1982,14(3):473–530.
    [33] Peir J-K, Hsu W W, Smith A J. Functional Implementation Techniques for CPUCache Memories [J]. IEEE Trans. Comput.1999,48(2):100–110.
    [34] Hill M D, Smith A J. Experimental evaluation of on-chip microprocessor cachememories [J]. SIGARCH Comput. Archit. News.1984,12(3):158–166.
    [35] ClarkDW.CachePerformanceintheVAX-11/780[J].ACMTrans.Comput.Syst.1983,1(1):24–37.
    [36] Przybylski S, Horowitz M, Hennessy J. Characteristics of performance-optimalmulti-level cache hierarchies [J]. SIGARCH Comput. Archit. News.1989,17(3):114–121.
    [37] Mutlu O, Stark J, Wilkerson C, et al. Runahead Execution: An Effective Alterna-tive to Large Instruction Windows [J]. IEEE Micro.2003,23(6):20–25.
    [38] TullsenDM,EggersSJ,LevyHM.Simultaneousmultithreading:maximizingon-chip parallelism [C]. In Proceedings of the22nd annual international symposiumon Computer architecture. New York, NY, USA,1995:392–403.
    [39] Gokhale M, Holmes B, Iobst K. Processing in Memory: The Terasys MassivelyParallel PIM Array [J]. Computer.1995,28(4):23–31.
    [40] Patterson D, Anderson T, Cardwell N, et al. A Case for Intelligent RAM [J]. IEEEMicro.1997,17(2):34–44.
    [41] Benítez D, Moure J C, Rexachs D I, et al. Adaptive L2cache for chip multipro-cessors [C]. In Proceedings of the2007Conference on Parallel Processing. Berlin,Heidelberg,2008:28–37.
    [42] Sibai F N. On the Performance Benefits of Sharing and Privatizing Second andThird-Level Cache Memories in Homogeneous Multi-core Architectures [J]. Mi-croprocess. Microsyst.2008,32(7):405–412.
    [43] Asaduzzaman A, Sibai F N, Rani M. Impact of Level-2Cache Sharing on the Per-formance and Power Requirements of Homogeneous Multicore Embedded Sys-tems [J]. Microprocess. Microsyst.2009,33(5-6):388–397.
    [44] Lee D C, Crowley P J, Baer J-L, et al. Execution Characteristics of Desktop Ap-plications on Windows NT [C]. In Proceedings of the25th Annual InternationalSymposium on Computer Architecture. Washington, DC, USA,1998:27–38.
    [45] Maynard A M G, Donnelly C M, Olszewski B R. Contrasting Characteristicsand Cache Performance of Technical and Multi-user Commercial Workloads [J].SIGOPS Oper. Syst. Rev.1994,28(5):145–156.
    [46] BarrosoLA,GharachorlooK,NowatzykA,etal.ImpactofChip-LevelIntegrationon Performance of OLTP Workloads [C]. In In The6th International Symposiumon High-Performance Computer Architecture.2000:3–14.
    [47] Huh J, Burger D, Keckler S W. Exploring the Design Space of Future CMPs [C].In Proceedings of the2001International Conference on Parallel Architectures andCompilation Techniques. Washington, DC, USA,2001:199–210.
    [48] Suh G E, Rudolph L, Devadas S. Dynamic Partitioning of Shared Cache Memo-ry [J]. J. Supercomput.2004,28(1):7–26.
    [49] FedorovaA.OperatingSystemSchedulingforChipMultithreadedProcessors[D].Cambridge, MA, USA: Harvard University,2006.
    [50] Fedorova A, Seltzer M, Smith M D. Improving Performance Isolation on ChipMultiprocessorsviaanOperatingSystemScheduler[C].InProceedingsofthe16thInternational Conference on Parallel Architecture and Compilation Techniques.Washington, DC, USA,2007:25–38.
    [51] Settle M W A. An Adaptive Chip Multiprocessor Cache Hierarchy [D]. Boulder,CO, USA: University of Colorado at Boulder,2007.
    [52] Moreto Planas M, Cazorla F, Ramirez A, et al. Explaining Dynamic Cache Parti-tioning Speed Ups [J]. IEEE Comput. Archit. Lett.2007,6(1):–.
    [53] Denning P J. Thrashing: Its Causes and Prevention [C]. In Proceedings of the De-cember9-11,1968, Fall Joint Computer Conference, Part I. New York, NY, USA,1968:915–922.
    [54] Goodman J R. Using Cache Memory to Reduce Processor-Memory Traffic [J].SIGARCH Comput. Archit. News.1983,11(3):124–131.
    [55] Katz R H, Eggers S J, Wood D A, et al. Implementing a Cache Consistency Proto-col [C]. In Proceedings of the12th Annual International Symposium on ComputerArchitecture. Los Alamitos, CA, USA,1985:276–283.
    [56] Tang C K. Cache System Design in the Tightly Coupled Multiprocessor Sys-tem [C]. In Proceedings of the AFIPS’76National Computer Conference andExposition. New York, NY, USA,1976:749–753.
    [57] Censier L M, Feautrier P. A New Solution to Coherence Problems in MulticacheSystems [J]. IEEE Trans. Comput.1978,27(12):1112–1118.
    [58] Martin M M K, Hill M D, Wood D A. Token Coherence: Decoupling Performanceand Correctness [C]. In Proceedings of the30th Annual International Symposiumon Computer Architecture. New York, NY, USA,2003:182–193.
    [59] Papamarcos M S, Patel J H. A Low-overhead Coherence Solution for Multipro-cessors with Private Cache Memories [C]. In Proceedings of the11th Annual In-ternational Symposium on Computer Architecture. New York, NY, USA,1984:348–354.
    [60] Sweazey P, Smith A J. A Class of Compatible Cache Consistency Protocols andTheirSupportbytheIEEEFuturebus[C].InProceedingsofthe13thAnnualInter-national Symposium on Computer Architecture. Los Alamitos, CA, USA,1986:414–423.
    [61] Beckmann B M, Marty M R, Wood D A. ASR: Adaptive Selective Replicationfor CMP Caches [C]. In Proceedings of the39th Annual IEEE/ACM InternationalSymposium on Microarchitecture. Washington, DC, USA,2006:443–454.
    [62] Huh J, Burger D, Keckler S W. Exploring the Design Space of Future CMPs [C].In Proceedings of the2001International Conference on Parallel Architectures andCompilation Techniques.2001:199–210.
    [63] Huh J, Kim C, Shafi H, et al. A NUCA Substrate for Flexible CMP CacheSharing [J]. IEEE Transactions on Parallel and Distributed Systems.2007,18:1028–1040.
    [64] Nayfeh B A, Olukotun K. Exploring the Design Space for a Shared-Cache Mul-tiprocessor [C]. In Proceedings of the21st Annual International Symposium onComputer Architecture. Los Alamitos, CA, USA,1994:166–175.
    [65] Zhang M, Asanovic K. Victim Replication: Maximizing Capacity while HidingWire Delay in Tiled Chip Multiprocessors [J]. SIGARCH Comput. Archit. News.2005,33(2):336–345.
    [66] Chishti Z, Powell M D, Vijaykumar T N. Optimizing Replication, Communica-tion, and Capacity Allocation in CMPs [C]. In Proceedings of the32nd Annual In-ternational Symposium on Computer Architecture. Washington, DC, USA,2005:357–368.
    [67] Chang J, Sohi G S. Cooperative Caching for Chip Multiprocessors [C]. In Pro-ceedings of the33rd Annual International Symposium on Computer Architecture.Washington, DC, USA,2006:264–276.
    [68] Dybdahl H, Stenstrom P. An Adaptive Shared/Private NUCA Cache PartitioningSchemeforChipMultiprocessors[C].InProceedingsofthe2007IEEE13thInter-national Symposium on High Performance Computer Architecture. Washington,DC, USA,2007:2–12.
    [69] Zhao L, Iyer R, Upton M, et al. Towards Hybrid Last Level Caches for Chip-multiprocessors [J]. SIGARCH Comput. Archit. News.2008,36(2):56–63.
    [70]高翔，章隆兵，胡伟武．一种基于容量复用的异构CMP Cache [J]．计算机研究与发展．2008，45(5)：877–885．
    [71] Kim H, Youn S, Kim J. Reusability-aware Cache Memory Sharing for Chip Multi-processors with Private L2Caches [J]. J. Syst. Archit.2009,55(10-12):446–456.
    [72] Liu C, Sivasubramaniam A, Kandemir M. Organizing the Last Line of Defensebefore Hitting the Memory Wall for CMPs [C]. In Proceedings of the10th Inter-national Symposium on High Performance Computer Architecture. Washington,DC, USA,2004:176–185.
    [73] Cho S, Jin L. Managing Distributed, Shared L2Caches through OS-Level PageAllocation [C]. In Proceedings of the39th Annual IEEE/ACM International Sym-posium on Microarchitecture.2006:455–468.
    [74] Beckmann B M, Wood D A. Managing Wire Delay in Large Chip-MultiprocessorCaches [C]. In Proceedings of the37th Annual IEEE/ACM International Sympo-sium on Microarchitecture. Washington, DC, USA,2004:319–330.
    [75] Chishti Z, Powell M D, Vijaykumar T N. Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures [C]. In Proceed-ings of the36th Annual IEEE/ACM International Symposium on Microarchitec-ture. Washington, DC, USA,2003:55–66.
    [76] Hardavellas N, Ferdman M, Falsafi B, et al. Reactive NUCA: Near-Optimal BlockPlacement and Replication in Distributed Caches [C]. In Proceedings of the36thAnnualInternationalSymposiumonComputerArchitecture.NewYork,NY,USA,2009:184–195.
    [77] Qureshi M K. Adaptive Spill-Receive for Robust High-Performance Caching inCMPs [C]. In International Symposium on High-Performance Computer Archi-tecture.2009:45–54.
    [78] Kim C, Burger D, Keckler S W. An adaptive, Non-uniform Cache Structure forWire-delay Dominated on-chip Caches [J].2002,36(5):211–222.
    [79] Kharbutli M, Irwin K, Solihin Y, et al. Using Prime Numbers for Cache Indexingto Eliminate Conflict Misses [C]. In Proceedings of the10th International Sympo-sium on High Performance Computer Architecture. Washington, DC, USA,2004:288–299.
    [80] Seznec A. A New Case for Skewed-Associativity.[EB/OL].1997. http://hal.inria.fr/inria-00073481/en/.
    [81] Hallnor E G, Reinhardt S K. A Fully Associative Software-managed Cache De-sign [J]. SIGARCH Comput. Archit. News.2000,28(2):107–116.
    [82] QureshiMK,Thompson D,Patt YN.TheV-WayCache: DemandBasedAssocia-tivityviaGlobalReplacement[C].InProceedingsofthe32ndAnnualInternationalSymposium on Computer Architecture. Washington, DC, USA,2005:544–555.
    [83] RolánD,FraguelaBB,DoalloR.AdaptiveLinePlacementwiththeSetBalancingCache [C]. In Proceedings of the42nd Annual IEEE/ACM International Sympo-sium on Microarchitecture. New York, NY, USA,2009:529–540.
    [84] Zhan D, Jiang H, Seth S. Exploiting Set-Level Non-Uniformity of Capacity De-mand to Enhance CMP Cooperative Caching [C]. In Proceedings of the24th IEEEInternational Parallel and Distributed Processing Symposium.2010:1–10.
    [85] Albonesi D H. Selective Cache Ways: On-Demand Cache Resource Alloca-tion [C]. In Proceedings of the32th International Symposium on Microarchitec-ture.1999:248–259.
    [86] PisharathJ,ChoudharyA.AnIntegratedApproachtoReducingPowerDissipationin Memory Hierarchies [C]. In Proceedings of the2002International ConferenceonCompilers,Architecture,andSynthesisforEmbeddedSystems.NewYork,NY,USA,2002:88–97.
    [87] Yang S H, Powell M D, Falsafi B, et al. Exploiting Choice in Resizable CacheDesign to Optimize Deep-Submicron Processor Energy-Delay [C]. In Proceedingsof the8th IEEE Symposium on High-Performance Computer Architecture.2002:151–161.
    [88] Benitez D, Moure J C, Rexachs D, et al. A Reconfigurable Cache Memory withHeterogeneous Banks [C]. In Proceedings of the Conference on Design, Automa-tion and Test in Europe.3001Leuven, Belgium, Belgium,2010:825–830.
    [89] Abella J, González A. Heterogeneous Way-Size Cache [C]. In Proceedings of the20th Annual International Conference on Supercomputing. New York, NY, USA,2006:239–248.
    [90] XieYJ,ZhangYH,WangDS.HeterogeneousAssociativeCacheforMultimediaApplications [C]. In Proceedings of the IASTED European Conference: Internetand Multimedia Systems and Applications. Anaheim, CA, USA,2007:100–105.
    [91] Varadarajan K, Nandy S K, Sharda V, et al. Molecular Caches: A Caching Struc-ture for Dynamic Creation of Application-Specific Heterogeneous Cache Region-s [C]. In Proceedings of the39th Annual IEEE/ACM International Symposium onMicroarchitecture. Washington, DC, USA,2006:433–442.
    [92] Candierendonck H, De Bosschere K. XOR-based hash functions [J]. IEEE Trans-actions on Computers.2005,54(7):800–812.
    [93] Cho S-J, Choi U-S, Hwang Y-H, et al. Design of new XOR-based hash functionsfor cache memories [J]. Comput. Math. Appl.2008,55(9):2005–2011.
    [94] Chang J, Sohi G S. Cooperative Cache Partitioning for Chip Multiprocessors [C].In Proceedings of the21st Annual International Conference on Supercomputing.New York, NY, USA,2007:242–252.
    [95] Belady L A. A Study of Replacement Algorithms for a Virtual-Storage Comput-er [J]. IBM Syst. J.1966,5(2):78–101.
    [96] Lee D, Choi J, Kim J H, et al. LRFU: A Spectrum of Policies that Subsumes theLeastRecentlyUsedandLeastFrequentlyUsedPolicies[J].IEEETrans.Comput.2001,50(12):1352–1361.
    [97] Dybdahl H, Stenstr m P, Natvig L. An LRU-based Replacement Algorithm Aug-mented with Frequency of Access in Shared Chip-multiprocessor Caches [J].SIGARCH Comput. Archit. News.2007,35(4):45–52.
    [98] Zhang C, Xue B. Divide-and-Conquer: A Bubble Replacement for Low LevelCaches [C]. In Proceedings of the23rd International Conference on Supercom-puting. New York, NY, USA,2009:80–89.
    [99] Wong W A, Baer J-L. Modified LRU Policies for Improving Second-Level CacheBehavior [C]. Los Alamitos, CA, USA,2000:49.
    [100] Kampe M, Stenstrom P, Dubois M. Self-Correcting LRU Replacement Poli-cies [C]. In Proceedings of the1st Conference on Computing Frontiers.2004:181–191.
    [101] Kharbutli M, Solihin Y. Counter-Based Cache Replacement Algorithms [C]. InProceedings of the2005International Conference on Computer Design. Washing-ton, DC, USA,2005:61–68.
    [102] Qureshi M K, Jaleel A, Patt Y N, et al. Adaptive Insertion Policies for High Per-formance Caching [J]. SIGARCH Comput. Archit. News.2007,35(2):381–391.
    [103] Jaleel A, Hasenplaugh W, Qureshi M, et al. Adaptive Insertion Policies for Man-aging Shared Caches [C]. In Proceedings of the17th International Conference onParallel Architectures and Compilation Techniques. New York, NY, USA,2008:208–219.
    [104] Jiang S, Zhang X. LIRS: an Efficient Low Inter-reference Recency Set Replace-ment Policy to Improve Buffer Cache Performance [J]. SIGMETRICS Perform.Eval. Rev.2002,30(1):31–42.
    [105] Takagi M, Hiraki K. Inter-reference Gap Distribution Replacement: An ImprovedReplacementAlgorithmforSet-associativeCaches[C].InProceedingsofthe18thAnnual International Conference on Supercomputing. New York, NY, USA,2004:20–30.
    [106] Qureshi M K, Lynch D N, Mutlu O, et al. A Case for MLP-Aware Cache Re-placement [C]. In Proceedings of the33rd Annual International Symposium onComputer Architecture. Washington, DC, USA,2006:167–178.
    [107] Subramanian R, Smaragdakis Y, Loh G H. Adaptive Caches: Effective Shaping ofCache Behavior to Workloads [C]. In Proceedings of the39th Annual IEEE/ACMInternational Symposium on Microarchitecture.2006:385–396.
    [108] Stone H S, Turek J, Wolf J L. Optimal Partitioning of Cache Memory [J]. IEEETrans. Comput.1992,41(9):1054–1068.
    [109] Suh G E, Devadas S, Rudolph L. A New Memory Monitoring Scheme forMemory-Aware Scheduling and Partitioning [C]. In Proceedings of the8th Inter-national Symposium on High-Performance Computer Architecture. Washington,DC, USA,2002:117–.
    [110] QureshiMK,PattYN.Utility-BasedCachePartitioning:ALow-Overhead,High-Performance, Runtime Mechanism to Partition Shared Caches [C]. In Proceedingsof the39th Annual IEEE/ACM International Symposium on Microarchitecture.Washington, DC, USA,2006:423–432.
    [111] ThiébautD,StoneHS,WolfJL.ImprovingDiskCacheHit-RatiosThroughCachePartitioning [J]. IEEE Trans. Comput.1992,41(6):665–676.
    [112] Suh G E, Rudolph L, Devadaslaboratory S, et al. Dynamic Cache Partitioning forSimultaneous Multithreading Systems [C]. In Proceedings of the IASTED Inter-national Conference on Parallel and Distributed Computing and Systems.2001:116–127.
    [113] Chiou D T. Extending the Reach of Microprocessors: Column and CuriousCaching [D].[S. l.]: Massachusetts Institute of Technology,1999. AAI0801277.
    [114] Chiou D, Devadas S, Rudolph L, et al. Dynamic Cache Partitioning via Colum-nization [C]. In Proceedings of Design Automation Conference.2000.
    [115] Chiou D, Jain P, Rudolph L, et al. Application-Specific Memory Managemen-t for Embedded Systems Using Software-Controlled Caches [C]. In Proceedingsof the37th Annual Design Automation Conference. New York, NY, USA,2000:416–419.
    [116] Ravindran R, Chu M, Mahlke S. Compiler-Managed Partitioned Data Caches forLow Power [C]. In Proceedings of the2007ACM SIGPLAN/SIGBED conferenceon Languages, compilers, and tools for embedded systems. New York, NY, USA,2007:237–247.
    [117] ZhangX,DwarkadasS,ShenK.TowardsPracticalPageColoring-basedMulticoreCache Management [C]. In Proceedings of the4th ACM European Conference onComputer Systems. New York, NY, USA,2009:89–102.
    [118] Soares L, Tam D, Stumm M. Reducing the Harmful Effects of Last-Level CachePolluterswithanOS-level,Software-onlyPolluteBuffer[C].InProceedingsofthe41st Annual IEEE/ACM International Symposium on Microarchitecture. Wash-ington, DC, USA,2008:258–269.
    [119] Lin J, Lu Q, Ding X, et al. Gaining Insights into Multicore Cache Partitioning:Bridging the Gap between Simulation and Real Systems [C]. In Proceedings ofthe14th International Symposium on High Performance Computer Architecture.2008:367–378.
    [120] Romer T, Lee D, Bershad B N, et al. Dynamic Page Mapping Policies for CacheConflict Resolution on Standard Hardware [C]. In Proceedings of the1st USENIXSymposium on Operating Systems Design and Implementation.1994:255–266.
    [121] Tam D, Azimi R, Soares L, et al. Managing Shared L2Caches on Multicore Sys-tems in Software [C]. In Proceedings of the Workshop on the Interaction betweenOperating Systems and Computer Architecture.2007:26–33.
    [122] KimS,ChandraD,SolihinY.FairCacheSharingandPartitioninginaChipMulti-processorArchitecture[C].InProceedingsofthe13thInternationalConferenceonParallel Architectures and Compilation Techniques. Washington, DC, USA,2004:111–122.
    [123] Jahre M, Natvig L. A Light-Weight Fairness Mechanism for Chip MultiprocessorMemory Systems [C]. In Proceedings of the6th ACM Conference on ComputingFrontiers. New York, NY, USA,2009:1–10.
    [124] Herdrich A, Illikkal R, Iyer R, et al. Rate-based QoS Techniques for Cache/Mem-ory in CMP Platforms [C]. In Proceedings of the23rd International Conference onSupercomputing. New York, NY, USA,2009:479–488.
    [125] Iyer R. CQoS: A Framework for Enabling QoS in Shared Caches of CMP Plat-forms [C]. In Proceedings of the18th Annual International Conference on Super-computing. New York, NY, USA,2004:257–266.
    [126] Iyer R, Zhao L, Guo F, et al. QoS Policies and Architecture for Cache/Memory inCMP Platforms [J]. SIGMETRICS Perform. Eval. Rev.2007,35(1):25–36.
    [127] Guo F, Solihin Y, Zhao L, et al. A Framework for Providing Quality of Service inChipMulti-Processors[C].InProceedingsofthe40thAnnualIEEE/ACMInterna-tional Symposium on Microarchitecture. Washington, DC, USA,2007:343–355.
    [128] Guo F, Kannan H, Zhao L, et al. From Chaos to QoS: Case Studies in CMP Re-source Management [J]. SIGARCH Comput. Archit. News.2007,35(1):21–30.
    [129] Yeh T Y, Reinman G. Fast and Fair: Data-stream Quality of Service [C]. In Pro-ceedings of the2005International Conference on Compilers, Architectures andSynthesis for Embedded Systems. New York, NY, USA,2005:237–248.
    [130] HsuLR,ReinhardtSK,IyerR,etal.Communist,Utilitarian,andCapitalistCachePolicies on CMPs: Caches as a Shared Resource [C]. In Proceedings of the15thInternational Conference on Parallel Architectures and Compilation Techniques.New York, NY, USA,2006:13–22.
    [131] Nesbit K J, Moreto M, Cazorla F J, et al. Multicore Resource Management [J].IEEE Micro.2008,28(3):6–16.
    [132] Moreto M, Cazorla F J, Ramirez A, et al. FlexDCP: a QoS Framework for CMPArchitectures [J]. SIGOPS Oper. Syst. Rev.2009,43(2):86–96.
    [133] Rafique N, Lim W-T, Thottethodi M. Architectural Support for Operating System-drivenCMPCacheManagement[C].InProceedingsofthe15thInternationalCon-ference on Parallel Architectures and Compilation Techniques. New York, NY,USA,2006:2–12.
    [134] Kondo M, Sasaki H, Nakamura H. Improving Fairness, Throughput and Energy-Efficiency on a Chip Multiprocessor through DVFS [J]. SIGARCH Comput. Ar-chit. News.2007,35(1):31–38.
    [135] Yu C, Petrov P. Off-chip Memory Bandwidth Minimization through Cache Parti-tioning for Multi-core Platforms [C]. In Proceedings of the47th Design Automa-tion Conference. New York, NY, USA,2010:132–137.
    [136]所光.面向多线程多道程序的加权共享Cache划分[J].计算机学报.2008,31(11):1938–1947.
    [137] Dybdahl H, Stenstr m P, Natvig L. A Cache-Partition Aware Replacement Policyfor Chip Multiprocessors [C]. In Proceedings of13th International Conference ofHigh Performance Computing.2006.
    [138] GecseiJ,SlutzDR,TraigerIL.EvaluationTechniquesforStorageHierarchies[J].IBM Syst. J.1970,9(2):78–117.
    [139] Ranganathan P,Adve S, Jouppi NP.Reconfigurable Cachesand Their Applicationto Media Processing [C]. In Proceedings of the27th Annual International Sympo-sium on Computer Architecture. New York, NY, USA,2000:214–224.
    [140] Jain P, Devadas S, Engels D, et al. Software-assisted cache replacement mech-anisms for embedded systems [C]. In Proceedings of the2001IEEE/ACM in-ternational conference on Computer-aided design. Piscataway, NJ, USA,2001:119–126.
    [141] Sherwood T, Perelman E, Hamerly G, et al. Automatically Characterizing LargeScale Program Behavior [J]. SIGOPS Oper. Syst. Rev.2002,36(5):45–57.
    [142] Su F, Shi X, Liu G, et al. Comparative Evaluation of Multi-core Cache OccupancyStrategies[J].ParallelandDistributedSystems,InternationalConferenceon.2007,1:1–8.
    [143] Rajan K, Ramaswamy G. Emulating Optimal Replacement with a ShepherdCache [C]. In Proceedings of the40th Annual IEEE/ACM International Sympo-sium on Microarchitecture. Washington, DC, USA,2007:445–454.
    [144] Magnusson P S, Christensson M, et al J E. Simics: A Full System Simulation Plat-form [J]. Computer.2002,35:50–58.
    [145] Muralimanohar N, Balasubramonian R. CACTI6.0: A Tool to Understand LargeCaches [R].2009.
    [146] Bienia C, Kumar S, Singh J P, et al. The PARSEC Benchmark Suite: Characteri-zation and Architectural Implications [C]. In Proceedings of the17th Internation-al Conference on Parallel Architectures and Compilation Techniques. New York,NY, USA,2008:72–81.
    [147] Hasegawa A, Kawasaki I, Yamada K, et al. SH3: High Code Density, Low Pow-er [J]. IEEE Micro.1995,15(6):11–19.
    [148] Hardavellas N, Pandis I, Johnson R, et al. Database Servers on ChipMultiprocessors:LimitationsandOpportunities[C].InProceedingsoftheBiennialConference on Innovative Data Systems Research.2007:79–87. SYSTEMS.
    [149] Mudge T. Power: A First-Class Architectural Design Constraint [J]. Computer.2001,34(4):52–58.
    [150] Gepner P, Kowalik M F. Multi-Core Processors: New Way to Achieve High Sys-tem Performance [C]. In Proceedings of the International Symposium on ParallelComputing in Electrical Engineering. Washington, DC, USA,2006:9–13.
    [151] Loghi M, Poncino M, Benini L. Cache Coherence Tradeoffs in Shared-MemoryMPSoCs [J]. ACM Trans. Embed. Comput. Syst.2006,5(2):383–407.
    [152] Loghi M, Poncino M. Exploring Energy/Performance Tradeoffs in Shared Mem-ory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions [C]. In Pro-ceedings of the Conference on Design, Automation and Test in Europe-Volume1. Washington, DC, USA,2005:508–513.
    [153] Ku J C, Ozdemir S, Memik G, et al. Power Density Minimization for Highly-Associative Caches in Embedded Processors [C]. In Proceedings of the16th ACMGreat Lakes symposium on VLSI. New York, NY, USA,2006:100–104.
    [154] Bahar R I, Albera G, Manne S. Power and Performance Tradeoffs Using VariousCaching SStrategies [C]. In Proceedings of the1998International Symposium onLow Power Electronics and Design. New York, NY, USA,1998:64–69.
    [155] BergamaschiR,HanG,BuyuktosunogluA,etal.ExploringPowerManagementinMulti-coreSystems[C].InProceedingsofthe2008AsiaandSouthPacificDesignAutomation Conference. Los Alamitos, CA, USA,2008:708–713.
    [156] Settle A, Connors D, Gibert E, et al. A Dynamically Reconfigurable Cache forMultithreaded Processors [J]. J. Embedded Comput.2006,2(2):221–233.
    [157] Carvalho M B, Góes L F W, Martins C A P S. Dynamically Reconfigurable CacheArchitecture Using Adaptive Block Allocation Policy [C]. In Proceedings of the20thInternationalConferenceonParallelandDistributedProcessing.Washington,DC, USA,2006:217–217.
    [158] Tao J, Kunze M, Nowak F, et al. Performance Advantage of Reconfigurable CacheDesign on Multicore Processor Systems [J]. Int. J. Parallel Program.2008,36(3):347–360.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700