基于性能监测硬件支持的片上缓存资源管理技术

英文题名：On-Chip Cache Management with Performance Monitoring Hardware Support
作者：刘玉
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：片上多处理器 ; 共享缓存管理 ; 性能监测 ; 访存负载平衡 ; 缓存控制
英文关键词：chip multiprocessor ; shared resources management ; performance
英文关键词：monitoring ; memory load balance ; cache controlling
学位年度：2013
导师：安虹 ; 李国杰
学科代码：081201
学位授予单位：中国科学技术大学
论文提交日期：2013-05-01

摘要

如何高效利用片上高速缓存是多核处理器研究的一个重要课题。现有的片上高速缓存管理机制是软件透明的,不能实时感知程序数据集的局部性特征,以及来自多个线程不同的访存请求。一方面,当多个线程同时在多核处理器上运行时,现有的缓存管理策略不仅不能保证每个任务的运行性能,还会导致共享缓存的多个任务之间发生不可预测的缓存竞争,形成相互干扰,降低系统的吞吐量。另一方面,由于软件不能控制缓存空间的分配,仅靠硬件进行管理,使得程序对高速缓存的利用效率不高,尤其对于单线程程序,不能利用多核处理器丰富的片上缓存资源来获得性能加速。
     针对以上问题,本文研究如何利用硬件性能监测单元来实时监测程序运行时的访存特征信息,实现对多线程运行时的共享缓存竞争管理,以及对单线程程序运行时的缓存空间分配,从而提高多任务系统的吞吐量和性能稳定性,并为单线程程序执行提供高效的缓存控制手段。本文的研究内容和主要工作成果包括以下几个方面：
     (1)研究了能够实时感知程序运行时访存特征的性能监测机制,提出了基于性能监测单元而实现的低代价访存性能监测方案LWM。IWM可以为用户层提供程序运行时访存性能信息的功能,以及为缓存管理器提供系统级的资源使用信息,减少了访存性能监测的代价。在实现过程中,我们在每个任务结构体中加入性能事件成员、提供事件配置的系统调用接口,并且对计数器溢出和上下文切换过程中出现的错误计数进行了处理。此外,我们还优化了性能计数器的分时复用机制,提高了多事件监测过程中的事件监测精度以及性能计数器的利用率。
     (2)研究了多个任务对共享缓存资源的竞争问题,提出了访存负载概念并设计了访存负载平衡调度算法,提高了多任务系统吞吐量和程序的性能稳定性。本文提出了一种访存负载平衡调度技术来解决多任务共享缓存竞争问题。访存负载平衡调度算法参照了操作系统计算负载平衡调度算法的设计,可以作为操作系统负载平衡系统的扩展。由于本文将访存负载平衡调度实现为一个用户层的负载调度系统,所以不需要对操作系统内核进行改动。通过与其它调度算法进行实验比较后,证明本文提出的访存负载平衡调度算法在程序加权加速,以及提升系统整体吞吐量方面都有较大改进,降低了多任务对共享缓存的竞争强度,减少了系统整体的片外访存请求数量。得益于算法的稳定性能,访存负载平衡调度降低了程序多次运行之间的性能差异性,可以为操作系统实现公平可靠的任务调度算法提供支持。
     (3)研究了单线程程序运行于多核处理器平台时的缓存空间利用率不高的问题,提出了一种新型缓存控制机制VSCP,提高了单线程程序的缓存利用率并加速了程序执行。本文提出的新型缓存控制方法VSCP可以有效提升单线程程序对多核处理器片上缓存空间的利用率,VSCP联合了整个系统上的缓存资源并为程序员提供显式的缓存控制接口,物理分布的缓存空间被虚拟化成用户可控的集中式缓存。与通过程序并行化来最大化计算资源的使用不同,VSCP试图去最大化缓存资源的利用率。VSCP保持单线程程序一段时间内只使用一个处理器核的状态,减少多核同时工作的功耗。另外,在片上缓存不能存放一个程序的所有工作集时,可以利用VSCP选择部分具有强局部性的数据集驻留缓存以确保这些数据不被替换或污染,降低缓存缺失率并最终加速程序。
     通过对本课题的研究,我们得到了以下重要认识：
     (1)访存性能对于单个程序以及系统整体性能都非常重要,在“存储墙”现象日益严重的背景下,对于提升单个程序以及系统整体性能来说,降低缓存缺失率比减少执行指令数都要更加有效。
     (2)现有的缓存管理策略(包括操作系统任务调度和缓存替换策略的实现)都无法感知到线程间缓存竞争与共享关系的存在,导致低效的缓存管理。缓存资源管理必须实现线程感知的策略,否则无法为系统性能、公平性和服务质量等指标提供支持。
     (3)解决多核处理器缓存资源管理最终还是需要软硬件协同配合才能完成,这需要对程序运行时和缓存管理器之间的接口进行重新设计,包括建立更好的性能监测基础设施(软、硬件)以便观察系统内部运行时情况,以及细粒度的缓存资源分配机制,这些问题的解决需要操作系统设计者、硬件架构师和程序开发人员的共同努力。
     本文针对缓存资源管理而提出的关键问题解决方案,都是基于真实硬件平台进行设计实现的,是相对实际的解决方法,并且这些实现方案具有一般通用性,可以为未来处理器体系结构上的缓存资源管理机制的实现提供参考。
Utilizing on-chip cache resources efficiently is a critical issue in Chip Multiprocessor research. Software transparent feature is a main advantage of hardware cache memory, but also means unaware of program's memory accessing behaviors and different requests from multiple threads. On one hand, it brings inter-thread cache interferences while multiple threads running on a multi-core system; existing cache management schemes could not ensure performance of each program and will lead to unpredictable cache contention and poor system throughputs. On the other hand, it results in caching inefficiency of running programs especially single-threaded programs because software could not control cache space allocation, wasting plenty of on-chip cache space.
     This dissertation will focus on three aspects of cache resources management, including information monitoring of running programs, cache contention management of multi-threads and cache space allocation in software manner. We implemented a scheme for monitoring programs running behavior with low cost; improved system throughputs and performance stability while running multiple threads; provided cache controlling measures for single-threaded programs execution. The major research contributions of this dissertation include:
     (1) Based on performance monitoring units that embedded in modern processors, a low cost performance monitoring tool named LWM is implemented. The underlying information of running programs could be accessed at user level with the help of LWM. Performance event records are added in each task structure; providing system calling interface for events configuration. Besides, performance-counter overflows and error counting situation are properly handled in context-switches. Events monitoring precision and performance counter utilization are improved through an optimized hardware counter multiplexing mechanism.
     (2) Proposed the memory load concept and designed memory load balance scheduling algorithm to improve system throughputs and performance stability of running programs. With reference to load balance scheduling in operating system, memory load balance scheduling algorithm is implemented at user level, and doesn't require modifying operating system kernel space; therefore, it could be implemented as an auxiliary facility of process scheduling mechanism. Comparing with other scheduling algorithms, MLB algorithm has better performance in weighted speedup and system throughputs; reducing a large number of off-chip memory requests. More importantly, the MLB algorithm has good stability, reducing performance deviation between different runs. It offers the possibility for implementing a task scheduling algorithm with fairness and reliability features.
     (3) Designed a cache controlling mechanism named VSCP, improved caching efficiency of single-threaded program. VSCP unifies whole system cache space and provides programmers with cache space allocation interface. Physically distributed caches are virtualized as a block of centralized controllable cache. Instead of parallelizing single-threaded program to maximize computing resources, VSCP avoids reprogramming efforts with highly utilization of cache resources. Besides, it has power-saving advantage because it enables a single thread running in a period of time. We got some important understandings through cache management research:
     (1) In the background of increasingly serious situation of "memory wall", memory accessing performance is very important for a single program execution and whole system throughputs. Reducing cache miss rate is becoming more importantly than instruction counts decrease.
     (2) Existing cache management schemes, including task scheduling of operating system and cache replacement policy, could not get information of inter-thread cache contention, which results in inefficient cache management. Cache management schemes should be implemented in thread-aware manner; otherwise, it could not provide assurance such as performance, fairness and quality of service features.
     (3) Software and hardware co-design should be the best choice for solving cache resources contention problem. We need to design new interfaces between application runtimes and the cache management, create better performance monitoring infrastructures (both in hardware and in software) that will permit better "observability" of what is happening inside the system, as well as create better mechanisms for fine-grained resource allocation in hardware. Addressing
     these problems will require inter-disciplinary effort of operating system designers, hardware architects and application developers.
     Cache management schemes proposed in our work are practical and implemented on real system. These solutions have general versatility and could be referred to future system architectures.

引文

Arnaldo Carvalho de Melo. Performance counters on Linux, the new tools. [Online] Available: http://linuxplumbersconf.org/2009/slides/Arnaldo-Carvalho-de-Melo-perf.pdf. [EB/OL]
    AMD Corporation.2013. BIOS and Kernel Developer's Guide For AMD Family 10h Processors. http://support.amd.com/us/Processor_TechDocs/31116.pdf. [EB/OL]
    Eranian S,2009, Perfmon2 documentation [EB/OL], http://perfmon2.sf.net/. [EB/OL]
    Intel Corporation, VTune(?) Performance Analyzer, DOI=http://www.intel.com/cd/software/products/asmo-na/eng/vtune/index.htm. [EB/OL]
    Intel Manual,2012, www.intel.com/products/processor/manuals/. [EB/OL]
    International Technology Roadmap for Semiconductors.www.itrs.net. [EB/OL]
    PAPI:http://icl.cs.utk.edu/papi/.[EB/OL]
    OProfile:http://oprofile.sourceforge.net/news/.[EB/OL]
    perf_event:https://perf.wiki.kernel.org/.[EB/OL]
    Agarwal V., Hrishikesh M., Keckler S., and Burger D.2000. Clock rate versus IPC:the end of the road for conventional microarchitectures [C]. Proceedings of the 27th annual international symposium on Computer architecture.248-259. ACM.
    Antonopoulos, C. D., Nikolopoulos, D. S., Papatheodorou, T. S.2004. Realistic workload scheduling policies for taming the memory bandwidth bottleneck of SMPs [C]. Proceedings of the 11th international conference on High Performance Computing,286-296, Springer-Verlag
    Awasthi, M., Sudan, K., Balasubramonian, R., and Carter, J.2009. Dynamic hardware assisted software-controlled page placement to manage capacity allocation and sharing within large caches [C]. IEEE 15th International Symposium on High Performance Computer Architecture. 250-261. IEEE Computer Society
    Azimi, R., Michael, S., Robert, W. W.2005. Online performance analysis by statistical sampling of microprocessor performance counters [C]. Proceedings of the 19th annual international conference on Supercomputing,101-110. ACM
    Bailey, D., H., Barzcz, E., Dagum, L., and Simon, H., D.1993. NAS parallel benchmark results [J]. IEEE Parallel & Distributed Technology:Systems & Applications. IEEE Computer Society
    Banerjee, S., Surendra, G., Nandy, S.,2008. On the effectiveness of phase based regression models to trade power and performance using dynamic processor adaptation [J]. Journal of Systerms Architecture:the EUROMICRO Journal,54(8):797-815. Elsevier North-Holland, Inc.
    Banikazemi, M., Poff, D., and Abali, B.2008. Pam:a novel performance/power aware meta-scheduler for multi-core systems. In SC '08:Proceedings of the 2008 ACM/IEEE conference on Supercomputing.1-12. IEEE Computer Society.
    Bitirgen, R., Ipek, E., and Martinez, J. F.2008. Coordinated management of multiple interacting resources in chip multiprocessors:A machine learning approach [C]. Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture.318-329. IEEE Computer Society
    Cascaval, C., Rose, L. D., Padua, D. A., and Reed, D. A.2000. Compile-time based performance prediction [C]. Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing.365-379. Springer-Verlag.
    Ceze L. Tuck J. Torrellas J. Cascaval C.2006. Bulk Disambiguation of Speculative Threads in Multiprocessors [C]. Proceedings of the 33rd annual international symposium on Computer Architecture.227-238. IEEE Computer Society
    Chandra, D., Guo, F., Kim, S., and Solihin, Y.2005. Predicting inter-thread cache contention on a chip multi-processor architecture [C]. Proceedings of the 11th International Symposium on High-Performance Computer Architecture.340-351. IEEE Computer Society.
    Chang J., Sohi Gurindar S.2007. Cooperative cache partitioning for chip multiprocessors [C]. Proceedings of the 21st annual international conference on Supercomputing.242-252. ACM
    Chaudhuri, M.2009. PageNUCA:Selected policies for page-grain locality management in large shared chip-multiprocessor caches. IEEE 15th International Symposium on High Performance Computer Architecture,227-238. IEEE Computer Society
    Chen, D., Vachharajani, N., Hundt, R., Liao, S. et al.2010. Taming hardware event samples for FDO compilation [C]. Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization.42-52. ACM
    Chen, R., Chen, H., and Zang, B.2010. Tiled MapReduce:Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling [C]. Proceedings of the Nineteenth In-ternational Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society
    Chen, S., Gibbons, P. B., Kozuch, M., Liaskovitis, V., Ailamaki, A., Blelloch, G. E., et al.2007 Scheduling Threads for Constructive Cache Sharing on CMPs [C]. Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures,105-115, ACM
    Chiou D., Jain P., Rudolph L., and Devadas S.2000. Application-specific memory management for embedded systems using software-controlled caches [C]. Proceedings of the 37th annual Design Automation Conference.416-419. ACM.
    Cook H., K. et al.2009. Virtual local stores:Enabling software-managed memory hierarchies in mainstream computing environments. Technical report, EECS Department, U. of California, Berkeley.
    Demme, J. and Sethumadhavan, S.2011. Rapid identification of architectural bottlenecks via precise event counting. Proceedings of the 38th annual international symposium on Computer architecture,353-364, ACM
    Dennard R., Gaensslen F., Rideout V., Bassous E. et al.1974. Design of ion-implanted MOSFETs with very small physical dimensions [J]. IEEE Journal of Solid-State Circuits,9(5):256-268. IEEE computer society.
    Ding X., Wang Kaibo, Zhang Xiaodong.2011. ULCC:a user-level facility for optimizing shared cache performance on multicores [C]. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming.103-112. ACM.
    Dybdahl, H. and Stenstrom, P.2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors [C]. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.2-12. IEEE Computer Society
    Esmaeilzadeh H., Blem E., Amant R. St, Sankaralingam K. and Burger D.2011. Dark Silicon and The End of Multicore Scaling [C]. Proceedings of the 38th annual international symposium on Computer architecture.365-376. ACM
    Fedorova, A., Seltzer, M., and Smith, M. D.2007. Improving performance isolation on chip multiprocessors via an operating system scheduler [C]. Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques.25-38. IEEE Computer Society
    Guo F., Kannan Hari, Zhao Li, Illikkal Ramesh et al.2007. From chaos to QoS:case studies in CMP resource management [J]. ACM SIGARCH Computer Architecture News.35(1):21-30. ACM
    Hammond L., Wong V. et al.2004. Transactional Memory Coherence and Consistency [C]. Proceedings of the 31st annual international symposium on Computer architecture,102-113. IEEE Computer Society
    Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A.2009. Reactive NUCA:near optimal block placement and replication in distributed caches [C]. Proceedings of the 36th annual international symposium on Computer architecture.184-195. ACM.
    Henning, J. L.2006. SPEC CPU2006 benchmark descriptions [J]. ACM SIGARCH Computer Architecture News Homepage archive.34(4):1-17, ACM
    Hsu, Lisa R., Reinhardt S., Iyer R., et al.2006. Communist, utilitarian, and capitalist cache policies on CMPs:caches as a shared resource [C]. Proceedings of the 15th international conference on Parallel Architectures and Compilation Techniques.
    Iyer R.2004. CQoS:A framework for enabling QoS in shared caches of CMP platforms [C]. Proceedings of the 18th annual international conference on Supercomputing.
    Jahre M., Natvig Lasse.2009. A light-weight fairness mechanism for chip multiprocessor memory systems [C]. Proceedings of the 6th ACM conference on Computing frontiers.1-10. ACM
    Jaleel, A., Najaf-abadi, H., H., Subramaniam, S.S., Steely, S., C., Emer, J.2012. CRUISE:cache replacement and utility-aware scheduling [C]. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, 249-260, ACM
    Jiang, Y., Shen, X., Chen, J., and Tripathi, R.2008. Analysis and approximation of optimal co-scheduling on chip multiprocessors. Proceedings of the 17th international conference on Parallel architectures and compilation techniques.220-229. ACM
    Kamali, A.2010. Sharing Aware Scheduling on Multicore Systems. M.S. thesis, Simon Fraser University, Burnaby, BC, Canada.
    Kamruzzaman, M., Swanson, S., Tullsen, D., M.2010. Software data spreading:leveraging distributed caches to improve single thread performance [C]. Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation,460-470, ACM
    Kim, C., Burger, D., and Keckler, S. W.2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches [C]. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems.211-222.
    Kim S., Chandra Dhruba, Solihin Yan.2004. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture [C]. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques.111-122. IEEE Computer Society.
    Kim, Y, Han, D., Mutlu, O., and Harchol-Balter, M.2010. ATLAS:A scalable and high performance scheduling algorithm for multiple memory controllers [C]. IEEE 16th International Symposium on High Performance Computer Architecture.1-12, IEEE Computer Society
    Knauerhase, R., Brett, P., Hohlt, B., Li, T., and Hahn, S.2008. Using OS observations to improve performance in multicore systems. IEEE Micro,28(3):54-66. IEEE Computer Society.
    Knauerhase, R., Brett, P., Irelan, P.2010. Hardware Support for Cross-Layer PMU Arbitration [C]. Proceeding of 3rd Workshop on Functionality of Hardware Performance Monitoring, ACM
    Lee, C. J., Mutlu, O., Narasiman, V., Patt, Y. N.2008. Prefetch-Aware DRAM Controllers [C]. Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 200-209. IEEE Computer Society
    Levine, F. E., Roth, C. P.,1997, A programmer's view of performance monitoring in the PowerPC microprocessor [J], IBM Journal of Research and Development,41(3):345-356.
    Lin J., Lu Qingda, Ding Xiaoning, Zhang Zhao, et al.2008. Gaining Insights into Multicore Cache Partitioning:Bridging the Gap between Simulation and Real Systems [C]. Proceedings of the 14th IEEE international symposium on High Performance Computer Architecture.367-378. IEEE Computer Society.
    Liu C, Sivasubramaniam A, Kandemir M.2004. Organizing the last line of defense before hitting the memory wall for cmps [C]. Proceeings of the 10th International Symposium on High Performance Computer Architecture.176-185, IEEE Computer Society
    Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L.1970. Evaluation techniques for storage hierarchies [J]. IBM Systems Journal 9,78-117.
    Moreto, M., Cazorla, F. J., Ramirez, A., Sakellariou, R., and Valero, M.2009. Flexdcp:a QoS framework for CMP architectures [J]. SIGOPS Operating System Review 43(2):86-96. ACM
    Mutlu, O. and Moscibroda, T.2007. Stall-time fair memory access scheduling for chip multiprocessors [C]. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.146-160. IEEE Computer Society
    Mutlu, O. and Moscibroda, T.2008. Parallelism-aware batch scheduling:Enhancing both performance and fairness of shared DRAM systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.63-74. ACM
    Nagarajan, V., Gupta, R.2009. ECMon:exposing cache events for monitoring [C]. Proceedings of the 36th annual international symposium on Computer architecture,349-360. ACM
    Nesbit, K. J., Aggarwal, N., Laudon, J., and Smith, J. E.2006. Fair queuing memory systems [C]. Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. 208-222. IEEE Computer Society.
    Nesbit, K. J., Laudon, J., and Smith, J. E.2007. Virtual private caches [C]. Proceedings of the 34th annual international symposium on Computer architecture.57-68. ACM
    Olukotun K., Nayfeh B., Hammond L., Wilson K., and Chang K.1996. The case for a single-chip multiprocessor [C]. Proceedings of the seventh international conference on Architectural support for programming languages and operating systems.2-11. ACM
    Patterson, D. A.2004. Latency lags bandwidth [J], Communications of the ACM.47(10):71-75, ACM.
    Pusukuri K. K., Vengerov, D., Fedorova, A., and Kalogeraki V.2011. FACT:a Framework for Adaptive Contention-aware Thread Migrations [C]. Proceedings of the 8th ACM International Conference on Computing Frontiers, Article No.35, ACM
    Qureshi, Moinuddin K., Patt Y..2006. Utility-based cache partitioning:A low-overhead, high-performance, runtime mechanism to partition shared caches [C]. Proceedings of the 39th annual IEEE/ACM international symposium on Microarchitecture.423-432. IEEE Computer Society.
    Qureshi, M. et al.2009. Adaptive spill-receive for robust high-performance caching in CMPs [C]. Proceedings of the 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society.
    Ranganathan P., Adve S., and Jouppi N..2000. Reconfigurable caches and their application to media processing [C]. Proceedings of the 27th annual International Symposium on Computer Architecture.214-224. ACM
    Rivera, G, Tseng, C., W.1998. Eliminating conflict misses for high performance architectures[C]. Proceedings of the 12th international conference on Supercomputing,353-360, ACM
    Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D.2000. Memory access scheduling [C]. Proceedings of the 27th annual international symposium on Computer architecture.128-138, IEEE Computer Society
    Sanchez, D., Kozyrakis, C.2011. Vantage:Scalable and Efficient Fine-Grain Cache Partitioning [C]. Proceedings of the 38th annual international symposium on Computer architecture,57-68, ACM
    Sherwood, T., Sari, S. and Calder, B.2003. Phase tracking and prediction [C], Proceedings of the 30th annual international symposium on computer architecture,336-349. ACM
    Siddha, S., Pallipadi, V., Mallick, A.2007, Process Scheduling Challenges in the Era of Multi-core Processors [J]. Intel Technology Journal, 11(4):361-369.
    Ste"phane Eranian.2008. What can performance counters do for memory subsystem analysis? [C] Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness.26-30. ACM.
    Stone H. S., Turek John, Wolf Joel L.1992. Optimal Partitioning of Cache Memory [J]. IEEE Transactions on Computers.41(9):1054-1068. IEEE Computer Society
    Strong, R., Mudigonda, J., Mogul, J., C., Binkert, N., Tullsen, D.2009. Fast switching of threads between cores [J]. ACM SIGOPS Operating Systems Review archive 43(2):35-45, ACM
    Suh, G E., Devadas S., and Rudolph L.,2002. A new memory monitoring scheme for memory-aware scheduling and partitioning [C]. Proceedings of the 8th IEEE international symposium on High Performance Computer Architecture.117-125. IEEE Computer Society.
    Suh G E., Rudolph L., Devadas S..2004. Dynamic Partitioning of Shared Cache Memory [J]. Jourrnal of Supercomputing.28(1):7-26. Kluwer Academic Publishers.
    Tam, D. K., Azimi, R., Soares, L. B., and Stumm, M.2009. RapidMRC:approximating L2 miss rate curves on commodity systems for online optimizations [C]. Proceeding of the 14th international conference on Architectural support for programming languages and operating systems.121-132. ACM.
    Tam, D., Azimi, R., and Stumm, M.2007. Thread Clustering:Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors [C]. Proceedings of the 2nd ACM European Conference on Computer Systems. ACM.
    Taylor, G, Peter, D., Farmwald, M.1990. The TLB slice—a low-cost high-speed address translation mechanism [C]. Proceedings of the 17th annual international symposium on Computer Architecture,355-363, ACM
    Thekkath, R. and Eggers, S. J.1994. Impact of Sharing-based Thread Placement on Multi-threaded Architectures [C]. Proceedings of the 21st annual international symposium on Computer architecture.176-186. IEEE Computer Society
    Tian, K., Jiang, Y., and Shen, X.2009. A study on optimally co-scheduling jobs of di□erent lengths on chip multiprocessors [C]. Proceedings of the 6th ACM conference on Computing frontiers.41-50. ACM.
    Torrellas J., Lam H. S., Hennessy J. L.1994. False Sharing and Spatial Locality in Multiprocessor Caches [J]. IEEE Transactions on Computers.43(6):651-663. IEEE Computer Society
    Varadarajan K., Nandy S., Sharda V., Bharadwaj A., et al.2006. Molecular Caches:A caching structure for dynamic creation of application-specific Heterogeneous cache regions [C]. Proceedings of the 39th annual IEEE/ACM international symposium on Microarchitecture. 433-442. IEEE Computer Society.
    Weaver, V., McKee, S.A.2008. Can hardware performance counters be trusted? [C] Proceedings of IEEE International Symposium on Workload Characterization,141-150, IEEE Computer Society.
    Wu C.J., Martonosi M.2008. A Comparison of Capacity Management Schemes for Shared CMP Caches [C]. Proceedings of the 7th Workshop on Duplicating, Deconstructing, and Debunking. 118-126. IEEE Computer Society.
    Wulf Wm. A., McKee Sally A.1995. Hitting the memory wall:implications of the obvious. ACM SIGARCH Computer Architecture News.23(1):20-24. ACM.
    Xie, Y. and Loh, G. 2008. Dynamic Classification of Program Memory Behaviors in CMPs [C]. In Proc. of CMP-MSI, held in conjunction with ISCA-35.
    Xie Y. and Loh G. H.2009. PIPP:promotion/insertion pseudo-partitioning of multicore shared caches [C]. Proceedings of the 36th annual International Symposium on Computer Architecture. 174-183. ACM.
    Xu, D., Wu, C., Yew, P. C.,2010. On mitigating memory bandwidth contention through bandwidth-aware scheduling [C]. Proceedings of the 19th international conference on Parallel architectures and compilation techniques,237-248, ACM
    Yang, X., Blackburn, S., M., Frampton, D., Sartor, J., B., Mckinley, K., S.2011. Why nothing matters:The impact of zeroing [C]. Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications,307-324, ACM
    Zhang, E. Z., Jiang, Y., and Shen, X.2010. Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? [C] Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming.203-212. ACM
    Zhang X., Dwarkadas Sandhya, Shen Kai.2009. Towards practical page coloring-based multicore cache management [C]. Proceedings of the 4th ACM European conference on Computer systems.89-102. ACM
    Zhuravlev, S., Blagodurov, S., and Fedorova, A.2010. Addressing shared resource contention in multicore processors via scheduling [C]. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.129-142. ACM
    Zhuravlev, S., Blagodurov, S. and Fedorova, A.2010. AKULA:A Toolset for Developing Scheduling Algorithms on Multicore Systems [C]. Proceedings of the 19th international conference on Parallel architectures and compilation techniques, Pages 249-260, ACM
    Zhuravlev, S., Saez, J. C., Blagodurov, S., Fedorova, A., Prieto, M.,2012. Survey of scheduling techniques for addressing shared resources in multicore processors [C]. ACM Computing Surveys,45(1), ACM
    李晓梅.2012.龙芯3号多核平台上性能调优环境的设计与实现[D].硕士论文,中国科学技术大学,合肥

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700