详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
Utilizing on-chip cache resources efficiently is a critical issue in Chip Multiprocessor research. Software transparent feature is a main advantage of hardware cache memory, but also means unaware of program's memory accessing behaviors and different requests from multiple threads. On one hand, it brings inter-thread cache interferences while multiple threads running on a multi-core system; existing cache management schemes could not ensure performance of each program and will lead to unpredictable cache contention and poor system throughputs. On the other hand, it results in caching inefficiency of running programs especially single-threaded programs because software could not control cache space allocation, wasting plenty of on-chip cache space.
     This dissertation will focus on three aspects of cache resources management, including information monitoring of running programs, cache contention management of multi-threads and cache space allocation in software manner. We implemented a scheme for monitoring programs running behavior with low cost; improved system throughputs and performance stability while running multiple threads; provided cache controlling measures for single-threaded programs execution. The major research contributions of this dissertation include:
     (1) Based on performance monitoring units that embedded in modern processors, a low cost performance monitoring tool named LWM is implemented. The underlying information of running programs could be accessed at user level with the help of LWM. Performance event records are added in each task structure; providing system calling interface for events configuration. Besides, performance-counter overflows and error counting situation are properly handled in context-switches. Events monitoring precision and performance counter utilization are improved through an optimized hardware counter multiplexing mechanism.
     (2) Proposed the memory load concept and designed memory load balance scheduling algorithm to improve system throughputs and performance stability of running programs. With reference to load balance scheduling in operating system, memory load balance scheduling algorithm is implemented at user level, and doesn't require modifying operating system kernel space; therefore, it could be implemented as an auxiliary facility of process scheduling mechanism. Comparing with other scheduling algorithms, MLB algorithm has better performance in weighted speedup and system throughputs; reducing a large number of off-chip memory requests. More importantly, the MLB algorithm has good stability, reducing performance deviation between different runs. It offers the possibility for implementing a task scheduling algorithm with fairness and reliability features.
     (3) Designed a cache controlling mechanism named VSCP, improved caching efficiency of single-threaded program. VSCP unifies whole system cache space and provides programmers with cache space allocation interface. Physically distributed caches are virtualized as a block of centralized controllable cache. Instead of parallelizing single-threaded program to maximize computing resources, VSCP avoids reprogramming efforts with highly utilization of cache resources. Besides, it has power-saving advantage because it enables a single thread running in a period of time. We got some important understandings through cache management research:
     (1) In the background of increasingly serious situation of "memory wall", memory accessing performance is very important for a single program execution and whole system throughputs. Reducing cache miss rate is becoming more importantly than instruction counts decrease.
     (2) Existing cache management schemes, including task scheduling of operating system and cache replacement policy, could not get information of inter-thread cache contention, which results in inefficient cache management. Cache management schemes should be implemented in thread-aware manner; otherwise, it could not provide assurance such as performance, fairness and quality of service features.
     (3) Software and hardware co-design should be the best choice for solving cache resources contention problem. We need to design new interfaces between application runtimes and the cache management, create better performance monitoring infrastructures (both in hardware and in software) that will permit better "observability" of what is happening inside the system, as well as create better mechanisms for fine-grained resource allocation in hardware. Addressing
     these problems will require inter-disciplinary effort of operating system designers, hardware architects and application developers.
     Cache management schemes proposed in our work are practical and implemented on real system. These solutions have general versatility and could be referred to future system architectures.
Arnaldo Carvalho de Melo. Performance counters on Linux, the new tools. [Online] Available: http://linuxplumbersconf.org/2009/slides/Arnaldo-Carvalho-de-Melo-perf.pdf. [EB/OL]
    AMD Corporation.2013. BIOS and Kernel Developer's Guide For AMD Family 10h Processors. http://support.amd.com/us/Processor_TechDocs/31116.pdf. [EB/OL]
    Eranian S,2009, Perfmon2 documentation [EB/OL], http://perfmon2.sf.net/. [EB/OL]
    Intel Corporation, VTune(?) Performance Analyzer, DOI=http://www.intel.com/cd/software/products/asmo-na/eng/vtune/index.htm. [EB/OL]
    Intel Manual,2012, www.intel.com/products/processor/manuals/. [EB/OL]
    International Technology Roadmap for Semiconductors.www.itrs.net. [EB/OL]
    Agarwal V., Hrishikesh M., Keckler S., and Burger D.2000. Clock rate versus IPC:the end of the road for conventional microarchitectures [C]. Proceedings of the 27th annual international symposium on Computer architecture.248-259. ACM.
    Antonopoulos, C. D., Nikolopoulos, D. S., Papatheodorou, T. S.2004. Realistic workload scheduling policies for taming the memory bandwidth bottleneck of SMPs [C]. Proceedings of the 11th international conference on High Performance Computing,286-296, Springer-Verlag
    Awasthi, M., Sudan, K., Balasubramonian, R., and Carter, J.2009. Dynamic hardware assisted software-controlled page placement to manage capacity allocation and sharing within large caches [C]. IEEE 15th International Symposium on High Performance Computer Architecture. 250-261. IEEE Computer Society
    Azimi, R., Michael, S., Robert, W. W.2005. Online performance analysis by statistical sampling of microprocessor performance counters [C]. Proceedings of the 19th annual international conference on Supercomputing,101-110. ACM
    Bailey, D., H., Barzcz, E., Dagum, L., and Simon, H., D.1993. NAS parallel benchmark results [J]. IEEE Parallel & Distributed Technology:Systems & Applications. IEEE Computer Society
    Banerjee, S., Surendra, G., Nandy, S.,2008. On the effectiveness of phase based regression models to trade power and performance using dynamic processor adaptation [J]. Journal of Systerms Architecture:the EUROMICRO Journal,54(8):797-815. Elsevier North-Holland, Inc.
    Banikazemi, M., Poff, D., and Abali, B.2008. Pam:a novel performance/power aware meta-scheduler for multi-core systems. In SC '08:Proceedings of the 2008 ACM/IEEE conference on Supercomputing.1-12. IEEE Computer Society.
    Bitirgen, R., Ipek, E., and Martinez, J. F.2008. Coordinated management of multiple interacting resources in chip multiprocessors:A machine learning approach [C]. Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture.318-329. IEEE Computer Society
    Cascaval, C., Rose, L. D., Padua, D. A., and Reed, D. A.2000. Compile-time based performance prediction [C]. Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing.365-379. Springer-Verlag.
    Ceze L. Tuck J. Torrellas J. Cascaval C.2006. Bulk Disambiguation of Speculative Threads in Multiprocessors [C]. Proceedings of the 33rd annual international symposium on Computer Architecture.227-238. IEEE Computer Society
    Chandra, D., Guo, F., Kim, S., and Solihin, Y.2005. Predicting inter-thread cache contention on a chip multi-processor architecture [C]. Proceedings of the 11th International Symposium on High-Performance Computer Architecture.340-351. IEEE Computer Society.
    Chang J., Sohi Gurindar S.2007. Cooperative cache partitioning for chip multiprocessors [C]. Proceedings of the 21st annual international conference on Supercomputing.242-252. ACM
    Chaudhuri, M.2009. PageNUCA:Selected policies for page-grain locality management in large shared chip-multiprocessor caches. IEEE 15th International Symposium on High Performance Computer Architecture,227-238. IEEE Computer Society
    Chen, D., Vachharajani, N., Hundt, R., Liao, S. et al.2010. Taming hardware event samples for FDO compilation [C]. Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization.42-52. ACM
    Chen, R., Chen, H., and Zang, B.2010. Tiled MapReduce:Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling [C]. Proceedings of the Nineteenth In-ternational Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society
    Chen, S., Gibbons, P. B., Kozuch, M., Liaskovitis, V., Ailamaki, A., Blelloch, G. E., et al.2007 Scheduling Threads for Constructive Cache Sharing on CMPs [C]. Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures,105-115, ACM
    Chiou D., Jain P., Rudolph L., and Devadas S.2000. Application-specific memory management for embedded systems using software-controlled caches [C]. Proceedings of the 37th annual Design Automation Conference.416-419. ACM.
    Cook H., K. et al.2009. Virtual local stores:Enabling software-managed memory hierarchies in mainstream computing environments. Technical report, EECS Department, U. of California, Berkeley.
    Demme, J. and Sethumadhavan, S.2011. Rapid identification of architectural bottlenecks via precise event counting. Proceedings of the 38th annual international symposium on Computer architecture,353-364, ACM
    Dennard R., Gaensslen F., Rideout V., Bassous E. et al.1974. Design of ion-implanted MOSFETs with very small physical dimensions [J]. IEEE Journal of Solid-State Circuits,9(5):256-268. IEEE computer society.
    Ding X., Wang Kaibo, Zhang Xiaodong.2011. ULCC:a user-level facility for optimizing shared cache performance on multicores [C]. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming.103-112. ACM.
    Dybdahl, H. and Stenstrom, P.2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors [C]. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.2-12. IEEE Computer Society
    Esmaeilzadeh H., Blem E., Amant R. St, Sankaralingam K. and Burger D.2011. Dark Silicon and The End of Multicore Scaling [C]. Proceedings of the 38th annual international symposium on Computer architecture.365-376. ACM
    Fedorova, A., Seltzer, M., and Smith, M. D.2007. Improving performance isolation on chip multiprocessors via an operating system scheduler [C]. Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques.25-38. IEEE Computer Society
    Guo F., Kannan Hari, Zhao Li, Illikkal Ramesh et al.2007. From chaos to QoS:case studies in CMP resource management [J]. ACM SIGARCH Computer Architecture News.35(1):21-30. ACM
    Hammond L., Wong V. et al.2004. Transactional Memory Coherence and Consistency [C]. Proceedings of the 31st annual international symposium on Computer architecture,102-113. IEEE Computer Society
    Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A.2009. Reactive NUCA:near optimal block placement and replication in distributed caches [C]. Proceedings of the 36th annual international symposium on Computer architecture.184-195. ACM.
    Henning, J. L.2006. SPEC CPU2006 benchmark descriptions [J]. ACM SIGARCH Computer Architecture News Homepage archive.34(4):1-17, ACM
    Hsu, Lisa R., Reinhardt S., Iyer R., et al.2006. Communist, utilitarian, and capitalist cache policies on CMPs:caches as a shared resource [C]. Proceedings of the 15th international conference on Parallel Architectures and Compilation Techniques.
    Iyer R.2004. CQoS:A framework for enabling QoS in shared caches of CMP platforms [C]. Proceedings of the 18th annual international conference on Supercomputing.
    Jahre M., Natvig Lasse.2009. A light-weight fairness mechanism for chip multiprocessor memory systems [C]. Proceedings of the 6th ACM conference on Computing frontiers.1-10. ACM
    Jaleel, A., Najaf-abadi, H., H., Subramaniam, S.S., Steely, S., C., Emer, J.2012. CRUISE:cache replacement and utility-aware scheduling [C]. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, 249-260, ACM
    Jiang, Y., Shen, X., Chen, J., and Tripathi, R.2008. Analysis and approximation of optimal co-scheduling on chip multiprocessors. Proceedings of the 17th international conference on Parallel architectures and compilation techniques.220-229. ACM
    Kamali, A.2010. Sharing Aware Scheduling on Multicore Systems. M.S. thesis, Simon Fraser University, Burnaby, BC, Canada.
    Kamruzzaman, M., Swanson, S., Tullsen, D., M.2010. Software data spreading:leveraging distributed caches to improve single thread performance [C]. Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation,460-470, ACM
    Kim, C., Burger, D., and Keckler, S. W.2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches [C]. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems.211-222.
    Kim S., Chandra Dhruba, Solihin Yan.2004. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture [C]. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques.111-122. IEEE Computer Society.
    Kim, Y, Han, D., Mutlu, O., and Harchol-Balter, M.2010. ATLAS:A scalable and high performance scheduling algorithm for multiple memory controllers [C]. IEEE 16th International Symposium on High Performance Computer Architecture.1-12, IEEE Computer Society
    Knauerhase, R., Brett, P., Hohlt, B., Li, T., and Hahn, S.2008. Using OS observations to improve performance in multicore systems. IEEE Micro,28(3):54-66. IEEE Computer Society.
    Knauerhase, R., Brett, P., Irelan, P.2010. Hardware Support for Cross-Layer PMU Arbitration [C]. Proceeding of 3rd Workshop on Functionality of Hardware Performance Monitoring, ACM
    Lee, C. J., Mutlu, O., Narasiman, V., Patt, Y. N.2008. Prefetch-Aware DRAM Controllers [C]. Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 200-209. IEEE Computer Society
    Levine, F. E., Roth, C. P.,1997, A programmer's view of performance monitoring in the PowerPC microprocessor [J], IBM Journal of Research and Development,41(3):345-356.
    Lin J., Lu Qingda, Ding Xiaoning, Zhang Zhao, et al.2008. Gaining Insights into Multicore Cache Partitioning:Bridging the Gap between Simulation and Real Systems [C]. Proceedings of the 14th IEEE international symposium on High Performance Computer Architecture.367-378. IEEE Computer Society.
    Liu C, Sivasubramaniam A, Kandemir M.2004. Organizing the last line of defense before hitting the memory wall for cmps [C]. Proceeings of the 10th International Symposium on High Performance Computer Architecture.176-185, IEEE Computer Society
    Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L.1970. Evaluation techniques for storage hierarchies [J]. IBM Systems Journal 9,78-117.
    Moreto, M., Cazorla, F. J., Ramirez, A., Sakellariou, R., and Valero, M.2009. Flexdcp:a QoS framework for CMP architectures [J]. SIGOPS Operating System Review 43(2):86-96. ACM
    Mutlu, O. and Moscibroda, T.2007. Stall-time fair memory access scheduling for chip multiprocessors [C]. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.146-160. IEEE Computer Society
    Mutlu, O. and Moscibroda, T.2008. Parallelism-aware batch scheduling:Enhancing both performance and fairness of shared DRAM systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.63-74. ACM
    Nagarajan, V., Gupta, R.2009. ECMon:exposing cache events for monitoring [C]. Proceedings of the 36th annual international symposium on Computer architecture,349-360. ACM
    Nesbit, K. J., Aggarwal, N., Laudon, J., and Smith, J. E.2006. Fair queuing memory systems [C]. Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. 208-222. IEEE Computer Society.
    Nesbit, K. J., Laudon, J., and Smith, J. E.2007. Virtual private caches [C]. Proceedings of the 34th annual international symposium on Computer architecture.57-68. ACM
    Olukotun K., Nayfeh B., Hammond L., Wilson K., and Chang K.1996. The case for a single-chip multiprocessor [C]. Proceedings of the seventh international conference on Architectural support for programming languages and operating systems.2-11. ACM
    Patterson, D. A.2004. Latency lags bandwidth [J], Communications of the ACM.47(10):71-75, ACM.
    Pusukuri K. K., Vengerov, D., Fedorova, A., and Kalogeraki V.2011. FACT:a Framework for Adaptive Contention-aware Thread Migrations [C]. Proceedings of the 8th ACM International Conference on Computing Frontiers, Article No.35, ACM
    Qureshi, Moinuddin K., Patt Y..2006. Utility-based cache partitioning:A low-overhead, high-performance, runtime mechanism to partition shared caches [C]. Proceedings of the 39th annual IEEE/ACM international symposium on Microarchitecture.423-432. IEEE Computer Society.
    Qureshi, M. et al.2009. Adaptive spill-receive for robust high-performance caching in CMPs [C]. Proceedings of the 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society.
    Ranganathan P., Adve S., and Jouppi N..2000. Reconfigurable caches and their application to media processing [C]. Proceedings of the 27th annual International Symposium on Computer Architecture.214-224. ACM
    Rivera, G, Tseng, C., W.1998. Eliminating conflict misses for high performance architectures[C]. Proceedings of the 12th international conference on Supercomputing,353-360, ACM
    Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D.2000. Memory access scheduling [C]. Proceedings of the 27th annual international symposium on Computer architecture.128-138, IEEE Computer Society
    Sanchez, D., Kozyrakis, C.2011. Vantage:Scalable and Efficient Fine-Grain Cache Partitioning [C]. Proceedings of the 38th annual international symposium on Computer architecture,57-68, ACM
    Sherwood, T., Sari, S. and Calder, B.2003. Phase tracking and prediction [C], Proceedings of the 30th annual international symposium on computer architecture,336-349. ACM
    Siddha, S., Pallipadi, V., Mallick, A.2007, Process Scheduling Challenges in the Era of Multi-core Processors [J]. Intel Technology Journal, 11(4):361-369.
    Ste"phane Eranian.2008. What can performance counters do for memory subsystem analysis? [C] Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness.26-30. ACM.
    Stone H. S., Turek John, Wolf Joel L.1992. Optimal Partitioning of Cache Memory [J]. IEEE Transactions on Computers.41(9):1054-1068. IEEE Computer Society
    Strong, R., Mudigonda, J., Mogul, J., C., Binkert, N., Tullsen, D.2009. Fast switching of threads between cores [J]. ACM SIGOPS Operating Systems Review archive 43(2):35-45, ACM
    Suh, G E., Devadas S., and Rudolph L.,2002. A new memory monitoring scheme for memory-aware scheduling and partitioning [C]. Proceedings of the 8th IEEE international symposium on High Performance Computer Architecture.117-125. IEEE Computer Society.
    Suh G E., Rudolph L., Devadas S..2004. Dynamic Partitioning of Shared Cache Memory [J]. Jourrnal of Supercomputing.28(1):7-26. Kluwer Academic Publishers.
    Tam, D. K., Azimi, R., Soares, L. B., and Stumm, M.2009. RapidMRC:approximating L2 miss rate curves on commodity systems for online optimizations [C]. Proceeding of the 14th international conference on Architectural support for programming languages and operating systems.121-132. ACM.
    Tam, D., Azimi, R., and Stumm, M.2007. Thread Clustering:Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors [C]. Proceedings of the 2nd ACM European Conference on Computer Systems. ACM.
    Taylor, G, Peter, D., Farmwald, M.1990. The TLB slice—a low-cost high-speed address translation mechanism [C]. Proceedings of the 17th annual international symposium on Computer Architecture,355-363, ACM
    Thekkath, R. and Eggers, S. J.1994. Impact of Sharing-based Thread Placement on Multi-threaded Architectures [C]. Proceedings of the 21st annual international symposium on Computer architecture.176-186. IEEE Computer Society
    Tian, K., Jiang, Y., and Shen, X.2009. A study on optimally co-scheduling jobs of di□erent lengths on chip multiprocessors [C]. Proceedings of the 6th ACM conference on Computing frontiers.41-50. ACM.
    Torrellas J., Lam H. S., Hennessy J. L.1994. False Sharing and Spatial Locality in Multiprocessor Caches [J]. IEEE Transactions on Computers.43(6):651-663. IEEE Computer Society
    Varadarajan K., Nandy S., Sharda V., Bharadwaj A., et al.2006. Molecular Caches:A caching structure for dynamic creation of application-specific Heterogeneous cache regions [C]. Proceedings of the 39th annual IEEE/ACM international symposium on Microarchitecture. 433-442. IEEE Computer Society.
    Weaver, V., McKee, S.A.2008. Can hardware performance counters be trusted? [C] Proceedings of IEEE International Symposium on Workload Characterization,141-150, IEEE Computer Society.
    Wu C.J., Martonosi M.2008. A Comparison of Capacity Management Schemes for Shared CMP Caches [C]. Proceedings of the 7th Workshop on Duplicating, Deconstructing, and Debunking. 118-126. IEEE Computer Society.
    Wulf Wm. A., McKee Sally A.1995. Hitting the memory wall:implications of the obvious. ACM SIGARCH Computer Architecture News.23(1):20-24. ACM.
    Xie, Y. and Loh, G. 2008. Dynamic Classification of Program Memory Behaviors in CMPs [C]. In Proc. of CMP-MSI, held in conjunction with ISCA-35.
    Xie Y. and Loh G. H.2009. PIPP:promotion/insertion pseudo-partitioning of multicore shared caches [C]. Proceedings of the 36th annual International Symposium on Computer Architecture. 174-183. ACM.
    Xu, D., Wu, C., Yew, P. C.,2010. On mitigating memory bandwidth contention through bandwidth-aware scheduling [C]. Proceedings of the 19th international conference on Parallel architectures and compilation techniques,237-248, ACM
    Yang, X., Blackburn, S., M., Frampton, D., Sartor, J., B., Mckinley, K., S.2011. Why nothing matters:The impact of zeroing [C]. Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications,307-324, ACM
    Zhang, E. Z., Jiang, Y., and Shen, X.2010. Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? [C] Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming.203-212. ACM
    Zhang X., Dwarkadas Sandhya, Shen Kai.2009. Towards practical page coloring-based multicore cache management [C]. Proceedings of the 4th ACM European conference on Computer systems.89-102. ACM
    Zhuravlev, S., Blagodurov, S., and Fedorova, A.2010. Addressing shared resource contention in multicore processors via scheduling [C]. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.129-142. ACM
    Zhuravlev, S., Blagodurov, S. and Fedorova, A.2010. AKULA:A Toolset for Developing Scheduling Algorithms on Multicore Systems [C]. Proceedings of the 19th international conference on Parallel architectures and compilation techniques, Pages 249-260, ACM
    Zhuravlev, S., Saez, J. C., Blagodurov, S., Fedorova, A., Prieto, M.,2012. Survey of scheduling techniques for addressing shared resources in multicore processors [C]. ACM Computing Surveys,45(1), ACM

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700