多核系统中的程序性能优化研究

英文题名：Study on Program Performance Optimization in Multi-Core Systems
作者：张琦
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：程序性能优化 ; 多核系统 ; 线程调度 ; 多线程程序可扩放性性能优化
英文关键词：Program performance optimization ; Multi-core system ; Thread scheduling ; Multi-thread program scalability performance optimization
学位年度：2010
导师：许胤龙
学科代码：081202
学位授予单位：中国科学技术大学
论文提交日期：2010-05-01

摘要

多核处理器在一个处理器芯片上集成多个处理器核心,可同时执行多个线程。长期以来,处理器芯片上的晶体管数目不断增加,处理器的设计越来越复杂,但因为功耗和工艺等方面的限制,处理器的时钟频率无法再继续提高。随着处理器厂商纷纷推出各自的多核处理器,多核系统在我们的工作和生活中迅速得到普及,并且每个处理器中的核数目还在不断的增加。多核处理器的普及给应用程序的发展带来了巨大的挑战,多核处理器中每个核的计算能力并没有增强,它是通过组合多个处理核来提供强大的计算能力。传统的串行应用程序无法方便的直接借助处理器核数目的增加提升性能,必须通过并行化或者同时执行多个程序才能充分发挥多核系统的计算能力。
     本文从应用程序性能优化和系统整体性能优化两个角度,研究了多核系统中的程序性能优化方法,并验证其有效性。本文的主要工作和创新点如下:
     1.对于多核系统中的应用程序性能优化,本文分别研究了串行程序性能优化方法,并行程序设计方法和并行程序性能优化方法。通过为程序设计并行算法并实现,可以使程序同时利用多个核的计算能力。通过对并行程序进行优化,可以使程序更充分的发挥多个核的计算能力,其方法包括增加任务数量改善负载均衡,选择最优的线程与处理核之间关联策略,设计无锁机制减少同步开销,消除线程间高速缓存伪共享等等。
     2.本文通过对多个图像特征提取和马尔可夫决策过程求解程序进行性能优化,使这些应用程序在多核系统中的性能获得了较大提升,并验证了所采用的性能优化方法能够有效的提高应用程序在多核系统中的性能。
     3.对于多核系统整体性能的优化,本文研究了多线程之间对共享缓存空间的竞争问题,这种竞争会损害整个系统以及各个程序的性能。本文提出了基于工作集模型分析和预测共享缓存上线程竞争情况的方法,并发现如果同时运行线程的工作集大小之和超出共享缓存容量,或者同时运行线程的时间局部性强度差异较大时,线程受到的干扰就会比较剧烈,性能损失比较严重。
     4.本文提出了一种基于工作集模型的线程调度方法。本方法通过一组监测单元以较小的代价获得线程的工作集大小和时间局部性强度属性,并根据一套线程调度策略,选取合适的线程同时运行,保证线程的工作集数据可以保存在高速缓存之中。实验结果表明,基于工作集模型的线程调度方法较好的缓解了共享缓存上线程间的互相竞争,有效提高了整个系统和各个程序的性能。
The multi-core processor integrates multiple processor cores on a single chip which can run multiple threads simultaneously. Over the years, the number of transistors on a processor chip grows constantly, and the design of processor becomes more and more complex. But for the power and some other aspects, the clock frequency can not increase more. With the processor manufacturers introducing their multi-core processors, multi-core systems become popular in our work and life. And the number of cores in a chip is increasing constantly. The popularity of multi-core processors brings enormous challenges to application program. In a multi-core processor, the computing power of each core is not enhanced. It combines multiple processor cores to provide powerful computing ability. The traditional serial applications can not directly improve performance by the increasing of processor cores. To fully use computing power of multi-core processor, we must run parallel program or multiple programs concurrently on the system.
     This paper studies program performance optimization on multi-core system by two perspectives, application program performance optimization and overall system performance optimization. The major work and innovation of this paper are as follows:
     1. For application program performance optimization in multi-core systems, this paper studies serial program performance optimization methods, parallel program design methods and parallel program performance optimization methods. To use computing power of multiple cores, we need design and implemation parallel algorithm for programs. To use computing power of multiple cores more fully, we need optimize prallel program performance. The parallel program performance optimization methods studied in this paper include improving load balance by increasing tasks number and reducing tasks size, choosing the best affinity policy between threads and processor cores, designing lock-free structure to reduce synchronization overhead, eliminating false sharing of cache between threads and so on.
     2. This paper applies these performance optimization methods to several image future extraction and Markov decision process solving programs. Experiment results show that the performance of these programs is improved much, and this verifies the program performance optimization methods studied in this paper can effectively improve the program performance in multi-core systems.
     3. For overall system performance optimization, this paper studies the threads contention problem on the shared last level cache of multi-core systems, which may reduce performance of each thread and overall system. We proposes a method based on working set model to analysis and estimate the thread contention problem on the multi-core systems. We find that once total working set size of the threads exceed the shared cache space or the difference of temporal locality is great, the interference thread suffer is severe and the performance degradation is serious.
     4. This paper proposes a thread scheduling method based on the working set model. In this method, we design a set of monitoring unit on shared cache, which collect the working set size and temporal locality of threads by low overhead. We also propose an operating system thread scheduling policy, which select appropriate threads running simultaneously to ensure the working sets of threads can be kept in the shared cache. The experimental results show that the thread scheduling method based on working set model remits the threads contention on the shared cache, effectively improves the performance of overall system and each program.

引文

陈国良. 1994.并行算法的设计与分析[M].北京:高等教育出版社.
    陈国良. 1999.并行计算——结构,算法,编程[M].北京:高等教育出版社.
    陈国良,安虹,陈崚,郑启龙,单久龙. 2004.并行算法实践[M].北京:高等教育出版社.
    陈国良,吴俊敏,章锋,章隆兵. 2002.并行计算机体系结构[M].北京:高等教育出版社.
    李辉,杨金才. 2006.多核CPU综述[C]. 2006中国计算机学会体系结构专委会学术年会.计算机科学.成都.
    李瑜,李磊. 1999.基于内容的图像检索的方法研究[J].计算机科学. 1999年08期.
    刘必慰,陈书明,汪东. 2007.先进微处理器体系结构及其发展趋势[J].计算机应用研究. 2007年03期.
    英特尔公司. 2009.英特尔?数学核心函数库10.1——FFT [EB]. http://www.intel.com/cd/software/products/apac/zho/329183.htm.
    章隆兵,吴少刚,蔡飞,胡伟武. 2004. PC机群上共享存储与消息传递的比较[J].软件学报. 2004年06期.
    周茂成. 1999.并行编程模型的现状与未来发展方向[J].电子计算机. 1999年2月:11-17.
    Awasthi M, Sudan K, Balasubramonian R, Carter J. 2009. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches [C]. In Proceedings of International Symposium on High Performance Computer Architecture 2009. 250-261.
    Bailey D, Harris T, Saphir W, Wijngaart R V D, Woo A, Yarrow M. 1995. The nas parallel benchmarks 2.0 [R]. Technical Report NAS-95-020, NASA Ames Research Center.
    Banikazemi M, Poff D, Abali B. 2008. Pam: a novel performance/power aware meta-scheduler for multi-core systems [C]. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Austin, Texas.
    Bellman R E. 1957. Dynamic Programming [M]. Princeton: Princeton University Press.
    Bhulai S. 2002. Markov Decision Processes the control of high-dimensional system [D]. Dissertation. Amsterdam: University press.
    Bryant R E, O’Hallaron D R. 2002. Computer Systems: A Programmer’s Perspective [M]. Prentice Hall, Inc.
    Cao J, Lan Y, Li J, Li Q, Li X, Lin F, Liu X, Luo L, Peng W, Wang D, Wang H, Wang Z, Xiang Z, Yuan J, Zhang B, Zhang J, Zhang L, Zhang X, Zheng W. 2006. Intelligent Multimedia Group of Tsinghua University at TRECVID 2006 [C]. In Proceedings of the TREC Video RetrievalEvaluation Workshop.
    Castleman K R. 1995. Digital Image Processing [M]. Prentice-Hall, Inc.
    Chandra D, Guo F, Kim S, Solihin Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture [C]. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. 340-351.
    Chang J, Sohi G S. 2007. Cooperative cache partitioning for chip multiprocessors [C]. In Proceedings of the 21st annual international conference on Supercomputing. Seattle, Washington:242-252.
    Chen S, Gibbons P B, Kozuch M, Liaskovitis V, Ailamaki Anastassia, Blelloch G E, Falsafi B, Fix L, Hardavellas N, Mowry T C, Wikerson C. 2007. Scheduling threads for constructive cache sharing on CMPs [C]. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures. San Diego, California, USA. 105-115.
    Chen Y, Li E, Li J, Zhang Y. 2007. Accelerating Video Feature Extractions in CBVIR on Multi-core Systems [J]. Intel Technology Journal. 2007, Volume 11.
    Daugman J G, 1988. Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression [J]. IEEE Transactions on Acoustics, Speech and Signal Processing. Volume 36, Issue 7:1169-1179.
    Denning P J. 1968. The working set model for program behavior [J]. Communications of the ACM. Volume 11, Issue 5:323-333.
    El-Moursy A, Garg R, Albonesi D, Dwarkadas S. 2006. Compatible phase co-scheduling on a cmp of multi-threaded processors [C]. In Proceedings of IEEE International Parallel and Distributed Processing Symposium’06.
    Estrada F, Jepson A, Fleet D. 2004. Local Features Tutorial [R]. Technique report. http://www.cs.toronto.edu/~jepson/csc2503/tutSIFT04.pdf.
    Fedorova A, Seltzer M, Small C, Nussbaum D. 2005. Performance of multithreaded chip multiprocessors and implications for operating system design [C]. In Proceedings of the annual conference on USENIX Annual Technical Conference. Anaheim, CA. 26.
    Fedorova A, Seltzer M, Smith M D. 2007. Improving performance isolation on chip multiprocessors via an operating system scheduler [C]. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 25-38.
    Gearhart C. 2003. Genetic Programming as Policy Search in Markov Decision Processes [C]. Genetic Algorithms and Genetic Programming at Stanford 2003. Stanford, California, USA. 61-67.
    Gerber R. 2009. More Work-Sharing with OpenMP [EB]. Intel Software Network.http://software.intel.com/en-us/articles/more-work-sharing-with-openmp/.
    Gerber R, Bik A J C, Smith K B, Tian X. 2006. The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms [M]. Second edition. Intel Press.
    Gopal S, Vijaykumar T N, Smith J E, Sohi G S. 1998. Speculative Versioning Cache [C]. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture. 195.
    Guestrin C E, Koller D, Gearhart C, Kanodia N. 2003. Generalizing Plans to New Environments in Relational MDPs [C]. In Proceedings of the 18th International Joint conference on Artificial Intelligence. Acapulco, Mexico:1003-1010.
    Hammond L, Hubbert B A, Siu M, Prabhu M K, Chen M, Olukotun K. 2000. The Stanford hydra CMP [J]. IEEE Micro. Volume 20, Issue 2:71-84.
    Haraljck R M, Shanmugam K, Dinstein I H. 1973. Textural Features for Image Classification [J]. IEEE Transactions on Systems, Man, and Cybernetics. Vol. SMC-3, No. 6:610-621.
    Heymann S, Müller K, Smolic A, Fr?hlich B, Wiegand T. 2007. SIFT Implementation and Optimization for General-Purpose GPU [C]. In Proceedings of International Conference on Computer Graphics, Visualization and Computer Vision. Plzen, Czech Republic.
    Hirschberg D S. 1976. Parallel algorithms for the transitive closure and the connected components problems [C]. In Proceedings of the eighth annual ACM symposium on Theory of computing. Hershey, Pennsylvania, United States. 55-57.
    Horowitz E, Zorat A. 1983. Divide and conquer for parallel processing [J]. IEEE Transactions on Computers. Volume 32, Issue 6:582-585.
    Howard R A. 1960. Dynamic Programming and Markov Processes [M]. Technology Press of Massachusetts Institute of Technology.
    Huang J, Kumar S R, Mitra Mandar, Zhu W J, Zabih R. 1997. Image indexing using color correlograms [C]. In Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition:762.
    Huang J, Kumar S R, Mitra Mandar, Zhu W J, Zabih R. 1999. Spatial color indexing and applications [J]. International Journal of Computer Vision. Volume 35, Issue 3:245-268.
    Hughes C J, Grzeszczuk R, Sifakis E, Kim D, Kumar S, Selle A P, Chhugani J, Holliman M, Chen Y K. 2007. Physical Simulation for Animation and Visual Effects: Parallelization and Characterization for Chip Multiprocessors [J]. ACM SIGARCH Computer Architecture News. Volume 35, Issue 2:220-231.
    IEEE Standards Board. 1990. IEEE Standard Glossary of Image Processing and Pattern Recognition Terminology [M]. New York, NY, USA: Institute of Electrical and ElectronicsEngineers, Inc.
    Intel Corporation. 2002. Intel? Pentium?4 and Intel? Xeon? Processor Optimization Reference Manual [M]. Intel Press. Order number: 248966-007.
    Intel Corporation. 2006. Intel? 64 and IA-32 Architectures Optimization Reference Manual [M]. Intel Press.
    Jiang Y, Shen X, Chen J, Tripathi R. 2008. Analysis and approximation of optimal co-scheduling on chip multiprocessors [C]. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. Toronto, Ontario, Canada. 220-229.
    Kaelbling L P Littman M L, Moore A W. 1996. Reinforcement Learning: A Survey [J]. Journal of Artificial Intelligence Research. Volume 4:237-285.
    Keckler S W, Dally W J, Maskit D, Carter N P, Chang A, Lee W S. 1998. Exploiting Fine-Grain Thread Level Parallelism on the MIT multi-ALU Processor [J]. ACM SIGARCH Computer Architecture News. Volume 26, Issue 3:306-317.
    Kim S, Chandra D, Solihin Y. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture [C]. In Proceedings of PACT’04. 111-122.
    Knauerhase R, Brett P, Hohlt B, Li T, Hahn S. 2008. Using os observations to improve performance in multicore systems [J]. IEEE Micro. Volume 28, Issue 3:54-66.
    Kumar V, Grama A, Gupta A, Karypis G. 1994. Introduction to Parallel Computing: Algorithm Design and Analysis [M]. Redwod City: Benjamin Commings/Addison Wesley.
    Ladner R E, Fischer M J. 1980. Parallel prefix computation [J]. Journal of ACM. Volume 27, Issue 4:831-838.
    Lee R, Ding X, Chen F, Lu Q, Zhang X. 2009. MCC-DB: minimizing cache conflicts in multi-core processors for databases [C]. In Proceedings of the VLDB Endowment. Volume 2, Issue 1:373-384.
    Lee T S. 1996. Image representation using 2d gabor wavelets [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. Volume 18, Issue 10:959-971.
    Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems [C]. In Proceedings of International Symposium on High Performance Computer Architecture. Salt Lake. 367-378.
    Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P. 2009. Enabling software management for multicore caches with a lightweight hardware support [C]. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. Portland, Oregon.
    Littman M L, Dean T L, Kaelbling L P. 1995. On the complexity of solving Markov decisionproblems [C]. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence. San Mateo, CA:394-402.
    Liu C, Sivasubramaniam A, Kandemir M. 2004. Organizing the last line of defense before hitting the memory wall for cmps [C]. In Proceedings of International Symposium on High Performance Computer Architecture 2004. 176-185.
    Lowe D G. 2004. Distinctive Image Features form Scale-Invariant Keypoints [J]. International Journal of Computer Vision. Springer Netherlands. Volume 60, Number 2:91-110.
    Lu Q, Lin J, Ding X, Zhang Z, Zhang X, Sadayappan P. 2009. Soft-olp: Improving hardware cache performance through software-controlled object-level partitioning [C]. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 246-257.
    Magnusson P S, Christensson M, Eskilson J, Forsgren D, H?llberg G, H?gberg J, Larsson F, Moestedt A, Werner B.2002. Simics: A full system simulation platform [J]. Computer. Vol. 35, No. 2:50-58.
    Martinez J M. 2004. MPEG-7 overview [R]. Version 10. Palma de Mallorca: ISO/IEC JTC1/SC29/WG11 N6828.
    Manjunath B S, Ma W Y. 1996. Texture Features for Browsing and Retrieval of Image Data [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. Volume 18, Issue 8:837-842.
    Ma W Y, Zhang H J. 1998. Benchmarking of Image Features for Content-based Retrieval [C]. In Conference Record of the Thirty-Second Asilomar Conference on Signals, Systems & Computers. Pacific Grove, CA. Vol.1:253-257.
    Mao J, Jain A K. 1992. Texture classification and segmentation using multiresolution simultaneous autoregressive models [J]. Pattern Recognition. Volume 25, Issue 2:173-188.
    Marques O, Costa F M, Furht B. 2000. Content-Based Image Search and Retrieval Using Relevance Feedback: the MUSE Project [C]. In Proceedings of the IASTED International Conference on Internet and Multimedia System and Applications. Las Vegas, USA.
    Miao Q, Chen Y, Li J, Zhang Q, Zhang Y, Chen G. 2009. Parallelization and Optimization of a CBVIR System on Multi-Core Architectures [C]. In Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing. 1-8.
    Olukotun K, Nayfeh B A, Hammond L, Wilson K, Chang K. 1996. The case for a single-chip multiprocessor [J]. ACM SIGPLAN Notices. Volume 31, Issue 9:2-11.
    OpenMP Architecture Review Board. 2002. OpenMP C and C++ application program interface [S]. Version 2.0.
    Otterlo M V. 2005. A Survey of Reinforcement Learning in Relational Domains [R]. TechnicalReport TR-CTIT-05-31, University of Twente.
    Picard R W, Kabir T, Liu F. 1993. Real-time Recognition with the entire Brodatz Texture Database [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York:638-639.
    Pisharath J, Liu Y, Liao W K, Choudhary A, Memik G, Parhi J. 2005. Nu-minebench 2.0 [R]. Center for Ultra-Scale Computing and Information Security, Northwestern University. Technical Report CUCIS-2005-08-01.
    Puterman M L. 1994. Markov Decision Processes [M]. Wiley Press.
    Qiao L, Raman V, Reiss F, Haas P J, Lohman G M. 2008. Main-memory scan sharing for multi-core cpus [C]. In Proceedings of the VLDB Endowment. Volume 1, Issue 1:610-621.
    Qureshi M K, Patt Y N. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches [C]. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. 423-432.
    Rothberg E, Singh J P, Gupta A. 1993. Working sets, cache sizes, and node granularity issues for large-scale multiprocessors [C]. In Proceedings of the 20th annual international symposium on Computer architecture. San Diego, California, United States. 14-26.
    Salehi J D, Kurose J F, Towsley D. 1996. The effectiveness of affinity-based scheduling in multiprocessor network protocol processing [J]. IEEE/ACM Transactions on Networking. Volume 4, Issue 4:516-530.
    Settle A, Kihm J, Janiszewski A, Connors D. 2004. Architectural support for enhanced smt job scheduling [C]. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. 63-73.
    Skrypnyk I, Lowe D G. 2004. Scene Modeling, Recognition and Tracking with Invariant Image Features [C]. In Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality:110-119.
    Sinha S N, Frahm J M, Pollefeys M, Genc Y. 2007. Feature Tracking and Matching in Video Using Programmable Graphics Hardware [J]. Machine Vision and Applications. Springer Berlin / Heidelberg.
    Smith J R, Chang S F. 1996. Automated binary texture feature sets for image retrieval [C]. In Proceedings of the Acoustics, Speech, and Signal Processing. Atlanta, GA. Volume 4:2239-2242.
    Snavely A, Tullsen D M, Voelker G. 2002. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor [C]. ACM SIGMETRICS Performance Evaluation Review. Volume 30, Issue 1:66-76.
    Sohi G, Breach S, Vijaykumar T. 1995. Multiscalar processors [C]. In Proceedings of the 22nd annual international symposium on Computer architecture. S. Margherita Ligure, Italy. 414-425.
    Steffan J G, Mowry T. 1998. The potential for Using Thread-level Data Speculation to Facilitate Automatic Parallelization [C]. In Proceedings of the 4th International Symposium on Hight-Performance Computer Architecture. 2.
    Suh G E, Rudolph L, Devadas S. 2004. Dynamic partitioning of shared cache memory [J]. The Journal of Supercomputing. Volume 28, Issue 1:7-26.
    Swain M J, Ballard D H. 1991. Color Indexing [J]. International Journal of Computer Vision. Volume 7, Issue 1:11-32.
    Tam D, Azimi R, Soares L, Stumm M. 2007. Managing shared l2 caches on multicore systems in software [C]. In Workshop on the Interaction between Operating Systems and Computer Architecture, 2007.
    Tam D, Azimi R, Stumm M. 2007. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors [C]. In Proceedings of the 2nd ACM SiGOPS/EuroSys European Conference on Computer Systems. Lisbon, Portugal. 47-58.
    Todd S. 1978. Algorithms and hardware for a merge sort using multiple processors [J]. IBM Journal of Research and Development. Volume 22, Issue 5:509-517.
    Valiant L G. 1975. Parallelism in comparison problems [J]. SIAM Journal on Computing. Volume 4, Issue 3:348-355.
    Vuyst M D, Kumar R, Tullsen D M. 2006. Exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors [C]. In Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006.
    Watkins C J C H, Dayan P. 1992. Q-learning [J]. Machine Learning Journal. Special Issue on Reinforcement Learning. Volume 8, Number 3-4:279-292.
    Xu K, Georgescu B, Comaniciu D, Meer P. 2000. Performance Analysis in Content-based Retrieval with Textures [C]. In Proceedings of the International Conference on Pattern Recognition. Volume 4:4275.
    Zhang X, Dwarkadas S, Shen K. 2009. Towards practical page coloring-based multicore cache management [C]. In Proceedings of the 4th ACM European conference on Computer systems. Nuremberg, Germany. 89-102.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700