面向嵌入式多核系统的并行程序优化技术研究

英文题名：Research on Optimization of Parallel Programs for Embedded Multicore System
作者：王庆
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：嵌入式多核系统 ; OpenMP ; 并行编译 ; 并行优化 ; 低功耗 ; 动态电压频率调整
英文关键词：Embedded multicore system ; OpenMP ; paralle compile ; parallel
英文关键词：optimization ; low-power ; dynamic voltage and frequency scaling (DVFS)
学位年度：2013
导师：季振洲
学科代码：081201
学位授予单位：哈尔滨工业大学

摘要

传统上嵌入式系统设计是以低功耗为首要目标，但是随着计算密集型的嵌入式应用不断扩展，对性能要求、功耗要求的不断提高，嵌入式系统最近已经转向高性能嵌入式计算。面对日益复杂的嵌入式应用，片上多核处理器(CMP)已成为高性能嵌入式计算的一个有效解决方案。CMP采用多个性能适中的处理核心提高能量效率，使用高的任务级并行或者线程级并行提高整个处理器的性能。在嵌入式领域，如何充分利用CMP带来的高性能和低功耗技术对并行程序在嵌入式多核平台上的应用提出了很大挑战。
     对于嵌入式多核系统来说，低功耗和高性能是其核心特征之一，因而，如果无法有效地利用片上多核处理器技术并对应用程序进行有效地的并行计算，势必会影响建立在其上的各种应用的性能，并且造成资源和能源的浪费，这对资源和能耗要求甚高的嵌入式领域来说，这种情况是不可忍受的。因此，针对嵌入式应用，设计和实现高性能低功耗的并行计算方法，是嵌入式多核系统能否取得广泛应用需解决的核心问题之一。
     鉴于上述原因，本文深入分析了当前高性能嵌入式计算采用的性能和功耗优化方法，重点围绕嵌入式多核平台的并行编译设计及并行程序优化等问题进行研究，本文的主要工作和技术创新概述如下：
     首先，提出了面向嵌入式多核系统平台的OpenMP并行编译方法，并在此基础上扩展OpenMP并行指导语句，实现了OpenMP并行优化。以嵌入式操作系统eCos为实例，基于共享式存储并行编程模型OpenMP为嵌入式多核平台设计并实现了一个源到源的并行编译器。提出了基于嵌入式多核层次存储结构的OpenMP并行循环优化算法，扩展了OpenMP循环的并行制导语句tiling，从而提高嵌入式多核平台上的并行编程效率和并行性能，最后通过实验验证了扩展语句在嵌入式多核平台上的有效性和应用性能。
     其次，提出了面向并行程序应用的嵌入式多核系统运行时动态优化方法。针对在受带宽、数据竞争及数据同步不当等因素影响的多线程并行程序中增加线程的数量会明显降低性能的问题，本文提出了一个基于并行程序结构的性能分析模型，该模型把程序的并行区划分为完全并行和临界区部分，使得在运行时能够动态分析出具有最佳性能时的线程数。为了减少因线程之间的负载不均衡造成的性能和能耗浪费，本文还提出了基于该运行时框架的动态调度方法，该方法针对并行循环动态选择调度方法，并根据线程负载状况调整调度块大小实现性能均衡。最后基于嵌入式多核平台对运行时动态优化框架进行了验证和评估，实验表明，该框架以及运行时优化方法能够很好的适用于嵌入式多核系统，为并行应用提升性能。
     第三，提出了面向并行线程负载的低功耗执行模型。为了避免并行应用程序在嵌入式多核平台上因负载不均衡造成的能耗浪费，本文首先对并行线程执行负载进行分析，结合动态电压频率调整(Dynamic Voltage and FrequencyScaling，DVFS)提出并实现了一个低功耗执行模型，然后，本文提出并实现了一个基于该模型的线程执行频率控制算法，使得运行时系统可以根据并行线程的负载不均衡性状况动态调节运行频率，在不影响并行程序运行性能的情况下，降低程序运行的能耗。最后基于模拟的嵌入式多核平台对模型进行验证。实验表明，本文设计的低功耗执行模型能够在2.2%的性能损失的情况下为嵌入式多核平台上的并行应用程序节省平均13%的能量消耗。
     第四，提出了基于能量效率的反馈式动态电压频率调整(DVFS)方法。根据并行应用的特点，该方法将将并行程序的性能和能量消耗综合考虑，采用能量效率的能量延迟积(Energy-Delay Product，EDP)衡量基准，通过反馈式的动态电压频率控制框架，在并行程序运行初期发现适合每个核心最佳的DVFS档位，在不影响程序性能的条件下，减少能耗提高能量效率。最后通过实验对反馈式DVFS进行了验证和评估。
Low-power computing is the primary objective for the traditional embeddedsystem design. However, with embedded computing-intensive applicationscontinuing to expand, performance requirements, power consumption requirementscontinuing to increase, embedded systems have recently turned to high-performanceembedded computing (HPEC). In order to face the situation of increasingcomplexity in embedded applications, chip multi-processor (CMP) can be used asan effective solution for high-performance embedded computing. It combines somemoderate performance processing cores to improve the energy efficiency. And italso makes use of high task-level parallelism or thread level parallelism to improvethe whole performance of the applications. In the embedded computing field, howto take full advantage of CMP which brings high-performance and low-powertechnology in the embedded multicore platform is becoming a great challenge forparallel applications.
     Low-power consumption and high performance are the core issues in theembedded multi-core systems. However, if we can not make full use of the on-chipmulti-core technology and applications to the parallel computing, it will causenegative impact for the performance of a variety of applications, and result in wasteof resources and energy. This situation is intolerable in the embedded field, wherethe resources and energy consumption is critical. Therefore, for embeddedapplications, the design and implementation of high-performance and low-powerparallel computing are one of core issues whether the embedded multicore systemcan be used widely.
     For these reasons, this thesis takes the in-depth analysis of the current high-performance embedded computing, and focuses on the design of parallel compiler inembedded multicore platform and the methods of parallel optimization. The maincontributions and technological innovation are as follows:
     First, an OpenMP parallel compiler framework for the embedded multi-coreplatform is proposed. And an OpenMP parallel guidance statement is extended forOpenMP parallel programs optimization on this basis. This compiler is a source-to-source compiler for embedded multi-core platform based on the shared memoryparallel programming model OpenMP. It is designed and implemented in the eCosembedded system. On this basis, an optimization algorithm based on the embeddedmulticore hierarchical storage structure is proposed for the OpenMP parallel loops.Then the OpenMP loop parallel guidance statement: tiling is extended for theembedded multicore platform. The availability and performance of the extended statement is verified by experiments.
     Second, a run-time dynamic optimization framework for the parallelapplications on the embedded multi-core systems is proposed. Continuing toincrease the number of running threads for multi-threaded parallel programs whichare affected by the factors in bandwidth, data competition and data impropersynchronization, may result in performance degradation for the applications. Thisthesis presents a performance analysis model based on the parallel programstructure. This model divides the parallel programs into fully parallel sections andthe critical sections. This framework can gain the number of threads when parallelapplications have the best performance by the dynamic analysis at runtime. In orderto reduce the waste of performance and energy which is caused by the unbalancedload among the threads, this thesis also proposes the dynamic scheduling methodbased on the runtime framework. This method is used to select the properscheduling scheme dynamically for the the parallel loops and adjust schedulingchunk size to achieve a balanced performance based on the load status amongthreads. This runtime optimization framework based on embedded multicoreplatform is validated and evaluated. The experiments show that this runtimeoptimization framework is suitable for parallel applications on embedded multicoresystems to improve performance.
     Third, a low-power execution model based on multithread load imbalance forthe parallel programs is proposed. In order to avoid the waste of energyconsumption in embedded multicore platform due to load imbalance of parallelthreads, this thesis first analyses the load of parallel threads performance, combinesthe dynamic voltage and frequency scaling (DVFS) technique, and proposes a low-power model for multithread execution. Then, this thesis also proposes an algorithmfor controling frequency which threads executed at based on this low-power model.The run-time system can dynamically adjust the thread operating frequencyaccording to the load imbalance situations of the parallel threads. This algorithmcan reduce the energy consumption without affecting the performance of the parallelprograms. Finally, this model is validated on simulation-based embedded multi-coreplatform. The experiments show that the proposed low-power execution model cansave an average of13%of the energy consumption for parallel applications onembedded multicore platform with the case of the2.2%loss of performance.
     Fourth, this thesis proposes a feedback DVFS method based on the energyefficiency. According to the characteristics of parallel applications, this thesisimplements a feedback framework guiding DVFS based on energy efficiency. Thismethod takes the performance and energy consumption into account. So this thesistakes the energy-delay product (EDP) as the main metrics, and determines the per-core DVFS level at the beginning of parallel programs running. Without affecting the performance of applications, this method and reduce energy consumption andimprove energy efficiency. Finally, the feedback DVFS is validated and evaluatedby experiments.

引文

[1] Olukotun K, Nayfeh B A, Hammond L, et al. The case for a single-chipmultiprocessor[J]. ACM Sigplan Notices,1996,31(9):2-11.
    [2]史莉雯,樊晓桠,张盛兵.单片多处理器的研究[J].计算机应用研究,2007,24(9):46-49.
    [3]刘必慰,陈书明,汪东.先进微处理器体系结构及其发展趋势[J].计算机应用研究,2007,24(3):16-26.
    [4] Tullsen D M, Eggers S J, Levy H M. Simultaneous multithreading:Maximizing on-chip parallelism[J]. ACM SIGARCH Computer ArchitectureNews,1995,23(2):392-403.
    [5] Zhu M, Liu L, Yin S, et al. A reconfigurable multi-processor SoC for mediaapplications[C]. Proceedings of2010IEEE International Symposium onCircuits and Systems, Paris, France: IEEE Computer Society Press,2010:2011-2014.
    [6]李涛,高德远,樊晓桠,等.高性能微处理器性能模型设计[J].航空电子技术,2001,2:25-28.
    [7] Marcuello P, González A. A quantitative assessment of thread-levelspeculation techniques[C]. Proceedings of14th International Parallel andDistributed Processing Symposium, Cancun, Mexico: IEEE Computer SocietyPress,2000:595-601.
    [8] Akhter S, Roberts J. Multi-core programming[M]. Intel Press,2006.
    [9] Svennebring J, Logan J, Engblom J, et al. Embedded multicore: Anintroduction[R]. Technical report, Freescale,2009.
    [10] Barroso L A. The price of performance[J]. Queue,2005,3(7):48-53.
    [11] Rogers B M, Krishna A, Bell G B, et al. Scaling the bandwidth wall:challenges in and avenues for CMP scaling[J]. ACM SIGARCH ComputerArchitecture News.2009,37(3):371-382.
    [12] Pande P P, Grecu C, Jones M, et al. Performance evaluation and design trade-offs for network-on-chip interconnect architectures[J]. IEEE Transactions onComputers,2005,54(8):1025-1040.
    [13] Gratz P, Kim C, McDonald R, et al. Implementation and evaluation of on-chipnetwork architectures[C]. Proceedings of International Conference on IEEEComputer Design, CA, USA: IEEE Computer Society Press,2006:477-484.
    [14] Gharachorloo K, Lenoski D, Laudon J, et al. Memory consistency and eventordering in scalable shared-memory multiprocessors[C]. Proceedings of25years of the International Symposium on Computer architecture (selectedpapers). ACM Press,1998:376-387.
    [15] Hennessy J L, Patterson D A. Computer architecture: a quantitativeapproach[M]. San Francisco: Morgan Kaufmann,2011.
    [16] Munir A, Ranka S, Gordon-Ross A. High-performance energy-efficientmulticore embedded computing[J]. IEEE Transactions on Parallel andDistributed Systems,2012,23(4):684-700.
    [17] Crowley P, Franklin M A, Buhler J, et al. Impact of CMP design on high-performance embedded computing[C]. Proceedings of the10th HighPerformance Embedded Computing Workshop, MA, USA: IEEE ComputerSociety Press,2006:33-34.
    [18] Hwang K. Advanced parallel processing with supercomputer architectures[J].Proceedings of the IEEE,1987,75(10):1348-1379.
    [19] Klietz A E, Malevsky A V, Chin-Purcell K. Mix-and-match high performancecomputing[J]. Potentials, IEEE,1994,13(3):6-10.
    [20] Pulleyblank W. How to build a Supercomputer[J]. IEE Review,2004,50(1):48-52.
    [21] Bokhari S, Saltz J. Exploring the performance of massively multithreadedarchitectures[J]. Concurrency and Computation: Practice and Experience,2010,22(5):588-616.
    [22] Feng W, Cameron K W. The green500list: Encouraging sustainablesupercomputing[J]. Computer,2007,40(12):50-55.
    [23] Dongarra J J, Meuer H W, Strohmaier E. TOP500supercomputer sites[J].Supercomputer,1997,13:89-111.
    [24] Milojicic D. Embedded systems[J]. Concurrency, IEEE,2000,8(4):80-90.
    [25] Multi-core embedded systems[M]. USA: CRC Press,2010.
    [26] Handbook of Energy-aware and Green Computing[M]. USA: CRC Press,2012.
    [27]陈光宇,黄锡滋,唐小我.故障树模块化分析系统可靠性[J].电子科技大学学报,2006,35(6):989-992.
    [28] Knight J C. Software challenges in aviation systems[J]. Lecture Notes inComputer Science,2002,2434:106-112.
    [29] Sankaralingam K, Nagarajan R, Liu H, et al. Exploiting ILP, TLP, and DLPwith the polymorphous TRIPS architecture[C]. Proceedings of the30thAnnual International Symposium on Computer Architecture, CA, USA: IEEEComputer Society Press,2003:422-433.
    [30] Huh J, Burger D, Keckler S W. Exploring the design space of future CMPs[C].Proceedings of International Conference on Parallel Architectures andCompilation Techniques, Barcelona, Spain: IEEE Computer Society Press,2001:199-210.
    [31] Davis J D, Laudon J, Olukotun K. Maximizing CMP throughput withmediocre cores[C]. Proceedings of the14th International Conference onParallel Architectures and Compilation Techniques, Saint Louis, Missouri:IEEE Computer Society Press,2005:51-62.
    [32] Kumar R, Tullsen D M, Jouppi N P, et al. Heterogeneous chipmultiprocessors[J]. Computer,2005,38(11):32-38.
    [33] Kumar R, Jouppi N P, Tullsen D M. Conjoined-core chip multiprocessing[C].Proceedings of the37th annual IEEE/ACM International Symposium onMicroarchitecture, Portland, Oregon: IEEE Computer Society Press,2004:195-206.
    [34] Carbone J. A SMP RTOS for the ARM MPCore Multiprocessor[J]. ARMInformation Quaterly,2005,4(3):64-67.
    [35] Bryant R, David Richard O H. Computer systems: a programmer'sperspective[M]. Prentice Hall,2003.
    [36] Keckler S W, Olukotun O A, Hofstee H P. Multicore processors andsystems[M]. Springer,2009.
    [37] Carlstrom B D, McDonald A, Chafi H, et al. The Atomos transactionalprogramming language[J]. ACM SIGPLAN Notices,2006,41(6):1-13.
    [38] Frigo M, Leiserson C E, Randall K H. The implementation of the Cilk-5multithreaded language[J]. ACM Sigplan Notices,1998,33(5):212-223.
    [39] Thies W, Karczmarek M, Amarasinghe S. StreamIt: A language for streamingapplications[C]. Proceedings of the11th International Conference onCompiler Construction, London, UK: Springer-Verlag,2002:179-196.
    [40] Ahn J H, Erez M, Dally W J. Tradeoff between data-, instruction-, and thread-level parallelism in stream processors[C]. Proceedings of the21st annualinternational conference on Supercomputing, Washington, USA: ACM Press,2007:126-137.
    [41] Fisher J A, Faraboschi P, Young C. Embedded computing: a VLIW approachto architecture, compilers and tools[M]. Morgan Kaufmann,2005.
    [42] Kirk D. NVIDIA CUDA software and GPU parallel computingarchitecture[C]. Proceedings of the6th International Symposium on MemoryManagement, Montreal, Canada: ACM Press,2007:103-104.
    [43] Hartley T D R, Catalyurek U, Ruiz A, et al. Biomedical image analysis on acooperative cluster of GPUs and multicores[C]. Proceedings of the22ndannual international conference on Supercomputing, Island of Kos, Greece:ACM Press,2008:15-25.
    [44] Ruiz A, Ujaldon M, Cooper L, et al. Non-rigid registration for large sets ofmicroscopic images on graphics processors[J]. Journal of Signal ProcessingSystems,2009,55(1-3):229-250.
    [45] Ortega J M. Introduction to parallel and vector solution of linear systems[M].Plenum Publishing Corporation,1988.
    [46] Kejariwal A, Veidenbaum A V, Nicolau A, et al. On the exploitation of loop-level parallelism in embedded applications[J]. ACM Transactions onEmbedded Computing Systems (TECS),2009,8(2):10.
    [47] Carr S, Sweany P. Automatic data partitioning for the agere payload plusnetwork processor[C]. Proceedings of the2004international conference onCompilers, architecture, and synthesis for embedded systems, Washington,USA: ACM Press,2004:238-247.
    [48] van Eijndhoven J T J, Sijstermans F W, Vissers K A, et al. TriMedia CPU64architecture[C]. Proceedings of the8th International Conference on ComputerDesign, Atlanta, USA: IEEE Computer Society Press,1999:586-592.
    [49] Haiek D A P. Multiprocessors: discussion of some theoretical and practicalproblems[D]. Department of Computer Science, University of Illinois atUrbana-Champaign,1979.
    [50] Shen Z, Li Z, Yew P C. An empirical study of Fortran programs forparallelizing compilers[J]. IEEE Transactions on Parallel and DistributedSystems,1990,1(3):356-364.
    [51] Chen D K, Torrellas J, Yew P C. An efficient algorithm for the run-timeparallelization of doacross loops[C]. Proceedings of the1994ACM/IEEEconference on Supercomputing, Washington, USA: IEEE Computer SocietyPress,1994:518-527.
    [52] Chu C P, Carver D L. Reordering the statements with dependence cycles toimprove the performance of parallel loops[C]. Proceedings of the18thInternational Conference on Parallel and Distributed Systems, Seoul, Korea:IEEE Computer Society Press,1997:322-328.
    [53] Li Z, Yew P C, Zhu C Q. An efficient data dependence analysis forparallelizing compilers[J]. IEEE Transactions on Parallel and DistributedSystems,1990,1(1):26-34.
    [54] Kennedy K, Allen J R. Optimizing compilers for modern architectures: adependence-based approach[M]. San Francisco: Morgan Kaufmann,2001:21-28.
    [55] Chang W L, Chu C P, Ho M. Exploitation of parallelism to nested loops withdependence cycles[J]. Journal of Systems Architecture,2004,50(12):729-742.
    [56] Padua D A, Kuck D J, Lawrie D H. High-speed multiprocessors andcompilation techniques[J]. IEEE Transactions on Computers,1980,100(9):763-776.
    [57] Polychronopoulos C D. Advanced loop optimizations for parallelcomputers[J]. Lecture Notes in Computer Science,1988,297:255-277.
    [58] Polychronopoulos C D. Compiler optimizations for enhancing parallelism andtheir impact on architecture design[J]. IEEE Transactions on Computers,1988,37(8):991-1004.
    [59] Mitchell B S, Mancoridis S. Using heuristic search techniques to extractdesign abstractions from source code[C]. Proceedings of the Genetic andEvolutionary Computation Conference, New York, USA: Morgan Kaufmann,2002:1375-1382.
    [60] Wolf M E, Lam M S. A data locality optimizing algorithm[J]. ACM SigplanNotices,1991,26(6):30-44.
    [61] Dongarra J J, Hinds A R. Unrolling loops in Fortran[J]. Software: Practice andExperience,1979,9(3):219-226.
    [62] Kennedy K, McKinley K S. Maximizing loop parallelism and improving datalocality via loop fusion and distribution[M]. Springer Berlin Heidelberg,1994.
    [63] Manjikian N, Abdelrahman T S. Fusion of loops for parallelism and locality[J].IEEE Transactions on Parallel and Distributed Systems,1997,8(2):193-209.
    [64] Sha E H M, Lang C, Passos N L. Polynomial-time nested loop fusion with fullparallelism[C]. Proceedings of the25th International Conference on ParallelProcessing, Bloomingdale, USA: IEEE Computer Society Press,1996,3:9-16.
    [65] Bondhugula U, Baskaran M, Krishnamoorthy S, et al. Automatictransformations for communication-minimized parallelization and localityoptimization in the polyhedral model[C]. Proceedings of InternationalConference on Compiler Construction, Budapest, Hungary: Springer BerlinHeidelberg,2008:132-146.
    [66] Bondhugula U, Hartono A, Ramanujam J, et al. A practical automaticpolyhedral parallelizer and locality optimizer[J]. ACM SIGPLAN Notices,2008,43(6):101-113.
    [67] Leiserson C E, Saxe J B. Retiming synchronous circuitry[J]. Algorithmica,1991,6(1-6):5-35.
    [68] Darte A, Silber G A, Vivien F. Combining retiming and scheduling techniquesfor loop parallelization and loop tiling[J]. Parallel Processing Letters,1997,7(04):379-392.
    [69] Liu D, Wang Y, Shao Z, et al. Optimally maximizing iteration-level loopparallelism[J]. IEEE Transactions on Parallel and Distributed Systems,2012,23(3):564-572.
    [70] Feautrier P. Automatic parallelization in the polytope model[M]. SpringerBerlin Heidelberg,1996:79-103.
    [71] Xue J, Vera X. Efficient and accurate analytical modeling of whole-programdata cache behavior[J]. IEEE Transactions on Computers,2004,53(5):547-566.
    [72] Ellmenreich N, Lengauer C. Costing stepwise refinements of parallelprograms[J]. Computer Languages, Systems&Structures,2007,33(3):134-167.
    [73] Bastoul C. Code generation in the polyhedral model is easier than youthink[C]. Proceedings of the13th International Conference on ParallelArchitectures and Compilation Techniques, Antibes Juan-les-Pins, France:IEEE Computer Society Press,2004:7-16.
    [74] Quilleré F, Rajopadhye S. Optimizing memory usage in the polyhedralmodel[J]. ACM Transactions on Programming Languages and Systems(TOPLAS),2000,22(5):773-815.
    [75] Pouchet L N, Bondhugula U, Bastoul C, et al. Loop transformations:convexity, pruning and optimization[J]. ACM SIGPLAN Notices,2011,46(1):549-562.
    [76] Trifunovic K, Nuzman D, Cohen A, et al. Polyhedral-model guided loop-nestauto-vectorization[C]. Proceedings of18th International Conference onParallel Architectures and Compilation Techniques, Raleigh, North Carolina,USA: IEEE Computer Society Press,2009:327-337.
    [77] Bondhugula U, Gunluk O, Dash S, et al. A model for fusion and code motionin an automatic parallelizing compiler[C]. Proceedings of the19thinternational conference on Parallel Architectures and CompilationTechniques, Vienna, Austria: ACM Press,2010:343-352.
    [78] Grosser T, Zheng H, Aloor R, et al. Polly-Polyhedral optimization inLLVM[C]. Proceedings of the First International Workshop on PolyhedralCompilation Techniques (IMPACT), Chamonix, France: IEEE ComputerSociety Press,2011:1-6.
    [79] Chen C, Chame J, Hall M. CHiLL: A framework for composing high-levelloop transformations[R]. U. of Southern California, Tech. Rep,2008:08-897.
    [80] Cohen A, Sigler M, Girbal S, et al. Facilitating the search for compositions ofprogram transformations[C]. Proceedings of the19th annual internationalconference on Supercomputing, Cambridge, Massachusetts, USA: ACM Press,2005:151-160.
    [81] Girbal S, Vasilache N, Bastoul C, et al. Semi-automatic composition of looptransformations for deep parallelism and memory hierarchies[J]. InternationalJournal of Parallel Programming,2006,34(3):261-317.
    [82] Tiwari A, Chen C, Chame J, et al. A scalable auto-tuning framework forcompiler optimization[C]. Proceedings of the23th International Symposiumon Parallel&Distributed Processing, Rome, Italy: EEE Computer SocietyPress,2009:1-12.
    [83] Agakov F, Bonilla E, Cavazos J, et al. Using machine learning to focusiterative optimization[C]. Proceedings of the4th International Symposium onCode Generation and Optimization, New York, USA: IEEE Computer SocietyPress,2006:295-305.
    [84] Pouchet L N, Bastoul C, Cohen A, et al. Iterative optimization in thepolyhedral model: Part II, multidimensional time[J]. ACM SIGPLAN Notices,2008,43(6):90-100.
    [85] Ren M, Park J Y, Houston M, et al. A tuning framework for software-managedmemory hierarchies[C]. Proceedings of the17th international conference onParallel Architectures and Compilation Techniques, Toronto, Ontario, Canada:ACM Press,2008:280-291.
    [86] Voronenko Y, de Mesmay F, Püschel M. Computer generation of general sizelinear transform libraries[C]. Proceedings of the7th annual IEEE/ACMInternational Symposium on Code Generation and Optimization, Seattle, WA,USA: IEEE Computer Society Press,2009:102-113.
    [87] Boneti C, Gioiosa R, Cazorla F J, et al. A dynamic scheduler for balancingHPC applications[C]. Proceedings of the2008ACM/IEEE conference onSupercomputing, Piscataway, NJ, USA: IEEE Press,2008:41.
    [88] Feitelson D G, Rudolph L. Gang scheduling performance benefits for fine-grain synchronization[J]. Journal of Parallel and Distributed Computing,1992,16(4):306-318.
    [89] Ferreira K B, Bridges P, Brightwell R. Characterizing application sensitivityto OS interference using kernel-level noise injection[C]. Proceedings of the2008ACM/IEEE conference on Supercomputing, Piscataway, NJ, USA: IEEEPress,2008:19.
    [90] Li T, Baumberger D, Koufaty D A, et al. Efficient operating systemscheduling for performance-asymmetric multi-core architectures[C].Proceedings of the2007ACM/IEEE conference on Supercomputing, NewYork, USA: ACM Press,2007:53.
    [91] Tsafrir D, Etsion Y, Feitelson D G, et al. System noise, OS clock ticks, andfine-grained parallel applications[C]. Proceedings of the19th annualinternational conference on Supercomputing, Cambridge, Massachusetts,USA: ACM Press,2005:303-312.
    [92] Hofmeyr S, Iancu C, Blagojevi? F. Load balancing on speed[J]. ACMSIGPLAN Notices,2010,45(5):147-158.
    [93] Snavely A, Tullsen D M. Symbiotic jobscheduling for a simultaneousmutlithreading processor[J]. ACM SIGPLAN Notices,2000,35(11):234-244.
    [94] Banikazemi M, Poff D, Abali B. PAM: a novel performance/power awaremeta-scheduler for multi-core systems[C]. Proceedings of InternationalConference for High Performance Computing, Networking, Storage andAnalysis, Austin, Texas, USA: IEEE Computer Society Press,2008:1-12.
    [95] Tam D, Azimi R, Stumm M. Thread clustering: sharing-aware scheduling onSMP-CMP-SMT multiprocessors[J]. ACM SIGOPS Operating SystemsReview,2007,41(3):47-58.
    [96] Tan I K T, Chai I, Hoong P K. Dynamic threshold for imbalance assessmenton load balancing for multicore systems[J]. Computers&ElectricalEngineering,2013,39(2):338-348.
    [97] Hofmeyr S, Colmenares J A, Iancu C, et al. Juggle: proactive load balancingon multicore computers[C]. Proceedings of the20th international symposiumon High performance distributed computing, San Jose, CA, USA: ACM Press,2011:3-14.
    [98] Chen S, Gibbons P B, Kozuch M, et al. Scheduling threads for constructivecache sharing on CMPs[C]. Proceedings of the19th annual ACM symposiumon Parallel algorithms and architectures, San Diego, CA, USA: ACM Press,2007:105-115.
    [99] Blumofe R D, Leiserson C E. Scheduling multithreaded computations bywork stealing[J]. Journal of the ACM (JACM),1999,46(5):720-748.
    [100] Lu W, Gannon D. Parallel XML processing by work stealing[C]. Proceedingsof the2007workshop on Service-oriented computing performance: aspects,issues, and approaches, Monterey, CA, USA: ACM Press,2007:31-38.
    [101] Spoonhower D, Blelloch G E, Gibbons P B, et al. Beyond nested parallelism:tight bounds on work-stealing overheads for parallel futures[C]. Proceedingsof the21th annual symposium on Parallelism in algorithms and architectures,Calgary, AB, Canada: ACM Press,2009:91-100.
    [102] Agrawal K, Leiserson C E, He Y, et al. Adaptive work-stealing withparallelism feedback[J]. ACM Transactions on Computer Systems (TOCS),2008,26(3):7.
    [103] Michael M M, Vechev M T, Saraswat V A. Idempotent work stealing[J]. ACMSigplan Notices,2009,44(4):45-54.
    [104] Moreto M, Cazorla F J, Sakellariou R, et al. Load balancing using dynamiccache allocation[C]. Proceedings of the7th ACM international conference onComputing frontiers, Bertinoro, Italy: ACM Press,2010:153-164.
    [105] Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chipmultiprocessor architecture[C]. Proceedings of the13th InternationalConference on Parallel Architectures and Compilation Techniques, AntibesJuan-les-Pins, France: IEEE Computer Society Press,2004:111-122.
    [106] Qureshi M K, Patt Y N. Utility-based cache partitioning: A low-overhead,high-performance, runtime mechanism to partition shared caches[C].Proceedings of the39th Annual IEEE/ACM International Symposium onMicroarchitecture, Orlando, FL, USA: IEEE Computer Society Press,2006:423-432.
    [107] Suh G E, Rudolph L, Devadas S. Dynamic partitioning of shared cachememory[J]. The Journal of Supercomputing,2004,28(1):7-26.
    [108] Suh G E, Devadas S, Rudolph L. A new memory monitoring scheme formemory-aware scheduling and partitioning[C]. Proceedings of the8thInternational Symposium on High-Performance Computer Architecture,Washington, USA: IEEE Computer Society Press,2002:117-128.
    [109] Iyer R, Zhao L, Guo F, et al. QoS policies and architecture for cache/memoryin CMP platforms[J]. ACM SIGMETRICS Performance Evaluation Review,2007,35(1):25-36.
    [110]董勇,陈娟,杨学军.改进的能量最优OpenMP静态调度算法[J].软件学报,2011,22(9):2235-2247.
    [111] Yang C Y, Chen J J, Kuo T W. An approximation algorithm for energy-efficient scheduling on a chip multiprocessor[C]. Proceedings of theconference on Design, Automation and Test, Munich, Germany: IEEEComputer Society Press,2005:468-473.
    [112] Chang P C, Wu I W, Shann J J, et al. ETAHM: An energy-aware taskallocation algorithm for heterogeneous multiprocessor[C]. Proceedings of the45th ACM/IEEE Design Automation Conference, Anaheim, CA, USA: IEEEComputer Society Press,2008:776-779.
    [113] Coskun A K, Strong R, Tullsen D M, et al. Evaluating the impact of jobscheduling and power management on processor lifetime for chipmultiprocessors[C]. Proceedings of the11th international joint conference onMeasurement and modeling of computer systems, Seattle, WA, USA: ACMPress,2009:169-180.
    [114] Teodorescu R, Torrellas J. Variation-aware application scheduling and powermanagement for chip multiprocessors[J]. ACM SIGARCH ComputerArchitecture News,2008,36(3):363-374.
    [115] Rangan K K, Wei G Y, Brooks D. Thread motion: fine-grained powermanagement for multi-core systems[J]. ACM SIGARCH ComputerArchitecture News. ACM,2009,37(3):302-313.
    [116] Elyada A, Ginosar R, Weiser U. Low-complexity policies for energy-performance tradeoff in chip-multi-processors[J]. IEEE Transactions on VeryLarge Scale Integration Systems,2008,16(9):1243-1248.
    [117] Li J, Martinez J F, Huang M C. The thrifty barrier: Energy-awaresynchronization in shared-memory multiprocessors[C]. Proceedings ofSoftware, Washington, DC, USA: IEEE Computer Society Press,2004:14-23.
    [118] Liu C, Sivasubramaniam A, Kandemir M, et al. Exploiting barriers tooptimize power consumption of CMPs[C]. Proceedings of19th IEEEInternational Symposium on Parallel and Distributed, Denver, CO: IEEEComputer Society Press,2005:5a-5a.
    [119] Cai Q, González J, Rakvic R, et al. Meeting points: using thread criticality toadapt multicore hardware to parallel regions[C]. Proceedings of the17thinternational conference on Parallel architectures and compilation techniques,Toronto, Ontario, Canada: ACM Press,2008:240-249.
    [120] Bhattacharjee A, Martonosi M. Thread criticality predictors for dynamicperformance, power, and resource management in chip multiprocessors[J].ACM SIGARCH Computer Architecture News,2009,37(3):290-301.
    [121] Donald J, Martonosi M. Techniques for multicore thermal management:Classification and new exploration[J]. ACM SIGARCH ComputerArchitecture News,2006,34(2):78-88.
    [122] Jayaseelan R, Mitra T. A hybrid local-global approach for multi-core thermalmanagement[C]. Proceedings of the2009International Conference onComputer-Aided Design, San Jose, CA, USA: ACM Press,2009:314-320.
    [123] McNairy C, Bhatia R. Montecito: A dual-core, dual-thread Itaniumprocessor[J]. IEEE Micro,2005,25(2):10-20.
    [124] http://www.amd.com.
    [125] http://www.transmeta.com.
    [126] Chang J, Huang M, Shoemaker J, et al. The65-nm16-MB shared on-die L3cache for the dual-core Intel Xeon processor7100series[J]. IEEE Journal ofSolid-State Circuits,2007,42(4):846-852.
    [127] http://focus.ti.com.cn.
    [128] Shu L, Li X. Temperature-aware energy minimization technique throughdynamic voltage frequency scaling for embedded systems[C]. Proceedings ofthe2nd International Conference on Education Technology and Computer(ICETC), Shanghai, China: IEEE Computer Society Press,2010,2:515-519.
    [129] Puschini D, Clermidy F, Benoit P, et al. Temperature-aware distributed run-time optimization on MP-SoC using game theory[C]. Proceedings ofSymposium on VLSI Technology and Circuits, Hawaii, USA: IEEE ComputerSociety Press,2008:375-380.
    [130] Herbert S, Marculescu D. Variation-aware dynamic voltage/frequencyscaling[C]. Proceedings of the15th International Symposium on HighPerformance Computer Architecture, Raleigh, North Carolina, USA: IEEEComputer Society Press,2009:301-312.
    [131] Pourshaghaghi H R, de Gyvez J P. Dynamic voltage scaling based on supplycurrent tracking using fuzzy Logic controller[C]. Proceedings of16th IEEEInternational Conference on Electronics, Circuits, and Systems, YasmineHammamet: IEEE Computer Society Press,2009:779-782.
    [132] Marculescu D. On the use of microarchitecture-driven dynamic voltagescaling[C]. Workshop on Complexity-Effective Design, Vancouver, BC,Canada,2000:42.
    [133] Yin A W, Guang L, Nigussie E, et al. Architectural Exploration of Per-CoreDVFS for Energy-Constrained On-Chip Networks[C]. Proceedings of12thEuromicro Conference on Digital System Design, Architectures, Methods andTools, Patras, Greece: IEEE Computer Society Press,2009:141-146.
    [134] Alimonda A, Carta S, Acquaviva A, et al. Non-linear feedback control forenergy efficient on-chip streaming computation[C]. Proceedings ofInternational Symposium on Industrial Embedded Systems, Antibes Juan-Les-Pins: IEEE Computer Society Press,2006:1-8.
    [135] Alimonda A, Carta S, Acquaviva A, et al. A feedback-based approach toDVFS in data-flow applications[J]. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems,2009,28(11):1691-1704.
    [136] Salehi M E, Samadi M, Najibi M, et al. Dynamic voltage and frequencyscheduling for embedded processors considering power/performancetradeoffs[J]. IEEE Transactions on Very Large Scale Integration (VLSI)Systems,2011,19(10):1931-1935.
    [137] Kong J, Choi J, Choi L, et al. Low-cost application-aware DVFS for multi-core architecture[C]. Proceedings of3th International Conference onConvergence and Hybrid Information Technology, Busan: IEEE ComputerSociety Press,2008:106-111.
    [138] Chabloz J M, Hemani A. Distributed DVFS using rationally-relatedfrequencies and discrete voltage levels[C]. Proceedings of ACM/IEEEInternational Symposium on Low-Power Electronics and Design (ISLPED),Austin, TX, USA: IEEE Computer Society Press,2010:247-252.
    [139] Liu S, Qiu M. A discrete dynamic voltage and frequency scaling algorithmbased on task graph unrolling for multiprocessor system[C]. Proceedings ofInternational Conference on Scalable Computing and Communications;8thInternational Conference on Embedded Computing, Dalian, China: IEEEComputer Society Press,2009:3-8.
    [140] Goossens K, She D, Milutinovic A, et al. Composable dynamic voltage andfrequency scaling and power management for dataflow applications[C].Proceedings of13th Euromicro Conference on Digital System Design:Architectures, Methods and Tools (DSD), Lille, France: IEEE ComputerSociety Press,2010:107-114.
    [141] Weiser M, Welch B, Demers A, et al. Scheduling for reduced CPU energy[C].Proceedings of the1st USENIX conference on Operating Systems Design andImplementation, Berkeley, CA, USA: USENIX Association,1994:2.
    [142] Hsu C, Feng W. A power-aware run-time system for high-performancecomputing[C]. Proceedings of the2005ACM/IEEE conference onSupercomputing, Seattle, WA, USA: IEEE Computer Society,2005:1-10.
    [143] Curtis-Maury M, Dzierwa J, Antonopoulos C D, et al. Online power-performance adaptation of multithreaded programs using hardware event-based prediction[C]. Proceedings of the20th annual international conferenceon Supercomputing, Queensland, Australia: ACM Press,2006:157-166.
    [144] HSU B Y C H. Compiler-directed dynamic voltage and frequency scaling forcpu power and energy reduction[D]. Rutgers, The State University of NewJersey,2003.
    [145]赵荣彩.低功耗多线程编译优化技术[J].软件学报,13(6):1123-1129.
    [146]易会战,陈娟,杨学军等,基于语法树的实时动态电压调节低功耗算法[J].软件学报,2005,16(10):1726-1734.
    [147] Kim W, Gupta M S, Wei G Y, et al. System level analysis of fast, per-coreDVFS using on-chip switching regulators[C]. Proceedings of the14thInternational Symposium on High Performance Computer Architecture, SaltLake City, UT, USA: IEEE Computer Society Press,2008:123-134.
    [148] Liu F, Chaudhary V. Extending OpenMP for heterogeneous chipmultiprocessors[C]. Proceedings of International Conference on ParallelProcessing, Kaohsiung, Taiwan: IEEE Computer Society Press,2003:161-168.
    [149] Jeun W C, Ha S. Effective OpenMP implementation and translation formultiprocessor system-on-chip without using OS[C]. Proceedings of the2007Asia and South Pacific Design Automation Conference, Yokohama: IEEEComputer Society Press,2007:44-49.
    [150] Marongiu A, Benini L. Efficient OpenMP support and extensions for MPSoCswith explicitly managed memory hierarchy[C]. Proceedings of the conferenceon Design, automation and test in Europe, Nice, France: IEEE ComputerSociety Press,2009:809-814.
    [151] Wang P H, Collins J D, Chinya G N, et al. EXOCHI: architecture andprogramming environment for a heterogeneous multi-core multithreadedsystem[J]. ACM SIGPLAN Notices,2007,42(6):156-166.
    [152] Dolbeau R, Bihan S, Bodin F. HMPP: A hybrid multi-core parallelprogramming environment[C]. Workshop on General Purpose Processing onGraphics Processing Units (GPGPU2007).2007:1-5.
    [153] PGI Fortran&C Accelerator Compilers and Programming Model[R/OL].http://www.pgroup.com/lit/pgi_whitepaper_accpre.pdf.
    [154] Bradley C, Gaster B R. Exploiting loop-level parallelism for SIMD arraysusing OpenMP[M]. Springer Berlin Heidelberg,2008:89-100.
    [155] Chapman B, Huang L, Biscondi E, et al. Implementing OpenMP on a highperformance embedded multicore MPSoC[C]. Proceedings of IEEEInternational Symposium on Parallel&Distributed Processing, Rome, Italy:IEEE Computer Society Press,2009:1-8.
    [156] McKee S A. Reflections on the memory wall[C]. Proceedings of the1stconference on Computing frontiers, Ischia, Italy: ACM Press,2004:162-167.
    [157] Przybylski S A. Cache and memory hierarchy design: a performance directedapproach[M]. Morgan Kaufmann,1990.
    [158] Van Der Pas R, Performance H. Memory hierarchy in cache-based systems[J].Sun Blueprints,2002:1-28.
    [159]所光,杨学军.面向多线程多道程序的加权共享Cache划分[J].计算机学报,2008,31(11):1938-1947.
    [160] Liu C, Sivasubramaniam A, Kandemir M. Organizing the last line of defensebefore hitting the memory wall for CMPs[C]. Proceedings of IEE Software,IEEE Computer Society Press,2004:176-185
    [161]杨磊,石磊,张铁军.多核系统中共享cache的动态划分[J].微电子学与计算机,2009,26(5):56-59.
    [162] Song Y, Xu R, Wang C, et al. Improving data locality by array contraction[J].Computers, IEEE Transactions on,2004,53(9):1073-1084.
    [163]刘利,陈彧,乔林,等.利用循环分割和循环展开避免Cache代价[J].Journal of Software,2008,19(9):2228-2242.
    [164] Nikolopoulos D S. Code and data transformations for improving shared cacheperformance on SMT processors[C]. Proceedings of the5th InternationalSymposium High Performance Computing, Tokyo-Odaiba, Japan: SpringerBerlin Heidelberg,2003:54-69.
    [165] Sarkar S, Tullsen D M. Compiler techniques for reducing data cache miss rateon a multithreaded architecture[M]. Springer Berlin Heidelberg,2008:353-368.
    [166] Badia R M, Perez J M, Ayguade E, et al. Impact of the memory hierarchy onshared memory architectures in multicore programming models[C].Proceedings of the17th Euromicro International Conference on Parallel,Distributed and Network-based Processing, Weimar, Germany: IEEEComputer Society Press,2009:437-445.
    [167] Asaduzzaman A, Sibai F N, Rani M. Impact of level-2cache sharing on theperformance and power requirements of homogeneous multicore embeddedsystems[J]. Microprocessors and Microsystems,2009,33(5):388-397.
    [168] Zhang E Z, Jiang Y, Shen X. Does cache sharing on modern CMP matter tothe performance of contemporary multithreaded programs[J]. ACM SigplanNotices,2010,45(5):203-212.
    [169] ANSI X3H5. FORTRAN77Binding of X3H5Model for ParallelProgramming Constructs. ANSI, New York,1992:1-11
    [170]陈永健. OpenMP编译与优化技术研究[D].清华大学.2004.
    [171] Pressel D M. The Scalability of Loop-Level Parallelism[R]. ARMYRESEARCH LAB ABERDEEN PROVING GROUND MD,2001.
    [172] Magnusson P S, Christensson M, Eskilson J, et al. Simics: A full systemsimulation platform[J]. Computer,2002,35(2):50-58.
    [173] Intel. Threading methodology: Principles and practices[T/OL].www.intel.com/cd/ids/developer/asmona/eng/219349.htm,2003.
    [174] van der Pas R, Copty N. The OMPlab on sun systems[J]. Proc. of IWOMP,2005,5.
    [175] Nieplocha J, Márquez A, Feo J, et al. Evaluating the potential ofmultithreaded platforms for irregular scientific computations[C]. Proceedingsof the4th international conference on Computing frontiers, Ischia, Italy: ACMPress,2007:47-58.
    [176] Saini S, Chang J, Hood R, et al. A scalability study of columbia using the nasparallel benchmarks[J]. Journal of Comput. Methods in Sci. and Engr,2006.
    [177] Snavely A, Tullsen D M, Voelker G. Symbiotic jobscheduling with prioritiesfor a simultaneous multithreading processor[J]. ACM SIGMETRICSPerformance Evaluation Review,2002,30(1):66-76.
    [177] McGregor R L, Antonopoulos C D, Nikolopoulos D S. Scheduling algorithmsfor effective thread pairing on hybrid multiprocessors[C]. Proceedings of the19th IEEE International Symposium on Parallel and Distributed Processing,Denver, CO: IEEE Computer Society Press,2005:28a-28a.
    [179] Dhiman G, Kontorinis V, Tullsen D, et al. Dynamic workload characterizationfor power efficient scheduling on CMP systems[C]. Proceedings ofACM/IEEE International Symposium on Low-Power Electronics and Design(ISLPED), Austin, TX, USA: IEEE Computer Society Press,2010:437-442.
    [180] Merkel A, Bellosa F. Memory-aware scheduling for energy efficiency onmulticore processors[C]. Workshop on Power Aware Computing and Systems,HotPower’08, San Diego, CA, USA: ACM Press,2008.
    [181] Meng J, Sheaffer J W, Skadron K. Exploiting inter-thread temporal localityfor chip multithreading[C]. Proceedings of IEEE International Symposium onParallel&Distributed Processing (IPDPS), Atlanta, Georgia, USA: IEEEComputer Society Press,2010:1-12.
    [182] Muralidhara S P, Kandemir M, Raghavan P. Intra-application cachepartitioning[C]. Proceedings of IEEE International Symposium on Parallel&Distributed Processing (IPDPS), Atlanta, Georgia, USA: IEEE ComputerSociety Press,2010:1-12.
    [183] Eyerman S, Eeckhout L. Modeling critical sections in Amdahl's law and itsimplications for multicore design[J]. ACM SIGARCH Computer ArchitectureNews,2010,38(3):362-370.
    [184] Woo S C, Ohara M, Torrie E, et al. The SPLASH-2programs:Characterization and methodological considerations[J]. ACM SIGARCHComputer Architecture News,1995,23(2):24-36.
    [185] DeRose L, Homer B, Johnson D. Detecting application load imbalance onhigh end massively parallel systems[M]. Euro-Par2007Parallel Processing.Springer Berlin Heidelberg,2007:150-159.
    [186] Schloegel K, Karypis G, Kumar V. Parallel multilevel algorithms for multi-constraint graph partitioning[C]. Proceedings of the Euro-Par2000ParallelProcessing, Munich, Germany: Springer Berlin Heidelberg,2000:296-310.
    [187] Boneti C, Gioiosa R, Cazorla F J, et al. Balancing HPC applications throughsmart allocation of resources in MT processors[C]. Proceedings of IEEEInternational Symposium on Parallel and Distributed Processing, Miami,Florida, USA: IEEE Computer Society Press,2008:1-12.
    [188] Boneti C, Gioiosa R, Cazorla F J, et al. A dynamic scheduler for balancingHPC applications[C]. Proceedings of the2008ACM/IEEE conference onSupercomputing, Piscataway, NJ, USA: IEEE Computer Society Press,2008:41.
    [189] Oh J, Hughes C J, Venkataramani G, et al. LIME: a framework for debuggingload imbalance in multi-threaded execution[C]. Proceedings of the33rdInternational Conference on Software Engineering, Waikiki, HI, USA: ACMPress,2011:201-210.
    [190] Jejurikar R, Pereira C, Gupta R. Leakage aware dynamic voltage scaling forreal-time embedded systems[C]. Proceedings of Design AutomationConference, San Diego, CA, USA: ACM Press,2004:275-280.
    [191] Hopfield J J, Tank D W.“Neural” computation of decisions in optimizationproblems[J]. Biological cybernetics,1985,52(3):141-152.
    [192] Martin M M K, Sorin D J, Beckmann B M, et al. Multifacet's generalexecution-driven multiprocessor simulator (GEMS) toolset[J]. ACMSIGARCH Computer Architecture News,2005,33(4):92-99.
    [193] Brooks D, Tiwari V, Martonosi M. Wattch: a framework for architectural-levelpower analysis and optimizations[J]. ACM SIGARCH Computer ArchitectureNews,2000,28(2):83-94.
    [194] Chen J, Dubois M, Stenstrom P. SimWattch and learn[J]. Potentials, IEEE,2009,28(1):17-23.
    [195] Wu Q, Martonosi M, Clark D W, et al. A dynamic compilation framework forcontrolling microprocessor energy and performance[C]. Proceedings of the38th annual IEEE/ACM International Symposium on Microarchitecture,Barcelona, Spain: IEEE Computer Society Press,2005:271-282.
    [196] Kimura H, Sato M, Hotta Y, et al. Emprical study on reducing energy ofparallel programs using slack reclamation by dvfs in a power-scalable highperformance cluster[C]. Proceedings of IEEE International Conference onCluster Computing, Barcelona, Spain: IEEE Computer Society Press,2006:1-10.
    [197] Rajamani K, Hanson H, Rubio J, et al. Application-aware powermanagement[C]. Proceedings of IEEE International Symposium on WorkloadCharacterization, San Jose, California, USA: IEEE Computer Society Press,2006:39-48.
    [198] Dhiman G, Rosing T S. Dynamic voltage frequency scaling for multi-taskingsystems using online learning[C]. Proceedings of the2007internationalsymposium on Low power electronics and design, Portland, OR, USA: ACMPress,2007:207-212.
    [199] Isci C, Buyuktosunoglu A, Cher C Y, et al. An analysis of efficient multi-coreglobal power management policies: Maximizing performance for a givenpower budget[C]. Proceedings of the39th Annual IEEE/ACM InternationalSymposium on Microarchitecture, Orlando, FL, USA: IEEE ComputerSociety Press,2006:347-358.
    [200] Rakvic R, Cai Q, González J, et al. Thread-management techniques tomaximize efficiency in multicore and simultaneous multithreadedmicroprocessors[J]. ACM Transactions on Architecture and CodeOptimization (TACO),2010,7(2):9.
    [201] Donald J, Martonosi M. Techniques for multicore thermal management:Classification and new exploration[J]. ACM SIGARCH ComputerArchitecture News,2006,34(2):78-88.
    [202] Thoziyoor S, Muralimanohar N, Jouppi N P. CACTI5.0[R]. HP Laboratories,Technical Report,2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700