大规模异构并行系统软件低功耗优化关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

大规模异构并行系统软件低功耗优化关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Software Low-Power Optimization for Large-scale Heterogeneous Parallel System
作者：王桂彬
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：异构并行系统 ; GPU ; 低功耗 ; 优化技术
英文关键词：Heterogenous System ; GPU ; Low power ; Optimization
学位年度：2011
导师：杨学军
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2011-09-01

摘要

绿色计算是当前高性能计算领域最为关注的话题之一,降低系统功耗、提高系统效能是维持超级计算机向更大规模系统发展的重要途径。异构并行系统已成为当前高性能计算机系统发展的重要趋势之一,与传统同构并行系统相比,集成有专用加速部件的异构并行系统具有更高的峰值计算速度和峰值效能。然而,由于异构处理器间计算速度和功耗开销的不同,已有的面向同构系统的功耗优化方法难以高效应用于异构并行系统。本文针对异构系统的功耗优化问题展开研究,主要工作与创新包括:
     1.提出异构系统功耗感知的并行循环调度方法(第二章)
     并行循环是科学与工程计算程序中的主要优化对象,文章以类OpenMP并行程序为研究对象,研究在满足性能约束的条件下结合异构系统并行循环调度和处理器动态电压频率调节技术优化系统功耗。首先建立了异构系统功耗感知的并行循环调度问题基本模型,然后通过分析方法给出异构系统并行循环调度的能耗下界,该下界可用于评估功耗优化方法的执行效率。进而,将异构系统并行循环调度问题归纳为一般整数规划问题,给出了该问题的求解方法。
     2.提出异构系统功耗感知的多计算段频率调节与任务划分方法(第三章)
     并行程序一般由多个串行段和并行段程序组成,根据并行计算段是否由异构处理器并行完成,文章将异构并行程序划分为同构计算段程序和异构计算段程序。针对同构计算段程序,首先建立了各计算段能耗与执行时间的关系;进而分析得出在总执行时间约束的条件下,多计算段程序达到能耗最优的条件,并给出能耗最优的计算段运行频率选择算法。针对异构计算段程序,首先分析得出时间约束下异构并行处理在单并行段内达到能耗最优的条件,进而建立各计算段能耗与执行时间的关系。在给定执行时间的约束下,将多计算段程序能耗问题描述为一般多元极值问题,并给出了基于最优下降的启发式求解算法。
     3.提出异构系统通信感知的全程序能耗优化方法(第四章)
     当前异构并行系统中,主处理器与加速部件大都通过系统总线连接,调用加速部件执行特定计算过程的同时必然伴随不可忽略的通信开销,因此应综合考虑加速部件引入的计算能耗和通信能耗以最小化系统整体能耗开销。文章提出了两种优化方法:基于整数线性规划的静态能耗优化方法和基于遗传算法的动态能耗优化方法,在满足性能约束的条件下优化全系统能耗开销。静态优化方法将并行任务的划分和调度以及处理器频率选择过程描述为整数线性规划问题,给出了该问题的最优解求解方法。动态优化方法在程序的执行过程中依据程序历史执行信息,反复应用通信感知的任务划分算法和动态频率调节算法在线优化程序能耗开销。
     4.提出异构系统应用感知的最大功耗管理方法(第五章)
     随着系统功耗不断增大,功耗不仅是系统优化的目标之一,而且逐渐成为影响系统设计与实现的重要约束之一。针对多道程序在异构并行系统上的执行模型,文章提出了一种层次化的最大功耗管理策略,旨在满足系统功耗约束的条件下,优化系统整体性能。首先对当前异构并行系统执行模型进行了抽象,并提出了融合三级功耗控制机制的系统功耗管理框架。在异构并行处理引擎级功耗控制中,文章提出了应用感知的最大功耗管理方法。首先,通过分析方法给出了异构处理器在给定功耗约束的条件下达到性能最优的条件。基于该结果,给出了功耗受限的并行任务划分算法,该方法通过协调并行任务划分和动态电压频率调节技术以优化异构并行处理。在异构并行处理组级功耗控制中,提出了基于关键线程的功耗划分策略,将功耗优先分配给处于关键路径上的线程;在系统级功耗控制中,我们建立了异构并行处理组效能评估方法,以此作为功耗划分的依据,在兼顾并发应用公平性的同时,提高系统整体执行效能。
Nowadays, green computing has become one of the most hot topics in high perfor-mancecomputingfields. Itisanessentialwayofloweringthesystempowerconsumptionand improving the system energy efficiency to keep the supercomputers be extended intolarger-scale systems. Heterogeneous parallel system has become one important trend forhigh performance computing systems. Compared with traditional homogeneous parallelsystem, heterogeneous parallel system, which integrates specific accelerator, has muchhigher peak processing speed and improved power efficiency. However, owing to the d-ifferencesinprocessingspeedandpowerconsumptionbetweenheterogeneousprocessingunits, existing low power optimization method designed for homogeneous system couldnot be efficiently applied onto heterogeneous system. This paper focuses on studying anddesigning the low-power optimization methods targeted for heterogeneous system. Themain works and contributions include:
     1. Proposing power-aware parallel loop scheduling method for heterogeneous system(Chapter 2)
     Parallel loop is the main optimization target in scientific and engineering applica-tions. BasedontheOpenMP-likeparallelprogram, weproposeaneffectivemethodof coordinating parallel loop scheduling and dynamic voltage/frequency scaling tominimize the power consumption given a fixed performance constraint. Firstly, weestablish the basic power-aware parallel loop scheduling model for heterogeneoussystem. Then, through theoretical analysis, we find the lower bound for parallelloop scheduling on heterogeneous system, which could be used to evaluate the effi-ciency of the optimization method. Finally, we deduce the optimization problem asa typical integer programming problem, which could be solved via existing tools.
     2. Proposing power-aware multi-section frequency scaling and task partition methodfor heterogeneous system (Chapter 3)
     Typically, a parallel program includes several sequential sections and parallel sec-tions. Accordingtowhethertheparallelsectioncouldbeexecutedbyheterogeneousprocessing units, we distinguish the heterogeneous parallel programs into homoge-neous computing-section program and heterogeneous computing-section program. Forthehomogeneouscomputing-sectionprograms,wefirstestablishthefunctionofenergy consumption with execution time for each computing section. Then we findthe condition for the minimum energy consumption of the whole program withinthe given performance constraint and propose the energy-optimal frequency scal-ingalgorithm. Fortheheterogeneouscomputing-sectionprograms, wefirstfindthecondition for the minimum energy consumption in a single parallel section given aperformanceconstraint,throughwhichwecouldestablishtherelationshipofenergyconsumption with execution time. Given a whole-program performance constraint,the optimization problem could be seen as a multi-variable extreme problem and aSteepest Drop based heuristic optimization algorithm is proposed to solve it.
     3. Proposing communication-aware whole program energy optimization method forheterogeneous system (Chapter 4)
     In current heterogeneous parallel system, the main processor and the acceleratorare connected via system bus, therefore, there exists extra communication over-head when scheduling specific computation onto accelerator. To minimize the totalenergy consumption, we should holistically consider the computation energy andcommunication energy induced by the accelerator. We propose two separate energyoptimizationmethods: integerlinearprogrammingbasedstaticenergyoptimizationmethod and genetic algorithm based dynamic energy optimization to optimize thewholeenergyconsumptionwithinperformanceconstraint. Inthestaticoptimizationmethod, we deduce the decision of parallel task partition, schedule and processoroperatingfrequencyasanintegerlinearprogrammingproblemandprovidetheopti-mal solution for this problem. In the dynamic optimization method, we monitor theexecution profile of the targeted program on line, and perform the communication-aware task partition method and dynamic frequency scaling method repeatedly toreduce the energy consumption at runtime.
     4. Proposing application-aware peak power management for heterogeneous system(Chapter 5)
     With the system power consumption increasing continually, power has become notonly an optimization target, but also an important constraint for system design andimplementation. Targeted for the execution model of concurrent parallel programs on heterogeneous system, we propose a multi-level peak power management in or-der to optimize the system throughput within the constrained power budget. Firstly,we summarize the execution model for current heterogeneous parallel system, andintroduce the whole power management framework with three-level power man-agement hierarchies. In the heterogeneous processing engine level power manage-ment, we propose an application-aware peak power management. Firstly, we findthe condition for the best performance within the power budget for heterogeneousprocessing engine. Based on this analysis, we provide the power constrained paral-lel task partition algorithm, which coordinates parallel task partition and dynamicfrequency scaling to optimize the heterogeneous parallel processing. In the het-erogeneous processing group level power management, we propose critical threadaware power partition mechanism, which allocates more power for the thread in thecritical path. In the system level power management, we provide the power effi-ciency evaluation method for heterogeneous processing group to direct the powerpartition decision, in order to improve the whole system power efficiency whilekeeping the fairness among concurrent programs.

引文

[1] Hennessy J L, Patterson D A. Computer architecture (4th ed.): a quantitative ap-proach [M]. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007.
    [2] Hsu C-h, Feng W-c. A Power-Aware Run-Time System for High-PerformanceComputing [C]. In Proceedings of the 2005 ACM/IEEE conference on Supercom-puting. Washington, DC, USA, 2005.
    [3] Scott D. HW & SW Challenges and Trends to Reach Exascale [C]. In HPCChina’09: High Performance Computing of China. 2009.
    [4]王涛.高性能计算在基础科学中的应用——从上海超级计算中心科学计算应用谈起[J].高性能计算发展与应用. 2009, 29: 3–7.
    [5] Top500 SuperComputer List. http://www.top500.org.
    [6] Panda P, Silpa B, Shrivastava A, et al. Power-efficient System Design [M].Springer, 2010.
    [7] Williams S, Shalf J, Oliker L, et al. The potential of the cell processor for scientificcomputing[C].InProceedingsofthe3rdconferenceonComputingfrontiers.NewYork, NY, USA, 2006: 9–20.
    [8] ClearSpeed Processor. http://www.clearspeed.com/. 2011.
    [9] OwensJD,LuebkeD,GovindarajuN,etal.ASurveyofGeneral-PurposeCompu-tationonGraphicsHardware[J].ComputerGraphicsForum.2007,26(1):80–113.
    [10] Green500 SuperComputer List. http://www.green500.org/lists/2009/11/top/list.php/. Nov 2009.
    [11] Conway S. Multicore Processing: Breaking through the Programming Wall.http://www.scientificcomputing.com/.
    [12] AMD. ATI Stream SDK User Guide V1.4 beta. http://developer.amd.com/gpu/ATIStreamSDK/ATIStreamSDKv1.4Beta/Pages/default.aspx. April 2009.
    [13] NVIDIA. Compute Unified Device Architecture Programming GuideV2.1Beta [J]. 2009.
    [14] Wikipedia. Random-access memory. 2009.9.14. http://en.wikipedia.org/wiki/Random-access_memory#Memory_wall.
    [15] Yi Y, Ping X, Jingfei K, et al. A GPGPU compiler for memory optimization andparallelism management [C]. In Proceedings of the 2010 ACM SIGPLAN confer-ence on Programming language design and implementation. New York, NY, USA,2010: 86–97.
    [16] ByunghyunJ,DanaS,PerhaadM,etal.ExploitingMemoryAccessPatternstoIm-prove Memory Performance in Data-Parallel Architectures [J]. IEEE Transactionson Parallel and Distributed Systems. 2011, 22: 105–118.
    [17] Yelick K. Ten ways to waste a parallel computer [C]. In ISCA’09: Proceedingsof the 36th annual international symposium on Computer architecture. New York,NY, USA, 2009: 1.
    [18] Turek D. The Strategic Future Based on High Performance Computing - the Pushto Exascale [R]. 2009.
    [19] Geist A. Paving the Roadmap to Exascale [J]. SciDACReview. 2010, 16: 52–59.
    [20] Los Alamos National Laboratory. Operational Data to Support and En-able Computer Science Research. http://institute.lanl.gov/data/lanldata.shtml.
    [21] Wu M, Sun X-H, Jin H. Performance under failures of high-end computing [C]. InProceedings of the 2007 ACM/IEEE conference on Supercomputing. New York,NY, USA, 2007: 1～11.
    [22] RaghavendraR,RanganathanP,TalwarV,etal.No”power”struggles:coordinatedmulti-level power management for the data center [C]. In Proceedings of the 13thinternational conference on Architectural support for programming languages andoperating systems. New York, NY, USA, 2008: 48–59.
    [23] Borkar S. Low power design challenges for the decade (invited talk) [C]. In Pro-ceedings of the 2001 Asia and South Pacific Design Automation Conference. NewYork, NY, USA, 2001: 293–296.
    [24] Chandrakasan A P, Brodersen R W. Minimizing power consumption in digital C-MOS circuits [M]. In Proceedings of the IEEE. 1995: 498–523.
    [25] Thompson S, Packan P, Bohr M. MOS Scaling: Transistor Challenges for the 21stCentury [J]. Intel Technology Journal. Q3’98.
    [26] Jung S-O, Kim K-W, Kang S-M. Low-swing clock domino logic incorporatingdual supply and dual threshold voltages [C]. In Proceedings of the 39th annualDesign Automation Conference. New York, NY, USA, 2002: 467–472.
    [27] Amelifard B, Fallah F, Pedram M. Low-power fanout optimization using multiplethreshold voltage inverters [C]. In Proceedings of the 2005 international sympo-sium on Low power electronics and design. New York, NY, USA, 2005: 95–98.
    [28] Calhoun B H, Chandrakasan A. Characterizing and modeling minimum energyoperation for subthreshold circuits [C]. In Proceedings of the 2004 internationalsymposium on Low power electronics and design. New York, NY, USA, 2004:90–95.
    [29] Donno M, Ivaldi A, Benini L, et al. Clock-tree power optimization based on RTLclock-gating [C]. In Proceedings of the 40th annual Design Automation Confer-ence. New York, NY, USA, 2003: 622–627.
    [30] Kapur P, Chandra G, Saraswat K C. Power estimation in global interconnects andits reduction using a novel repeater optimization methodology [C]. In Proceedingsof the 39th annual Design Automation Conference. New York, NY, USA, 2002:461–466.
    [31] Wason V, Banerjee K. A probabilistic framework for power-optimal repeater in-sertion in global interconnects under parameter variations [C]. In Proceedings ofthe 2005 international symposium on Low power electronics and design. New Y-ork, NY, USA, 2005: 131–136.
    [32] Kim N S, Austin T, Blaauw D, et al. Leakage Current: Moore’s Law Meets StaticPower [J]. Computer. 2003, 36: 68–75.
    [33] Kim N S, Blaauw D, Mudge T. Leakage Power Optimization Techniques for UltraDeepSub-MicronMulti-LevelCaches[C].InProceedingsofthe2003IEEE/ACMinternational conference on Computer-aided design. Washington, DC, USA, 2003:627–632.
    [34] Ananthan H, Kim C H, Roy K. Larger-than-vdd forward body bias in sub-0.5VnanoscaleCMOS[C].InProceedingsofthe2004internationalsymposiumonLowpower electronics and design. New York, NY, USA, 2004: 8–13.
    [35] Rao R M, Burns J L, Devgan A, et al. Efficient techniques for gate leakage esti-mation [C]. In Proceedings of the 2003 international symposium on Low powerelectronics and design. New York, NY, USA, 2003: 100–103.
    [36] PiguetOC,PiguetPC,RenaudinM,etal.SpecialSessiononLow-PowerSystemson Chips (SOCs). 2001.
    [37] Powell M D, Schuchman E, Vijaykumar T N. Balancing Resource Utilization toMitigate Power Density in Processor Pipelines [C]. In Proceedings of the 38thannual IEEE/ACM International Symposium on Microarchitecture. Washington,DC, USA, 2005: 294–304.
    [38] Ku J C, Ozdemir S, Memik G, et al. Thermal Management of On-Chip CachesThrough Power Density Minimization [C]. In Proceedings of the 38th annualIEEE/ACMInternationalSymposiumonMicroarchitecture.Washington,DC,US-A, 2005: 283–293.
    [39] International Technology Roadmap for Semiconductors 2005 Edition [J/OL].ITRS. http://public.itrs.net.
    [40] KumarR.HolisticDesignforMulti-coreArchitectures[D].SanDiego,California,USA: University of California, 2006.
    [41]易会战.低功耗技术研究-体系结构和编译优化[D]. [S. l.]:国防科学技术大学.计算机学院, 2006.
    [42] Tiwari V, Malik S, Wolfe A. Compilation Techniques for Low Energy: AnOverview [C]. In In Proceedings of the 1994 Symposium on Low-Power Elec-tronics. 1994: 38–39.
    [43] Mehta H, Owens R M, Irwin M J, et al. Techniques for low energy software [C]. InProceedings of the 1997 international symposium on Low power electronics anddesign. New York, NY, USA, 1997: 72–75.
    [44] Roy K, Johnson M C. Low power design in deep submicron electronic-s [M] // Nebel W, Mermet J. Norwell, MA, USA: Kluwer Academic Publishers,1997: 1997: 433–460.
    [45] Yang H. Power-aware compilation techniques for high performance proces-sors [D]. Newark, DE, USA: University of Delaware, 2004.
    [46] HsuC-H.Compiler-directeddynamicvoltageandfrequencyscalingforcpupowerand energy reduction [D]. New Brunswick, NJ, USA: Rutgers University, 2003.
    [47] Shrivastava A, Issenin I, Dutt N. Compilation techniques for energy reduction inhorizontally partitioned cache architectures [C]. In Proceedings of the 2005 in-ternational conference on Compilers, architectures and synthesis for embeddedsystems. New York, NY, USA, 2005: 90–96.
    [48] Kremer U, Hicks J, Rehg J. A compilation framework for power and energy man-agement on mobile computers [C]. In Proceedings of the 14th international con-ference on Languages and compilers for parallel computing. Berlin, Heidelberg,2003: 115–131.
    [49] Ge R. Theories and Techniques for Efficient High-End Computing [D]. Blacks-burg, Virginia, USA: Virginia Polytechnic Institute and State University, 2007.
    [50] Curtis-Maury M F. Improving the Efficiency of Parallel Applications on Multi-threaded and Multicore Systems [D]. Blacksburg, Virginia, USA: Virginia Poly-technic Institute and State University, 2008.
    [51] Jayaseelan R. Application-specific Thermal Management of Computer System-s [D]. Singapore: National University of Singapore, 2009.
    [52] Coskun A K. Efficient Thermal Management for Multiprocessor Systems [D]. SanDiego, California, USA: University of California, 2009.
    [53] LiJ,MartinezJF.Power-PerformanceImplicationsofThread-levelParallelismonChip Multiprocessors [C]. In Proceedings of the IEEE International Symposiumon Performance Analysis of Systems and Software, 2005. Washington, DC, USA,2005: 124–134.
    [54] Hsu C-H, Kremer U. The design, implementation, and evaluation of a compileralgorithm for CPU energy reduction [C]. In Proceedings of the ACM SIGPLAN2003 conference on Programming language design and implementation. New Y-ork, NY, USA, 2003: 38–48.
    [55] Saputra H, Kandemir M, Vijaykrishnan N, et al. Energy-conscious compilationbasedonvoltagescaling[C].InProceedingsofthejointconferenceonLanguages,compilers and tools for embedded systems: software and compilers for embeddedsystems. New York, NY, USA, 2002: 2–11.
    [56] Mosse D, Aydin H, Childers B, et al. Compiler-Assisted Dynamic Power-AwareScheduling for Real-Time Applications [C]. In Workshop on Compilers and Op-erating Systems for Low Power (COLP). 2000.
    [57] Shin D, Kim J. A profile-based energy-efficient intra-task voltage scheduling al-gorithm for real-time applications [C]. In Proceedings of the 2001 internationalsymposium on Low power electronics and design. New York, NY, USA, 2001:271–274.
    [58] Xie F, Martonosi M, Malik S. Compile-time dynamic voltage scaling settings: op-portunities and limits [C]. In Proceedings of the ACM SIGPLAN 2003 confer-ence on Programming language design and implementation. New York, NY, USA,2003: 49–62.
    [59]赵荣彩.多线程低功耗编译优化技术研究[D]. [S. l.]:中国科学院计算技术研究所, 2002.
    [60]陈娟.低功耗软件优化技术研究[D]. [S. l.]:国防科学技术大学.计算机学院,2007.
    [61] Kadayif I, Kandemir M, Karakoy M. An energy saving strategy based on adap-tive loop parallelization [C]. In DAC’02: Proceedings of the 39th annual DesignAutomation Conference. New York, NY, USA, 2002: 195–200.
    [62] KadayifI,KandemirM,KolcuI.ExploitingProcessorWorkloadHeterogeneityforReducing Energy Consumption in Chip Multiprocessors [J]. Design, Automationand Test in Europe Conference and Exhibition. 2004, 2: 1158–1163 Vol.2.
    [63] Kim W, Gupta M S, yeon Wei G, et al. System level analysis of fast, per-core D-VFSusingon-chipswitchingregulators[C].InInternationalSymposiumonHigh-Performance Computer Architecture. Salt Lake City, UT, 2008: 123–134.
    [64] Woo D H, Lee H-H S. Extending Amdahl’s Law for Energy-Efficient Computingin the Many-Core Era [J]. Computer. 2008, 41: 24–31.
    [65] Li J, Martinez J F. Dynamic power-performance adaptation of parallel computa-tion on chip multiprocessors [J]. High-Performance Computer Architecture, Inter-national Symposium on. 2006, 0: 77–87.
    [66] Curtis-Maury M, Shah A, Blagojevic F, et al. Prediction models for multi-dimensional power-performance optimization on many cores [C]. In Proceedingsofthe17thinternationalconferenceonParallelarchitecturesandcompilationtech-niques. New York, NY, USA, 2008: 250–259.
    [67] Curtis-Maury M, Blagojevic F, Antonopoulos C D, et al. Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes [J]. IEEE Transactionson Parallel and Distributed Systems. 2007, 19: 1396–1410.
    [68] LiK.PerformanceAnalysisofPower-AwareTaskSchedulingAlgorithmsonMul-tiprocessorComputerswithDynamicVoltageandSpeed[J].IEEETransactionsonParallel and Distributed Systems. 2008, 19 (11): 1484–1497.
    [69] LiK.Energyefficientschedulingofparalleltasksonmultiprocessorcomputers[J].The Journal of Supercomputing. 2010: 1–25.
    [70] ChoS,MelhemRG.OntheInterplayofParallelization,ProgramPerformanceandEnergy Consumption [J]. IEEE Transactions on Parallel and Distributed Systems.2010, 21: 342–353.
    [71] Korthikanti V A, Agha G. Analysis of Parallel Algorithms for Energy Conserva-tion in Scalable Multicore Architectures [C]. In Proceedings of the 2009 Interna-tional Conference on Parallel Processing. Washington, DC, USA, 2009: 212–219.
    [72] Korthikanti V A, Agha G. Towards optimizing energy costs of algorithms forshared memory architectures [C]. In Proceedings of the 22nd ACM symposium onParallelism in algorithms and architectures. New York, NY, USA, 2010: 157–165.
    [73] Cebrian J M, Aragon J L, Garcia J M, et al. Efficient microarchitecture policiesfor accurately adapting to power constraints [C]. In Proceedings of the 2009 IEEEInternational Symposium on Parallel&Distributed Processing. Washington, DC,USA, 2009: 1–12.
    [74] Donald J, Martonosi M. Techniques for Multicore Thermal Management: Classi-fication and New Exploration [C]. In Proceedings of the 33rd annual internationalsymposium on Computer Architecture. Washington, DC, USA, 2006: 78–88.
    [75] Simunic T, Benini L, Acquaviva A, et al. Dynamic voltage scaling and powermanagement for portable systems [C]. In Proceedings of the 38th annual DesignAutomation Conference. New York, NY, USA, 2001: 524–529.
    [76] Isci C, Buyuktosunoglu A, Cher C-Y, et al. An Analysis of Efficient Multi-CoreGlobal Power Management Policies: Maximizing Performance for a Given PowerBudget [C]. In Proceedings of the 39th Annual IEEE/ACM International Sympo-sium on Microarchitecture. Washington, DC, USA, 2006: 347–358.
    [77] WangX,ChenM.Cluster-levelfeedbackpowercontrolforperformanceoptimiza-tion [C]. In 17th International Conference on High-Performance Computer Archi-tecture. 2008: 101–110.
    [78] Teodorescu R, Torrellas J. Variation-Aware Application Scheduling and PowerManagement for Chip Multiprocessors [C]. In Proceedings of the 35th Annual In-ternational Symposium on Computer Architecture. Washington, DC, USA, 2008:363–374.
    [79] Meng K, Joseph R, Dick R P, et al. Multi-Optimization Power Management forChipMultiprocessors[C].InPACT’08:Proceedingsofthe15thinternationalcon-ference on Parallel architectures and compilation techniques. New York, NY, US-A, 2008: 177–186.
    [80] Sartori J, Kumar R. Distributed peak power management for many-core architec-tures [C]. In Proceedings of the Conference on Design, Automation and Test inEurope. 3001 Leuven, Belgium, Belgium, 2009: 1556–1559.
    [81] Winter J A, Albonesi D H, Shoemaker C A. Scalable thread scheduling and globalpower management for heterogeneous many-core architectures [C]. In Proceed-ings of the 19th international conference on Parallel architectures and compilationtechniques. New York, NY, USA, 2010: 29–40.
    [82] Li J, Huang W, Lefurgy C, et al. Power shifting in Thrifty Interconnection Net-work. [C]. In International Symposium on High Performance Computer Architec-ture. 2011: 156–167.
    [83] Juan M Cebrián J L A, Kaxiras S. Power Token Balancing: Adapting CMPs toPower Constraints for Parallel Multithreaded Workloads [C]. In IPDPS’11: Pro-ceedings of IEEE International Symposium on Parallel & Distributed Processing.2009: 1–12.
    [84] MaK,LiX,ChenM,etal.Scalablepowercontrolformany-corearchitecturesrun-ningmulti-threadedapplications[C].InProceedingofthe38thannualinternation-al symposium on Computer architecture. New York, NY, USA, 2011: 449–460.
    [85] Felter W, Rajamani K, Keller T, et al. A performance-conserving approach forreducing peak power consumption in server systems [C]. In Proceedings of the19th annual international conference on Supercomputing. New York, NY, USA,2005: 293–302.
    [86] WangX,MaK,WangY.AchievingFairorDifferentiatedCacheSharinginPower-Constrained Chip Multiprocessors [C]. In Proceedings of the 2010 39th Interna-tional Conference on Parallel Processing. Washington, DC, USA, 2010: 1–10.
    [87] Mishra A K, Srikantaiah S, Kandemir M, et al. CPM in CMPs: Coordinated PowerManagementinChip-Multiprocessors[C].InProceedingsofthe2010ACM/IEEEInternational Conference for High Performance Computing, Networking, Storageand Analysis. Washington, DC, USA, 2010: 1–12.
    [88] Chen M, Wang X, Li X. Coordinating Processor and Main Memory for EfficientServerPowerControl[C].InICS’11:Proceedingsofthe25ndannualinternationalconference on Supercomputing. New York, NY, USA, 2011: 130–140.
    [89] Annavaram M, Grochowski E, Shen J. Mitigating Amdahl’s Law through EPIThrottling [C]. In Proceedings of the 32nd annual international symposium onComputer Architecture. Washington, DC, USA, 2005: 298–309.
    [90] GomaaM,PowellMD,VijaykumarTN.Heat-and-run:leveragingSMTandCMPto manage power density through the operating system [C]. In Proceedings of the13thinternationalconferenceonArchitecturalsupportforprogramminglanguagesand operating systems (ASPLOS-XI). 2004: 260–270.
    [91] Wang Y, Ma K, Wang X. Temperature-constrained power control for chip multi-processors with online model estimation [J]. ACM SIGARCH Computer Archi-tecture News. 2009, 37 (3): 314–324.
    [92] Ge Y, Malani P, Qiu Q. Distributed task migration for thermal management inmany-core systems [C]. In DAC’10: Proceedings of the 47th Annual Design Au-tomation Conference. New York, NY, USA, 2010: 579–584.
    [93] Rangan K K, Wei G-Y, Brooks D. Thread motion: fine-grained power manage-ment for multi-core systems [C]. In Proceedings of the 36th annual internationalsymposium on Computer architecture. New York, NY, USA, 2009: 302–313.
    [94] Coskun A K, Rosing T v, Whisnant K A, et al. Static and dynamic temperature-aware scheduling for multiprocessor SoCs [J]. IEEE Transactions on Very LargeScale Integration Systems. 2008, 16: 1127–1140.
    [95] Kumar R, Farkas K I, Jouppi N P, et al. Single-ISA Heterogeneous Multi-Core Ar-chitectures: The Potential for Processor Power Reduction [C]. In Proceedings ofthe36thannualIEEE/ACMInternationalSymposiumonMicroarchitecture.Wash-ington, DC, USA, 2003: 81–92.
    [96] Kumar R, Tullsen D M, Ranganathan P, et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance [C]. In Proceedingsofthe31stannualinternationalsymposiumonComputerarchitecture.Washington,DC, USA, 2004: 64–75.
    [97] KumarR,TullsenDM,JouppiNP,etal.HeterogeneousChipMultiprocessors[J].Computer. 2005, 38: 32–38.
    [98] Kumar R, Tullsen D M, Jouppi N P. Core architecture optimization for heteroge-neouschipmultiprocessors[C].InPACT’06:Proceedingsofthe15thinternationalconference on Parallel architectures and compilation techniques. New York, NY,USA, 2006: 23–32.
    [99] Hill M D, Marty M R. Amdahl’s Law in the Multicore Era [J]. Computer. 2008,41: 33–38.
    [100] Fedorova A, Saez J C, Shelepov D, et al. Maximizing power efficiency with asym-metric multicore systems [J]. Communications of the ACM. 2009, 52: 48–57.
    [101] Sun X-H, Chen Y. Reevaluating Amdahl’s law in the multicore era [J]. Journal ofParallel and Distributed Computing. 2010, 70: 183–188.
    [102] Suleman M A, Mutlu O, Qureshi M K, et al. Accelerating critical section exe-cution with asymmetric multi-core architectures [C]. In Proceeding of the 14thinternational conference on Architectural support for programming languages andoperating systems. New York, NY, USA, 2009: 253–264.
    [103] Huang W, Skadron K, Gurumurthi S, et al. Exploring the Thermal Impact onManycore Processor Performance [C]. In Proceedings of the IEEE SemiconductorThermal Measurement, Modeling, and Management Symposium. 2010: 191–197.
    [104] BalakrishnanS,RajwarR,UptonM,etal.TheImpactofPerformanceAsymmetryin Emerging Multicore Architectures [C]. In Proceedings of the 32nd annual in-ternational symposium on Computer Architecture. Washington, DC, USA, 2005:506–517.
    [105] Chung E S, Milder P A, Hoe J C, et al. Single-Chip Heterogeneous Computing:DoestheFutureIncludeCustomLogic,FPGAs,andGPGPUs?[C].InProceedingsof the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitec-ture. Washington, DC, USA, 2010: 225–236.
    [106] Morad T Y, Weiser U C, Kolodny A, et al. Performance, Power Efficiency andScalability of Asymmetric Cluster Chip Multiprocessors [J]. IEEE Computer Ar-chitecture Letters. 2006, 5: 14–17.
    [107] Chen J. Energy-aware application scheduling on a heterogeneous multi-core sys-tem [C]. In 2008 IEEE International Symposium on Workload Characterization.2008: 5–13.
    [108] Goraczko M, Liu J, Lymberopoulos D, et al. Energy-optimal software partition-ing in heterogeneous multiprocessor embedded systems [C]. In Proceedings ofthe 45th annual Design Automation Conference. New York, NY, USA, 2008:191–196.
    [109] Sharifi S, Coskun A K, Rosing T S. Hybrid dynamic energy and thermal man-agement in heterogeneous embedded multiprocessor SoCs [C]. In Proceedings ofthe 2010 Asia and South Pacific Design Automation Conference. Piscataway, NJ,USA, 2010: 873–878.
    [110] Powermizer Technology. http://www.nvidia.com/object/feature_powermizer.html.
    [111] Powerplay Technology. http://www.amd.com/us/products/technologies/ati-power-play/Pages/ati-power-play.aspx.
    [112] HuangS,XiaoS,FengW.Ontheenergyefficiencyofgraphicsprocessingunitsforscientific computing [C]. In Fifth Workshop on High-Performance, Power-AwareComputing (HPPAC’09). Washington, DC, USA, 2009: 1–8.
    [113] CollangeS,DefourD,TisserandA.PowerConsumptionofGPUsfromaSoftwarePerspective [C]. In ICCS’09: Proceedings of the 9th International Conference onComputational Science. Berlin, Heidelberg, 2009: 914–923.
    [114] Jiao Y, Lin H, Balaji P, et al. Power and Performance Characterization of Com-putational Kernels on the GPU [C]. In Proceedings of the 2010 IEEE/ACM Int’lConference on Green Computing and Communications & Int’l Conference on Cy-ber, Physical and Social Computing. Washington, DC, USA, 2010: 221–228.
    [115] RenD,SudaR.PowerEfficientLargeMatricesMultiplicationbyLoadSchedulingon Multi-core and GPU Platform with CUDA [C]. In Proceedings of the 2009InternationalConferenceonComputationalScienceandEngineering-Volume01.Washington, DC, USA, 2009: 424–429.
    [116] Ren D Q, Suda R. Modeling and Estimation for the Power Consumption of MatrixComputation on Multi-core Platform [C]. In Proceedings of the 2009 Internation-al Joint Conference on Computational Sciences and Optimization - Volume 01.Washington, DC, USA, 2009: 42–46.
    [117] Ren D Q. Algorithm level power efficiency optimization for CPU-GPU process-ing element in data intensive SIMD/SPMD computing [J]. Journal of Parallel andDistributed Computing. 2011, 71: 245–253.
    [118] Ren D-Q, Suda R. Global optimization model on power efficiency of GPU andmulticoreprocessingelementforSIMDcomputingwithCUDA[J].ComputerSci-ence - Research and Development: 1–9.
    [119] Ma X, Dong M, Dong L, et al. Statistical Power Consumption Analysis and Mod-eling for GPU-based Computing [C]. In Proceedings of the 2009 Workshop onPower Aware Computing and Systems. 2009.
    [120] Nagasaka H, Maruyama N, Nukada A, et al. Statistical power modeling of GPUkernels using performance counters [C]. In Proceedings of the International Con-ference on Green Computing. Washington, DC, USA, 2010: 115–122.
    [121] Hong S, Kim H. An analytical model for a GPU architecture with memory-levelandthread-levelparallelismawareness[C].InProceedingsofthe36thInternation-al Symposium on Computer Architecture (ISCA). 2009: 152–163.
    [122] Hong S, Kim H. An integrated GPU power and performance model [C]. In ISCA’10: Proceedings of the 37th annual international symposium on Computer archi-tecture. New York, NY, USA, 2010: 280–289.
    [123] Takizawa H, Sato K, Kobayashi H. SPRAT: Runtime processor selection forenergy-aware computing. [C]. In IEEE International Conference on Cluster Com-puting. 2008: 386–393.
    [124] Hamano T, Toshio E, Satoshi M. Power-aware dynamic task scheduling for het-erogeneous accelerated clusters [C]. In Proceedings of the 2009 IEEE Internation-al Symposium on Parallel&Distributed Processing. Washington, DC, USA, 2009:1–8.
    [125] Luk C-K, Hong S, Kim H. Qilin: exploiting parallelism on heterogeneous multi-processors with adaptive mapping [C]. In In the Proceedings of the 42nd AnnualIEEE/ACM International Symposium on Microarchitecture. New York, NY, USA,2009: 45–55.
    [126] Yang C, Wang F, Du Y, et al. Adaptive Optimization for Petascale HeterogeneousCPU/GPU Computing [C]. In IEEE International Conference on Cluster Comput-ing. Los Alamitos, CA, USA, 2010: 19–28.
    [127] Chen L, Villa O, Krishnamoorthy S, et al. Dynamic load balancing on single- andmulti-GPU systems [C]. In IEEE International Symposium on Parallel and Dis-tributed Processing. 2010: 1–12.
    [128] Chen J, Dong Y, Yang X-j, et al. A Compiler-Directed Energy Saving Strategy forParallelizing Applications in On-Chip Multiprocessors [C]. In Proceedings of theThe 4th International Symposium on Parallel and Distributed Computing. Wash-ington, DC, USA, 2005: 147–154.
    [129] Zhu Y, Magklis G, Scott M L, et al. The Energy Impact of Aggressive LoopFusion [C]. In PACT’04: Proceedings of the 13th International Conference onParallel Architectures and Compilation Techniques. Washington, DC, USA, 2004:153–164.
    [130] Baskaran M M, Bondhugula U, Krishnamoorthy S, et al. A compiler frameworkfor optimization of affine loop nests for gpgpus [C]. In ICS’08: Proceedings of the22nd annual international conference on Supercomputing. New York, NY, USA,2008: 225–234.
    [131] Dong Y, Chen J, Yang X, et al. Energy-Oriented OpenMP Parallel Loop Schedul-ing[C].InProceedingsofthe2008IEEEInternationalSymposiumonParallelandDistributed Processing with Applications. Washington, DC, USA, 2008: 162–169.
    [132] LakshminarayanaNB,KimH.Understandingperformance,powerandenergybe-haviorinasymmetricmultiprocessors[C].InProceedingsofthe26thInternationalConference on Computer Design. 2008: 471–477.
    [133] Burd T D, Brodersen R W. Energy efficient CMOS microprocessor design [C].In Proceedings of the 28th Hawaii International Conference on System Sciences.Washington, DC, USA, 1995: 288.
    [134] Burd T D, Brodersen R W. Design issues for dynamic voltage scaling [C]. In Pro-ceedings of the 2000 international symposium on Low power electronics and de-sign. New York, NY, USA, 2000: 9–14.
    [135] Yang X, Tang T, Wang G, et al. MPtostream: an OpenMP compiler for CPU-GPUheterogeneousparallelsystems[J].SCIENCECHINAInformationSciences.2011: 1–11.
    [136] Enhanced Intel SpeedStep Technology for the Intel Pentium M Processor, WhitePaper. March, 2004.
    [137] Advanced Configuration and Power Interface. http://www.acpi.info/.
    [138] AMD Display Library. http://developer.amd.com/GPU/ADLSDK/Pages/default.aspx.
    [139] SPEC OpenMP Benchmark Suite. http://www.spec.org/omp/.
    [140] Wang G, Tang T, Fang X, et al. Program Optimization of Array-Intensive SPEC2kBenchmarks on Multithreaded GPU Using CUDA and Brook+ [C]. In Internation-alConferenceonParallelandDistributedSystems.LosAlamitos,CA,USA,2009:292–299.
    [141] AMD Corporation. Stream Sample Code. 2009. http://developer.amd.com/SAMPLES/STREAMSHOWCASE/Pages/default.aspx.
    [142] Borkar S. Microarchitecture and Design Challenges for Gigascale Integration [C].In Proceedings of the 37th annual IEEE/ACM International Symposium on Mi-croarchitecture. Washington, DC, USA, 2004: 3–3.
    [143] Che S, Boyer M, Meng J, et al. Rodinia: A benchmark suite for heterogeneouscomputing [C]. In IEEE International Symposium on Workload Characterization.IISWC’09. 2009: 44–54.
    [144] University of Illinois. Parboil Benchmark suite. http://impact.crhc.illinois.edu/parboil.php.
    [145] Eyerman S, Eeckhout L. Modeling critical sections in Amdahl’s law and its impli-cations for multicore design [C]. In Proceedings of the 37th annual internationalsymposium on Computer architecture. New York, NY, USA, 2010: 362–370.
    [146] The OpenCL Specification. http://www.khronos.org/opencl/. Jun2010.
    [147] Radulescu A, van Gemund A J. A Low-Cost Approach towards Mixed Task andData Parallel Scheduling [C]. In In Proceedings of 2001 International Conferenceon Parallel Processing (30th ICPP’01). Washington, DC, USA, 2001: 69–76.
    [148] N’Takpe T, Suter F. Critical Path and Area Based Scheduling of Parallel TaskGraphs on Heterogeneous Platforms [C]. In Proceedings of the 12th Internation-al Conference on Parallel and Distributed Systems - Volume 1. Washington, DC,USA, 2006: 3–10.
    [149] Hunold S, Rauber T, Suter F. Redistribution aware two-step scheduling for mixed-parallel applications [J]. IEEE International Conference on Cluster Computing.2008: 50–58.
    [150] Dutot P-F, N’Takpe T, Suter F, et al. Scheduling Parallel Task Graphs on (Almost)Homogeneous Multicluster Platforms [J]. IEEE Transactions on Parallel and Dis-tributed Systems. 2009, 20: 940–952.
    [151] Choi K, Soma R, Pedram M. Fine-Grained Dynamic Voltage and Frequency Scal-ing for Precise Energy and Performance Trade-Off Based on the Ratio of Off-ChipAccess to On-Chip Computation Times [C]. In Proceedings of the conference onDesign, automation and test in Europe - Volume 1. Washington, DC, USA, 2004:4–9.
    [152] Zomaya A Y, Ward C, Macey B. Genetic Scheduling for Parallel Processor Sys-tems: Comparative Studies and Performance Issues [J]. IEEE Transactions on Par-allel and Distributed Systems. 1999, 10: 795–812.
    [153] Page A J, Naughton T J. Dynamic Task Scheduling using Genetic Algorithms forHeterogeneous Distributed Computing [C]. In Proceedings of the 19th IEEE Inter-national Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop6 - Volume 07. Washington, DC, USA, 2005: 189–196.
    [154] Zomaya A Y, Teh Y-H. Observations on Using Genetic Algorithms for DynamicLoad-Balancing[J].IEEETransactionsonParallelandDistributedSystems.2001,12: 899–911.
    [155] Weaver N, Paxson V, Gonzalez J M. The shunt: an FPGA-based accelerator fornetwork intrusion prevention [C]. In Proceedings of the 2007 ACM/SIGDA 15thinternationalsymposiumonFieldprogrammablegatearrays.NewYork,NY,USA,2007: 199–206.
    [156] Cai Q, González J, Rakvic R, et al. Meeting points: using thread criticality to adaptmulticorehardwaretoparallelregions[C].InProceedingsofthe17thinternationalconference on Parallel architectures and compilation techniques. New York, NY,USA, 2008: 240–249.
    [157] Bhattacharjee A, Martonosi M. Thread criticality predictors for dynamic perfor-mance, power, and resource management in chip multiprocessors [C]. In Proceed-ings of the 36th annual international symposium on Computer architecture. NewYork, NY, USA, 2009: 290–301.
    [158] Intel Corporation. Intel Performance Counter Monitor. http://software.intel.com/en-us/.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700