     论文紧紧围绕如何为大规模并行系统开发高效能OpenMP程序设计环境这一主题,对大规模分布共享存储(Distributed Shared Memory,DSM)系统上OpenMP实现的关键技术、面向DSM系统的OpenMP语言扩展、编译指导的数据预取、OpenMP的检查点/续算技术以及面向OpenMP的低功耗优化展开研究,取得了以下创新性成果:
     1、针对大规模并行计算机体系结构,设计实现了OpenMP并行编译器CCRGOpenMP。提出了编译时和链接时协同的OpenMP共享数据放置策略,不仅克服了在分布操作系统上需要显式分配共享内存的缺点,而且为检查点的数据局部性优化提供了有力支持。在OpenMP实现上,采用了大量的源级优化策略以提高程序性能。对于科学计算和模拟程序,在我们的SCCMP系统上,CCRG OpenMP性能与采用最新的Intel 9.1编译器的SGI Altix相当。
     3、提出了面向OpenMP的编译指导的两阶段数据预取算法,克服了DSM系统上远程访存与本地访存延迟不一致引起的预取不准确的问题。建立了一个静态的性能分析模型,对预取算法进行了评估。在SCCMP系统上,采用本文的两阶段数据预取算法后,在32个线程时,SPEC OMP2001中swim程序在我们的系统上性能提高了14%;在64个线程时,性能提高了9%。
     4、建立了系统级和应用级协同的OpenMP检查点/续算机制,设计了阻塞的OpenMP检查点协议。基于该机制实现了一个CCRG OpenMP检查点/续算系统。该系统完全支持OpenMP 2.0 API,具有良好的可扩展性和实用价值。
     5、研究了面向OpenMP的功耗优化技术。在结点具有动态电压调整(DynamicVoltage Scaling,DVS)能力的并行系统上提出了三种低功耗优化方法及其实现算法。在基于最差执行时间的功耗优化中,提出了基于同步段的OpenMP程序最差执行时间分析与DVS方法。该方法将同步段作为分析和电压调整单位,有效避免了障碍同步引起的负载不平衡对程序执行和功耗的影响。建立了一个能量消耗分析模型,模拟分析显示,针对OpenMP并行应用的功耗优化技术能有效地减少并行系统运行OpenMP程序时的能量消耗。
Nowadays, high-end computing has changed its ambition from the pure pursuit of high performance to the realization of high productivity systems, which includes the improvement in performance, programmability, portability and robustness, and the reduction of costs in development, running and maintenance of systems. High productivity computer systems must be supported with high productivity programming environments. Furthermore, the applications confronting the future teraflops and petaflops systems are multidisciplinary and multiscale, whose complexity requires domain experts and software scientists from different disciplines to work together for development, management and maintainence. Such kind of participation puts higher requirements to the performance, programmability, portability and fault-tolerance of programming environments. With such features as easy programmability, supporting incremental design patterns, good maintainability and high portability, OpenMP will be the mainstream parallel programming language in the long run.
     Focusing on development of high productivity OpenMP programming environment for large-scale parallel systems, this thesis systematically investigates some key techniques in implementing OpenMP on large-scale distributed shared memory (DSM) systems, DSM-oriented OpenMP extensions, compiler-guided data prefetching, checkpoint/restart and OpenMP-oriented low-power optimization and others related techniques of OpenMP. The main contributions of the thesis are as follows.
     1. CCRG OpenMP, an OpenMP parallel compiler, has been designed and implemented for large-scale parallel computer systems. We present the compiling-time and linking-time coordinated OpenMP shared data placement strategy, which not only overcomes the disadvantage that shared memory is required to explicitly allocate in distributed OS, but also provides support for data locality optimization of Checkpointing. Several source-level optimization techniques are used to improve performance. The practical experiments show the performance of CCRG OpenMP on our SCCMP system is equal to that of Intel compiler 9.1 on SGI Altix.
     2. Two OpenMP directives BARRIER (thread_id) and ALLREDUCTION have been presented to reduce the rapid-increasing overhead in such global operations as barrier and reduction incurred when the scale of OpenMP parallel programs is enlarged, and the implementing algorithms of the new directives are given. The experiments show that for real scientific application Plasma Physics, when the number of threads is 64, the performance has been increased 76%.
     3. The compiler-directed two-stage data prefetch algorithm has been presented to overcome the inaccuracy incurred by the inconsistency between remote access latency and local access latency. The algorithm is evaluated by means of a static performance analysis model. The experiments show that, by using the presented algorithm, the performance has been improved 14% for swim in SPEC OMP2001 when the number of threads is 32, and 9% when the number of threads is 64.
     4. We have presented the system-level and application-level coordinated OpenMP Checkpoint/Restart mechanisms, and a blocked OpenMP Checkpoint protocol. Based on these mechanisms, a CCRG OpenMP Checkpoint/Restart system has been implemented. The system provides the complete supports for OpenMP 2.0 API, with good scalability and applicability.
     5. Energy optimization techniques are studied based on OpenMP programming model. Three energy optimization methods and implementations are presented for parallel systems with dynamic voltage scaling (DVS) capabilities. The barrier section based analysis of worst-case execution-time (WCET) and DVS methods are proposed for WCET based energy optimization. These methods use barrier section as the unit of analysis and voltage scaling, which avoid the impact of barrier on program execution and energy consumptions caused by load imbalance due to barrier. An analysis model is built and the simulation shows that these techniques can effectively reduce energy consumptions for parallel systems.
