面向分布共享存储体系结构的高效能OpenMP关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向分布共享存储体系结构的高效能OpenMP关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：High Productivity OpenMP for Distributed Shared Memory Architecture
作者：黄春
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：高效能 ; OpenMP ; 语言扩展 ; 两阶段数据预取 ; 检查点/续算 ; 低功耗优化
英文关键词：High-Productivity ; OpenMP ; OpenMP Externsion ; Two-Stage Data Prefetch ; Checkpoint/Restart ; Low-Energy Optimization
学位年度：2007
导师：杨学军
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2007-04-01

摘要

高端计算发展到今天,已经从单一地追求高性能转向致力于实现系统的高效能,包括提高系统的性能、可编程性、可移植性和健壮性,同时降低系统的开发、运行以及维护成本。高效能计算机系统离不开高效能的程序设计环境,尤其是未来的百万亿次、千万亿次计算机系统所面向的应用是多学科和多尺度的,这些应用的复杂性要求各学科的科学家和软件专家一起设计、管理和维护应用程序。各学科专家的参与对程序设计环境的性能、可编程性、可移植性以及容错性提出了更高的要求。OpenMP具有易编程、支持增量式程序设计模式、可维护性好以及可移植性高等特点,在未来很长一段时间仍将是主流的并行程序设计语言。
     论文紧紧围绕如何为大规模并行系统开发高效能OpenMP程序设计环境这一主题,对大规模分布共享存储(Distributed Shared Memory,DSM)系统上OpenMP实现的关键技术、面向DSM系统的OpenMP语言扩展、编译指导的数据预取、OpenMP的检查点/续算技术以及面向OpenMP的低功耗优化展开研究,取得了以下创新性成果:
     1、针对大规模并行计算机体系结构,设计实现了OpenMP并行编译器CCRGOpenMP。提出了编译时和链接时协同的OpenMP共享数据放置策略,不仅克服了在分布操作系统上需要显式分配共享内存的缺点,而且为检查点的数据局部性优化提供了有力支持。在OpenMP实现上,采用了大量的源级优化策略以提高程序性能。对于科学计算和模拟程序,在我们的SCCMP系统上,CCRG OpenMP性能与采用最新的Intel 9.1编译器的SGI Altix相当。
     2、提出了两个新的OpenMP指导命令BARRIER(thread_id)和ALLREDUCTION,降低了OpenMP并行程序在障碍同步和归约等全局操作上的开销:给出了新指导命令的实现算法。对于实际科学计算程序粒子云,在64个线程时,性能提高了76%。
     3、提出了面向OpenMP的编译指导的两阶段数据预取算法,克服了DSM系统上远程访存与本地访存延迟不一致引起的预取不准确的问题。建立了一个静态的性能分析模型,对预取算法进行了评估。在SCCMP系统上,采用本文的两阶段数据预取算法后,在32个线程时,SPEC OMP2001中swim程序在我们的系统上性能提高了14%;在64个线程时,性能提高了9%。
     4、建立了系统级和应用级协同的OpenMP检查点/续算机制,设计了阻塞的OpenMP检查点协议。基于该机制实现了一个CCRG OpenMP检查点/续算系统。该系统完全支持OpenMP 2.0 API,具有良好的可扩展性和实用价值。
     5、研究了面向OpenMP的功耗优化技术。在结点具有动态电压调整(DynamicVoltage Scaling,DVS)能力的并行系统上提出了三种低功耗优化方法及其实现算法。在基于最差执行时间的功耗优化中,提出了基于同步段的OpenMP程序最差执行时间分析与DVS方法。该方法将同步段作为分析和电压调整单位,有效避免了障碍同步引起的负载不平衡对程序执行和功耗的影响。建立了一个能量消耗分析模型,模拟分析显示,针对OpenMP并行应用的功耗优化技术能有效地减少并行系统运行OpenMP程序时的能量消耗。
Nowadays, high-end computing has changed its ambition from the pure pursuit of high performance to the realization of high productivity systems, which includes the improvement in performance, programmability, portability and robustness, and the reduction of costs in development, running and maintenance of systems. High productivity computer systems must be supported with high productivity programming environments. Furthermore, the applications confronting the future teraflops and petaflops systems are multidisciplinary and multiscale, whose complexity requires domain experts and software scientists from different disciplines to work together for development, management and maintainence. Such kind of participation puts higher requirements to the performance, programmability, portability and fault-tolerance of programming environments. With such features as easy programmability, supporting incremental design patterns, good maintainability and high portability, OpenMP will be the mainstream parallel programming language in the long run.
     Focusing on development of high productivity OpenMP programming environment for large-scale parallel systems, this thesis systematically investigates some key techniques in implementing OpenMP on large-scale distributed shared memory (DSM) systems, DSM-oriented OpenMP extensions, compiler-guided data prefetching, checkpoint/restart and OpenMP-oriented low-power optimization and others related techniques of OpenMP. The main contributions of the thesis are as follows.
     1. CCRG OpenMP, an OpenMP parallel compiler, has been designed and implemented for large-scale parallel computer systems. We present the compiling-time and linking-time coordinated OpenMP shared data placement strategy, which not only overcomes the disadvantage that shared memory is required to explicitly allocate in distributed OS, but also provides support for data locality optimization of Checkpointing. Several source-level optimization techniques are used to improve performance. The practical experiments show the performance of CCRG OpenMP on our SCCMP system is equal to that of Intel compiler 9.1 on SGI Altix.
     2. Two OpenMP directives BARRIER (thread_id) and ALLREDUCTION have been presented to reduce the rapid-increasing overhead in such global operations as barrier and reduction incurred when the scale of OpenMP parallel programs is enlarged, and the implementing algorithms of the new directives are given. The experiments show that for real scientific application Plasma Physics, when the number of threads is 64, the performance has been increased 76%.
     3. The compiler-directed two-stage data prefetch algorithm has been presented to overcome the inaccuracy incurred by the inconsistency between remote access latency and local access latency. The algorithm is evaluated by means of a static performance analysis model. The experiments show that, by using the presented algorithm, the performance has been improved 14% for swim in SPEC OMP2001 when the number of threads is 32, and 9% when the number of threads is 64.
     4. We have presented the system-level and application-level coordinated OpenMP Checkpoint/Restart mechanisms, and a blocked OpenMP Checkpoint protocol. Based on these mechanisms, a CCRG OpenMP Checkpoint/Restart system has been implemented. The system provides the complete supports for OpenMP 2.0 API, with good scalability and applicability.
     5. Energy optimization techniques are studied based on OpenMP programming model. Three energy optimization methods and implementations are presented for parallel systems with dynamic voltage scaling (DVS) capabilities. The barrier section based analysis of worst-case execution-time (WCET) and DVS methods are proposed for WCET based energy optimization. These methods use barrier section as the unit of analysis and voltage scaling, which avoid the impact of barrier on program execution and energy consumptions caused by load imbalance due to barrier. An analysis model is built and the simulation shows that these techniques can effectively reduce energy consumptions for parallel systems.

引文

[1]Harold S.Stone.High-Performance Computer Architecture(3~(rd) Edition).Addison-Wesley Series in Electrical and Computer Engineering.1993.
    [2]周毓麟、沈隆钧,高性能计算的应用与战略地位,中国科学院院刊,1999,3:184-188.
    [3]PETAFLOP.http://www.petaflop.info/.
    [4]PetaFLOPS Enabling Technologies and Applications.http://www.hq.nasa.gov/hpcc/petaflops/.
    [5]ASCI Project.http://www.lanl.gov/projects/asci/asci.html.
    [6]ASCI Project.http://www.sandia.gov/ASCI/.
    [7]ASCI Project.http://www.llnl.gov/asci/.
    [8]ASCI Project.http://www.lanl.gov/asci/
    [9]The ASCI Q System:30 TeraOPS Capability at Los Alamos National Laboratory.http://www.sandia.gov/supercomp/sc2002/flyers/ASCI_Q_rev.pdf.
    [10]The Message Passing Interface(MPI) Standard.http://www-unix.mcs.anl.gov/mpi/index.htm.
    [11]OpenMP Architecture Review Board.OpenMP Application Programming Interface,Version 2.5.http://www.openmp.org,May 2005.
    [12]Timothy G.Mattson and Greg Henry.An Overview of the Intel TFLOPS Supercomputer.Intel Technology Journal,(Q 1):12.1998.
    [13]N.R.Adiga,G.Almasi,G.S.Almasi,Y.Aridor and R.Barik.An Overview of the BlueGene/L Supercomputer.In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing(SC'02).Baltimore,Maryland,USA.2002.
    [14]Jeffrey Skolnick.Putting the Pathway back into Protein Folding.Proceedings of National Academic Science 2005:102(7):2265-2266.
    [15]Mudge T.Power.A First Class Design Constraint for Future Architectures.In Proceeding of the 7~(th) International Conference on High Performance Computing (HiPC 2000),2000.Bangalore,India:Springer,2000.215-224.
    [16]Xizhou.Feng,Rong Ge and Kirk W.Cameron.Power and Energy Profiling of Scientific Applications on Distributed Systems.19~(th) International Parallel and Distributed Processing Symposium(IPDPS 05),Denver,CO.April 2005.
    [17]Alejandro Duran,Roger Ferrer,Juan Jos'e Costa,Marc Gonz'alez,Xavier Martorell,Eduard Ayguad'e and Jes'us Labarta.A Proposal for Error Handling in OpenMP.In Proceeding of 2~(nd) International Workshop of OpenMP (IWOMP'2006).Reims,France.June,2006.
    [18]High Productivity Computing Systems.http://www.highproductivity.org/.
    [19] J. W. Manke and J. Wu. Data-Intensive System Benchmark Suite Analysis and Specification. Atlantic Aerospace Electronics Corp. June 1999.

    [20] http://www.darpa.mil/ipto/research/pacc/index.html.

    [21] http://www.darpa.mil/ipto/programs/pca.

    [22] High Performance Fortran Forum. High Performance Fortran Language Specification, Version 2.0, CRPC-TR92225. January 1997.

    [23] F.Allen, G.Almasi, W.Andreoni at el. Blue Gene: A Vision for Protein Science Using a Petaflop supercomputer[J]. IBM Systems Journal. 2001: 40(2)

    [24] Lorna Smith and Mark Bull. Development of Mixed Mode MPI/OpenMP Applications. In Proceeding of Workshop on OpenMP Applications and Tools (WOMPAT'OO). San Diego, California. July, 2000.

    [25] Kengo Nakajima and Hiroshi Okuda. Parallel Iterative Solvers for Unstructured Grids using an OpenMP/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures. In Proceeding of International Workshop on OpenMP: Experiences and Implementations (WOMPEI'02).Kyoto, Japan. May, 2002.

    [26] Panagiotis E.Hadjidoukas, Eleftherios D.Polychronopoulos and Theodore S.Papatheodorou. OpenMP for Adaptive Master-Slave Message Passing Applications. In Proceeding of International Workshop on OpenMP:Experiences and Implementations (WOMPEI'03). Tokyo, Japan. October, 2003.

    [27] Kengo Nakajima. OpenMP/MPI Hybrid vs. Flat MPI on the Earth Simulator: Parallel Iterative Solvers for Finite Element Method. In Proceeding of International Workshop on OpenMP: Experiences and Implementations (WOMPEI'03). Tokyo, Japan. October, 2003.

    [28] Gabriele Jost, Haoqiang Jin, Dieter an Mey and Ferhat F.Hatay. Comparing the OpenMP, MPI, and Hybrid Programming Paradigms on an SMP Cluster. In 5~(th) European Workshop on OpenMP (EWOMP'03). Aachen, Germany, September 2003.

    [29] Philippe Kloos, Fabrice Mathey and Philippe Blaise. OpenMP and MPI programming with a CG algorithm. In Proceeding of 2~(nd) European Workshop on OpenMP (EWOMP'00). Roma, Italy. September, 2002.

    [30] Robert.W.Numrich and J.K.Reid. Co-Array Fortran for Parallel Programming.Fortran Forum, 17(2): 1-31, August 1998.

    [31] W.W.Carlson, J.M.Draper, D.E.Culler, K.Yelick and K.W.E.Brooks.Introduction to UPC and Language Specification. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences. May, 1999.

    [32] K.A.Yelick, L.Semenzato, G.Pike, C.Miyamoto, B.Liblit, A.Krishnamurthy,P.N.Hilfinger, S.L.Graham, D.Gay, P.Colella and A.Aiken. Titanium: A High-Performance Java Dialect. Concurrency: Practice and Experience, 10(11-13), September-November 1998.

    [33] Eric Allen, David Chase, Victor Luchangco, Jan-Willem Maessen, Sukyoung Ryu, Guy L. Steele Jr. and Sam Tobin-Hochstadt. The Fortress Languages Specification, Version 0.618. Technical Report. Sun Microsystems, Inc. April 2005.

    [34] Cray, Inc. CF90~(?)Co-Array Programming Manual. Technical Report SR-3908 3.1, Cray Computer, August 1998.

    [35] Cray, Inc. Cray XI Server. 2004. http://www.cray.com

    [36] David Callahan, Bradford L.Chamberlain and Hans P.Zima. The Cascade High Productivity Language. In 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2004), IEEE Computer Society 50-62.2004.

    [37] Kemal Ebcioglu, Vijay Saraswat and Vivek Sarkar. X10: an Experimental Language for High Productivity Programming of Scalable Systems. In Proceeding of 2nd Workshop on Productivity and Performance in High-End Computing. San Francisco, USA. February 13, 2005.

    [38] SGI Inc. SGI Products: Servers and Supercomputers: SGI Altix Family.http://www.sgi.com/products/servers/altix/.

    [39] J.Laudon and J.Laudon. The SGI Origin2000: A ccNUMA Highly Scalable Server. Proceeding of the 24~(th) Annual International Symposium on Computer Architecture, pp. 241-251, Denver,USA. 1997.

    [40] James Laudon and Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceeding of the 24~(th) Annual International Symposium on Computer. Architecture. 25(2): 241-251. Denver, USA. June, 1997.

    [41] Columbia Supercomputer. http://www.nas.nasa.gov/Resources/Systems/

    [42] J.Bircsak, P.Craig, R.Crowell, Z.Cvetanovic, J.Harris, C.Nelson and C.Oner.Extending OpenMP for NUMA Machines. In Proceeding of the IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conference (SC'OO). Dallas, USA. Scientific Programming, 8(3): 163-181.November 2000.

    [43] Dimitrios S.Nikolopoulos and Theodore S.Papatheodorou. Is Data Distribution Necessary in OpenMP? In Proceeding of the IEEE/ACM Supercomputing'2000:High Performance Networking and Computing Conference (SC'2000), Dallas,Texas, USA. November, 2000.

    [44] Dimitrios S.Nikolopoulos. Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors. In Proceeding of the 16~(th) IEEE International Parallel and Distributed Processing Symposium (IPDPS'2002), Fort Lauderdale, Florida, USA. April, 2002.

    [45] Zhenying Liu, Barbara M.Chapman, Tien-hsiung Weng and Oscar Hernandez. Improving the Performance of OpenMP by Array Privatization.In Proceeding of Workshop on OpenMP Applications and Tools(WOMPAT 2002):244-259,Heidelberg,Germany.August,2002.
    [46]OpenMP Architecture Review Board.OpenMP Application Programming Interface,Version 1.0.October1997.http://www.openmp.org.
    [47]周虎成、黄春、赵克佳,编译器指导的OpenMP Fontran程序数据分布,南京大学学报(自然科学),第41卷第5期,2005.9.
    [48]V.Schuster and D.Miles.Distributed OpenMP:Extensions to OpenMP for SMP Clusters.In Proceeding of the Workshop on OpenMP Applications and Tools (WOMPAT'00).San Diego,California.July,2000.
    [49]T.S.Abdelrahman and T.N.Wong.Compiler Support for Data Distribution on NUMA Multiprocessors.Journal of Supercomputing,12(4):349-371,October 1998.
    [50]Dimitrios S.Nikolopoulos and Constantine D.Polychronopoulos.Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs.In Proceeding of the 16~(th) International Parallel and Distributed Processing Symposium(IPDPS 2002).Fort Lauderdale,FL,USA.April,2002.
    [51]Robert L.McGregor,Christos D.Antonopoulos and Dimitrios S.Nikolopoulos.Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors.In Proceeding of the 19~(th) International Parallel and Distributed Processing Symposium(IPDPS 2005).Denver,California,USA.April,2005.
    [52]Dimitrios S.Nikolopoulos and Constantine D.Adaptive Scheduling under Memory Pressure on Multiprogrammed Cluster.CCGRID 2002:22-29.
    [53]Dimitrios S.Nikolopoulos.Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors.In Proceeding of the 16~(th)International Parallel and Distributed Processing Symposium(IPDPS 2002).Fort Lauderdale,FL,USA.April,2002.
    [54]Aurlien Bouteiller,Thomas Herault,Graud Krawezik,Pierre Lemarinier and Franck Cappello.MPICH-V:A Multiprotocol Fault Tolerant MPI.International Journal of High Performance Computing and Applications,20:77-90,Spring 2006.
    [55]M.Litzkow,T.Tannenbaum,J.Basney and M.Livny.Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System.Technical Report 1346,University of Wisconsin-Madison,1997.
    [56]G.Stellner.CoCheck:Checkpointing and Process Migration for MPI.In Proceedings of the 10~(th) International Parallel Processing Symposium(IPPS '96),Honolulu,Hawaii,1996.
    [57]Aurelien Bouteiller,Franck Cappello,Thomas Herault,Geraud Krawezik,Pierre Lemarinier and Frederic Magniette. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes Based on Pessimistic Sender Based Message Logging. In IEEE/ACM Supercomputing Conference (SC'03), November 2003.
    [58] Huang Chun and YANG Xue-Jun, Performance Analysis and Improvement of OpenMP on Software Distributed Shared Memory Systems. In Proceeding of 1st European Workshop on OpenMP(EWOMP'03), Aachen, Germany. September 2003.
    [59] George Bosilca, Aurlien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fdak,Ccile Germain, Thomas Hrault, Pierre Lemarinier, Oleg Lodygensky, Frdric Magniette, Vincent Nri and Anton Selikhov. Mpich-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Proceeding of IEEE/ACM Supercomputing Conference (SC'02), November 2002.
    [60] Greg Bronevetsky. Portable Checkpointing for Parallel Applications. PhD Thesis, January, 2007.

    [61] http://theory.lcs.mit.edU/classes/6.972/TMC%20Corp.html.
    [62] IBM XL Fortran Advanced Edition VI 0.1.1 for Linux.http://www-306.ibm.com/software/awdtools/fortran
    [63] G.Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10~(th) International Parallel Processing Symposium (IPPS '96),Honolulu, Hawaii, 1996.
    [64] J.M.Squyres and A.Lumsdaine. A Component Architecture for LAM/MPI. In Proceedings of 10th European PVM/MPI Users' Group Meeting, number 2840 in Lecture Notes in Computer Science, pages 379-387, Venice, Italy, September/ October 2003. Springer-Verlag.
    [65] Aurelien Bouteiller, Franck Cappello, Thomas Herault, Geraud Krawezik, Pierre Lemarinier and Frederic Magniette. MPICH-V2: A Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender based Message Logging. In IEEE/ACM Supercomputing Conference, Phoenix, USA. November 2003.
    [66] G.Burns, R.Daoud and J.Vaigl. LAM: An Open Cluster Environment for MPI.In Proceedings of Supercomputing Symposium, 379-386, Toronto, Canada. June,1994.
    [67] High Performance Fortran Language Specification. High Performance Fortran Forum, Version 2.0, CRPC-TR92225. January 1997.
    [68] Psilogeorgopoulos M., Munteanu M., Chuang T., et al. Contemporary Techniques for Lower Power Circuit Design, PREST Deliverable D2.1:Technical Report. D2.1. University of Sheffield, 1998.
    [69] Rabaey J.M, Pedram Masoud. Low Power Design Methodologies. Boston: Kluwer Academic Publishers. 1996
    [70] Tiwari V., Singh D., Rajgopal S. and et al. Reducing Power in High-performance Microprocessors. In Proceeding of the 35 Annual Conference on Design automation, 1998. San Francisco, CA USA: ACM Press,New York, NY, USA, 1998. 732-737.

    [71] Borkar S. Low Power Design Challenges for the Decade (Invited Talk). In Proceeding of the 2001 conference on Asia South Pacific Design Automation, 2001. Yokohama, Japan: ACM Press, New York, NY, USA, 2001. 293-296.

    [72] Chandrakasan A.P. and Brodersen R.W. Minimizing Power Consumption in Digital CMOS Circuits. In Proceeding . IEEE, 1995, 83: 498-523.

    [73] Thompson S., Packan P. and Bohr M. MOS Scaling: Transistor Challenges for the 21~(st) Century. Intel Technology Journal, 1998, Q3.

    [74] Jung S., Kim K. and Kang S. Low-Swing Clock Domino Logic Incorporating Dual Supply and Dual Threshold Voltages. In Proceeding of the 39~(th) conference on Design Automation, 2002. New Orleans, Louisiana, USA: ACM Press, New York, NY, USA, 2002. 467-472.

    [75] Amelifard B. and Fallah F. Pedram M. Low-Power Fanout Optimization Using Multiple Threshold Voltage Inverters. In Proceeding of the 2005 International Symposium on Low Power Electronics and Design, August 8-10, 2005. San Diego, California, USA: ACM, 2005. 95-98.

    [76] Calhoun B.H. and Chandrakasan A. Characterizing and Modeling Minimum Energy Operation for Subthreshold Circuits. In Proceeding of International Symposium on Low Power Electronics and Design 2004, August 9-11, 2004.Newport Beach, California, USA: ACM, 2004. 90-95.

    [77] Donno M., Ivaldi A., Benini L. and et al. Clock-Tree Power Optimization based on RTL Clock-Gating. In Proceeding of the 40~(th) conference on Design automation, June 2-6,2003. Anaheim, California, USA: ACM Press, New York,NY, USA, 2003. 622-627.

    [78] Heydari P. and Pedram M. Interconnect Energy Dissipation in High-Speed ULSI Circuits. In Proceeding of ASP-DAC/VLSI Design 2002, Jan, 2002.Bangalore, India: IEEE, 2002. 132-140.

    [79] Kapur P., Chandra G. and Saraswat K.C. Power Estimation in Global Interconnects and its Reduction Using a Novel Repeater Optimization Methodology. In Proceeding of the 39~(th) conference on Design automation, June,2002. New Orleans, Louisiana, USA: ACM Press, New York, NY, USA, 2002.461-466.

    [80] Wason V. and Banerjee K. A Probabilistic Framework for Power-Optimal Repeater Insertion in Global Interconnects under Parameter Variations. In Proceeding of the 2005 international symposium on Low power electronics and design, August 8-10, 2005. San Diego, California, USA: ACM Press, New York,NY,USA, 2005. 131-136.____________________________________________
    [81] Kim N.S., Austin T., Blaauw D and et al. Leakage Current: Moore's Law Meets Static Power. IEEE Computer, 2003, 36(12): 65-77.

    [82] Kim N.S., Blaauw D. and Mudge T. Leakage Power Optimization Techniques for Ultra Deep Sub-Micron Multi-Level Caches. In Proceeding of 2003 International Conference on Computer-Aided Design (ICCAD'03), November 11-13, 2003. San Jose, California, USA: IEEE Computer Society / ACM, 2003.627-632.

    [83] Ananthan H., Kim C.H. and Roy K. Larger-than-Vdd Forward Body Bias in Sub-0.5V Nanoscale CMOS. In Proceeding of the 2004 international symposium on Low power electronics and design, August 9-11, 2004. Newport Beach, California, USA: ACM Press, New York, NY, USA, 2004. 8-13.

    [84] Rao R.M., Burns J.L. and Devgan A. Efficient Techniques for Gate Leakage Estimation. In Proceeding of the 2003 international symposium on Low power electronics and design, August 25-27, 2003. Seoul, Korea: ACM Press, New York, NY, USA, 2003. 100-103.

    [85] Piguet C, Renaudin M. and Omnes T.J. Special Session on Low-Power Systems on Chips (SOCs). In Proceeding of Design, Automation, and Test in Europe (DATE '01), February 2004. Paris, France: IEEE Computer Society, 2001.

    [86] Powell M.D., Schuchman E. and Vijaykumar T.N. Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines. In Proceeding of the 38th annual IEEE/ACM International Symposium on Microarchitecture,November 12-16,2005. Barcelona, Spain: IEEE CS, 2005. 294-304.

    [87] Ku J.C., Ozdemir S., Memik G. and et al. Thermal Management of On-Chip Caches through Power Density Minimizatio. In Proceeding of the 38~(th) annual IEEE/ACM International Symposium on Microarchitecture, November 12-16,2005. IEEE CS, 2005. 283-293.

    [88] Itrs. International Technology Roadmap for Semiconductors, 2005 Edition.ITRS, May 2006, download from http://public.itrs.net.

    [89] Chase J.S. and Doyle R.P. Balance of Power: Energy Management for Server Clusters. In Proceeding of the 8th Workshop on Hot Topics in Operating Systems (HotOS), May 20-23,2001. Schloss Elmau, Germany: IEEE CS, 2001.

    [90] Pinheiro E., Bianchini R., Carrera E.V. and et al. Dynamic cluster reconfiguration for power and performance. In: Compilers and Operating Systems for Low Power.: Kluwer Academic Publishers, 2003. 75-93.

    [91] Chase J.S., Anderson D.C., Thakar P.N. and et al. Managing Energy and Server Resources in Hosting Centers. In Proceeding of the 18~(th) ACM symposium on Operating systems principles, 2001. Banff, Alberta, Canada: ACM Press, 2001.103-116.

    [92] Bianchini R. Research Directions in Power and Energy Conservation for Clusters.DCS-TR-466.Department of Computer Science,Rutgers University,November 2001.
    [93]Pinheiro E.and Bianchini R.Energy Conservation Techniques for Disk Array-Based Servers.In Proceeding of the 18~(th) Annual International Conference on Supercomputing(ICS 2004).June 26 - July 01,2004.Saint Malo,France:ACM Press,2004.68-78.
    [94]Heath T.,Diniz B.and Carrera E.V.Energy Conservation in Heterogeneous Server Clusters.In Proceeding of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,PPOPP 2005,June 15-17,2005.Chicago,IL,USA:ACM Press,2005.186-195.
    [95]Pinheiro E.,Bianchini R.and Dubnicki C.Exploiting Redundancy to Conserve Energy in Storage Systems.In Proceeding of the Joint International Conference on Measurement and Modeling of Computer Systems(SIGMETRICS),June 26-30,2006.Saint Malo,France:ACM Press,2006.
    [96]Elnozahy E.N.,Kistler M.and Rajamony R.Energy-Efficient Server Clusters.In Proceeding of 2~(nd) Workshop on Power-Aware Computing Systems,February 2002.Cambridge,MA,USA:Springer Verlag,2002.
    [97]Charles Lefurgy,Karthick Rajamani,Freeman Rawson,Wes Felter,Michael Kistler and Tom W.Keller.Energy Management for Commercial Servers.IEEE Computer,2003,36(12):39-48.
    [98]Ricardo Bianchini and Ram Rajamony.Power and Energy Management for Server Systems.IEEE Computer,2004,37(11):68-74.
    [99]Xu R.,Zhu D.,Rusu C.and et al.Energy-Efficient Policies for Embedded Clusters.In Proceeding of the 2005 ACM SIGPLAN/SIGBED Conference on Languages,Compilers,and Tools for Embedded Systems(LCTES'05),June 15-17,2005.Chicago,Illinois,USA:ACM Press,2005.1-10.
    [100]Chunghsing Hsu and Wuchun Feng.A Power-Aware Run-Time System for High-Performance Computing.SC'05 November 12-18,2005,Seattle,Washington,USA.
    [101]Pang Zhengbin,Dou Qiang,Liu Guangming,Li Yongjin,Zhou Xingming,A Hybrid Directory-Based Cache Coherence Protocol Design,高性能通讯(2004年全国计算机体系结构学术会议)14(增刊):142-146.济南,2004.
    [102]Pang zhengbin,Zhang Jun,Li Yongjin,Xia Jun and Xu Weixia.A Cost-Effective DirsNB+CCV Directory Scheme and Its Efficient Implementation on SCCMP System.第十四届全国信息存储技术学术会议.2006.
    [103]SGI Inc.SGI Products:Servers and Supercomputers:SGI Altix Family.http://www.sgi.com/products/servers/altix/.
    [104] C.Amza, A.L.Cox, S.Dwarkadas, P.Keleher, H.Lu, R.Rajamony, W.Yu and W.Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2): 18-28, February 1996.
    [105] W.Hu, Weisong Shi and Z.Tang. J1AJIA: An SVM System Based on A New Cache Coherence Protocol, In Proceeding of the High Performance Computing and Networking (HPCN'99), LNCS 1593, Springer, Apr. 1999, pp. 463-472.
    [106] Nicholas P.Carter, William J.Dally, Whay S.Lee, Stephen W.Keckler and Andrew Chang. Processor Mechanisms for Software Shared Memory. In International Symposium on High-Performance Computing. Tokyo, Japan, Oct.2000.

    [107] Lustre. http://www.lustre.org/.

    [108] Cluster File Systems. Lustre Scalable Stroage. http://www.clusterfs.com/.
    [109] V.Aslot, M.Domeika, R.Eigenmann, G.Gaertner, W.B.Jones and B.Parady.MIPSpro Compiling and Performance Tuning Guide. 1999.
    [110] IBM XL C/C++ Advanced Edition V8.0.1 for Linux.http://www-306.ibm.com/software/awdtools/ccompilers.

    [111] Sun Microsystems Inc. The F95 Sun Compiler. http://www.sun.org. 2001.
    [112] Intel Corporation. Intel C++ Compiler 9.1 for Linux. 2006.http://www.intel.com/cd/software/products/asmo-na/eng/compilers/clin/277618.htm.
    [113] Intel Corporation. Intel Fortran Compiler 9.1 for Linux. 2006.http://www.intel.com/cd/software/products/asmo-na/eng/compilers/282048.htm
    [114] The Portland Group. PGI Fortran Reference. 2005. http://www.pgroup.com.
    [115] Diego Novillo. OpenMP and Automatic Parallelization in GCC. In Proceeding of GCC Developers' Summit, Ottawa, Canada. June 2006.
    [116] Jason Merrill. GENERIC and GIMPLE: A New Tree Representation for Entire Functions. 171-180. Ottawa, Ontario,Canada. The GCC & GNU Toolchain Developers' Summit. 2003.
    [117] Christial Brunschen and Mats Brorsson. OdinMP/CCP: A Free Portable OpenMP Implementation for C. In Proceedings of 1st European Workshop on OpenMP (EWOMP'99), 123-129, Lund, Sweden. 1999.
    [118] K.Kusano, M.Sato, S.Satoh and Y.Tanaka. Design of OpenMP Compiler for an SMP Cluster. In Proceedings of 1st European Workshop on OpenMP (EWOMP'99), 32-39, Lund, Sweden, 1999.
    [119] J.Balart, A.Duran, M.Gonz'alez, X.Martorell, E.Ayguade and J.Labarta. Nanos Mercurium: a Research Compiler for OpenMP. In Proceeding of the 6~(th) European Workshop on OpenMP (EWOMP'04), Stockholm, Sweden. October,2004.

    [120] M.Gonzalez, E.Ayguade, J.Labarta, X.Martorell, N.Navarro and J.Oliver. NanosCompiler:A Research Platform for OpenMP Extensions.In Proceeding of the 1~(st) European Workshop on OpenMP(EWOMP'99).Lund,Sweden,October,1999.
    [121]E.Ayguade,M.Gonzalez,J.Labarta,X.Martorell,N.Navarro and J.Oliver.NanosCompiler:a Research Platform for OpenMP Extensions.In Proceedings of the 1~(st) European Workshop on OpenMP(ewomp99),Lund,Sweden.1999.
    [122]INTONE:Innovative Tools for Non-Experts,IST/FET project (IST-1999-20252).http://www.cepba.upc.es/intone/.
    [123]Huang Chun and YANG Xue-Jun.CCRG OpenMP:Experiments and Improvements.Lecture Notes in Computer Science,2005,2690:514-521.
    [124]陈永健,OpenMP编译与优化技术研究,清华大学博士论文,2004.
    [125]Open64 Compiler and Tools.http://sourceforge.net/projects/open64.
    [126]C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,W.Yu and W.Zwaenepoel.YreadMarks:Shared Memory Computing on Networks of Workstations.IEEE Computer,29(2):18-28,February 1996.
    [127]Hu Y.C,Honghui Lu,Alan L Cox and Willy Zwaenepoel,OpenMP for Networks of SMPS.Journal of Parallel and Distributed Computing 60(12),1512-1530.1999.
    [128]袁国兴、张宝琳,一类流体力学问题的并行计算,计算物理,第11卷第4期,1994,p483-488
    [129]D.Hensgen,R.Finkel and U.Manber,Two Algorithms for Barrier Synchronization.Journal of Parallel Programming 17,1988.
    [130]Sage++:a Class Library for Building Fortran and C++ Restructuring Tools (Version 1.9).May 1995.http://www.extreme.indiana.edu/sage.
    [131]Francois B.,Peter B.,Dennis G.and et al.Sage++:an Object-Oriented Toolkit and Class Library for Building Fortran and C++ Restructuring Rools.ftp://www.ftp.extreme.indiana.edu/pub/sage.
    [132]Dmitry Pekurovsky.OpenMP Microbenchmarking Study on IBM SP High Nodes.WOMPAT2000,2000.
    [133]J.Mark Bull and Darragh O'Neill.A Microbenchmark Suite for OpenMP 2.0.In Proceeding of the 3~(th) European Workshop on OpenMP(EWOMP'01).2001.
    [134]Fiona J.L.Reid and J.Mark Bull.OpenMP Microbenchmarks Version 2.0.Technical Report.7-14-2004.July,2004.
    [135]NAS Parallel Benchmarks,http://www.nas.nasa.gov/Resources/Software.2006.
    [136]Rob F.Van Der Wijngaart.NAS Parallel Benchmarks Version 2.4.NAS Technical Report NAS-02-007,NASA Ames Research Center,Moffett Field,CA,October 2002.
    [137]Huang Chun and YANG Xue-Jun,Improve OpenMP Performance by Extending BARRIER and REDUCTION Constructs.International Workshop on OpenMP:Experiences and Implementation(WOMPEI2003).Lecture Notes in Computer Science,2003.10.
    [138]J.Labarta,E.Ayguadé and José Oliver.New OpenMP Directives for Irregular Data Access Loops.In Proceeding of the 2~(nd) European Workshop on OpenMP (EWOMP'00).Edinburgh,U.K.September,2000
    [139]B.Chapman,P.Mehrotra,H.Zima.Enhancing OpenMP with features for locality control.Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology.Towards Teracomputing.World Scientific Publishing.1999,pp.301-13.Singapore
    [140]D.S.Nikolopulos and T.S.Papatheodorou.System Software Support for Reducing Memory Latency on Distributed Shared Memory Multiprocessors.Proceeding of the 7~(th) Hellenic Conference on Informatics(HCI'99),pp.61-68,Ioannina,Greece,August 1999.
    [141]莫则尧、许林宝、张宝琳、沈隆钧,二维等离子体模拟粒子云网格方法的并行计算与性能分析,计算物理,第16卷第5期,1998,pp.496-504.
    [142]J.Laudon and D.Lenoski.The SGI Origin2000:A ccNUMA Highly Scalable Server.Proceeding of the 24~(th) Annual International Symposium on Computer Architecture,pp.241-251,Denver(USA),1997.
    [143]T.Lovett and R.Clapp.STING:A CC-NUMA Computer System for the Commercial Marketplace.Proceeding of the 23~(rd) Annual International Symposium on Computer Architecture,Philadelphia(USA),1996.
    [144]O.Tatebe,M.Sato and S.Skeiguchi.Impact of OpenMP Optimization for the MCCG Method.Japan.2000.In Proceeding of International Workshop on OpenMP:Experiences and Implementations(WOMPEI2000).Kyoto,Japan.May,2000.
    [145]Geraud Krawezik,Guillaume Alleon and Franck Cappello.SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks.In Proceeding of International Workshop on OpenMP:Experiences and Implementations(WOMPEI'02).Kyoto,Japan.May,2002.
    [146]C.K.Birdsall and A.B.Longdon.Plasma Physics via Compu ter Simulation.McGraw Hill Book Company,1985
    [147]PVM:Parallel Virtual Machine.http://www.csm.ornl.gov/pvm/
    [148]Mowry TC.Tolerating Latency through Software-Controlled Data Prefetching.Ph.D.Thesis.Stanford University,1994.
    [149]Steven P.VanderWiel and David J.Lilja.Data Prefetching Techniques.ACM Computing Surveys,vol.32 no.2,June 2000.
    [150]Steven P.VanderWiel and David J.Lilja.Data Prefetch Mechanisms.ACM Computing Surveys 32:22,174-199,Association for Computing Machinery,2000.
    [151]Haoqiang Jin and Gabriele Jost.Performance Evaluation of Remote Memory Access(RMA) Programming on Shared Memory Parallel Computers.NAS-03-001,http://www.nas.nasa.gov/News/Techreports/2003/2003.html.
    [152]Allan K.Porterfield.Software Methods for Improvement of Cache Cache Performance on Supercomputer Applications.PhD thesis,Rice University,May 1989.
    [153]Roland Lun Lee.The Effectiveness of Caches and Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors.PhD thesis,University of Illinois at Urbana-Champaign,May 1987.
    [154]David M.Marcovitz.A multiprocessor cache performance metric.Master's thesis,University of Illinois at Urbana-Champaign.October,1988.
    [155]Harlan Husmann.Compiler Memory Management and Compound Function Definition for Multiprocessors.PhD thesis,University of Illions at Urbana-Champaign,August,1986.
    [156]Daniel Thomas Jachson.Data Movement in Doall Loops.Master's Thesis,University of Illinois at Urbana-Champain.May,1985.
    [157]Kyungsook Yoon Lee.Interconnection Networks and Compiler Algorithms for Multiprocessors.PhD thesis.University of Illinois at Urbana-Champain.1983.
    [158]Dennis Gannon,William Jally and Kyle Gallivan.Strategies for Cache and Local Memory Management by Global Program Transformation.In 1987International Conference on Supercomputing,1987.
    [159]Edward H.Gornish.Compile Time Analysis for Data Prefetching.Master's thesis,University of Illinois at Urbana-Champain,December 1989.
    [160]E.Gornish,E.Granston and A.Veidenbaum.Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies.In International Conference on Supercomputing.1990.
    [161]J.D.Choi and J.M.Stone.Balancing Runtime and Replay costs in a trace-and-replay system.In Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging,Dec.1991.ACM SIGPLAN Notices,26(12):26-35.
    [162]J.Duell.The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart.http://www.nersc.gov/research/FTG/checkpoint/reports.html.
    [163]A.Barak,S.Guday and R.Wheeler.The MOSIX Distributed Operating System,Load Balancing for UNIX.Number 672 in Lecture Notes in Computer Science,Springer-Verlag,1993.
    [164]Gautam Doshi,Rakesh Krishnaiyer and Kalyan Muthukumar.Optimizing Software Data Prefetches with Rotating Registers.In proceeding of the International Conference on Parallel Architectures and Compilation Techniques.Barcelona,Catalunya,Spain.September,2001.
    [165]J.S.Plank,M.Beck,G.Kingsley and K.Li.Libckpt:Transparent Checkpointing under UNIX.In USENIX Winter.1995.p.213-224.
    [166]Michael Marchetti,Leonidas.I Kontothanassis,Ricardo Bianchini and Michael L.Scott.Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems.In Proceeding of the 9~(th)International Parallel Processing Symposium,pp.480-485,Santa Barbara,CA,April 1995.p.480-485
    [167]J.M.Squyres and A.Lumsdaine.A Component Architecture for LAM/MPI.In Proceedings,10~(th) European PVM/MPI Users' Group Meeting,number 2840 in Lecture Notes in Computer Science,pages 379-387,Venice,Italy,September/October 2003.Springer-Verlag.
    [168]G.Burns,R.Daoud and J.Vaigl.LAM:An Open Cluster Environment for MPI.In Proceedings of Supercomputing Symposium,pages 379-386,1994.
    [169]Aurelien Bouteiller,Franck Cappello,Thomas Herault,Geraud Krawezik,Pierre Lemarinier and Frederic Magniette.MPICH-V2:a Fault Tolerant MPI for Volatile Nnodes based on Pessimistic Sender based Message Logging.In IEEE/ACM Supercomputing Conference,November 2003.
    [170]黄春、杨学军,基于值-剖面的OpenMP运行时优化系统,计算机工程与科学,第28卷第12期,2006.
    [171]王昭飞、黄春,OpenMP Fortan程序中死锁的静态检测,计算机研究与发展,第43卷第3期,2007.
    [172]Kim S,Veidenbaum A V.Stride-Directed Prefetching for Secondary Caches.In Proceeding of the 1997 International Conference on Parallel Processing,IEEE CS Press,1997.
    [173]W.Dieter and Jr.J.Lumpp.A User-Level Checkpointing Library for POSIX Threads Programs.In Symposium on Fault-Tolerant Computing Systems (FTCS),June 1999.
    [174]J.Duell,P.Hargrove and E.Roman.The Design and Implementation of Berkeley lab's Linux Checkpoint/Restart.Technical report,Lawrence Berkeley National Laboratory,November 2003.
    [175]J.Duell.The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart.http://www.nersc.gov/research/FTG/checkpoint/reports.html.
    [176]G.Bronevetsky,M.Schulz,P.Szwed,D.Marques and K.Pingali.Checkpointing shared memory programs at the application-level.In European Workshop on OpenMP, 2004.

    [177] G.Bronevetsky, M.Schulz, P.Szwed, D.Marques and K.Pingali.Application-level checkpointing for shared memory programs. In Conference on Application Support for Programming Languages and Operating Systems, 2004.

    [178] G.Bronevetsky, M.Schulz, P.Szwed, D.Marques and K.Pingali. Checkpointing Shared Memory Programs at the Application-Level. In European Workshop on OpenMP, 2004.

    [179] Greg Bronevetsky, Keshav Pingali and Paul Stodghill. Application-level Checkpointing for OpenMP Programs. International Conference on Supercomputing (ICS). 2006.

    [180] Mueller F. Static Cache Simulation and its Applications. Ph.D. Thesis. Department of Computer Science, Florida State University, 1994.

    [181] Colin A, Puaut I. Worst Case Execution Time Analysis for a Processor with Branch Prediction. Real-Time System, 2000,18(2/3): 249-274.

    [182] Edward.H. Gornish, Elana D. Granston and Alexander V. Veidenbaum.Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies. In Proceeding of the 4~(th) International Conference on Supercomputing. Amsterdam, Netherlands, USA. June 1990, p. 354-368.

    [183] Todd Mowry and Anoop Gupta. Tolerating Latency through Software Controlled Prefetching in Shared-memory Multiprocessors. Journal of Parallel and Distributed Computing, Vol.12, No.2, June 1991, p. 87-106.

    [184] John W.C. Fu and Janak H. Patel. Data Prefetching in Multiprocessor Vector Cache Memories. In Proceeding of the 18~(th) Annual International Symposium on Computer Architecture. Toronto, Ontario, Canada. May 1991, p. 54-63

    [185] Rodric M. Rabbah, Hariharan Sandanagobalane, Mongkol Ekpanyapong and Weng-Fai Wong. Compiler Orchestrated Prefetching via Speculation and Predication. In Proceeding of the 11~(th) International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS'04),ACM Press. December, 2004. p. 189-198.

    [186] Trevor Pering, Tom Burd and Robert Brodersen. Dynamic Voltage Scaling and the Design of a Low-Power Microprocessor System. In Proceeding Power-Driven Microarchitecture Workshop, Associated with ISCA98.Barcelona, Spain, June 1998.

    [187] W. Weiser, B. Welch, A. Demers and S. Shenker. Scheduling for Reduced CPU Energy. Proceedings of the 1st USENIX Symposium on Operating Systems Design and Implementation. November 1994, pp. 13-23.

    [188] Jacob Rubin Lorch. Operating Systems Techniques for Reducing Processor Energy Consumption. Ph.D. Thesis. University of California, Berkeley, Fall 2001.____________________________________________________________
    [189] Daniel Mosse and et.al. Compiler-Assisted Dynamic Power-Aware Scheduling for Real-Time Applications. Workshop on Compilers and Operating Systems for Low-Power (COLP'OO), Philadelphia, PA, October 2000.

    [190] Dongkun Shin and et.al. Intra-Task Voltage Scheduling for Low-Energy Hard Real-Time Applications. In IEEE Design & Test of Computers, March 2001.

    [191] Dakai Zh and etc. Scheduling with Dynamic Voltage/Speed Adjustment Using Slack Reclamation in Multi-Processor Real-Time Systems. IEEE Trans. on Parallel & Distributed Systems, vol. 14, no. 7, pp. 686 - 700,2003.

    [192] Chung-Hsing Hsu and et.al. The Design, Implementation, and Evaluation of a Compiler Algorithm for CPU Energy Reduction. Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pp. 38-48, June 2003.

    [193] Kim J W, Rabbah R.M, Palem K.V, Xong W.F, Adaptive Compiler Directed Prefetching for EPIC Processors. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, CSREA Press, 2004.

    [194] Karthick Rajamani and Charles Lefurgy. On Evaluating Request-Distribution Schemes for Saving Energy in Server Clusters. In Proceeding International.Symposium Performance Analysis of Systems and Software, March 2003.

    [195] Puschner P, Burns A. A Review of Worst-Case Execution-Time Analysis (Editorial). Holand: Kluwer Academic Publishers, 1999.

    [196] Rong Ge, Xizhou Feng and Kirk W.Cameron. Performance-Constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters. SC'05 November 12-18, 2005, Seattle, Washington, USA.

    [197] Vincent W.Freeh, Feng Pan, Nandini Kappiah, David K.Lowenthal and Rob Springer. Exploring the Energy-Time Tradeoff in MPI Programs on a Power-Scalable Cluster. 19th International Parallel and Distributed Processing Symposium (IPDPS 05), April 2005. (Denver, CO).

    [198] Robert Springer, David K. Lowenthal, Barry Rountree and Vincent W.Freeh. Minimizing Execution Time in MPI Programs on an Energy-Constrained,Power-Scalable Cluster. In Proceediing of the 11~(th) ACM SIGPLAN on Principles and Practice of Parallel Programming (PPoPP'06). New York, USA.March, 2006. P.230-238

    [199] William J.Dally, Partrick Hanrahan, Mattan Erez, Timothy J. Knight, et.al Merrimac: Supercomputing with Streams. In Proceeding of the Super Computing Conference (SC'03). Phoenix, USA. November, 2003

    [200] http://cva.stanford.edu/projects/imagine

    [201] http://www.bsc.cs/projects/deepcoumputing/linuxoncell.

    [202] A.E.Eichenberger, et.al. Using Advanced Compiler Technology to Exploit the Performance of the CELL Broadband Engine Architecture.IBM System Journal.Vol.45,No.1.2006.
    [203]http://merrimac.stanford.edu
    [204]http://techpubs.sgi.com/library/tpl/
    [205]韩国兴,非平衡刚性动力学方程组的块迭代法和并行算法,第六届全国计算数学年会论文集,1999年10月.
    [206]袁仙春、廖振民,多流体网格法的并行计算,计算机工程与科学,第4期,1984.
    [207]莫则尧、符尚武、沈隆钧,二维三温流体力学数值模拟程序的并行化,计算物理,第17卷第6期,2000,P625-632
    [208]D.Callahan,K.Kennedy,and A.Porterfield,Software Prefetching,In Proceedings of the 4~(th) International Conference on Architecture Support for Programming Languages and Operating Systems,1991,40-52.
    [209]David Bernstein,Doron Cohen and Ari Freund.Compiler Techniques for Data Prefetching on the PowerPC.In Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques,June 1995,p.19-26.
    [210]Yeager,K.C..The MIPS R10000 Superscalar Microprocessor.IEEE Micro,Vol.16,No.2.April,1996,p.28 - 41.
    [211]Jerry Huck,Dale Morris,Jonathan Ross,Allan Knies,Hans Mulder,and Rumi Zahir.Introducing the IA-64 Architecture.IEEE Micro,20(5):12-23.2000..
    [212]Jean-Loup Baer and Yien-Fu Chen.Effective Hardware-based Data Prefetching for High-Performance Processors.IEEE Transactions on Computers,44(5):609-623,1995.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700