面向众核GPU的编程模型及编译优化关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
GPGPU(General Purpose computing on Graphics Processing Units)已广泛应用于高性能计算领域,但是GPU体系结构和编程模型不同于传统的CPU体系结构和编程模式,开发高效的GPU应用程序仍然极具挑战性。本文重点围绕面向众核GPU的编程模型及编译优化关键技术进行了研究,集中解决了众核GPU编程模型及编译优化中的若干关键理论与技术问题,取得的主要研究成果和技术创新如下:
     1.提出了一种众线程并行编程模型。多核、众核时代的到来使得并行编程模型研究正处于蓬勃发展的阶段。然而,到目前为止,仍然没有一个被普遍接受的多核、众核并行编程模型。本文基于流并行编程思想,综合考虑典型并行编程模型的优缺点,首次提出了一种众线程编程模型ab-Stream。ab-Stream编程模型能够很好地屏蔽众核体系结构差异并且给程序员提供了一个易于并行、易于编程、易于扩展和易于调优的并行编程模型。
     2.提出了面向GPGPU应用映射的多层次计算粒度并行方法。GPU拥有成百上千个计算核,如何划分并行任务确定并行计算粒度以最大限度挖掘GPU强大的并行计算能力是一项艰巨且富有挑战性的工作。因此,本文以GPGPU应用程序输入集特征为导向,面向链式依赖关系输入集提出了一种面向链式依赖结构的片段级松弛并行方法。同时,面向2D数据结构输入集提出了一种像素级映射并行方法。实验结果表明,本文提出的两种不同计算粒度的并行方法能够充分挖掘GPGPU应用潜在的并行性,并且具有简明直接、实现简单的特点。
     3.提出了基于数据分类的存储传输优化技术。GPGPU体系结构是一款存储受限的高性能处理器体系结构。为有效利用GPGPU体系结构中多样化存储资源,首先提出了一种基于分类存储的数据布局优化技术,该布局优化方法将不同类别的数据显式地分派到能够充分利用数据特性的存储器空间以最大化存储访问效率。然后,针对Strided data数据结构提出了一种基于预变换技术的Strided data数据传输优化技术。实验结果表明,本文提出的基于数据分类的存储传输优化技术能够显著提升GPGPU应用程序性能。
     4.提出了一种面向计算密集型应用的负载均衡计算协作框架。CPU+GPU异构计算系统经常会在很长一段时间内处于超载和轻载的状态,为了充分利用GPU+CPU异构系统的计算资源,该计算协作框架让CPU和GPU以流水模式并行执行,同时,将GPU提升为数据消费者或部分数据的生产者,并且将零加载和缓存加载等优化技术整合到负载均衡计算协作框架中,以提升整个协作框架的性能。实验结果表明,本文提出的负载均衡计算协作框架能够显著提高GPU+CPU异构系统的计算资源利用率。
     为了验证ab-Stream编程模型及其关键支撑技术的可行性和有效性,本文基于ab-Stream编程框架设计实现了一款原型系统ab-Stream4G,其中包含了面向众线程体系结构的应用映射方法、众线程体系结构存储优化技术和众线程异构系统负载均衡策略等关键支撑技术。实验结果表明原型系统ab-Stream4G能够正确高效的运行。
GPGPU (General Purpose computing on Graphics Processing Units) has beenwidely applied to high performance computing. However, GPU architecture andprogramming model are different from that of traditional CPU. Accordingly, it is ratherchallenging to develop efficient GPU applications. This thesis focuses on the keytechniques of programming model and compiler optimization for many-core GPU, andaddresses a number of key theoretical and technical issues. The primary contributionsand innovations are concluded as follows.
     1. We propose a many-threaded programming model. There is no authorizedparallel programming model for multi-core and many-core processors. Accordingly,after understanding stream-based and classical parallel programming models, wepropose a many-threaded programming model ab-Stream, which would transparentizearchitecture differences and provide an easy to parallel, easy to program, easy to extendand easy to tune programming model.
     2. We propose parallelizing approaches with hierarchy computing granularities tomap GPGPU applications. There are hundreds of computing cores in GPUs. However, itis difficult to identify an appropriate computing granularity to map GPGPU applicationsfor maximizing GPU productivity. Orienting application inputs, firstly, we propose aparallelizing approach with relaxation to parallelize GPU applications characterizedwith chain dependence inputs. Secondly, we propose another pixel-level parallelizingapproach to map GPU applications with2D inputs. Experimental results show thatproposed approaches are easy to implement and would exploit potential parallelism inGPGPU applications efficiently.
     3. We propose memory optimization and data transfer transformation according todata classification. GPGPU architecture is memory-bound and high-performancearchitecture. In order to effectively utilize diverse GPU storage resources, firstly, wepropose data layout pruning based on classification memory, and then we propose TaT(Transfer after Transformed) for transferring Strided data between CPU and GPU.Experimental results demonstrate that proposed techniques would significantly improveperformance for GPGPU applications.
     4. We propose a collaborative framework with load-balance for compute-intensiveapplications. Heterogeneous systems composed of CPU and GPU are often not in stateof load-balance. In order to take full advantage of GPU+CPU heterogeneous systems,data transfer and computations would be overlapped in pipeline mode in collaborativeframework proposed. Additionally, optimization techniques including zero-loading andcache loading are integrated into collaborative framework for maximizing performance of heterogeneous systems. Experimental results demonstrate that proposed collaborativeframework would maximize utilization of heterogeneous systems.
     In order to validate correctness and high productivity of ab-Stream programmingmodel, we design a prototype ab-Stream4G for CUDA-enabled GPU based on proposedtechniques. Experimental results show that ab-Stream4G would work correctly andefficiently.
引文
[1] International technology roadmap for semiconductors2010update [EB/OL].http://public.itrs.net.
    [2] S. Borkar, Thousand Core Chips—A Technology Perspective[C], DesignAutomation Conference, Jun.2007,746-749.
    [3] Wen-mei Hwu, et. al., Implicitly Parallel Programming Models forThousand-Core Microprocessors, Design Automation Conference, Jun.2006,754-759.
    [4] Li W, Wei X M, Kaufman A. Implementing Lattice Boltzmann Computation onGraphics Hardware [J]. The Visual Computer,2003,19(7-8):444-456.
    [5] Li W, Fan Z, Wei XM, Kaufman A. GPU-Based Flow Simulation with ComplexBoundaries [R]. Technical Report,031105, Computer Science Department,SUNY at Stony Brook,2003.
    [6] Harris M J, Coombe G, Scheuermann T, Lastra A. Physically-Based VisualDimulation on Graphics Hardware [C]. Proceedings of the Graphics Hardware,2002:109-118.
    [7] Randi J R. OpenGL(R) Shading Language [M]. Redwood City, CA: AddisonWesley Longman Publishing Co., Inc.,2004.
    [8] Peeper C, Mitchell JL. Introduction to the DirectX9High-Level ShaderLanguage [EB/OL].http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnhlsl/html/shaderx2_introductionto.asp.
    [9] Mark W R, Glanville S, Akeley K, Kilgard M J. Cg: A System for ProgrammingGraphics Hardware in A C-like Language [J]. ACM Transaction on Graphics,2003,22(3):896-907.
    [10] Li W, Wei X M, Kaufman A. Implementing Lattice Boltzmann Computation onGraphics Hardware [J]. The Visual Computer,2003,19(7-8):444-456.
    [11] NVIDIA Corp. Technical Brief: NVIDIA GeForce8800GPU ArchitectureOverview [EB/OL].2006.11.
    [12] NVIDIA, CUDA programming guide [EB/OL].2008.
    [13] R600Technology–R600-Family Instruction Set Architecture [EB/OL].2008.http:www.x.org/docs/AMD/r600isa.pdf.
    [14] AMD Corporation. ATI Stream Computing User Guide v1.0[EB/OL].http://developer.amd.com.
    [15] Khronos. The Open Standard for Parallel Programming of HeterogeneousSystems [S]. http://www.khronos.org/opencl/.
    [16] Papakipos M. The PeakStream Platform, High-Productivity SoftwareDevelopment for Multi-Core Processors [EB/OL].http://download.microsoft.com/download/d/f/6/df6accd5-4bf2-4984-8285-f4f23b7b1f37/WinHEC2007_PeakStream.doc, April2007.
    [17] McCool M D, Wadleigh K, Henderson B, Lin H Y. Performance Evaluation ofGPUs Using the RapidMind Development Platform [C]. Proceedings of the2006ACM/IEEE Conference on Supercomputing (SC), November2006:11-17.
    [18] Fatica M. Accelerating Linpack with CUDA on Heterogenous Clusters [C].Proceedings of2nd Workshop on General Purpose Processing on GraphicsProcessing Units,2009:46-51.
    [19] Stone S S, Haldar J P, Tsao S C, Hwu W, Liang Z P, Sutton B P. AcceleratingAdvanced MRI Reconstructions on GPUs [C]. Proceedings of the5thConference on Computing Frontiers (CF), May05-07,2008:261-272.
    [20] Hartley T, Catalyurek U, Ruiz R, Igual F, Mayo R, Ujaldon M. BiomedicalImage Analysis on a Cooperative Cluster of GPUs and Multicores [C].Proceedings of the22nd annual International Conference on Supercomputing(SC),2008:15-25.
    [21] Kerr A, Campbell D, Richards M. QR decomposition on GPUs [C]. Proceedingsof2nd Workshop on General Purpose Processing on Graphics Processing Units,March2009:71-78.
    [22]马安国.高效能GPGPU体系结构关键技术研究[D].博士学位论文,长沙:国防科学技术大学研究生院,2011.
    [23] Halfhill T R. Looking Beyond Graphics [R]. Microprocessor Report, September2009.
    [24] Patterson D A. The top10Innovations in the New NVIDIA Fermi Architectureand the Top3Next Challenges [EB/OL].http://www.NVIDIA.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf, September2009.
    [25] NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi,v1.1[EB/OL]. September2009.http://www.NVIDIA.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
    [26] NVIDIA CORPORATION. NVIDIA Solves the GPU Computing Puzzle[EB/OL]. http://www.NVIDIA.cn/object/fermi_architecture_cn.html, September2009.
    [27] National Supercomputing Center in Tianjin. Official Website [EB/OL].http://www.nscc-tj.gov.cn.
    [28] HOUSTON M: Personal communication [EB/OL]. March2007. StanfordUniversity and Folding@Home.
    [29] Announcing Cluster GPU Instances for Amazon EC2[EB/OL].http://aws.amazon.com/ec2/.
    [30] Asanovic, Krste and etc., The landscape of Parallel Computing Research: AView from Berkeley, UCB/EECS-2006-183,2006.
    [31]张春元,文梅,伍楠,何义.流处理器研究与设计[M].北京:电子工业出版社,2009.
    [32] Khailany B, Dally W J, Kapasi U, Mattson P, Namkoong J, Owens J D, TowlesB, Chang A, Rixner S. Imagine: Media Processing with Streams[J]. IEEE Micro,March2001,21(2):35-46.
    [33] Dally W J, Labonte F, Das A, Hanrahan P, Ahn J H, Gummaraju J, ErezM,Jayasena N, Buck J, Knight T J, Kapasi U J. Merrimac: Supercomputing withStreams [C].2003ACM/IEEE Conference on Supercomputing, November15-21,2003:35-42.
    [34] D. Pham, et. al. The Design and Implementation of a First-Generation CELLProcessor[C], International Solid-State Circuits Conference Technical Digest,Feb.2005.
    [35] J. A. Kahle, Day M N, Hofstee H P, Johns C R, Maeurer T R, Shippy D.Introduction to the Cell Multiprocessor [J]. IBM Journal of Research andDevelopment, July/September2005,49(4/5):589-604.
    [36] Roadrunner-tutorial [EB/OL].http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-2-web1.pdf.
    [37] Taylor M B, Lee W, Miller J, Wentzlaff D, Bratt I, Greenwald B, HoffmannH,Johnson P, Kim J, Psota J, Saraf A, Shnidman N, Strumpen V, FrankM,Amarasinghe S, Agarwal A. Evaluation of the Raw Microprocessor: AnExposed-Wire-Delay Architecture for ILP and Streams [C]. Proceedings ofInternational Symposium on Computer Architecture (ISCA),2004:2.
    [38] M. Taylor. Master’s thesis, Design Decisions in the Implementation of a RawArchitecture Workstation [D]. Department of Electrical Engineering andComputer Science, September1999.
    [39]文梅.流体系结构关键技术研究[D].博士学位论文,长沙:国防科学技术大学研究生院,2006.
    [40] Chu M M. Dynamic Runtime Scheduler Support for SCORE [D]. Departmentof Electrical Engineering and Computer Sciences, University of California atBerkeley,2002.
    [41] Owens J D, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A E andPurcell T J. A Survey of General-Purpose Computation on GraphicsHardware[R]. In Eurographics2005, State of the Art Reports, August2005:21-51.
    [42] Owens J D, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A E, PurcellT. A Survey of General-Purpose Computation on Graphics Hardware [J].Computer Graphics Forum, March2007,26(1):80-113.
    [43] Kirk D B and Hwu W. Programming Massively Parallel Processors: A Hands-onApproach [M]. Boston, Massachusetts, USA: Morgan Kaufman,2010.
    [44] Stone S S, Haldar J P, Tsao S C, Hwu W, Liang Z P, Sutton B P. AcceleratingAdvanced MRI Reconstructions on GPUs [C]. Proceedings of the5thConference on Computing Frontiers (CF), May05-07,2008:261-272.
    [45] Hartley T, Catalyurek U, Ruiz R, Igual F, Mayo R, Ujaldon M. BiomedicalImage Analysis on a Cooperative Cluster of GPUs and Multicores [C].Proceedings of the22nd annual International Conference on Supercomputing(SC),2008:15-25.
    [46] Takashi Shimokawabe, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, NukadaA, Maruyama N, Matsuoka S. An80-Fold Speedup,15.0TFlops Full GPUAcceleration of Non-Hydrostatic Weather Model ASUCA Production Code [C].Proceedings of the2010ACM/IEEE International Conference for HighPerformance Computing, Networking, Storage and Analysis, New Orleans,Louisiana, November2010:1-11.
    [47] Kerr A, Campbell D, Richards M. QR decomposition on GPUs [C]. Proceedingsof2nd Workshop on General Purpose Processing on Graphics Processing Units,March2009:71-78.
    [48] Volkov V and Demmel J. LU, QR and Cholesky Factorizations Using VectorCapabilities of GPUs [R]. Technical Report No UCB/EECS-2008-49.
    [49] Komatitsch D, Michea D, Erlebacher G. Porting a High-Order Finite-ElementEarthquake Modeling Application to NVIDIA Graphics Cards Using CUDA [J].Journal of Parallel and Distributed Computing. May2009:451-460.
    [50] Stone J E, Saam J, Hardy D J, Vandivort K L, Hwu W W, Schulten K. HighPerformance Computation and Interactive Display of Molecular Orbitals onGPUs and Multi-core CPUs [C]. Proceedings of2nd Workshop on GeneralPurpose Processing on Graphics Processing Units,2009:9-18.
    [51] Van Meel J A, Arnold A, Frenkel D, Zwart S F P, Belleman R G. Harvestinggraphics power for MD simulations [J]. Mol Simulation,2008,34(3):259-266.
    [52] Volkov V and Demmel J W. Benchmarking GPUs to Tune Dense LinearAlgebra [C]. Proceedings of the2008ACM/IEEE Conference onSupercomputing, Piscataway, NJ, USA: IEEE Press,2008:1-11.
    [53] Fatica M. Accelerating Linpack with CUDA on Heterogenous Clusters [C].Proceedings of2nd Workshop on General Purpose Processing on GraphicsProcessing Units,2009:46-51.
    [54] Peter Mattson et al, Imagine Programming System User’s Guide,[EB/OL].http://cva.stanford.edu,2002
    [55] Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P.Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. InSIGGRAPH,2004.
    [56] DirectCompute API [EB/OL].http://www.microsoft.com/games/en-gb/aboutGFW/pages/directx.aspx.
    [57] McCool M D, Wadleigh K, Henderson B, Lin H Y. Performance Evaluation ofGPUs Using the RapidMind Development Platform [C]. Proceedings of the2006ACM/IEEE Conference on Supercomputing (SC), November2006:11-17.
    [58] RapidMind. Writing Applications for the GPU Using the RapidMindDevelopment Platform [EB/OL]. http://www.rapidmind.net.
    [59] Papakipos M. The PeakStream Platform, High-Productivity SoftwareDevelopment for Multi-Core Processors [EB/OL].http://download.microsoft.com/download/d/f/6/df6accd5-4bf2-4984-8285-f4f23b7b1f37/WinHEC2007_PeakStream.doc, April2007.
    [60] P. H. Wang, J. D. Collins, G. N. Chinya, et.al, EXOCHI: Architecture andprogramming environment for a heterogeneous multi-core multithreaded system[C]. Proceedings of International Conference on Programming Language Designand Implementation,2007:156–166.
    [61] Qiming Hou, Kun Zhou, Baining Guo, BSGP: Bulk-synchronous GPUprogramming [J]. ACM Transactions on Graphics, Vol.27, No.3,9,2008.
    [62] Chuntao Hong, Dehao Chen, Wenguang Chen, et.al, MapCG: Writing parallelprogram portable between CPU and GPU [C]. Proceedings of the19thInternational Conference on Parallel Architectures and Compilation Techniques,2010:217-226.
    [63] Michael D. Linderman, Jamison D. Collins, Hong Wang, et.al, Merge: Aprogramming model for heterogeneous multi-core systems [J].ACM SIGPLANNotices (ASPLOS'08),2008,43(3):287-296.
    [64] Lindholm E, Kilgard M J, Moreton H. A User-Programmable Vertex Engine [C].Proceedings of the SIGGRAPH.2001:149-158.
    [65] Randi J R. OpenGL(R) Shading Language [M]. Redwood City, CA: AddisonWesley Longman Publishing Co., Inc.,2004.
    [66] Proudfoot K, Mark W R, Hanrahan P, Tzvetkov S. A Real Time ProceduralSystem for Programmable Graphics Hardware [C]. Proceedings of the28thannual conference on Computer Graphics and Interactive Techniques,2001:159-170.
    [67] Peeper C, Mitchell JL. Introduction to the DirectX9High-Level ShaderLanguage [EB/OL].http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnhlsl/html/shaderx2_introductionto.asp.
    [68] Mark W R, Glanville S, Akeley K, Kilgard M J. Cg: A System for ProgrammingGraphics Hardware in A C-like Language [J]. ACM Transaction on Graphics,2003,22(3):896-907.
    [69] Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P.Brook for GPUs: Stream Computing on Graphics Hardware [J]. ACMTransaction on Graphics,2004,23(3):777-786.
    [70]陈曙辉,熊淑华等译.大规模并行处理器编程实践[M].北京:清华大学出版社,2010.
    [71] Harris M J, Coombe G, Scheuermann T, Lastra A. Physically-Based VisualDimulation on Graphics Hardware [C]. Proceedings of the Graphics Hardware,2002:109-118.
    [72] Li W, Fan Z, Wei XM, Kaufman A. GPU-Based Flow Simulation with ComplexBoundaries [R]. Technical Report,031105, Computer Science Department,SUNY at Stony Brook,2003.
    [73] Li W, Wei X M, Kaufman A. Implementing Lattice Boltzmann Computation onGraphics Hardware [J]. The Visual Computer,2003,19(7-8):444-456.
    [74] Kim T, Lin M C. Visual Simulation of Ice Crystal Growth [C]. Proceedings ofthe SIGGRAPH/Eurographics Symposium on Computer Animation,2003:86-97.
    [75] Lefohn A E, Kniss J M, Hansen C D, Whitaker R T. Interactive Deformationand Visualization of Level Set Surfaces Using Graphics Hardware [C].Proceedings of the14th IEEE Visualization2003(VIS),2003:11.
    [76] NVIDIA Corp. Technical Brief: NVIDIA GeForce8800GPU ArchitectureOverview [EB/OL].2006.11.
    [77] NVIDIA Corp. Official Website [EB/OL]. http://www.NVIDIA.com.
    [78] Kirk D. NVIDIA's GT200: Inside a Parallel Processor [J]. Real WorldTechnologies,2008:33-35.
    [79] Patterson D A. The top10Innovations in the New NVIDIA Fermi Architecture,and the Top3Next Challenges [EB/OL].http://www.nvidia.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf, September2009.
    [80] Halfhill T R. Looking Beyond Graphics [R]. Microprocessor Report, September2009.
    [81] NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi,v1.1[EB/OL]. White Paper (electronic), September2009.http://www.NVIDIA.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
    [82] NVIDIA CORPORATION. NVIDIA Solves the GPU Computing Puzzle[EB/OL], September2009.http://www.NVIDIA.cn/object/fermi_architecture_cn.html
    [83] R600Technology–R600-Family Instruction Set Architecture [EB/OL].December2008.http:www.x.org/docs/AMD/r600isa.pdf.
    [84] AMD Corporation. ATI Stream Computing User Guide v1.0[EB/OL].http://developer.amd.com.
    [85] Seiler L, Carmean D, Sprangle E, Forsyth T, Abrash M, Dubey P, Junkins S,Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P.Larrabee: A Many-Core x86Architecture for Visual Computing [C].SIGGRAPH’08, August2008:1-15.
    [86] NVIDIA. NVIDIA CUDA CUFFT Library v1.0[EB/OL].2007.
    [87] NVIDIA. NVIDIA_CUDA_C_Programming_Guide v3.2[EB/OL].2010.
    [88] AMD Corporation. AMD Brook+[EB/OL].http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf.
    [89] Technical Overview: ATI Stream Computing [EB/OL].http://developer.amd.com/gpu_assets/Stream_Computing_Overview.pdf.
    [90] Khronos. The Open Standard for Parallel Programming of HeterogeneousSystems [S]. http://www.khronos.org/opencl/.
    [91] Stone J E, Gohara D, Shi G. OpenCL: A parallel Programming Standard forHeterogeneous Computing Systems [J]. Computing in Science and Engineering,2010(12):66-73.
    [92] The Message Passing Interface (MPI) standard [EB/OL],2010.http://www.mcs.anl.gov/research/projects/mpi/.
    [93] The OpenMP API specifivation for parallel programming,2010,[EB/OL].http://openmp.gov/wp.
    [94] The OpenTM Tractional MPI,2010,[EB/OL]. http://opentm.stanford.edu/.
    [95] Berkeley UPC-Unified parallel C,2010,[EB/OL]. http://upc.lbl.gov/.
    [96]卢兴敬,商雷,陈莉.POM:一个MPI程序的进程优化映射工具[C]. Proceedingsof International Conference on ProgrammingHPC in China,2009:156-166.
    [97]王毅.MPI程序通信行为不变分析[D].硕士学位论文,北京:中国科学院研究生院,2010.
    [98]吴俊杰,杨学军,刘光辉,唐玉华.面向OpenMP和OpenTM应用的并行数据重用理论[J].软件学报,2010,21(12):3011-3028.
    [99]吴俊杰,潘晓辉,杨学军.面向非一致Cache的智能多跳提升技术[J].计算机学报,2009,32(10):1887-1895.
    [100] Xue-jun Yang, Jun-jie Wu, Kun Zeng, Yu-hua Tang. Managing Data-objects inDynamically Reconfigurable Cache [J].Journal of Computer Science andTechnology,2010,25(2):232-245.
    [101] Jun-jie Wu, Xue-jun Yang. Optimizing the Management of reference PredictionTable for Prefetching and Prepromotion [J]. Journal of Computer,2010,5(2):242-249.
    [102] Jun-jie Wu, Xiaohui Pan, Xue-jun Yang. Software Prepromotion forNon-Uniform Cache Architeture [J]. Journal of Software,2010,5(1):11-19.
    [103] Jun-jie Wu, Xiaohui Pan, Guanghui Liu Baida Zhang, Xue-jun Yang.ParallelData Reuse Theory for OpenMP Application[C]. Proceedings of InternationalConference on Software Engeneering, Artifial intelligence, networking andparallel/Distributed Computing,2009:516-523.
    [104]方晓燕,姜小成,漆锋滨. UPC并行循环优化救赎研究与实现[J],计算机工程与应用,2006,42:65-68.
    [105]方晓燕,王俊,漆锋滨. UPC共享访问消息向量化[J],计算机应用与软件,2008,25(6):165-170.
    [106]文延华,黄传信,漆锋滨. Berkeley UPC编译技术分析[J],高性能计算技术,2004,2:1-5.
    [107]商雷,陈莉,李恒杰.UPC细粒度重叠优化[C]. Proceedings of InternationalConference on ProgrammingHPC in China,2008:156–165.
    [108] AMD Brook+[EB/OL].http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf.
    [109] Manjunath Kudlur, Scott Mahlke. Orchestrating the execution of streamprograms on multicore platforms[C]. Proceedings of the2008ACM SIGPLANconferenceon Programming Language Design and Implementation (PLDI’08),2008:114-124.
    [110] X. H. Sun, D. T. Rover, Scalability of Parallel Algorithm-MachineCombinations [J]. IEEE Trans. Parallel Distributed Systems Vol.5,1994(6):599-613.
    [111] Stone S S, Haldar J P, Tsao S C, Hwu W, Liang Z P, Sutton B P. AcceleratingAdvanced MRI Reconstructions on GPUs [C]. Proceedings of the5thConference on Computing Frontiers (CF), May05-07,2008:261-272.
    [112] Hartley T, Catalyurek U, Ruiz R, Igual F, Mayo R, Ujaldon M.BiomedicalImage Analysis on a Cooperative Cluster of GPUs and Multicores[C].Proceedings of the22nd annual International Conference on Supercomputing(SC),2008:15-25.
    [113] Takashi Shimokawabe, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, NukadaA, Maruyama N, Matsuoka S. An80-Fold Speedup,15.0TFlops Full GPUAcceleration of Non-Hydrostatic Weather Model ASUCA Production Code [C].Proceedings of the2010ACM/IEEE International Conference for HighPerformance Computing, Networking, Storage and Analysis, New Orleans,Louisiana, November2010:1-11.
    [114]甘新标,沈立,王志英.基于CUDA的并行全搜索运动估计算法[J].计算机辅助设计与图形学学报,2010,22(3):457-460.
    [115] Weiguo Liu, Bertil Schmidt, Gerrit Voss, and Wolfgang M¨uller-Wittig.Molecular Dynamics Simulations on Commodity GPUs with CUDA [C].Proceedings of the HiPC2007, pp.185-196.
    [116] Mo Zeyao,Zhang Jinglin. Dynamic load balancing for short-range parallelmolecular dynamics simulations [J]. International Journal on Computer Math,2002,79(2):165.
    [117] Suchard, M. A. and Rambaut, A. Many-core algorithms for statisticalphylogenetics. Bioinformatics,2009,25(11):1370-1376.
    [118] Pratas, F, Trancoso, P, Stamatakis, A, and Sousa, L. Fine-grain parallelism usingmulti-core, cell/BE, and GPU systems: Accelerating the phylogenetic likelihoodfunction. Proceedings of the ICPP,2009,9-17.
    [119] Li CH, Wu HW, Chen SF, et al. Efficient implementation for MD5-RC4encryption using GPU with CUDA[C]. Proceedings of the3rd InternationalConference on Anti-counterfeiting, Security, and Identification inCommunication,2009:167-170.
    [120] Hu G, Ma JH, Huang BX. High Throughput Implementation of MD5Algorithmon GPU[C]. Proceedings of the4th International Conference on UbiquitousInformation Technologies&Applications,2009,1-5.
    [121]冯登国.国内外密码学研究现状及发展趋势[J].通信学报,2002,23(5):18-26.
    [122] David N. LeBard. Pretty Fast Analysis: An embarrassingly parallel algorithm forbiological simulation analysis [J]. Computational physics,2008,21(8):1-13.
    [123] Owens J D, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A E andPurcell T J. A Survey of General-Purpose Computation on Graphics Hardware[R]. In Eurographics2005, State of the Art Reports, August2005:21-51.
    [124] Owens J D, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A E,PurcellT. A Survey of General-Purpose Computation on Graphics Hardware[J].Computer Graphics Forum, March2007,26(1):80-113.
    [125] Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J. GPU Computing[C]. Proceedings of the IEEE,2008,96(5):879-899.
    [126] Guiang C.S, Purkayastha A, and Milfeld K.F. Remote memory operations ofLinux clusters: expressiveness and efficiency of current implementation[C]. The3rd LCI International Conference on Linux Clusters,2003
    [127] Seidel J, Berrendorf R, Crngarov A, etc. Optimization Strategies for DataDistribution Schemes in a Parallel File System [J]. John von Neumann Institutefor Computing, NIC Series, Vol.38,2007,425-432
    [128] Yeom J-S and Nikolopoulos D.S. Strider: Runtime Support for OptimizingStrided Data Accesses on Multi-cores with Explicitly Managed Memories[C].ACM/IEEE International Conference for High Performance Computing,Networking, Storage and Analysis,2010.
    [129]刘振英,方滨兴,胡铭曾,张毅.一种有效的动态负载平衡方法[J].软件学报.2001,12(4):563-569.
    [130] Shivaratri,Krueger,Singhal.Load Sharing Policies in Locally Di stributedSystems[J].IEEE Computer.1992,25(12):33-44.
    [131] D.Eager, E.Lazowska,J.Zahorjan. A Comparison of Receiver-Initiatedand Sender-Initiated Dynamic Load Sharing.Performance Evaluat ion.1986,6(1):53-68.
    [132]孔涛,张亶.一种分类预计算QoS路由算法[J].软件学报,2002,13(4):591-600.
    [133]王泽辉.二维随机矩阵置乱变换的周期及在图像信息隐藏中的应用[J].计算机学报,2006,29(12):2218-2225.
    [134]孔涛,张亶. Arnold反变换的一种新算法[J].软件学报,2004,15(10):1558-1564.
    [135]邵利平,覃征,衡星辰等.基于矩阵变换的图像置乱逆问题求解[J].电子学报,2008,36(7):1356-1363.
    [136] Stanford Compiler Group. SUIF Compiler System[S].US:Standford University,1994.
    [137] Robert P. Wilson, Robert S. French, Christopher S. Wilson. SUIF: Aninfrastructure for Research on Parallelizing and Optimizing Compilers[R]. US:Computer Systems Laboratory Stanford University,1994.
    [138] G. Aigner, A.Diwan, D.L.heine et al., The basic SUIF programming Guide[R].US: Computer Systems Laboratory Stanford University,2000.
    [139] Stanford Compiler Group. SMGN reference Manual[S].US: StandfordUniversity,1994.
    [140] CUDA benchmark suite EB/OL].http://www.crhc.uiuc.edu/impact/cudaben ch.html.2010.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700