面向GPU计算平台的若干并行优化关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向GPU计算平台的若干并行优化关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research of Parallel Optimization Technicals on GPU Computing Platforms
作者：贾海鹏
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：GPU ; 性能优化链 ; GPURoofline ; 粗粒度并行 ; 本地和全局队列
英文关键词：GPU ; Performance Optimization Chain ; GPURoofline ; Coarse-grained Parallelism ; Local and Global
英文关键词：queue
学位年度：2012
导师：徐建良
学科代码：081203
学位授予单位：中国海洋大学
论文提交日期：2012-10-09

摘要

随着计算能力和可编程性的不断增强,GPU被越来越多的应用开发人员用作性能加速器以提高程序性能。然而,如果没有经过精心优化,很难在GPU上实现理想性能。这是因为GPU程序的优化工作已经从硬件设计者转移到应用开发人员手中。而GPU程序的性能优化是一个非常困难的过程,其实质是实现算法特性向底层硬件特征的高效映射。一方面这个过程需要对GPU底层硬件有着深入的认识,而现代GPU架构的日益多样性,无疑加剧了本已困难的优化工作；另一方面,移植到GPU上的应用的程序特性也日益多样化,从整体上看,这些应用可分为规则应用和非规则应用两大类。不同的程序特性在不同硬件架构上具有不同的优化方法和策略。为简化GPU程序的性能优化工作,使应用开发人员能够更加容易的实现高性能GPU程序。针对不同的应用特点,本文的主要工作可分为两部分：
     针对规则应用,我们提出性能优化链的概念,并根据GPU计算和访存的特点,将性能优化链划分为绝对性能优化链和相对性能优化链两类。通过引入Roofline模型,实现了性能优化链的可视化,建立了针对特定硬件平台的可视化GPU程序性能优化指导模型：GPURoofline。该模型可通过提供性能信息来确定GPU程序在特定硬件平台上的性能瓶颈以及应选择的优化策略和方法,以此来指导应用开发人员特别是对GPU底层架构不熟悉的应用开发人员更加容易的实现高性能GPU程序。本文通过三个具有不同计算密度和程序特性的典型应用验证了GPURoofline模型的可用性和正确性。
     针对非规则应用,以Viola-Jones人脸检测算法为例,引入了非规则应用在GPU上实现和优化的五大关键技术：粗粒度并行、Uberkernel、Persistent Kernel、本地队列和全局队列。并通过性能特征参数的定义和抽取,完成了可调优GPUkernel的初步实现,并以此实现了Viola-Jones人脸检测算法在不同GPU平台上的性能移植。实验表明,经过优化的Viola-Jones人脸检测算法比OpenCV库中同样经过精心优化的CPU版本在AMD HD5850GPU、AMD HD7970GPU和NVIDIA C2050GPU三个GPU平台上分别达到了5.19～27.724、6.468-35.080和5.850～28.768的性能提升。
     本文的创新点如下：
     (1)分析和比较当前主流GPU架构的异同,提出了GPU程序性能优化的三大有效途径：提高片外带宽利用率,提高计算资源利用率和数据本地化。
     (2)提出算法计算密度和硬件计算密度两个概念,并通过这两个概念的比较将GPU kernel分为访存密集型和计算密集型两大类。提出并构建针对特定硬件平台的性能优化链。并根据访存和计算优化的特点,将性能优化链划分为绝对性能优化链和相对性能优化链两类。
     (3)构建完成了一个可视化的GPU性能指导模型：GPURoofline.通过引入Roofline模型实现了性能优化链的可视化,以一种更加直观的形式指导GPU程序的优化。
     (4)引入非规则应用在GPU实现和优化的五大方法和策略：粗粒度并行、Uberkernel、Persist Thread、本地队列和全局队列。并通过Viola-Jones人脸检测算法说明了这五种方法的具体应用方式。最后,通过对性能参数的定义和抽取,初步完成了可调优kernel的实现,验证了在不同GPU硬件平台间实现性能移植的可能性。
More and more application developers have been adopting GPUs as standard computing accelerators because of their increasing computing power and programmability. However, it's hard to get the required performance without careful optimizations because the performance problem has shifted from hardware designers to application developers. Unfortunately, performance optimizations of GPU programs are very difficult. The essence of this progress is to achieve the best match between algorithm features and the underlying hardware characteristics. On the one hand, this optimization process requires deep technical knowledge of the underlying hareware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. On the other hand, the characteristics of application programs ported to GPUs are also becoming increasingly diverse. Overall, these applications can be divided into two categories:regular applications and irregular applications. Optimization methods and strategies are very different for different programs running on different hardware platforms. In order to simplify optimizations of GPU programs and enable application developers write high performance GPU programs more easily. Considering the different characteristics of the differnent GPU applications, we divide our work into two parts:
     For regular applications, we propose the concept of performance optimization chain, and divide it into two categories:threshold optimization chain and tradeoff optimization chain according to the differences between GPU computing and memory access. We also make the optimization chain insightful by introducing Roofline model, and establish an insightful performance model for guiding optimizations on GPUs: GPURoofline. This model can provide performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted. This model is useful for programmers, especially non-expert programmers with limited knowledge of GPU architectures to implement high performance GPU kernels directly. We aslo demonstrate the usage of GPURoofline by optimizing three representative GPU kernels with different compute intensity and program characteristics.
     For irregular applications, we take the Viola-Jones face detection algorithm as an example to intruoduce five key technologies for optimizing irregular applications on GPUs:coarse-grained parallelism, Uberkernel, Persistent Thread, local queue and global queue. We also propose a tunable GPU kernel by defining and extracting performance parameters and achieving the performance portability across different GPU platforms for the Viola-Jones face detection algorithm. We also demonstrate the high performance of our implementation by comparing it with a well-optimized CPU version from OpenCV library. Experimental results show that the speedup reaches up to5.19～27.724,6.468～35.080and5.850～28.768on AMD HD5850GPU, AMD HD7970GPU and NVIDIA C2050GPU respectively.
     In summary, our key contributions are as follows:
     1. Comparison and analysis of differences and similarities among the current mainstream GPU architectures. We propose three effective ways to improve performance of programs on GPUs:improving the utilization of the off-chip memory bandwidth, improving the utilization of the computing resource and data locality.
     2. Definitions of hardware compute intensity and algorithm compute intensity respectively. Starting from these definitions, we classified algorithms as memory-bound or computation-bound by measuring such features. Furthermore, we also build performance optimization chainm, and divide it into two categories:threshold optimization chain and tradeoff optimization chain according to the differences between GPU computing and memory access.
     3. GPURoofline:an empirical and insightful performance model for guiding performance optimizations. We make the optimization chain insightful by introducing Roofline model, so we can guide optimizations in a more intuitive way.
     4. We introduce five key technologies for optimizing irregular applications on GPUs:coarse-grained parallelism, Uberkernel, Persistent Thread, local queue and global queue. We demonstrate the usage of these five methods through implementing and optimizing the Viola-Jones face detection algorithm on GPUs. Finally, we complete a tunable GPU kernel by defining and extracting performance parameters. So as to vertify the possibility of performance portability across different GPU platforms.

引文

[1]Govindaraju, N. K., Larsen, S., Gray, J., Manocha, D.,2006. A memory model for scientific algorithms on graphics processors. In:Proceedings of the 2006 ACM/IEEE conference on Supercomputing. SC'06. ACM, New York, NY, USA.
    [21 Liu, W., Muller-Wittig, W., Schmidt, B.,2007. Performance predictions for general-purpose computation on gpus. In:Proceedings of the 2007 International Conference on Parallel Processing. ICPP'07. IEEE Computer Society, Washington, DC, USA, pp.50-.
    [3]Calculator, http://developer.download.nvidia.com/compute/cuda/cuda occupancy calculator.xls.
    [4]Hong, S., Kim, H.,2009. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In:Proceedings of the 36th annual international symposium on Computer architecture. ISCA'09. ACM, New York, NY, USA, pp.152-163.
    [5]Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim.2012. A performance analysis framework for identifying potential benefits in GPGPU applications, In proceeding of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming(New York,2012)
    [6]Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S.-Z., Stratton, J. A., Hwu, W.m. W.,2008. Program optimization space pruning for a multithreaded gpu. In:Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. CGO'08. ACM, New York, NY, USA, pp. 195-204.
    [7]Kothapalli, K., Mukherjee, R., Rehman, M., Patidar, S., Narayanan, P., Srinathan, K., dec.2009. A performance prediction model for the cuda gpgpu platform. In:High Performance Computing (HiPC),2009 International Conference on. pp.463-472.
    [8]Murthy, G., Ravishankar, M., Baskaran, M., Sadayappan, P., april 2010. Optimal loop unrolling for gpgpu programs. In:Parallel Distributed Processing (IPDPS),2010 IEEE International Symposium on. pp.1-11.
    [91 Meng, J., Skadron, K.,2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In:Proceedings of the 23rd international conference on Supercom-puting. ICS'09. ACM, New York, NY, USA, pp.256-265.
    [101 Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., Hwu, W.-m. W.,2010. An adaptive performance modeling tool for gpu architectures. In:Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP'10. ACM, New York, NY, USA, pp.105-114.
    [11]Choi, J. W., Singh, A., Vuduc, R. W.,2010. Model-driven autotuning of sparse matrix-vector multiply on gpus. In:Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP'10. ACM, New York, NY, USA, pp.115-126.
    [12]Schaa, D., Kaeli, D.,2009. Exploring the multiple-gpu design space. In:Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing. IPDPS'09. IEEE Computer Society, Washington, DC, USA, pp.1-12.
    [131 Zhang, Y, Owens, J. D.,2011. A quantitative performance analysis model for gpu architectures. In:HPCA'11: Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Washington, DC, USA, pp.382-393.
    [141 Kim, Y, Shrivastava, A.,2011. Cumapz:a tool to analyze memory access patterns in cuda. In:Proceedings of the 48th Design Automation Conference. DAC'11. ACM, New York, NY, USA, pp.128-133.
    [15]Goto, K., Geijn, R. A. v. d., May 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw.34 (3),12:1-12:25
    [16]Dongarra, J. J., Du Croz, J., Hammarling, S., Du, I. S., Mar.1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw.16 (1),1-17.
    [17]J. A. B., Phipac:fast matrix multiply. http://www.icsi.berkeley.edu/-bilmes/phipac/.
    [18]Ryan Taylor, Xiaoming Li.2010. A Micro-benchmark Suit for AMD GPUs. In:Proceeding of 39th International Conference on Parallel Processing Workshops..
    [19]Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Henry Wong. Micro-benchmarking the GT200 GPU.
    [20]Vasily Volkov.2010. Better Performance at Lower Occupancy. In:Proceeding of GPU Technology Conference 2010.
    [21]Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone.2008. Performance Predictions for General-Purpose Computation on GPUs. In proceeding of the 13th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming.
    [22]Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone.2008. Program optimization carving for GPU computing. In:J. Parallel Distrib, comput.68(2008) 1389-1401.
    [23]Mayank Daga, Thomas R.W. Scogland, Wu-chun Feng. Architecture-Aware Optimization on a 1600-core Graphics Processor.
    [24]Samuel Williams, Andrew Waterman, David Patterson. Roofline:An Insightful Visual Performance Model for Multicore Architectures. In proceeding of Communications of the ACM, PP 65-76 (2009)
    [25]Yoshiei Sato,. Ryuichi Nagaoka. Performance Tuning and Analysis of Future Vector Processors Based on the Roofline Model. In proceeding of 10th workshop on MEmory performance:DEaling with Applications, systems and architecture.2009.
    [261 Oro, D.; Fernandez, C.; Saeta, J.R.; Martorell, X.; Hernando, J., "Real-time GPU-based face detection in HD video sequences":530-537,2011
    [27]Paul Viola, Michael Jones. "Robust Real-time Object Detection", Second International Workshop On Statistical And Computation Theories Of Vision-Modeling, Learning, Computing and Sampling":Vancouver, Canada, July 2001.
    [28]Duane Merrill, Michael Gariand, Andrew Grimshaw, Scalable GPU Graph Traversal. In proceeding of the 17th ACM SIGPLAN symposium on Principles and Parallel Pragramming(PPoPP),2012.
    [29]Timo Aila, Samuli Laine, "Understanding the efficiency of ray traversal on GPUs", Proceedings of the Conference on High Performance Graphics 2009:145-150 August 2009.
    [30]] O. M. Lozano and K. Otsuka, "Simultaneous and fast 3d tracking of multiple faces in video by gpu-based stream processing," in Proc. Int. Conf. Acoustics, Speech and Signal Processing,713-716,2008.
    [31]H. Ghorayeb, B. Steux, and C. Laurgeau, "Boosted algorithms for visual object detection on graphics processing units," in Proc. of the 7th Asian Conference on Computer Vision (ACCV'06):254-263,2006.
    [32]AMD Corporation, " Accelerated Parallel Processing OpenCLTM", JAN,2011
    [33]NVIDIA Corporation, "OpenCL programming Guide for the CUDA Architecture V4.0", AUG,2011
    [34]KHRONOS OpenCL Working Group, "The OpenCL Specification V1.2", November 2011.
    [35]Tatarinov A., Kharlamov A, " Alternative Rendering Pipelines Using NVIDIA GPU", Talk at SIGGRAPH 2009, August,2009.
    [36]Stanley Tzeng, Anjul Patney, John D. Owens, "Task management for irregular-parallel workloads on the GPU", Proceedings of the Conference on High Performance Graphics(HPG'10):29-37, June 2010.
    [37]Florian Matthes, Joachim W. Schmidt, "Persistent Threads", Proceeding of the 20th VLDB Conference: 403-414, Santiago, Chile,1994.
    [38]Robert D. Blumofe, "Scheduling Multithreaded Computations by Work Stealing", Foundations of Computer Science,1994 Proceedings,35th Annual Symposium on Digital Object Identifier:356-368,1994.
    [39]T. Sim, S. Baker, and M. Bsat, " The cmu pose, illuminatin, and expression database", IEEE Trans. On Pattern Analysis and Machine Intelligence,1615-1618, Decemember 2003
    [40]Sharma, B.; Thota, R.; Vydyanathan, N.; Kale, A.," Towards a robust, real-time face processing system using CUDA-enabled GPUs":368-377,2009.
    [41]Acasandrei, L.; Barriga, A, "Accelerating Viola-Jones face detection for embedded and SoC environments":1-6,2011
    [42]Samuel Webb Williams. Auto-tunning Performance on Multicore Computers. University of California at Berkeley.2008.
    [43]NVIDIA Corporation, "NVIDIA CUDA C Programming Guide V3.2", Sept 2010
    [44]NVIDIA Corporation, "OpenCL programming Guide for the CUDA Architecture V3.2", AUG,2010
    [45]http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_ju110_lores.pdf
    [46]NVIDIA Corporation, "NVIDIA's Next Generation CUDA Compute Architecture:Fermi",
    [47]http://www.realworldtech.com/cayman/
    [48]卢风顺.面向CPU/GPU异构体系架构的并行关键技术研究[D].博士论文,国防科技大学,2012.
    [49]崔翔.众核处理器及其加速集群的程序设计方法研究[D].博士论文,北京大学,2012.
    [50]唐铁轩.面向多线程应用的Cache优化策略及并行模拟研究[D].博士论文,中国国防科学技术大学,2012.
    [51]Ashwin M.Aji, Mayank daga,Wu-chun Feng. Bounding the Effect of Partition Camping in GPU kernels. In proceeding of the 8th International Conference on Computing Frontiers. May 3-5,2011.
    [52]Lazowska.E., Zahorjanj., Grahanm.s.,and Sevcik.k. Quantiative System Performance:Computer Suystem Analysis Using Queueing Network Models. Prentice Hall, Upper Saddle River, NJ,1984.
    [53]Jacek Naruniec. Using GPU for face detection. In proceeding of the SPIE, Volume 7502(2009).
    [54]M.J. Harris, W.V. Baxter, T. Scheuermann, A. Lastra. Simulation of cloud dynamics on graphics hardware[C]. Proccedings of the Graphics Hardware,2003,92-101.
    [55]白洪涛.基于GPU的高性能并行算法研究[D].博士论文,吉林大学,2010年.
    [56]林一松.面向GPU的低功耗软件优化关键技术研究[D].博士论文,国防科技大学,2012年.
    [57]陈刚.众核GPU体系结构相关技术研究[D].博士论文,复旦大学,2011年.
    [58]赵荣彩.多线程低功耗编译优化技术研究[D].[S.1.]：中国科学院计算技术研究所,2002.
    [59]杜静.流体系结构的编译技术研究[D].博士学位论文,国防科学技术大学,2008.
    [60]A. Mohan, C. Papageorgiou, T. Poggio. Example-based objectdetection in images by components. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, Vol.23, No.4, pp.349-361, April 2001.
    [61]S. E. Breach, T. N. Vijaykumar, and G. S. Sohi. Multiscalar processors. In ISCA'95:Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 414-425, Los Alamitos, CA, USA,1995.
    [62]G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO 38:Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 105-118, Washington, DC, USA,2005.
    [63]Michael Collins, Robert E. Schapire. Logistic Regression. AdaBoost and Bregman Distances. Machine Learning,2002
    [64]Freund, Y, Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences,55(1),119-139.
    [65]Friedman, J., Hastie, T.,&Tibshirani, R. (2000). Additive logistic regression:A statistical view of boosting/The Annals of Statistics,38(2),337-374.
    [66]K. Kennedy, B. Broom, A. Chauhan, R. Fowler, J. Garvin, C. Koelbel, C. McCosh, and J. Mellor-Crummey. Telescoping languages:A system for automatic generation of domain languages. Proc. of the IEEE,93:378-408, 2005.
    [67]Y. Amit, D. Geman, and K. Wilder. Joint induction of shape features and tree classifiers.1997. vol 19.1300-1300
    [68]T. Kim, M.C. Lin. Visual simulation of ice crystal growth[C]. Proccedings of the SIGGRAPH/Eurographics Symp. on Computer Animation,2003,86-97.
    [69]A. Lefohn, J. Kniss, J. Owens. Implementing efficient parallel data structures on GPUs[M]. In GPU Gems 2, M. Pharr, Addison Wesley, chapter 33, March,2005,521-545.
    [70]W. Liu, B. Schmidt, G. Voss, and W. Miiller-Wittig. Molecular dynamics simulations on commodity gpus with cuda. In HiPC, pages 185-196,2007.
    [71]T. M. Austin and G. S. Sohi. Dynamic dependency analysis of ordinary programs. SIGARCH Comput. Archit News,20(2):342-351,1992.
    [72]I. Buck. Taking the plunge into GPU computing[M]. In GPU Gems 2, M. Pharr, Addison Wesley, chapter 32, March,2005,509-519.
    [73]Greg Rustsch, Paulius Micikevicius. Optimizing Matrix Transpose in CUDA. NVIDIA Corporation.
    [74]D. Horn. Stream reduction operations for GPGPU applications[M]. In GPU Gems 2, M. Pharr, Addison Wesley, chapter 36, March,2005,573-589.
    [75]P. Kipfer, R. Westermann. Improved GPU sorting[M]. In GPU Gems 2, M. Pharr, Addison Wesley, chapter 46, March,2005,733-746.
    [76]Issenin, I., Brockmeyer, E., Miranda, M., and Dutt, N. Data reuse analysis technique for software-controlled memory hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe. February,2004.
    [77]Y. Lin and D. Padua. On the automatic parallelization of sparse and irregular Fortran programs. In Proceedings of the Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers (LCR-98), May 1998.
    [78]吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5)：60612.
    [79]任化敏,张勇东,林守勋.GPU加速的基于增量式聚类的视频拷贝检测方法[J].计算机辅助设计与图形学学报,2010,22(3)：449-456
    [80]E. Grochowski, R. Ronen, J. Shen, and H. Wang. Best of both latency and throughput. In Proc. of ICCD, pages 236-243,2004.
    [81]V. Kuncak, P. Lam, and M. Rinard. Role analysis. ACM SIGPLAN Notices,37:17-32,2002.
    [82]W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig. Molecular dynamics simulations on commodity gpus with cuda. In HiPC, pages 185-196,2007.
    [83]T.J. Purcell, C. Donner, M. Cammarano et al. Photon mapping on programmable graphics hardware[M]. In Graphics Hardware, July,2003,41-50.
    [84]C.J. Thompson, S. Hahn, M. Oskin. Using modern graphics architectures for general-purpose computing:A framework and analysis[C]. Proceedings of the Int'l Symp. On Microarchitecture.2002,306-317.
    [85]J.D. Hall, N.A. Carr, J.C. Hart. Cache and bandwidth aware matrix multiplication on the GPU[T]. UIUCDCS-R-2003-2328, Champaign:University of Illinois at Urbana-Champaign,2003.
    [86]A. Moravanszky. Dense matrix algebra on the GPU [OL]. www.shaderx2.com,2003.
    [87]A. Corrigan, F. Camelli, R. Lohner, and J. Wallin. Running unstructured grid cfd solvers on modern graphics hardware. In 19th AIAA Computational Fluid Dynamics Conference, number AIAA 2009-4001, June 2009.
    [88]R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured Euler solver using software primitives. AIAA Journal,32(3):489-496, Mar.1994.
    [89]J. Bolz, I. Farmer, E. Grinspun, et al. Sparse matrix solvers on the GPU:Conjugate gradients and multigrid[J]. ACM Transactions on Graphics,2003,22 (3):917-924.
    [90]N. Fujimoto. Dense matrix-vector multiplication on the CUDA architecture[J]. Parallel Processing Letters, 2008,18(4),511-530.
    [91]W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig. Molecular dynamics simulations on commodity gpus with cuda. In HiPC, pages 185-196,2007.
    [92]T. M. Austin and G. S. Sohi. Dynamic dependency analysis of ordinary programs. SIGARCH Comput. Archit. News,20(2):342-351,1992.
    [93]M. Garland. Sparse matrix computations on manycore GPUs. In DAC,2008. E. Gutierrez, O. Plata, and E. L. Zapata. A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors. In ICS00, pages 78-87. ACM Press, May 2000.
    [94]M. Hall, S. Amarsinghe, B. Murphy, S. Liao, and M. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, (12), Dec.1996.
    [95]韩博,周秉锋GPGPu性能模型及应用实例分析[J].计算机辅助设计与图形学学报,2009,21(9)：1219-1226
    [961杜静,敖富江,唐滔,杨学军.流处理器上基于参数模型的长流分段技术[J].软件学报,2009,20(9)：2320-2331.
    [97]Baskaran, M., Ramanujam, J., and Sadayappan, P. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the International Conference on Compiler Construction. March,2010.
    [98]N. Fujimoto. Faster matrix-vector multiplication on GeForce 8800 GTX[C]. Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM,2008,1-8.
    [99]HuangQH, Huang Z Y, Werstein P, et al. GPU as a general purpose computing resource. In Proeeeding of the 9th International Conferenee on Parallel and DistributedComPuting, APPlications and Technologies,2008, 151-158.
    [100]IBM High Productivity Computing Systems Toolkit, http://www.alphaworks.ibm.com/tech/hpcst,2011.
    [101]A. MacNab, G. Vahala, P. Pavlo, L. Vahala, and M. Soe, "Lattice Boltzmann Model for Dissipative Incompressible MHD," Proc.28th EPS Conf. Controlled Fusion and Plasma Physics, vol.25 A, pp.853-856,2001.
    [102]D.G. Spampinato, A.C. Elster. Linear optimization on modern GPUs[C]. IEEE International Parallel and Distributed Processing Symposium,2009,1-8.
    [103]N.K. Govindaraju, B. Lloyd, W. Wang, M. Lin, D. Manocha. Fast computation of database operations using graphics processors[C]. Proceedings of the ACM SIGMOD International Conference onManagement of Data, 2004,215-226.
    [104]C. Sun, D. Agrawal, A.E. Abbadi. Hardware acceleration for spatial selections and joins[C]. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data,2003,455-466.
    [105]B.S. He, K. Yang, R. Fang, et al. Relational joins on graphics processors[C]. Proceedings of SIGMOD. Vancouver:ACM,2008,321-331.
    [106]Jason Sanders, Edward Kandrot. CUDA范例精解,清华大学出版社.2010年.
    [107]C. Jiang and M. Snir. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. In International Conference on Parallel Architectures and Compilation Techniques, September 2005.
    [108]J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40-53,2008.
    [109]N.K. Govindaraju, N. Raghuvanshi, D. manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors[C]. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data,2005,611-622.
    [110]David B. Kirk, Wen-mei, w.Hwu. Programming Massively Parallel Processors.清华大学出版社,2010年.
    [111]J.D. Owens, D. Luebke, et al. A survey of general-purpose computation on graphics hardware[J]. Computer graphics forum,2007,26(1),80-113.
    [112]J.D. Owens, M. Houston, D. Luebke, et al. GPU computing[C]. Proceedings of the IEEE,2008,96(5), 879-899.
    [113]General purpose GPU. gpgpu.org[OL],2005.
    [114]K. Das,, P. Havlak, J. Saltz, and K. Kennedy. Index array flattening through program transformation. In SC95. IEEE Computer SocietyPress, Dec.1995.
    [115]P. Martz. The OpenGL shading language[J]. Dr. Dobb's Journal,2009,29(8),80.
    [116]J.D. Kessenich, R.R. Baldwin. OpenGL 2.0 Shading Language[M].1.051 Ed,200
    [117]Hong, S. and Kim, H. An Integrated GPU Power and Performance Model. In Proceedings of the 37th International Symposium on Computer Architecture. June,2010.
    [118]Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W. W. CUDALite:reducing GPU programming complexity. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing. July,2008.
    [119]邢立宁,陈英武.基于混合蚁群优化的卫星地面站系统任务调度方法『J1.自动化学报.2008.34(4).414-419.
    [120]陈国良.并行算法研究进展[R].第八届全国并行计算学术会议,2004.
    [121]杨正龙,金林,李蔚清.基于GPU的图形电磁计算加速算法[J].电子学报,2007,35(6),1056-1060.
    [122]K. Proudfoot, W.R. Mark, P. Hanrahan et al. A real time procedural system for programmable graphics hardware[J]. Computer Graphics,2001.159-170.
    [123]S. Hong and H. Kim. Anmodel for a GPU Architecture withMemorylevel and Thread-level Parallelism Awareness. In International Symposium on Computer Architecture, June 2009.
    [124]Z. Pan and R. Eigenmann. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, March 2006.
    [125]W.R. Mark, K. Proudfoot. Compiling to a VLIW fragment pipeline[C]. Proc. of the Graphics Hardware 2001.
    [126]Microsoft high-level shading language[OL].
    [127]C. Ding and K. Kennedy. Improving cache performance of dynamic applications with computation and data layout transformations. In PLDI99, May 1999.
    [128]M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande. Accelerating molecular dynamic simulation on graphics processing units. Journal of Computational Chemistry,30(Radeon 4870):864-872,2009
    [129]http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9c/directx/graphics/reference/hlslref erence/hlslreference.asp.2005.
    [130]R. Mark, S. Glanville, K. Akeley et al. Cg:A system for programming graphics hardware in a C-like language[J]. ACM Transaction on Graphics 2003,22(3),896-907.
    [131]I. Buck, T. Foley, D. Horn et al. Brook for GPUs:Stream computing on graphics hardware[J]. ACM Transaction on Graphics,2004,23(3),777-786.
    [132]张舒,褚艳利.GPU高性能计算之CUDA[M]中国水利水电出版社,2009.
    [133]陈国良编著.并行计算——结构.算法.编程(修订版)[M].高等教育出版,2003.
    [134]王之元.并行计算可扩展性分析与优化——能耗、可靠性和计算性能[D].[S.1]：国防科学技术大学，2011.
    [135]Nvidia, NVIDIA CUDA Compute Unified Device Architecture-Programming Guide[OL],
    [136]http://developer.download.nvidia.com/compute/cuda,2008.
    [137]P. Mark, S. Mark, G. Derek. A performance-oriented data parallel virtual machine for GPUs[C].In Proccedings of SIGGRAPH 2006,2006:184-es. Ati Research. The radeon x1x00 programming guide[OL].www.ati.com,2006.
    [138]N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In ACM/IEEE Conference on Supercomputing, November 2008.
    [139]S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W.W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In International Symposium on Code Generation and Optimization, April 2008
    [140]J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. In PPoPP, Feb.2010.
    [141]S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU:A Compiler Framework for Automatic Translation and Optimization. In PPoPP'09,2009.
    [142]Benedict R. Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry. Heterogeneous Computing with OpenCL. 2012.
    [143]Y. Lin and D. Padua. On the automatic parallelization of sparse and irregular Fortran programs. In Proceedings of the Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers (LCR-98), May 1998.
    [144]M. Burtscher, B.-D. Kim, J. Diamond, J. McCalpin, L. Koesterke, and J. Browne, "Perfexpert:An Easy-to-Use Performance Diagnosis Tool for Hpc Applications," Proc. ACM/IEEE Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11,2010.
    [145]A. Chandramowlishwarany, K. Madduri, and R. Vuduc, "Diagnosis, Tuning, and Redesign for Multicore Performance:A Case Study of the Fast Multipole Method," Proc. ACM/IEEE Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-12,2010.
    [146]张舒,褚艳利.GPU高性能运算之CUDA.中国水利水电出版社.2009年.
    [147]M. Gerndt and M. Ott, "Automatic Performance Analysis with Periscope," Concurrency and Computation: Practice and Experience, vol.22, pp.736-748, Apr.2010.
    [148]A. Hartono, B. Norris, and P. Sadayappan, "Annotation-Based Empirical Performance Tuning Using Orio Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS), pp.1-11,2009.
    [149]D. J., Nicolau, A., and Dutt, N. Elimination of redundant memory traffic in high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Vol.15, No.11, November,1996.
    [150]Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. March,2010.
    [151]J. Dean and S. Ghemawat. Mapreduce:Simplified data processing on large clusters. In Proc. of OSDI, pages 137-149,2004.
    [152]M. Segal and M. Peercy. A performance-oriented data parallel virtual machines for GPUs. Technical report, ATI Technologies,2006.
    [153]N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In ACM/IEEE Conference on Supercomputing, November 2006.
    [154]AMD上海研发中心.跨平台的多核与众核编程讲义OpenCL的方式.2010.
    [155]陈钢,吴百锋.面向OpenCL模型的GPU性能优化.计算机辅助设计与图形学学报.Vol23.No 4.2011
    [156]W.DANEL HILLIS, GUY L. STEELE, JR. DATA PARALLEL ALGORITHMS.1986.
    [157]NVIDIA Corporation "Whitepaper NVIDIA GeForce GTX 680"
    [158]http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-int egrated-core-architecture.html
    [159]http://www.top500.org/
    [160]Duane Merrill, Michael Garland, Andrew Grimshaw. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In proceeding of the 17th ACM SIGPLAN symposium on Principles and Parallel Pragramming(PPoPP),2012.
    [161]Zheng Cui, Yun Liang, Kyle Rupnow, Deming Chen. An Accurate GPU Performance Model for effective Control Flow Divergence Optimization. In proceeding of the IEEE 26th International Parallel and Distributed Processing Symposium(IPDPS),2012.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700