基于GPU的程序分析与并行化研究

英文题名：Research of Programming Analysis and Parallelism Based on Graphics Processing Unit
作者：王涛
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：图形处理器 ; GPU通用计算 ; 程序分析与并行化 ; 统一计算设备架构 ; 开销模型
英文关键词：Graphics Processing Unit ; General Computation on GPU ; Programming Analysis and Parallelism ; CUDA ; Cost Model
学位年度：2010
导师：姚远
学科代码：081202
学位授予单位：解放军信息工程大学
论文提交日期：2010-04-15

摘要

高性能计算机是一个国家经济和科技实力的综合体现,也是促进经济、科技发展,社会进步和国防安全的重要工具,已成为世界各国竞相争夺的战略制高点。在人们追求高性价比的并行计算机系统的同时,在许多专用领域的专用计算部件也发挥着其强大的并行计算能力。图形处理器(GPU,Graphics Processing Unit)就是一种用于通用计算的专用加速部件。随着微电子技术的发展,图形处理器,无论是在集成度还是在数据处理能力上都已远远超过通用处理器,特别是在可编程能力、并行处理能力和应用范围方面得到不断提升和扩展,成为当前计算机系统中具备高性能处理能力的部件。
     目前,国内外针基于GPU的并行化研究,一般都是在原有串行程序的基础上,由熟悉GPU硬件结构的计算机专业人员进行程序改写。但由于串行程序并行化后带来的各种开销,使得并行化后的执行效率可能不及串行程序的执行效率。因此,如何合理地对串行程序进行分析,评估串行程序并行化后在GPU上的执行效率变得尤为重要。
     本文针对如何评估串行程序并行化后在GPU上的执行效率展开研究,主要研究内容如下:
     一、研究支持CUDA架构的GPU多线程硬件体系结构以及编程模型。在分析目前高性能计算和GPU通用计算的现状的基础上,详细阐述了GPU在通用计算中的优势,对图形处理器的硬件结构以及编程模型进行深入研究,为开销模型建立提供理论基础。
     二、为实现循环体工作量的精确计算,本文在深入研究传统的数据依赖关系分析方法的基础上,针对SUIF无法准确计算循环体上下界不固定时的迭代次数的情况,提出了一种改进的方法。
     三、为了预测串行程序并行化后在GPU上的执行效率,提出了一种基于CUDA架构的GPU并行开销模型,该模型综合考虑了程序并行化的各种开销(设备启动开销、数据传输开销以及GPU执行开销)。通过该模型可以预测出串行程序用GPU加速时的时间开销,将其与串行执行的开销进行对比,从而判断是否用于GPU加速,进而指导串行程序的并行化。
High performance computer is not only the integrated expression of a country's economic and technological strength, but also an important tool for economic promotion, technology development, social progress and national security. It has become the strategic high ground. While people pursue the cost-effective parallel super-computer system, some dedicated computing components play their powerful parallel computing power in many special areas, Graphics Processing Unit, GPU, is one of them for image processing and general purpose computation. With the development of microelectronics technology, GPU is far better than general-purpose processor in integration and data processing capabilities. And GPU has become the component of high performance computer systems.
     At present, the research for GPU parallelism mainly based on the original serial program, and the professional, who is familiar with the GPU architecture, transforms the serial into parallel. But due to the various costs brought by the parallel implementation, the efficiency of the parallel program is less than that of serial program. This is undoubtedly a great waste of manpower and financial resources. Therefore, how to analyse the serial program reasonably and to predict the efficiency of parallel program on GPU becomes particularly important. This thesis studies how to make GPU more reasonable and effective in general purpose computation. The main research contents and innovations are as follows:
     1. The thesis analyses the current status of high performance computing, points out the difficulties and challenges which the traditional high performance computers are facing from different views, and studies the hardware architecture of GPU and the programming model, which will be the theoretic foundation of the following cost model.
     2. The thesis studies the data dependent relation technologies, and adopts an improved method to accurate the number of iteration for calculating loop body workload, which SUIF cannot do when the upper bound and the lower bound of loop body are not certain.
     3. In order to predict the execution efficiency of parallel program on GPU, the thesis presents a cost model for GPU based on CUDA architecture. The model takes into account several factors including the cost of data transfer, the cost of device startup and the cost of GPU execution. The model can estimate the total time cost of parallel program on GPU, which can determine whether it is worthy for GPU acceleration.

引文

[1] TOP500[EB/OL]. : www.top500.org, 2009-11.
    [2]孙凝晖,陈明宇.高性能计算机科学技术发展报告[J].中国科学院计算技术研究所,2006,3(10):36-41.
    [3] Shameem Akhter, Jason Roberts.多核程序设计技术[M].北京:电子工业出版社, 2007.
    [4]曲洋. GPU通用计算在机群环境中的混合编程模型技术研究[D].郑州:解放军信息工程大学, 2008.
    [5] Wei-wu Hu. High Performance and General—Purpose Microprocessors:Past and Future[J]. Journal of Computer Science and Technology, 2006, 21(4):631-640.
    [6]桂亚东.高效能计算机发展趋势[J].高性能计算发展与应用, 2007, 3:11-18.
    [7] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E.Lefohn, and Timothy J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware [A]. In:Proceedings of the Eurographics/SIGGRAPH Workshop on Graphics Hardware [C], 2005:21-51.
    [8]吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学报, 2004, 16 (5): 602-612.
    [9]吴恩华.图形处理器用于通用计算的技术、现状及其挑战[J].软件学报, 2004, 15(10): 1493-1540.
    [10] Larsen ES,McAllister D. Fast Matrix Multiplies Using Graphics Hardware[A].In:Proceedings of the Supercomputing[C], Denver, 2004, 55-60.
    [11] Thompson CJ,Hahn S,Oskin M.Using Modern Graphics Architectures for General—Purpose Computing:A framework and Analysis[A]. In: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, Istanbul, Turkey, 2005:317-332.
    [12] Govindaraju NK,Redon S,Lin M,Manocha D.CULLIDE:Interactive Collision Detection between Complex Models in Large Environments Using Graphics Hardware[A].In:Proceedings of the Eurographics/SIGGRAPH Workshop on Graphics Hardware[C], 2006, 2: 356-369.
    [13] Sud A,Otaduy MA,Manocha D.DiFi:Fast 3D Distance Field Computation Using Graphics Hardware[A].In:Proceedings of the Eurographics/SIGGRAPH Workshop on Graphics Hardware[C], 2006, 1:129-140.
    [14] Tomov S,McGuigan M,Bennet R,Smith G,Spiletic J.Benchmarking and implementation of probability-based simulations on programmable graphics cards[J].Computers&Graphics, 2005, 29,(1):71-80.
    [15] The Stony Brook Visual Computing Cluster [EB/OL]. :http://cs.sunysb.edu/~vislab /projects/cluster/, 2007.
    [16] Folding@Home [EB/OL]. : http://folding.stanford.edu/, 2006.
    [17] GPGPU. General Purpose Computation on GPUs[EB/OL]. http://www.gpgpu.org, 2009.
    [18] Tokyo Institute of Technology. Global scientific information and computing center[EB/OL], : http://www.gsic.titech.ac.jp/, 2008-12.
    [19] Khronos OpenCL Working Group. The OpenCL specification[R]. Technical report, Khronos Group, 2008.
    [20] Ronan Keryell. GPU and Open source[EB/OL]. : http://www.hpc-project.com, 2009-7.
    [21]霍光. GPU计算逐渐大规模商用[OE/OL].: http://sapce.clo360.net/b/0-200268-155. html, 2008-2
    [22]腾讯科技[EB/OL].:http://tec.qq.com,2009-3.
    [23]张云泉,孙家昶,袁国兴,张林波. 2009中国高性能计算机发展趋势分析与展望[A]. In:2009年全国高性能计算学术年会[C].长沙, 2009.
    [24] Nvidia Corp CUDA 2.0 Programming Guide[OB/OL]. :http://www.nvidia.com/ obiect/cuda develop.html, 2008-8.
    [25] Erik Lindholm, John Nickolls, Stuart Oberman. Nvidia Tesla: a unified graphics and computing architecture [A]. IEEE Micro-Institute of Electrical and Electronics Engineers [C], Los Alamitos, CA, 2008: 39-55.
    [26] Stanford Compiler Group. SUIF Compiler System Version1.0[EB/OL]. http://suif.stanford.edu/suif/suif1/docs/suif.html, 1994.
    [27] C. Ancourt, F. Irigoin. Scanning Polyhedra with DO Loops [A]. In: Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming[C], US, 1991: 39-50.
    [28]沈志宇,胡子昂,等.并行编译方法[M].北京:国防工业出版社, 2000.
    [29] Randy Allen and Kennedy著.张兆庆,乔如良,冯晓兵等译.现代体系结构的优化编译器[M].北京:机械工业出版社,2004..
    [30]沈亚楠.分布存储系统数据生成技术研究[D].郑州:解放军信息工程大学, 2006.
    [31] U. Banerjee. Data dependence in ordinary programs [M]. Department of Computer Science, University of Illinois at Urbana-Champaign, November 1996. Report No.76-837.
    [32] K. Kennedy. Automatic translation of Fortran programs to vector form[R]. Department of Mathematical Sciences, Rice University, October 2000.
    [33] W. Blume, R. Eigezmann. Nonlinear and symbolic data dependence testing [A]. IEEE Transactions on Parallel and Distributed Systems[C]. 2001: l180-1194.
    [34]韩林.面向分布存储结构的并行分解一致性优化技术研究[D].郑州:解放军信息工程大学, 2008.
    [35] .Z. Li, P. Yew, and C. Zhu. Data dependence analysis on multi-dimensional arrayreferences[A]. In: Proceedings of the 2001 ACM International Conference on Supercomputing[C], Sorrento,Napoli,Italy,2001:215-224.
    [36]董春丽.并行化编译中数据和计算的自动划分及优化技术研究[D].郑州:解放军信息工程大学, 2007.
    [37]龚雪荣.分布存储结构中并行程序自动生成和优化技术的研究与实现[D].郑州:解放军信息工程大学, 2007.
    [38] Carl W.Lee. Linear Programming Notes[D]. US: University of Kentucky, 1996.
    [39] Alan Chun-Wai Leung. Automatic Parallelization for Graphics Processing Units in JikesRVM [D], University of Waterloo, Canada, 2008.
    [40] Mark E, Crovella, Thomas J, LeBlanc. Parallel Performance Prediction Using Lost Cycles Analysis[A] In: Proceeding of Supercomputing 94 [C] ,1994:600-609,.
    [41] J. M Bull. A Hierarchical Classification of Overheads in Parallel Programs[A]. In: Proceedings of First IFIP TC10 International Workshop on Software Engineering for Parallel and Distributed Systems [C]. I.Jelly, I.Gorton and P.Croll(eds), Chapman Hall. 1996, 208-219.
    [42] P. Michaud and A. Seznec. Data-flow Prescheduling for Large Instruction Windows in Out-of-order Processors [A]. In HPCA [C], Nuevo Leone, Mexico, 2001:2-33.
    [43] P. Michaud, A. Seznec, and S. Jourdan. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors [A]. In PACT [C], St. Petersburg, Russia, 1999:10-22.
    [44] D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance [A].In: MICRO-27[C],San Jose, California USA,1994:52-62.
    [45] T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model [A]. In ISCA [C], Munich Germany, 2004: 476-490.
    [46] X. E. Chen and T. M. Aamodt. A first-order fine-grained multithreaded throughput model [A]. In: HPCA [C], Raleigh, North Clalifornia USA, 2009, 36-51.
    [47] R. H. Saavedra-Barrera and D. E. Culler. An Analytical Solution for a Markov chain modeling multithreaded. Technical report [A], Berkeley, CA [C], USA, 1991, 329-341.
    [48] D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A.Wood. Analytic evaluation of shared-memory systems with ILP processors[A].In ISCA[C], Barcelona,Spain, 1998:380-391.
    [49] Williams, S. and Waterman, A. and Patterson, D.. an insightful visual performance model for multicore architectures[A], In: Communications of the ACM[C], 2009, 52:65-76.
    [50] Govindaraju N K, Larsen S , Gray J , et al . A Memory Model for Scientific Algorithms on Graphics Processors [A]. In: Proceedings of the ACM/IEEE Conference on Supercomputing [C], Tampa, 2006:688-700.
    [51]韩博,周秉锋.GPGPU性能模型以应用实例分析[J].计算机辅助设计与图形学报,.2009, 21(9):1219-1226.
    [52] S. Hong and H. Kim. An analytical model for a GPU architecture with memory-leveland thread-level parallelism awareness [A]. Technical Report TR-2009-003 [C], Atlanta, GA, USA, 2009:152-163.
    [53] L. A. Feldkamp, L.C. Davis, J. W. Kress. Practical Cone-beam Algorithm[J]. Optical Society of America, 1984, 1(6): 6-14.
    [54] Xing Zhao, Jing-jing Hu, Peng Zhang. GPU-Based 3D Cone-Beam CT Image Reconstruction for Large Data Volume[J]. International Journal of Biomedical Imaging, 2009, 10:31-39.
    [55]张晓帆,何明一.基于FDK算法的三维CT快速计算方法[J].计算机工程与应用, 2004, 31:208-211.
    [56]梁亮,张定华,毛海鹏,顾娟.一种基于可编程图形硬件的快速三维图像重建算法[J]. 2006, 23(1):241-243.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700