GPU流式计算模型应用研究

英文题名：Research on GPU Stream Computing and Its Applications
作者：林江
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：图形处理器 ; 流式计算 ; 序列比对 ; 凸包生成
英文关键词：GPU ; Stream Computing ; Sequence Alignment ; Hull Generation
学位年度：2011
导师：唐敏 ; 童若锋
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2011-01-20

摘要

当前市场主流处理器的发展趋势是多核化／众核化,即通过提高处理器核心数目保持计算性能的持续增长。最新的图形处理器已经能够提高兆级的FLOPS理论峰值,远远超出了市场主流多核CPU。本文以国家自然科学基金项目(60803054)、浙江省自然科学基金项目(Y1100069)和AMD-浙江大学合作项目为研究背景,针对流式计算模型及其应用展开研究,主要工作包括：
     1、在NVIDIA CUDA平台上实现了基因序列比对的分值计算部分。本文设计实现的Diamond Tiled Wavefront算法的效率能够达到传统的Tiled Wavefront算法的1.7倍,更充分的利用GPU的并行性,更快的返回两个序列串的局部最大匹配值。
     2、在NVIDIA CUDA平台上实现了基因序列比对的精确比对部分。本文设计实现的流式序列比对算法首次在GPU上实现精确返回各元素的位置匹配结果。
     3、在ATI Stream平台上实现了三维模型凸包生成算法。在GPU上解决了CPU代码中大量应用vector、queue、map数据结构的问题。同时本文也介绍了一些用于辅助或优化上述算法实现的通用流式算法。
     4、在NVIDIA GeForce GTX285和ATI Radeon 5870图形处理器上使用CUDA和OpenCL实现了以上算法,并使用一系列模型进行了测试。
     本文算法对于基于GPU的算法加速研究具有一定的通用意义,并能延伸到其他生物计算、几何处理等领域的相关问题。
The current trend of commodity processors is towards developing mulit-core/many-core processors. By increasing the number of processor cores, the peak performances are keeping high-speed improvement. The latest graphics processor units (GPUs) are capable of achieving tera FLOPS in theory, which is superior to the commodity multi-core CPUs. This paper focuses on the research of stream computing model on GPU and its applications. The research is supported in part by National Natural Science Foundation (60803054), Zhejiang Provincial Natural Science Foundation (Y1100069) and AMD-Zhejiang University cooperation project. The main contributions are:
     1. We achieved the score calculation of biological sequence alignment on NVIDIA CUDA. Designed and implemented a new parallel algorithm named Diamond Tiled Wavefront algorithm, which can achieve the efficiency 1.7 times the traditional Tiled Wavefront algorithm's, better utilize the GPU parallelism, and faster return the local maximum match value of two sequences.
     2. We achieved the accurate full alignment of biological sequence alignment on CUDA. As we know, the stream sequence alignment algorithm we designed and implemented was the first GPU parallel algorithm that returns the accurate full alignment results.
     3. We achieved a new parallel algorithm of 3D convex hull generation on ATI Stream. The algorithm implemented map, queue and vector data structures in GPU. We also introduced other stream algorithms that used to aid and to optimize the algorithms mentioned above.
     4. We achieved the above algorithms in NVIDIA GeForce GTX285 and ATI Radeon 5870 GPU, using CUDA and OpenCL separately. And a series of models were tested.
     Our algorithms are general accelerating techniques on GPU, and can be extended to other problems in bioinformatics and geometric processing fields.

引文

[1]J.Owens, D.Luebke, N.Govindaraju, et al. A Survey of General-Purpose Computation on Graphics Hardware[J]. Computer Graphics Forum,2007, 26(1):80-113
    [2]丁艺明,刘波.利用GPU进行高性能数据并行计算[J].程序员,2008(4)：97-99
    [3]NVIDIA CUDA Programming Guide[M]. Version 2.3, Santa Clara:NVIDIA Corporation,2009
    [4]ATI Stream Computing Programming Guide[M]. Version 2.2,2010
    [5]S.Needleman, C.Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins[J]. Journal of Molecular Biology,1970,48(3):443-53
    [6]T.Smith, M.Waterman. Identification of common molecular subsequences[J]. Journal of Molecular Biology,1981,147(1):195-197
    [7]W.Pearson, D.Lipman. Improved tools for biological sequence comparison[J]. Journal of National Academy of Sciences.1988,85(8):2444-2448
    [8]S.Altschul, Madden T L, Schaffer A A, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J]. Journal of Nucleic Acids Research,1997,25(17):3389-3402
    [9]A.Aji, W.Feng. Accelerating data-serial applications on data-parallel GPGPUs:A Systems Approach[R]. Virginia Tech,2008
    [10]林江,唐敏,童若锋.GPU加速的生物序列比对[J].计算机辅助设计与图形学学报,2010,22(3)：420-427
    [11]O.Gotoh. An improved algorithm for matching biological sequences[J]. Journal of Molecular Biology,1982,162(3):705-708
    [12]A.Aji, W.Feng, F.Blagojevic, et al. Cell-SWat:modeling and scheduling wavefront computations on the cell broadband engine [C]. Proceedings of the 5th Conference on Computing Frontiers. Ischia,2008:13-22
    [13]W.Martins, J.Cuvillo, F.Useche, et al. A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison[C]. Proceedings of Pacific Symposium on Biocomputing. Hawaii,2001:311-322
    [14]B.Alpern, L.Carter, K.Gatlin. Microparallelism and high performance protein matching[C]. Proceedings of the 1995 ACM/IEEE on Supercomputing Conference. San Diego,1995:3-8
    [15]A.Wozniak. Using video-oriented instructions to speed up sequence comparison[J]. Journal of Computer Applications in the Biosciences, 1997,13(2): 145-150
    [16]T.Rognes, E.Seeberg. Six-fold speed-up of Smith-Waterman Sequence database searches using parallel processing on common microprocessors [J]. Journal of Bioinformatics,2000,16(8):699-706
    [17]W.Pearson. Searching protein sequence libraries:comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms [J]. Journal of Genomics,1991,11(3):635-650
    [18]M.Farrar. Striped Smith-Waterman speeds database searches six times over other SIMD implementation[J]. Journal of Bioinformatics,2007,23(2):156-161
    [19]戴正华,张庆丹,徐琳,等.基于SSE2的Smith-Waterman算法[J].计算机工程与应用,2006,42(11)：89-91
    [20]W.Liu, B.Schmidt, G.Voss, et al. Bio-sequence database scanning on a GPU[C]. Proceedings of the 20th International Parallel and Distributed Processing Symposium. Rhodes Island,2006:1-8
    [21]Y.Liu, W.Huang, J.Johnson, et al. GPU accelerated Smith-Waterman[M]. Lecture Notes In Computer Science. Heidelberg:Springer,2006,3994:188-195
    [22]S.Manavski, G.Valle. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment[J]. Journal of BMC Bioinformatics,2008,9(2):S10
    [23]Y.Liu, D.Maskell, B.Schmidt. CUDASW++:optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units [J]. BMC Research Notes 2009,2:73
    [24]Y.Liu, B.Schmidt, D.Maskell. CUDASW++2.0:enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions[J]. BMC Research Notes 2010,3:93
    [25]E.O.Sandes, A.Melo. CUDAlign:Using GPU to Accelerate the Comparison of Megabase Genomic Sequences[C]. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9-14,2010
    [26]S.Xiao, W.Feng. Inter-block GPU communication via fast barrier synchronization[R]. Technical Report TR-09-19, Computer Science, Virginia Tech
    [27]M.Harris, S.Sengupta, J.Owens. Parallel prefix sum(scan) with CUDA[M]. In GPU Gem3, Nguyen H, (Ed.). Addison Wesley, Aug.2007
    [28]G.Blelloch. Prefix Sums and Their Applications[R]. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, 1990
    [29]D.Hirschberg. A linear space algorithm for computing maximal common subsequences[J]. Communications of the ACM,1975,18(6):341-343
    [30]D.Powell, L.Allison, T.Dix. A versatile divide and conquer technique for optimal string alignment[J]. Inf Process Lett 1999,70:127-139
    [31]C.Barber, D.Dobkin, H.Huhdanpaa. The Quickhull Algorithm for Convex Hulls[J]. ACM Transactions on Mathematical Software,1996,22(4):469-483
    [32]M.Berg, M.Kreveld, M.Overmars, et al. Computational Geometry Algorithms and Applications(Second Edition)[M].2005,5:262-278
    [33]J.O'Rourke. Computational Geometry in C (2nd ed.)[M]. Cambridge University Press,1998
    [34]D.Chand, S.Kapur. An algorithm for convex polytopes[J]. ACM,1970, 17:78-86
    [35]K.Clarkson, P.Shor. Applications of random sampling in computational geometry Ⅱ[J]. Discrete Comput, Geom,1989,4:387-421
    [36]M.Berg, M.Kreveld, M.Overmars, et al. Computational Geometry:Algorithms and Applications [M]. Springer-Verlag, Heidelberg,1997
    [37]K.Mulmuley. Computational Geometry:An Introduction Through Randomized Algorithms[M]. Prentice-Hall, Englewood Cliffs, NJ,1994
    [38]F.Preparata, S.Hong. Convex Hulls of Finite Sets of Points in Two and Three Dimensions[J]. Communications of the ACM,1977,20(2):87-93
    [39]F.Preparata, M.Shames. Computational Geometry[M]. Springer, New York, 1985
    [40]Y.Zhang, Z.Zhang, Q.Chen. A New Randomized Parallel Dynamic Convex Hull Algorithm based on M2M model[C]. Proceedings of the International Conference on Convergence Technology and Information Convergence,2007
    [41]M.Golin, R.Sedgewick. Analysis of a Simple Yet Efficient Convex Hull Algorithm[C]. Proceedings of the fourth annual symposium on Computational Geometry,1988:153-163
    [42]M.Goodrich. Randomized Fully-Scalable BSP Techniques for Multi-Searching and Convex Hull Construction[C]. Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms,1997:767-776
    [43]N.Amato, F.Preparata. An NCI Parallel 3D Convex Hull Algorithm[C].9th Annual Computational Geometry, USA,1993
    [44]A.Day. Parallel implementation of 3D convex hull algorithm[J]. Comput-Aided,1991,23(3):177-188
    [45]N.Gupta, S.Sen. Faster output-sensitive parallel algorithms for 3D convex hulls and vector maxima[J]. Parallel Distrib, Comput,2003,63:488-500
    [46]S.Srungarapu, D.Reddy. Parallelizing two dimensional convex hull on NVIDIA GPU and Cell BE[OL]. http://www.hipc.org/hipc2009/documents/HIPCSS09Papers/1569255641.pdf

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700