众核GPU体系结构相关技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

众核GPU体系结构相关技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Many-core GPU Architectures
作者：陈钢
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：众核 ; GPU ; GPGPU ; 数据并行 ; 性能模型 ; 存储优化 ; 程序重构
英文关键词：Many-core ; GPU ; GPGPU ; data-parallel ; performance model ; memory optimization ; program restructuring
学位年度：2011
导师：吴百锋
学科代码：081201
学位授予单位：复旦大学
论文提交日期：2011-09-25

摘要

大规模数据并行应用对可扩展性、计算能力和存储带宽的迫切需求促使高性能微处理器正在向众核体系结构演变。作为一种新型的众核体系结构,图形处理器(GPU)采用大量晶体管用于计算单元,采用相对简单的控制逻辑,具有非常高效的存储带宽层次。现代GPU体系结构所具有的片上计算单元密集、存储带宽高效、性价比高等鲜明的特点,形成了一个崭新的研究领域一基于GPU的通用计算(GPGPU),即利用GPU来实现更为广泛的数据并行计算。
     受体系结构和可编程性的制约,早期的GPU未能在并行计算领域普及。随着高级编程模型(如AMD/ATI STREAM TM、NVIDIA CUDATM和OpenCL)的相继推出,GPU程序设计的复杂性在一定程度上得到降低。为了节约设计成本并实现未来体系结构的可扩展性,GPU体系结构通常采用分散式硬件设计。与CPU存储系统相比,GPU存储系统的设计目标是维持高吞吐量而非低延迟。虽然GPU体系结构可以同时维持大量的线程,以零开销的硬件线程切换来隐藏存储访问延迟,但是如果应用程序中存在大量的不规则数据访问,势必会造成很多线程因同时访存而出现暂停,浪费了宝贵的计算资源。GPU特殊的体系结构使得高级编程模型下的应用程序难以充分利用其强大的计算能力和高效的存储带宽,编写高性能的GPGPU程序需要考虑如何将应用程序有效映射至GPU硬件上加以执行。此外,GPU的并行编程模型与传统的串行编程模型存在差异,基于GPU体系结构的应用开发与优化方法也与传统方法有着很大不同。由于GPU体系结构底层硬件的复杂性,编译器并没有对应用程序进行充分的优化。为了指导应用程序高效映射到GPU体系结构上执行,本文研究了面向众核GPU体系结构的性能评估与优化方法,具体工作如下：
     (1)当应用程序映射到GPU体系结构上执行时,很多因素都会降低程序的性能,一种量化的性能模型可以用于评估特定应用移植至GPU体系结构上的实际执行性能。由于现代GPU体系结构的复杂性,传统的并行计算模型无法用于评估GPGPU程序的性能。为了预测应用程序并行化后的执行性能,评估并行化过程中可能存在的性能瓶颈,本文针对GPU体系结构提出了一种量化的性能评估模型。该模型建立在抽象GPU体系结构和执行模型的基础上,充分考虑了影响GPGPU程序性能的各种因素(如全局存储器的接合访问、局部存储器的冲突访问、计算与存储访问重叠、条件分支转移、同步),在无需编写实际GPGPU程序的前提下,通过对应用程序的静态分析并结合GPU的性能参数设定具体的执行配置,即可估算出应用程序并行化后的执行时间。实验结果表明,该性能模型能够较为准确地评估应用程序在GPU体系结构上的执行时间。
     (2)在GPU体系结构的存储系统中,全局存储器容量较大但访问延迟较高,快速存储器(如局部存储器)访问速度较快但容量有限。因此,改善数据在全局存储器中的布局,减少不规则存储访问,合理利用片上快速存储器,减少总体的存储访问开销对于提升GPGPU程序的性能至关重要。为了充分发挥GPU体系结构在存储带宽方面的优势,本文提出了基于多面体模型的存储优化方法。该方法建立源程序的多面体表示,分别对GPU的全局存储器和快速存储器进行优化与分配：通过检测存储访问模式,发掘可向量化的存储访问实例,利用数据空间变换对不规则存储访问模式进行转换,提高了GPU片外存储器的带宽利用率；通过检测程序中的数据重用,根据数据的访问属性和GPU存储器硬件的特性,实现了快速存储器的有效分配；采用坐标转换和增加偏移量的技术分别对IMAGE存储对象和局部存储器进行优化,提高了片上存储器的使用效率。实验结果表明,该存储优化方法可以使得程序的性能相对优化前提升1.2-8.4倍。
     (3)循环和数组结构通常具有计算密集和数据并行的特征,因此这种结构通常是GPU计算核心的天然候选。然而在一些应用程序中,数据依赖和控制相关阻碍了它们在GPU体系结构上高效地运行。由于GPU体系结构同时强调计算密集与数据并行,因此将计算重构和数据重构加以组合更能够充分开发其性能潜力。为了使应用程序能够充分开发GPU体系结构的性能潜力,本文提出了面向GPU体系结构的程序重构方法：首先通过循环合并与拆分的计算重构增大了应用程序的可并行性,尽可能消除操作间的依赖关系,提高所生成GPU计算核心的计算密集性,有利于存储访问延迟的隐藏；其次,通过对线程内和线程问的数据访问进行重构,减少了GPU计算核心的存储访问次数；最后,通过条件执行、分支化简和间接索引等重构技术,减少了分支转移对于程序性能的负面影响。实验结果表明,该程序重构方法可以使得程序性能相对重构前提升1.18-2.56倍。
     (4)数据并行应用中的非计算密集型算法存在存储墙问题,在基于GPU的并行化过程中显得更为突出。为了有效缓解存储受限型应用的存储墙问题,本文针对生物序列比对领域设计了一种基于GPU的Smith-Waterman并行算法：通过改变原有Smith-Waterman算法的计算流程和数据依赖关系,进一步增加了序列比对的并行性；通过实施面向GPU体系结构的优化方法,进一步提升了序列比对的性能和效率。实验结果表明,经过优化的Smith-Waterman算法与CPU上的串行算法相比提升了近115倍。
     本文在众核GPU体系结构上的研究成果对今后在GPU上开发通用计算及面向其他众核体系结构的优化编译器方面具有借鉴意义。
High-performance microprocessor architectures have involved into many-core ones due to the demand for large-scale data-level parallel applications with respect to scalability, computational power and memory bandwidth. As novel many-core architecture, modern GPUs employ massive of transistors for ALU resources. Meanwhile, modern GPUs have relative simple control logic and very high-efficient memory bandwidth. These prominent characteristics make modern GPUs very feasible to be used in general-purpose data-parallel computing domains with high performance/cost ratio, which forms a new research area called general-purpose computing on GPUs (GPGPU).
     Due to the constraint of architecture and programmability, earlier generations of GPUs do not be widely deployed in general-purpose computing domains. High-level programming models (e.g., ATM STREAMTM, NVIDIA CUDATM and OpenCL) reduce the learning curve of GPU programming. In order to save the design cost and maintain architectural scalability, modern GPUs always employ decentralized hardware framework. Compared with CPU memory system, GPU memory system is optimized for throughput rather than latency. GPUs are capable of supporting thousands of concurrently executing threads, with zero-cost hardware controlled context switching between threads to tolerance memory latency. However, many threads will access memory simultaneously if there are many irregular memory access patterns in the programs, which might waste computational resources by causing ALU stall. Hence, manual development of high-performance parallel codes for GPU architectures is still very challenging, because it is hard to fully exploit the architectural advantage of GPUs under high-level programming models. Achieving high performance need to consider how to effiently map these codes onto GPU hardware. Furthermore, given that there is significant difference between the sequential programming model for CPUs and the parallel programming model for GPUs, application development and optimization methods are diverse with respect to them. As a result of the complexity of the underlying hardware of GPU architectures, the compilers do not perform adequately optimizations. In order to fully exploit the computation capability of GPUs for general-purpose data-parallel computing, this paper investigates performance evaluation and optimization methods targeting GPU architectures. The contributions of this paper can be summarized as following:
     (1) When porting application programs to GPU architectures, many factors will degrade performance. A quantitative performance model will help to determine the actual potential of a specific application program to port to GPU architectures. Due to the complex architecture of GPUs, traditional parallel performance models are not applicable. In order to evaluate the potential execution performance of an application programs and identify performance bottlenecks, this paper presents a quantitative GPGPU performance model. Based on the abstract GPU architecture, the present model embodies various features of the GPU architecture which affect the performance of a GPGPU kernel such as global memory access, local memory access, overlapping memory access with useful computation, conditional branch divergence and synchronization. By statically analyzing an application program with considering of the specific execution configuration, the present model can approximately estimate the execution time of an application program without the need of writing the actual GPGPU kernel. Analytical and experimental results show that the present model can estimate the execution time of application programs relative accurately.
     (2) In GPU hierarchy memory system, global memory has larger size but with long access latency, fast memories (e.g., local memory) can be accessed with fast speed but with limited size. Hence, improving data layout in global memory to reduce irregular memory accesses and utilizing the fast memories effectively are crucial to reduce the overall overhead of memory accesses for a GPGPU kernel. In order to exerting the advantages of memory bandwidth of GPU architectures, this paper proposes memory optimization methods by means of polyhedral model. Polyhedral representation of a source program is built to optimize and allocate GPU memory system. By checking memory access patterns of the source program, access instances those can be grouped together are discovered by means of graph coloring. Subsequently, data space transformation is utilized to alter irregular memory access patterns for the sake of improving the off-chip memory bandwidth. Data reuse information is detected to allocate data into distinct fast memory regions according to both the properties of data accesses and the characteristics of the GPU memory model. Finally, coordinate conversion and offset inserting techniques are proposed with the purpose of optimizing the IMAGE memory object and the local memory bank conflict, making best usage of the fast on-chip memory. Experimental results for a set of benchmarks show that the optimized programs could achieve a speedup of 1.2x-8.4x compared to the un-optimized versions.
     (3) The loop and array constructs exhibit inherent computation intensiveness and data parallelism, which are naturally been parallelized in a GPU kernel. However, some application programs exhibit complex dependence relationships and control flows, hindering them efficiently executing on GPU architectures. Given that GPU architectures emphasize both computation intensiveness and data parallelism, combination computation restructuring with data restructuring can better improve the performance of GPGPU kernels. In order to map data-parallel applications onto GPU architectures with high performance, this paper proposes program restructuring methods targeting GPU architectures. By computation restructuring with respect to loop incorporation and splitting, it enhances the parallelism and the compute intensity of the application programs, improving the ALU resources utilization of GPUs for better memory latency hiding. By inter-thread and intra-thread data restructuring, it eliminates the irregular data access patterns, reducing the number of memory access which improves the available memory bandwidth of the application programs. By branch restructuring such as conditional execution, branch predigesting and indirect indexing, it decrease the negative impact of branch divergence. Experimental results show that the proposed restructuring methods could achieve a speedup of 1.18x～2.56x compared to the un-restructured versions.
     (4) During the parallelizing process of non-compute-intensive applications in GPU architectures, memory wall is a prominent problem. By designing and implementing a novel parallel Smith-Waterman algorithm in biological sequence alignment, this paper proposes a common optimization strategy for memory-bound applications targeting GPU architectures. It enhances the parallelism by altering the computation flows and data dependencies of the original Smith-Waterman algorithm and performs several architecture-aware optimizations, greatly improving the performance and efficiency of biological sequence alignment algorithm. Experimental results show that the proposed algorithm can achieve almost 115 times speedup, relieving the memory wall problem in data-parallel applications efficiently.
     The performance evaluation model and optimization methods presented in this paper are helpful to provide insight for developing high-performance GPGPU applications and optimized compilers as well as other many-core achitectures.

引文

[1]Agarwal V, Hrishikesh M S, Keckler S W, et al. Clock rate versus IPC:the end of the road for conventional microarchitectures [J]. SIGARCH Computer Architecture News,2000,28(2):248-259.
    [2]Scott R, William J D, Ujval J K, et al. A bandwidth efficient architecture for media processing [C]. In proceedings of the 31st Annual IEEE/ACM International Symposium on Micro-architecture,1998.
    [3]Owens J D, Houston M, Luebke D, at al. GPU computing [J]. Proceedings of the IEEE,2008,96(1):879-899.
    [4]Amr B, Michael C, Yasser H, et al. Scientific and engineering computing using ATI stream technology [J]. Computing in Science and Engineering,2009,11(6): 92-97.
    [5]Cohen J, Garland M. Novel architectures:solving computational problems with GPU computing [J]. Computing in Science and Engineering,2009,11(5):58-63.
    [6]Chas B. Data-parallel computing [J]. ACM Queue,2008,6(2):30-39.
    [7]Scott R. Stream processor architecture [M]. Kluwer Academic Publishers. Bosto MA,2001.
    [8]KaPasi U J, Scott R, William J D, et al. Programmable stream processor [J]. IEEE Computer Magazine,2003,36(8):54-62.
    [9]Owens J D, Scott R, Ujval J K, et al. Media processing applications on the Imagine stream processor [C]. In Proceedings of the 2002 IEEE International Conference on Computer Design:VLSI in Computers and Processors,2002,240-250.
    [10]IBM Blue Gene team. Blue Gene:a vision for protein science using a Petaflop supercomputer [J]. IBM System Journal,2001,40(2):310-327.
    [11]杜静.流体系结构的编译技术研究[D].博士学位论文,国防科学技术大学,2008.
    [12]Sharp J A. Data flow computing [M]. Ablex Publishing Corporation,1991.
    [13]Hennessy J, Patterson A. Computer architecture:a quantitative approach (3rd edition) [M]. Morgan Kaufmann Publishing,2002.
    [14]Wolf W, McKee S A. Hitting the memory wall:implications of the obvious [J]. ACM SIGARCH Computer Architecture News,1995,23(1):20-24.
    [15]Wilkes M V. The memory gap and the future of high performance memories [J]. ACM SIGARCH Computer Architecture News,2001,29(1):1-6.
    [16]Brucek K, William J D, Andrew C, et al. VLSI design and verification of the Imagine processor [C]. In Proceedings of the IEEE International Conference on Computer Design,2002,289-296.
    [17]Dongarra J, Foster I, Fox G.并行计算综论[M].电子工业出版社,2005.
    [18]Chung E S, Milder P A, Hoe J C, et al. Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs? [C]. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture,2010, 225-236.
    [19]Matt P, Randima F. GPU gems 2 [M]. Boston:Addison-Wesley,2005.
    [20]Cope B, Cheung P Y K, Luk W, et al. Performance comparison of graphics processors to reconfigurable logic:a case study [J].IEEE Transactions on Computers, 2010,59(4):433-448.
    [21]Chu S-L, Hsiao C-C. OpenCL:make ubiquitous supercomputing possible [C]. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications,2010,556-561.
    [22]Michael G, Scott L G, John N, et al. Parallel computing experiences with CUDA [J]. IEEE Micro,2008,28(4):13-27.
    [23]Lindholm E, Nickolls J, Oberman S, et al. NVIDIA TESLA:a unified graphics and computing architecture [J]. IEEE Micro,2008,28(2):39-55.
    [24]Eichenberger A, W u P. Vectorization for SIMD architectures with alignment constraints [C]. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation,2004,82-93.
    [25]Nickolls J, Buch I, Garland M, et al. Scalable parallel programming with CUDA [J]. ACM Queue,2008,6(3):41-53.
    [26]Lindholm E, Kligard M J, Henry M. A user-programmable vertex engine [C]. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques,2001,149-158.
    [27]Gummaraju J, Rosenblum M. Stream programming on general-purpose processors [C]. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture,2005,1-12.
    [28]Luebke D, Humphreys G. How GPUs work [J]. IEEE Computer,2007,40(2): 96-100.
    [29]Huang Q H, Huang Z Y, Werstein P, et al. GPU as a general purpose computing resource [C]. In Proceedings of the 9th International Conference on Parallel and Distributed Computing, Applications and Technologies,2008,151-158.
    [30]Microsoft high-level shading language [OL]. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9c/directx/gra phics/reference/hlslreference/hlslreference.asp.2005.
    [31]Mark R, Glanville S, Akeley K, et al. Cg:A system for programming graphics hardware in a C-like language [J]. ACM Transaction on Graphics,2003,22(3): 896-907.
    [32]Martz P. The OpenGL shading language [J]. Dr. Dobb's Journal,2009,29(8): 80-89.
    [33]Moya V, Gonzalez C, Roca J, et al. Shader performance analysis on a modern GPU architecture [C]. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture,2005,355-364.
    [34]Buck I, Foley T, Horn D, et al. Brook for GPUs:stream computing on graphics hardware [J]. ACM Transaction on Graphics,2004,23(3),777-786.
    [35]AMD Corp. Brook+ language specification [OL]. http://ati.amd.com/technology/streamcomputing/index.html.2008.
    [36]Nvidia Corp. CUDA 2.0 Programming Guide [OL]. http://www.nvidia.com/object/cuda develop.html.2008.
    [37]Khronos. The OpenCL specification [OL], http://www.khronos.org/OpenCL.2009.
    [38]McCool M, Popa T, Chan B, et al. Shader algebra [J]. ACM Transaction on Graphics,2004,23(3):787-795.
    [39]Tarditi D, Puri S, Oglesby J. Accelerator:using data-parallelism to program GPUs for general-purpose uses [C]. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems,2006,325-335.
    [40]PeakStream. The PeakStream platform:High productivity software development for multi-core processors [OL]. http://www.peakstreaminc.com/reference/peakstream platform technote.pdf.2006.
    [41]Macedonia M. The GPU enters computing's mainstream [J]. IEEE Computer, 2003,36(10):106-108.
    [42]Owens J D, Luebke D, Govindaraju N, et al. A survey of general-purpose computation on graphics hardware [J]. Computer Graphics Forum,2007,26(1): 80-113.
    [43]吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5)：601-612.
    [44]Barrachina S, Castillo M, D. Igual F, et al. Solving dense linear systems on graphics processors [C]. In Proceedings of the 14th international Euro-Par conference on Parallel Processing,2008,739-748.
    [45]Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra [C]. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing,2008,1-11.
    [46]F. Cupertino L, P. Singulani A, P. da Silva C, et al. LU decomposition on GPUs: the impact of memory access [C]. In Proceedings of the 22nd International Symposium on Computer Architecture and High Performance Computing Workshops, 2010,19-24.
    [47]Tomov S, Hath R, Ltaief H, et al. Dense linear algebra solvers for multicore with GPU accelerators [C]. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum,2010,1-8.
    [48]Fatica M. Accelerating Linpack with CUD A on heterogenous clusters [C]. In Proceedings of the 2nd ACM Workshop on general purpose processing on graphics processing units,2009,1-8.
    [49]Ries F, Macro T, Zivieri M, et al. Triangular matrix inversion on graphics processing unit [C]. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis,2009,1-10.
    [50]Brandvik T, Pallan G. SBLOCK:A Framework for efficient stencil-based PDE solvers on multi-core platforms [C]. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology,2010,1181-1188.
    [51]Lessig C. An implementation of the MRRR algorithm on a data-parallel coprocessor [R]. Technical Report, University of Toronto,2008.
    [52]Zhang Y, Cohen J, Owens J D. Fast tridiagonal solvers on the GPU [C]. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,2010,127-136.
    [53]Dehnavi M M, M.Fernandez D, Giannacopoulos D. Enhancing the performance of conjugate gradient solvers on graphic processing units [J]. IEEE Transactions on Magnetics,2011,47(5):1162-1165.
    [54]Garland M. Sparse matrix computations on many-core GPU [C]. In Proceedings of International Conference on Design Automatic Conference,2008,2-6.
    [55]Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors [C]. In Proceedings of the Conference on High Performance Computing Networking,2009,1-11.
    [56]Choi J W, Singh A, Vuduc R W. Model-driven auto-tuning of sparse matrix-vector multiply on GPUs [C]. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming,2010,1-10.
    [57]Surkov V. Parallel option pricing with Fourier space time-stepping method on graphics processing units [C]. In Proceedings of the 2008 IEEE International Symposium on Parallel & Distributed Processing,2008,2850-2856.
    [58]Abbas-Turki L A, Vialle S, Lapeyre B, et al. High dimensional pricing of exotic European contracts on a GPU cluster, and comparison to a CPU Cluster [C]. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing,2009,2414-2421.
    [59]Lee M, Jeon J, Bae J, et al. Parallel implementation of a financial application on a GPU [C]. In Proceedings of the 2nd International Conference on Interaction Sciences,2009,1136-1141.
    [60]Dang D M, Christara C C, Jackson K R. Pricing multi-asset American options on graphics processing units using a PDE approach [C]. In Proceedings of the 2010 IEEE Workshop on High Performance Computational Finance,2010,1-8.
    [61]Solomon S, Thulasiram R K, Thulasiraman P. Option pricing on the GPU [C]. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications,2010,289-296.
    [62]Tian Y, Zhu Z L, Klebaner F C, et al. Option pricing with the SABR model on the GPU [C]. In Proceedings of the 2010 IEEE Workshop on High Performance Computational Finance,2010,1-8.
    [63]King G H, Cai X Y, Lu Y Y, et al. A high-performance multi-user service system for financial analytics based on web service and GPU computation [C]. In Proceedings of the 2010 International Symposium on Parallel and Distributed Processing with Applications,2010,327-333.
    [64]Ribeiro B, Lopes N, Silva C. High-performance bankruptcy prediction model using graphics processing units [C]. In Proceedings of the 2010 International Joint Conference on Neural Networks,2010,1-7.
    [65]Govindaraju N K, Lloyd B, Dotsenko Y, et al. High performance discrete fourier transforms on graphics processors [C]. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing,2008,1-12.
    [66]Volkov V, Kazian B. Fitting FFT onto the G80 architecture [R]. Tech report CS258, UCB,2008.
    [67]Kauker D, Sanftmann H, Frey S. Memory saving discrete Fourier transform on GPUs [C]. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology,2010,1152-1157.
    [68]Lobeiras J, Amor M, Doallo R. FFT implementation on a streaming architecture [C]. In Proceedings of the 19th International Euro-micro Conference on Parallel, Distributed and Network-Based Processing,2011,119-126.
    [69]Van der Laan W J, Jalba A C, Roerdink J. Accelerating wavelet lifting on graphics hardware using CUDA [J]. IEEE Transactions on Parallel and Distributed Systems,2011,22(1):132-146.
    [70]Abburi K K. A scalable LDPC decoder on GPU [C]. In Proceedings of the 24th Annual Conference on VLSI Design,2011,183-188.
    [71]Falcao G, Sousa L, Silva V. Massively LDPC decoding on multicore architectures [J]. IEEE Transactions on Parallel and Distributed Systems,2011,22(2):309-322.
    [72]Yau S F, Wong T L, Lau F C M. Extremely fast simulator for decoding LDPC codes [C]. In Proceedings of the 13th International Conference on Advanced Communication Technology,2011,635-639.
    [73]任化敏,张勇东,林守勋.GPU加速的基于增量式聚类的视频拷贝检测方法[J].计算机辅助设计与图形学学报,2010,22(3)：449-456.
    [74]Kamal A A, Youssef A M. Enhanced implementation of the NTRUEncrypt algorithm using graphics cards [C]. In Proceedings of the 1st International Conference on Parallel, Distributed and Grid Computing,2010,168-174.
    [75]Mei C L, Jiang H, Jenness J. CUDA-based AES parallelization with fine-tuned GPU memory utilization [C]. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing,2010,1-7.
    [76]Barlas G, Hassan A, Jundi Y A. An analytical approach to the design of parallel block cipher encryption/decryption:A CPU/GPU case study [C]. In Proceedings of the 19th International Euro-micro Conference on Parallel, Distributed and Network-Based Processing,2011,247-251.
    [77]Zhou K, Hou Q M, Wang R, et al. Real-time KD-tree construction on graphics hardware [J]. ACM Transaction on Graphics,2008,27(5):1-11.
    [78]WL-0
    [79]李博,李曦鹏,张云,等.耦合Nvidia/AMD两类GPU的格子玻尔兹曼模拟[J].科学通报,2009,54(20)：3177-3184.
    [80]温婵娟,欧嘉蔚,贾金原.GPU通用计算平台上的SPH流体模拟[J].计算机辅助设计与图形学学报,2010,22(3)：406-411.
    [81]陈曦,王章野,何戬,等.GPU中的流体场景实时模拟算法[J].计算机辅助设计与图形学学报,2010,22(3)：396-405.
    [82]Amador G, Gomes A. CUDA-based linear solvers for stable fluids [C]. In Proceedings of the 2010 International Conference on Information Science and Applications,2010,1-8.
    [83]Dematte L. Parallel particle-based reaction diffusion:a GPU implementation [C]. In Proceedings of the 9th International Workshop on Parallel and Distributed Methods in Verification,2010,67-77.
    [84]Xu L F, Taufer M, Collins S, et al. Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs [C]. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing,2010,1-9.
    [85]Jetley P, Wesolowski L, Gioachin F, et al. Scaling hierarchical N-body simulations on GPU clusters [C]. In Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis,2010,1-11.
    [86]Rodrigues C I, Hardy D J, Stone J E, et al. GPU acceleration of cutoff pair potentials for molecular modeling applications [C]. In Proceedings of the 5th conference on Computing frontiers,2008,273-282.
    [87]Friedrichs M S, Eastman P, Vaidyanathan V, et al. Accelerating molecular dynamic simulation on graphics processing units [J]. Journal of Computational Chemistry,2009,30(6):864-872.
    [88]Chen G, Li G B, Pei S W, et al. GPGPU supported cooperative acceleration in molecular dynamics [C]. In Proceedings of the 13th International Conference on Computer Supported Cooperative Work in Design,2009,113-118.
    [89]Daga M, Feng W C, Scogland T. Towards accelerating molecular modeling via multi-scale approximation on a GPU [C]. In Proceedings of the 1st International Conference on Computational Advances in Bio and Medical Sciences,2011,75-80.
    [90]Yang K, HE B S, FAN G R, et al. In memory grid files on graphics processors [C]. In Proceedings of the ACM SIGMOD DaMoN Workshop,2008,45-53.
    [91]He B S, Yang K, Fang R, et al. Relational joins on graphics processors [C]. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data,2008,511-524.
    [92]He B S, Fang W B, Luo Q, et al. Mars:a mapReduce framework on graphics processors [C]. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques,2008,260-269.
    [93]Fang W B, He B S, Luo Q, et al. Mars:accelerating map-reduce with graphics processors [J]. IEEE Transactions on Parallel and Distributed Systems,2011,22(4): 608-620.
    [94]Ma W J, Agrawal G. A translation system for enabling data mining applications on GPUs [C]. In Proceedings of the 23 rd International Conference on Supercomputing,2009,400-409.
    [95]Ma W J, Agrawal G. AUTO-GC:automatic translation of data mining applications to GPU clusters [C]. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing,2010,1-8.
    [96]Tsakalozos K, Tsangaris M, Delis A. Using the graphics processor unit to realize data streaming operations [C]. In Proceedings of the 6th Middleware Doctoral Symposium,2009,1-6.
    [97]Wu R, Zhang B, Hsu M. Clustering billions of data points using GPUs [C]. In Proceedings of the combined workshops on Conventional high performance computing workshop plus memory access workshop,2009,1-6.
    [98]Bakkum P, Skadron K. Accelerating SQL database operations on a GPU with CUDA [C]. In Proceedings of the 3rd ACM Workshop on general purpose processing on graphics processing units,2010,94-103.
    [99]Teodoro G, Mariano N, Meira W, et al. Tree projection-based frequent itemset mining on multi-core CPUs and GPUs [C]. In Proceedings of the 22nd International Symposium on Computer Architecture and High Performance Computing,2010, 47-54.
    [100]Lauer T, Datta A, Khadikov Z, et al. Exploring graphics processing units as parallel coprocessors for online aggregation [C]. In Proceedings of the ACM 13th International Workshop on Data Warehousing and OLAP,2010,77-84.
    [101]Chiosa I, Kolb A. GPU-based multilevel clustering [J]. IEEE Transcations on Visualization and Computer Graphics,2011,12(2):132-145.
    [102]Dhanasekaran B, Rubin N. A new method for GPU based irregular Reductions and its application to K-Means clustering [C]. In Proceedings of the 4th ACM Workshop on general purpose processing on graphics processing units,2011,1-8.
    [103]Gulati K, Croix J, Khatri S P, et al. Fast circuit simulation on graphics processing units [C]. In Proceedings of the IEEE/ACM Asia and South Pacific Design Automation Conference,2009,403-408.
    [104]Gulati K, Khatri S P. Towards acceleration of fault simulation using graphics processing units [C]. In Proceedings of the 45th IEEE/ACM Design Automation Conference,2008,822-827.
    [105]Gulati K, Khatri S. Accelerating statistical static timing analysis using graphics processing units [C]. In Proceedings of the IEEE/ACM Asia and South Pacific Design Automation Conference,2009,260-265.
    [106]G:Boolean Satisfiability on a Graphics Processor
    [107]Amr M B, Yasser Y. Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures [C]. In Proceedings of the 1st International Forum on Next-generation Multicore/Manycore Technologies,2008,1-5.
    [108]Taming irregular EDA applications on GPUs
    [109]Distributed Time, Conservative Parallel Logic Simulation on GPUs
    [110]Efficient smart Monte Carlo based SSTA on graphics processing units with improved resource utilization
    [111]Feng Z, Li P. Multigrid on GPU:tackling power grid analysis on parallel SIMT platforms [C]. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design,2008,647-654.
    [112]Feng Z, Zhao X Q, Zeng Z Y. Robust parallel preconditioned power grid simulation on GPU with adaptive runtime performance modeling and optimization [J]. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 2011,30 (4):562-573.
    [113]Robilliard D, Marion V, Fonlupt C. High performance genetic programming on GPU [C]. In Proceedings of the 2009 Workshop on Bio-inspired Algorithms for Distributed Systems,2009,85-93.
    [114]Luong V, Melab N, Talbi E G. Parallel hybrid evolutionary algorithms on GPU [C]. In Proceedings of the 2010 IEEE Congress on Evolutionary Computation,2010, 1-8.
    [115]Lin T K, Chien S Y. Support vector machines on GPU with sparse matrix format [C]. In Proceedings of the 9th International Conference on Machine Learning and Applications,2010,313-318.
    [116]Herrero-Lopez S, Williams J, Sanchez A. Parallel multiclass classification using SVMs on GPUs [C]. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units,2010,2-11.
    [117]Fung J, Mann S. Using graphics devices in reverse:GPU-based image processing and computer vision [C]. In Proceedings of the 2008 IEEE International Conference on Multimedia and EXPO,2008,9-12.
    [118]Nugteren C, Braak G, Corporaal H, et al. High performance predictable histogramming on GPUs:exploring and evaluating algorithm trade-offs [C]. In Proceedings of the 2nd Workshop on General-Purpose Computation on Graphics Processing Units,2009,1-8.
    [119]Varnavas A, Argyriou V, Ng J, et al. Dense Pphotometric stereo reconstruction on many core GPUs [C]. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops,2010,59-65.
    [120]Park I K, Singhal N, Lee H M, et al. Design and performance evaluation of image processing algorithms on GPUs [J]. IEEE Transcations on Parallel and Distributed Systems,2011,22(1):91-104.
    [121]General purpose computing on GPU [OL]. http://gpgpu.org.
    [122]ATI RadeonTM HD 5870 Graphics website. http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5870/Pages /ati-radeon-hd-5870-overview.aspx
    [123]Showerman M, Enos J, Steffen C, et al. ECOG:A power-efficient GPU cluster architecture for scientific computing [J]. Computing in Science & Engineering,2011, 13(2):83-87.
    [124]Yang C Q, Wang F, Du Y F, et al. Adaptive optimization for petascale heterogeneous CPU/GPU computing [C]. In Proceedings of the 2010 IEEE International Conference on Cluster Computing,2010,19-28.
    [125]Nickolls J, Dally W J. The GPU computing era [J]. IEEE Micro,2010,30(2): 56-69.
    [126]Glaskowsky P. NVIDIA'S Fermi:the first complete GPU computing architecture. NVIDIA White Paper,2009.
    [127]Hwu W M, Rodrigues C, Ryoo S, et al. Compute unified device architecture application suitability [J]. Computing in Science & Engineering,2009,11(3):16-26.
    [128]Kapasi U J, Dally W J, Rixner S, et al. The Imagine stream processor [C]. In Proceedings of IEEE International Conference on Computer Design:VLSI in Computers and Processors,2002,282-288.
    [129]Khailany B, Dally W J, Rixner S, et al. Imagine:media processing with streams [J]. IEEE Micro,2001,21(2):35-46.
    [130]Jung H A, William J D, Brucek K, et al. Evaluating the Imagine stream architecture [C]. In Proceedings of the 31st Annual International Symposium on Computer Architecture,2004,14-25.
    [131]Yang X J, Yan X B, Xing Z C, et al. A 64-bit stream processor architecture for scientific applications [C]. In Proceedings of the 34th Annual International Symposium on Computer Architecture,2007,210-219.
    [132]Yang X J, Yan X B, Xing Z C, et al. Fei Teng 64 stream processing system: architecture, compiler and programming [J]. IEEE Transaction on Parallel and Distributed Systems,2009,20(8):1142-1157.
    [133]Hofstee H P. Power efficient processor architecture and the Cell processor [C]. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture,2005,258-262.
    [134]Flachs B, Asano S, Dhong S H, et al. The microarchitecture of the synergistic processor for a cell processor [J]. IEEE Journal of Solid-State Circuits,2006,41(1): 63-70.
    [135]Larry S, Doug C, Eric S, et al. Larrabee:a many-core x86 architecture for visual computing [J]. ACM Transaction on Graphics,2008,27(3):1-15.
    [136]Malony A D, Biersdorff S, Spear W, et al. An experimental approach to performance measurement of heterogeneous parallel applications using CUD A [C]. In Proceedings of the 24th ACM International Conference on Supercomputing,2010, 127-136.
    [137]Gharaibeh A, Ripeanu M. Size Matters:Space/Time tradeoffs to improve GPGPU applications performance [C]. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis,2010,1-12.
    [138]Lee J, Lakshminarayana N B, Kim H. Many-thread aware prefetching mechanisms for GPGPU applications [C]. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture,2010,213-224.
    [139]Schenk O, Christen M, Burkhart H. Algorithmic performance studies on graphics processing units [J]. Journal of Parallel and Distributed Computing,2008, 68(10):1360-1369.
    [140]Goswami N, Ramkumar S, Madhura J, et al. Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications [C]. In Proceedings of the 2010 IEEE International Symposium on Workload Characterization,2010,1-10.
    [141]Purnomo B, Rubin N, Houston M. ATI stream profiler:a tool to optimize an OpenCL kernel on ATI Radeon GPUs [C]. In Proceedings of the 2010 ACM SIGGRAPH Posters,2010,1-1.
    [142]NVIDIA Corporation,2010. NVIDIA Parallel Nsight [OL]. http://www.nvidia.com/object/parallel-nsight.html
    [143]Bakhoda A, Yuan G, Fung W W L, et al. Analyzing cuda workloads using a detailed gpu simulator [C]. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software,2009,163-174.
    [144]Collange S, Daumas M, Defour D, et al. Barra:A parallel functional simulator for GPGPU [C]. In Proceedings of the 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems,2010,351-360.
    [145]Kerr A, Diamos G, Yalamanchili S. A characterization and analysis of PTX kernels [C]. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization,2009,3-12.
    [146]Ariel A, Fung W, Turner A, et al. Visualizing complex dynamics in many-core accelerator architectures [C]. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software,2010,164-174.
    [147]Williams S, Waterman A, Patterson D. Roofline:an insightful visual performance model for multicore architectures [J]. Communications of the ACM, 2009,52 (4):65-76.
    [148]Che S, Boyer M, Meng J Y, et al. Rodinia:a benchmark suite for heterogeneous computing [C]. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization,2009,44-54.
    [149]韩博,周秉锋GPGPU性能模型及应用实例分析[J].计算机辅助设计与图形学学报,2009,21(9)：1219-1226.
    [150]Mistry P, Gregg C, Rubin N. Analyzing program flow within a many-kernel OpenCL application [C]. In Proceedings of the 4th ACM Workshop on general purpose processing on graphics processing units,2011,1-8.
    [151]Kothapalli K, Mukherjee R, Rehman, M S, et al. A performance prediction model for the CUDA GPGPU platform [C]. In Proceedings of the 2009 International Conference on High Performance Computing,2009,463-472.
    [152]Hong S, Kim H. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness [C]. In Proceedings of the 36th International Symposium on Computer Architecture,2009,152-163.
    [153]Baghsorkhi S S, Delahaye M, Patel S J, et al. An adaptive performance modeling tool for GPU architectures [C]. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,2010,105-114.
    [154]Zhang Y, Owens J D. A quantitative performance analysis model for GPU architectures [C]. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture,2011,382-393.
    [155]Lee V W, Kim C, Chhugani J, et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU [C]. In Proceedings of the 37th International Symposium on Computer Architecture,2010,451-460.
    [156]Zhang E Z, Jiang Y, Guo Z, et al. Streamlining gpu applications on the fly [C]. In Proceedings of the 24th ACM International Conference on Supercomputing,2010, 115-125.
    [157]Fung W W L, Aamodt T M. Thread block compaction for efficient SIMT control flow [C]. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture,2011,25-36.
    [158]Feinbube F, Troger P, Polze A. Joint Forces:From multithreaded programming to GPU computing [J]. IEEE Software,2010,28 (1):51-57.
    [159]Wong H, Papadopoulou M M, Sadooghi-Alvandi M, et al. Demystifying GPU microarchitecture through microbenchmarking [C]. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software,2010, 235-246.
    [160]Jang B, Mistry P, Schaa D, et al. Data transformations enabling loop vectorization on multithreaded data parallel architectures [C]. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2010,353-354.
    [161]Fatahalian K, Houston M. GPUs:a closer look [J]. ACM Queue,2008,6(2): 18-28.
    [162]Liu Y, Zhang E Z, Shen X. A cross-input adaptive framework for GPU program optimizations [C]. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing,2009,1-10.
    [163]Fatahalian K, Sugerman J, Hanrahan P. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication[C]. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,2004,133-137.
    [164]Govindaraju N, Larsen S, Gray J, at al. A memory model for scientific algorithms on graphics processors [C]. In Proceedings of the ACM/IEEE conference on Supercomputing,2006,1-6.
    [165]Cope B, Cheung P Y K, Luk W. Using reconfigurable logic to optimize GPU memory accesses [C]. In Proceedings of the conference on Design, Automation and Test in Europe,2008,44-49.
    [166]Silberstein M, Schuster A, Geiger D, at al. Efficient computation of sum-products on GPUs through software-managed cache [C]. In Proceedings of the ACM annual international conference on Supercomputing,2008,309-318.
    [167]Ueng S, Lathara M, Baghsorkhi SS, at al. CUDA-Lite:Reducing GPU programming complexity [C]. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing (LCPC),2008,1-15.
    [168]Ryoo S, Rodrigues C I, Stone S S, et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA [C]. In Proceedings of the 13th ACM symposium on principles and practice of parallel programming,2008, 73-82.
    [169]Jang B, Do S, Pien H, et al. Architecture-aware optimization targeting multithreaded stream computing [C]. In Proceedings of the 2nd ACM Workshop on general purpose processing on graphics processing units,2009,62-70.
    [170]Baskaran M M, Bondhugula U, Krishnamoorthy S, et al. A compiler framework for optimization of affine loop nests for GPGPUs [C]. In Proceedings of the ACM annual international conference on Supercomputing,2008,225-234.
    [171]Pouchet L N, Bastoul C, Cohen A, et al. Iterative optimization in the polyhedral model:Part I, one-dimensional time [C]. In Proceedings of the IEEE/ACM Conference on Code Generation and Optimization,2007,144-156.
    [172]Pouchet L N, Bastoul C, Cavazos J, et al. Iterative optimization in the polyhedral model:Part Ⅱ, multidimensional time [C]. In Proceedings of the ACM Conference on Programming Language Design and Implementation,2008,1-11.
    [173]Bastoul C. Code generation in the polyhedral model is easier than you think [C]. In Proceedings of the IEEE Conference on Parallel Architectures and Compilation Techniques,2004,7-16.
    [174]O'Boyle M F P, Knijnenburg P M W. Non-singular data transformations: definition, validity and applications [C]. In Proceedings of the ACM/IEEE international conference on Supercomputing,1997,309-316.
    [175]Leung S-T. Array restructuring for cache locality [R]. Department of Computer Science and Engineering, University of Washington, Technical Report UW-CSE-96-08-01,1996.
    [176]Garland M, Kirk D B. Understanding throughput-oriented architectures [J]. Communications of the ACM,2010,53 (11):58-66.
    [177]McKinley K, Carr S, Tseng C W. Improving data locality with loop transformation [J]. ACM Transactions on Programming Languages and Systems,1996, 18(4):424-453.
    [178]Manjikian N, Abdelrahman T. Fusion of loops for parallelism and locality [J]. IEEE Transactions on Parallel and Distributed System,1997,8(2):193-209.
    [179]Wolf M E, Lam M S. A data locality optimizing algorithm [C]. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation,1991,30-44.
    [180]Yang X J, Du J, Yan X B, et al. Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor. Journal of Supercomputing,2009,47(2):171-197.
    [181]Lee S, Min S-J, Eigenmann R. OpenMP to GPGPU:a compiler framework for automatic translation and optimization [C]. In Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming,2009, 101-110.
    [182]Siegel J, Ributzka J, Li X M. CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator [C]. In the 2009 International Conference on Parallel Processing Workshops,2009,174-181.
    [183]Nuzman D, Rosen I, Zaks A. Auto-vectorization of interleaved data for SIMD [J]. SIGPLAN Notice,2006,41(6):132-143.
    [184]Carrillo S, Siegel J, Li X M. A control-structure splitting optimization for gpgpu [C]. In Proceedings of the 2009 ACM Computing Frontiers,2009,147-150.
    [185]Wu H C, Diamos G, Li S, et al. Characterization and transformation of unstructured control flow in GPU Applications [C]. In the First ACM International Workshop on Characterizing Applications for Heterogeneous Exascale Systems,2011, 1-8.
    [186]Needleman S B, Wunsch C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins [J]. Journal of Molecular Biology,1970,48(3):443-453.
    [187]Smith T F, Waterman M S. Identification of common molecular subsequences [J]. Journal of Molecular Biology,1981,147(1):195-197.
    [188]Myers E, Miller W. Optimal alignments in linear space [J]. Computer Applications in the Biosciences,1998,4(1):11-17.
    [189]Altschul S F, Madden T L, Schaffer A A, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs [J]. Journal of Nucleic Acids Research,1997,25(17):3389-3402.
    [190]Pearson W R, Lipman D J. Improved tools for biological sequence comparison [J]. Journal of National Academy of Sciences,1988,85(8):2444-2448.
    [191]Wozniak A. Using video-oriented inst ructions to speed up sequence comparison [J]. Journal of Computer Applications in the Biosciences,1997,13(2): 145-150.
    [192]Rodolfo B B, Azzedine B, Alba C M. A parallel strategy for biological sequence alignment in restricted memory space [J]. Journal of Parallel and Distributed Computing,2008,68(4):548-561.
    [193]Azzedine B, Jan M C, Alba C M, et al. Reconfigurable architecture for biological sequence comparison in reduced memory space [C]. In Proceeding of the IEEE International Parallel & Distributed Processing Symposium,2007,1-8.
    [194]Liu W G, Schmidt B, Voss G, et al. Bio-sequence database scanning on a GPU [C]. In Proceeding of the 20th IEEE International Parallel & Distributed Processing Symposium. High Performance Computational Biology (HiCOMB) Workshop,2006, 1-8.
    [195]Liu Y, Douglas M, Bertil S. CUDASW++:optimizing smith-waterman sequence database searches for cuda-enabled graphics processing units [J]. BMC Research Notes,2009,2(1):73.
    [196]林江,唐敏,童若锋.GPU加速的生物序列比对[J].计算机辅助设计与图形学学报,2010,22(3)：420-427.
    [197]Svetlin M, Giorgio V. CUD A compatible GPU cards as efficient hardware accelerators for smith-waterman sequence alignment [J]. BMC Bioinformatics,2008, 9:(Suppl 2), S10.
    [198]Gotoh O. An improved algorithm for matching biological sequences [J]. Journal of Molecular Biology,1982,162(3):705-708.
    [199]Green P. SWAT [OL]. http://www.genome.washington.edu.uwgc/analysistools/swat.htm.
    [200]Sengupta S, Harris M, Zhang Y, et al. Scan primitives for GPU computing [C]. In Proceedings of Graphics Hardware Graphics Hardware,2007,97-106.
    [201]Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations [J]. Bioinformatics,2007,23(2):156-161.
    [202]杜静,敖富江,唐滔,杨学军.流处理器上基于参数模型的长流分段技术[J].软件学报,2009,20(9)：2320-2331.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700