虚拟现实仿真平台异构并行计算关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

About the library

Background
History
Leadership
Organization

Readers' Guide

Opening Hours
Collections
Help Via Email

Publications

Electronic Information Resources

虚拟现实仿真平台异构并行计算关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Key Techniques of Heterogeneous Parallel Computing for Virtual Reality Simulation Platform
作者：刘寿生
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：并行计算 ; OpenCL ; CUDA ; 骨骼动画 ; 粒子系统 ; 虚拟现实
英文关键词：Parallel Computing ; OpenCL ; CUDA ; Skeletal Animation ; Particle
英文关键词：System ; Virtual Reality
学位年度：2014
导师：陈戈
学科代码：081203
学位授予单位：中国海洋大学
论文提交日期：2014-05-31

摘要

面向单核处理器的传统单线程算法难以满足海量数据处理的实时性需求，也无法充分发挥多核处理器的计算潜力，并行计算技术成为计算机性能优化的前沿技术。特别在多媒体、三维图形这些具有高实时性需求的领域，快速处理海量数据的需求尤为迫切。本文研究对象是虚拟现实地理信息系统一体化仿真平台VRGIS，它是一套集虚拟现实及地理信息系统为一体的软件平台，该软件具备城域级别三维模型数据以及三维地形数据的承载能力，还支持多种复杂自然现象模拟以及三维可视化特效，并且具备较高的画面逼真度和交互实时性。本文旨在研究各种并行计算技术，以解决VRGIS在三维仿真过程中以骨骼动画和粒子系统为代表的各种性能瓶颈问题。
     主要研究内容如下：
     1.建立多技术方案交叉互评的并行计算性能评价模型
     本文在阿姆达尔定律的理论基础上，引入并行计算多技术方案交叉互评机制，完善并行计算性能评价模型。本文研究的五套并行计算技术方案中，最新的是OpenCL(Open Computing Language,开放计算语言)，可以同时用于中央处理器CPU和图形处理器GPU。此外CPU和GPU有各自专用的并行计算技术，其中：OpenMP(Open Multi-Processing,开放多线程处理)和SSE(Streaming SIMDExtensions,流式单指令多数据扩展指令集)专门面向CPU； GLSL(OpenGLShading Language, OpenGL着色语言)和CUDA(Compute Unified DeviceArchitecture,统一设备架构)专门面向GPU。
     2.为骨骼动画矩阵调色板算法设计多个并行计算方案
     实现并改进了已有的包括SSE和GLSL在内的骨骼动画并行计算方案，结合包括CUDA、OpenCL在内的新兴并行计算技术针对骨骼动画提出了新的并行计算方案，对比分析各种并行计算技术。在设计了多套并行方案的基础上，为骨骼动画多个并行计算方案设计自适应抉择策略，支持在不同性能配置的并行硬件上，自动选中最优方案。
     3.为柏林噪声风场扰动喷泉粒子系统设计基于OpenCL的并行计算方案
     为了提升喷泉粒子系统仿真效果的逼真度，引入柏林噪声随机因子模拟风场扰动效果，动态模拟过程所需复杂运算极大影响仿真实时性，本文采用基于OpenCL的并行计算技术，同时面向CPU和GPU提出了粒子系统性能改进方案。
     4.构建多个并行计算任务与多个并行计算设备之间的映射原则
     当多个模块同时进行并行计算时，为了充分挖掘CPU和GPU等多个设备的异构并行计算能力，在前文隔离拆分并独立解决VRGIS内部包括骨骼动画和粒子系统两大瓶颈问题的基础上，将两个模块重新合并在一起作为多任务系统，研究虚拟现实仿真平台多个并行计算任务与多个并行计算设备之间的映射原则和执行方案。
     本文创新点主要体现在以下三个方面：
     1.提出基于OpenCL面向CPU和GPU异构体系的骨骼动画矩阵调色板算法。功能创新：提升面向GPU骨骼动画矩阵调色板算法的可移植性，原先基于CUDA的算法依赖特定的GPU，基于OpenCL面向GPU的矩阵调色板算法普遍适用于各种GPU。性能创新：面向CPU的OpenCL算法，以CPU串行算法和基于SSE叠加OpenMP的传统并行算法作为性能参考基准，加速比分别是3.9和1.5。
     2.设计骨骼动画多并行方案的自动调优算法。功能创新：本文为骨骼动画矩阵调色板算法设计了5套并行优化方案，并设计自动选择最优方案的算法。包括最新的是OpenCL，可以同时用于中央处理器CPU和图形处理器GPU。此外还为CPU和GPU设计了专用的并行方案，其中：OpenMP和SS专门面向CPU；GLSL和CUDA专门面向GPU。性能创新：在所有不同CPU和GPU配置上，自动寻找可行的而且性能最优的方案。
     3.提出多粒度任务与异构并行设备之间的动态映射和负载均衡策略。功能创新：首先设计了第二个基于OpenCL面向CPU和GPU异构体系的并行任务——柏林噪声风场扰动粒子系统喷泉，将现有基于CUDA的柏林噪声并行算法移植到OpenCL，从而突破了硬件限制，提升了柏林噪声并行算法的可移植性和通用性。结合粒子系统喷泉和上文的骨骼动画，设计多任务与异构并行设备映射原则。性能创新：按照CPU和GPU对不同任务的OpenCL并行加速能力的不同，将并行任务按加速比系数进行分配，通过降低设备等待时间提升性能。
     本文结合虚拟现实仿真平台的骨骼动画模块和粒子系统模块研究多种并行计算技术，当研发人员需要做以下决策时——是否将现有串行算法进行并行化、是否追随新的并行计算技术对已有并行算法进行移植升级、是否升级并行计算硬件设备，本文可为其提供有效的决策辅助。
The traditional single-threaded algorithm for single-core processor is difficult tomeet the real-time requirements of mass data processing; it also cannot give full playto the potential of multi-core processors computing. Parallel computing technique hasbecome the forefront of computer performance optimization techniques. Especially inthe areas of multimedia and three-dimensional graphics in which a high demand forreal-time is required, demand for rapid processing of massive data is particularlyurgent. This dissertation is oriented to VRGIS, a virtual reality simulation platformintegrated geographic information system. The software has a carrying capacity forthree-dimensional metro-level model data and three-dimensional terrain data. VRGISalso supports a variety of complex natural phenomena simulation and visualizationeffects, and has a high picture fidelity and real-time interaction. In this dissertation,various parallel computing techniques are studied to solve a variety of performancebottlenecks of VRGIS from skeletal animation to particle systems inthree-dimensional simulation.
     The major research work includes the following five aspects:
     1. To establish an performance evaluation model of parallel computing bycomparison among multiple peer technique solutions
     In this dissertation, with the theoretical basis of Amdahl's law an assessmentmechanism by multiple parallel computing technique solutions is introduced toimprove the performance evaluation model for parallel computing. Among the fivesets of parallel computing technique programs in this dissertation, the latest one isOpen Computing Language (OpenCL) which can be used for both central processingunits (CPUs) and graphics processing units (GPUs). Besides CPU and GPU have theirown dedicated parallel computing technique. Open Multi-Processing (OpenMP) andStreaming SIMD Extensions (SSE) are oriented to CPU, while OpenGL ShadingLanguage (GLSL) and Compute Unified Device Architecture (CUDA) are oriented toGPU.
     2. To design multiple parallel computing schemes for the matrix palettealgorithm of skeletal animation
     Existing parallel computing schemes SSE and GLSL are implemented andimproved for skeletal animation. With new techniques such as CUDA and OpenCL new parallel computing schemes are produced for skeletal animation, comparison andanalysis are performed among these parallel computing schemes with differenttechniques. Based on the former multiple schemes, we design a strategy to auto-selectthe best parallel scheme for skeletal animation.
     3. To design a parallel computing scheme based on OpenCL for fountainsimulation with Berlin noise wind disturbance by particle system
     In order to improve the fidelity of the simulation results random disturbancefactor such as Berlin noise is involved. A parallel computing scheme based onOpenCL is used to perfect the performance of particle system running on both CPUand GPU.
     4. To construct a principle to map multiple parallel tasks to multiple paralleldevices
     While multiple tasks are running at the same time with the purpose to fullyexploit the heterogeneous parallel computing capabilities of multiple devices such asCPU and GPU, two separated modules skeletal animation and particle system aretreated as a whole multi-task system. Then a principle to map multiple parallel tasksto multiple parallel devices is studied to get further performance optimization forsimulation platform of virtual reality.
     The innovation of this dissertation is mainly reflected in three aspects:
     1. To propose the matrix palette algorithm for skeletal animation based onOpenCL for heterogeneous platform mixed CPU and GPU. Function innovation:Portability is advanced for skeletal animation algorithm matrix palette from CUDA toOpenCL. Because original algorithm based on CUDA is dependent on particularNVIDIA GPU, while algorithm based on OpenCL for matrix palette is generallyapplicable to a variety of GPU. Performance innovation: With two set of referencebenchmark including serial algorithm and traditional parallel algorithm by SSE andOpenMP on CPU, OpenCL version of the matrix palette algorithm on CPU gets aspeedup of3.9and1.5.
     2. To design a strategy to auto-select the best parallel scheme for skeletalanimation. Function innovation: we design five sets of parallel computing schemes forskeletal animation. The latest scheme is based on OpenCL which can be used for bothCPUs and GPUs. Besides CPU and GPU have their own dedicated parallel computingtechnique. OpenMP and SSE are oriented to CPU, while GLSL and CUDA areoriented to GPU. Performance innovation: on different GPUs and CPUs, auto-selectthe available and best parallel scheme for skeletal animation.
     3. To design a parallel computing scheme based on OpenCL for fountainsimulation with Berlin noise wind disturbance by particle system for heterogeneousplatform mixed CPU and GPU. Performance innovation: With a reference benchmarkwhich is formed by serial algorithm on CPU, OpenCL gets speedup of3.4on CPUand65on GPU.
     Multiple parallel computing techniques are studied to accelerate the performanceof skeletal animation and particle system involved in virtual reality. Researchers anddevelopers can get valuable proofs and clues from this dissertation while thefollowing decisions are looking forward to be made: whether to parallelize the serialalgorithm, whether to port outdated parallel algorithm to the latest version, whether toupdate the parallel hardware device.

引文

[1] Asanovic K, Bodik R, Demmel J, et al. A view of the parallel computing landscape.Communications of the ACM,2009,52(10):56~67
    [2] Asanovic K, Bodik R, Catanzaro BC, et al. The landscape of parallel computing research: Aview from Berkeley.Technical Report UCB/EECS-2006-183, EECS Department, Universityof California, Berkeley,2006.
    [3] Williams SW. Auto-tuning performance on multicore computers:[Ph.D Dissertation].Berkeley: University of California,2008
    [4] Kurzak J, Bader DA, Dongarra J. Scientific Computing with Multicore and Accelerators.Florida: CRC Press, Inc.,2010.
    [5] Sodan AC, Machina J, Deshmeh A, et al. Parallelism via multithreaded and multicore CPUs.Computer,2010,43(3):24~32
    [6] Brodtkorb AR, Dyken C, Hagen TR, et al. State-of-the-art in heterogeneous computing.Scientific Programming,2010,18(1):1~33
    [7]陈国良,孙广中,徐云,等.并行计算的一体化研究现状与发展趋势.科学通报,2009,(8):1043~1049
    [8] Franchetti F, Kral S, Lorenz J, et al. Efficient utilization of SIMD extensions. Proceedings ofthe IEEE,2005,93(2):409~425
    [9] Hassaballah M, Omran S, Mahdy YB. A review of SIMD multimedia extensions and theirusage in scientific and engineering applications. The Computer Journal,2008,51(6):630~649
    [10] Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming.Computational Science&Engineering, IEEE,1998,5(1):46~55
    [11] Kegel P, Schellmann M, Gorlatch S. Using openmp vs. threading building blocks for medicalimaging on multi-cores. Euro-Par2009Parallel Processing.Springer,2009.654~665
    [12] Strey A, Bange M. Performance Analysis of Intel s MMX and SSE: A case Study. Euro-Par2001Parallel Processing.Springer,2001.142~147
    [13] Aberdeen D, Baxter J. Emmerald: a fast matrix–matrix multiply using Intel's SSEinstructions. Concurrency and Computation: Practice and Experience,2001,13(2):103~119
    [14] Va ko A, rámek M. Optimizing Gaussian filtering of volumetric data using SSE.Concurrency and Computation: Practice and Experience,2011,23(1):100~116
    [15] Larsson P, Palmer E. Image Processing Acceleration Techniques using Intel Streaming SIMDExtensions and Intel Advanced Vector Extensions. Intel. Corp. Whitepaper,2009:
    [16] Intel. Introduction to intel advanced vector extensions2012.http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions
    [17] Intel. Advanced vector extensions programming reference2011.https://secure-software.intel.com/sites/default/files/m/d/4/1/d/8/319433-011.pdf
    [18] Chafi H, Sujeeth AK, Brown KJ, et al. A domain-specific approach to heterogeneousparallelism.Proceedings of the16th ACM symposium on Principles and practice of parallelprogramming.ACM,2011.35~46
    [19] Sujeeth AK, Rompf T, Brown KJ, et al. Composition and reuse with compileddomain-specific languages. Proceedings of ECOOP.2013.
    [20] Brown KJ, Sujeeth AK, Lee HJ, et al. A heterogeneous parallel framework fordomain-specific languages.2011.89~100
    [21] Chen T, Raghavan R, Dale JN, et al. Cell broadband engine architecture and its firstimplementation—a performance view. IBM Journal of Research and Development,2007,51(5):559~572
    [22] Gschwind M. The Cell Broadband Engine: exploiting multiple levels of parallelism in a chipmultiprocessor. International Journal of Parallel Programming,2007,35(3):233~262
    [23] Bader DA, Patel S. High performance MPEG-2software decoder on the cell broadbandengine.2008.1~10
    [24] Agarwal V, Liu L, Bader DA. Financial modeling on the cell broadband engine.2008.1~12
    [25] Shirako J, Kasahara H, Sarkar V. Language extensions in support of compiler parallelization.Languages and Compilers for Parallel Computing.Springer,2008.78~94
    [26] Amdahl GM. Validity of the single processor approach to achieving large scale computingcapabilities. Proceedings of the April18-20,1967, Spring Joint Computer Conference.ACM,1967.483~485
    [27] Gustafson JL, Montry GR, Benner RE. Development of parallel methods for a1024-processor hypercube. SIAM journal on Scientific and Statistical Computing,1988,9(4):609~638
    [28] Gustafson JL. Reevaluating Amdahl's law. Communications of the ACM,1988,31(5):532~533
    [29] Lee VW, Kim C, Chhugani J, et al. Debunking the100X GPU vs. CPU myth: an evaluationof throughput computing on CPU and GPU. ACM SIGARCH Computer ArchitectureNews.ACM,2010.451~460
    [30] Trinitis C. Is GPU enthusiasm vanishing?. High Performance Computing and Simulation(HPCS),2012International Conference.IEEE,2012.410
    [31]林一松,杨学军,唐滔,等.一种基于关键路径分析的CPU-GPU异构系统综合能耗优化方法.计算机学报,2012,35(1):123~133
    [32] Kalra P, Magnenat-Thalmann N, Moccozet L, et al. Real-time animation of realistic virtualhumans. Computer Graphics and Applications, IEEE,1998,18(5):42~56
    [33]尚华强.基于Kinect的虚拟人物动作仿真研究:[硕士学位论文].杭州电子科技大学,2013
    [34] Reynolds CW. Flocks, herds and schools: A distributed behavioral model. ACM SIGGRAPHComputer Graphics.ACM,1987.25~34
    [35] Müller M, Charypar D, Gross M. Particle-based fluid simulation for interactive applications.Proceedings of the2003ACM SIGGRAPH/Eurographics symposium on Computeranimation.Eurographics Association,2003.154~159
    [36] Kolb A, Latta L, Rezk-Salama C. Hardware-based simulation and collis ion detection for largeparticle systems. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware.ACM,2004.123~131
    [37] Kipfer P, Segal M, Westermann R. UberFlow: a GPU-based particle engine. Proceedings ofthe ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware.ACM,2004.115~122
    [38] Kruger J, Kipfer P, Konclratieva P, et al. A particle system for interactive visualization of3Dflows. IEEE Transactions on Visualization and Computer Graphics,2005,11(6):744~756
    [39] Purcell TJ, Buck I, Mark WR, et al. Ray tracing on programmable graphics hardware.Proceedings of the29th annual conference on Computer graphics and interactivetechniques.San Antonio, Texas: ACM,2002.703~712
    [40] Parker SG, Bigler J, Dietrich A, et al. Optix: a general purpose ray tracing engine. ACMTransactions on Graphics (TOG),2010,29(4):66
    [41] Huff R, Gierlinger T, Kuijper A, et al. A comparison of xpu platforms exemplified with raytracing algorithms. Virtual Reality (SVR),2011XIII Symposium on.IEEE,2011.1~8
    [42] Martz P. OpenSceneGraph Quick Start Guide: A Quick Introduction to the Cross-PlatformOpen Source Scene Graph API. Skew Matrix Software,2007.
    [43] Li F, Sun J, Yang Q. Design and research of virtual assembly system based on OSG. DigitalManufacturing and Automation (ICDMA),2011Second International Conference on.IEEE,2011.385~388
    [44]孟效轲,华泽玺,何春.基于OSG的爆炸装置拆除视景仿真及关键技术.计算机仿真,2010,(007):234~238
    [45] Junker G. Pro OGRE3D programming. Apress,2006.
    [46] Mark WR, Glanville RS, Akeley K, et al. Cg: a system for programming graphics hardwarein a C-like language.2003,22(3):896~907
    [47] Fernando R, Kilgard MJ. The Cg Tutorial: The definitive guide to programmable real-timegraphics. Addison-Wesley Longman Publishing Co., Inc.,2003.
    [48] Burtnyk N, Wein M. Interactive skeleton techniques for enhancing motion dynamics in keyframe animation. Communications of the ACM,1976,19(10):564~569
    [49] Lewis JP, Cordner M, Fong N. Pose space deformation: a unified approach to shapeinterpolation and skeleton-driven deformation.2000.165~172
    [50] Lindholm E, Kilgard MJ, Moreton H. A user-programmable vertex engine.Proceedings of the28th annual conference on Computer graphics and interactive techniques.ACM,2001.149~158
    [51] James DL, Twigg CD. Skinning mesh animations.ACM SIGGRAPH2005Papers. LosAngeles, California: ACM,2005.399~407
    [52] Baran I, Popovi C J. Automatic rigging and animation of3d characters. ACM Transactionson Graphics (TOG),2007,26:72
    [53] Ju T, Zhou Q, van de Panne M, et al. Reusable skinning templates using cage-baseddeformations.2008.
    [54] Raptis M, Kirovski D, Hoppe H. Real-time classification of dance gestures from skeletonanimation. Proceedings of the2011ACM SIGGRAPH/Eurographics Symposium onComputer Animation.2011.147~156
    [55] Reeves WT. Particle systems—a technique for modeling a class of fuzzy objects. ACMSIGGRAPH Computer Graphics.ACM,1983.359~375
    [56] Reeves WT, Blau R. Approximate and probabilistic algorithms for shading and renderingstructured particle systems. ACM Siggraph Computer Graphics.ACM,1985.313~322
    [57] Khronos. OpenGL Shading Language Specification(GLSL SPEC) v4.4.2013.http://www.opengl.org/registry/doc/GLSLangSpec.4.40.pdf
    [58] Buck I, Foley T, Horn D, et al. Brook for GPUs: stream computing on graphics hardware.ACM SIGGRAPH2004Papers. Los Angeles, California: ACM,2004.777~786
    [59] Owens JD, Luebke D, Govindaraju N, et al. A Survey of General-Purpose Computation onGraphics Hardware. Computer Graphics Forum,2007,26(1):80~113
    [60] Owens JD, Houston M, Luebke D, et al. GPU Computing. Proceedings of the IEEE,2008,96(5):879~899
    [61] Nickolls J, Buck I, Garland M, et al. Scalable Parallel Programming with CUDA. Queue,2008,6(2):40~53
    [62] Lindholm E, Nickolls J, Oberman S, et al. NVIDIA Tesla: A Unified Graphics andComputing Architecture. Micro, IEEE,2008,28(2):39~55
    [63] Che S, Boyer M, Meng J, et al. A performance study of general-purpose applications ongraphics processors using CUDA. Journal of parallel and distributed computing,2008,68(10):1370~1380
    [64] Ryoo S, Rodrigues CI, Baghsorkhi SS, et al. Optimization principles and applicationperformance evaluation of a multithreaded GPU using CUDA. Proceedings of the13thACM SIGPLAN Symposium on Principles and practice of parallel programming. Salt LakeCity, UT, USA: ACM,2008.73~82
    [65] Garland M, Le Grand S, Nickolls J, et al. Parallel Computing Experiences with CUDA.Micro, IEEE,2008,28(4):13~27
    [66] Jianbin F, Varbanescu AL, Sips H. A Comprehensive Performance Comparison of CUDAand OpenCL. Parallel Processing (ICPP),2011International Conference on. Taipei City:2011.216~225
    [67] Jie S, Jianbin F, Sips H, et al. Performance Gaps between OpenMP and OpenCL forMulti-core CPUs. Parallel Processing Workshops (ICPPW),201241st InternationalConference on. Pittsburgh, PA:2012.116~125
    [68] Lee S, Min S, Eigenmann R. OpenMP to GPGPU: a compiler framework for automatictranslation and optimization. Proceedings of the14th ACM SIGPLAN symposium onPrinciples and practice of parallel programming. Raleigh, NC, USA: ACM,2009.101~110
    [69] Hormati AH, Samadi M, Woh M, et al. Sponge: portable stream programming on graphicsengines. Proceedings of the sixteenth international conference on Architectural support forprogramming languages and operating systems. Newport Beach, California, USA: ACM,2011.381~392
    [70] Moreland K, Angel E. The FFT on a GPU.Proceedings of the ACMSIGGRAPH/EUROGRAPHICS conference on Graphics hardware. San Diego, California:Eurographics Association,2003.112~119
    [71] Fatahalian K, Sugerman J, Hanrahan P. Understanding the efficiency of GPU algorithms formatrix-matrix multiplication.2004.133~137
    [72] Nukada A, Matsuoka S. Auto-tuning3-D FFT library for CUDA GPUs. Proceedings of theConference on High Performance Computing Networking, Storage and Analysis.ACM,2009.30
    [73] Govindaraju NK, Manocha D. Cache-efficient numerical algorithms using graphicshardware. Parallel Computing,2007,33(10):663~684
    [74] Jiang C, Snir M. Automatic tuning matrix multiplication performance on graphics hardware.Parallel Architectures and Compilation Techniques,2005.PACT2005.14th InternationalConference on.IEEE,2005.185~194
    [75]吴恩华,柳有权.基于图形处理器(GPU)的通用计算.计算机辅助设计与图形学学报,2004,16(5):601~612
    [76]吴恩华.图形处理器用于通用计算的技术,现状及其挑战.软件学报,2004,15(10):1493~1504
    [77] Lefohn AE, Sengupta S, Kniss J, et al. Glift: Generic, efficient, random-access GPU datastructures. ACM Transactions on Graphics (TOG),2006,25(1):60~99
    [78] Harish P, Narayanan PJ. Accelerating large graph algorithms on the GPU using CUDA.High performance computing--HiPC2007. Springer,2007.197~208
    [79] Blythe D. Rise of the graphics processor. Proceedings of the IEEE,2008,96(5):761~778
    [80] Christophe E, Michel J, Inglada J. Remote sensing processing: From multicore to GPU.Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of,2011,4(3):643~652
    [81] De Caro D, Petra N, Strollo A. High-performance special function unit for programmable3-D graphics processors. Circuits and Systems I: Regular Papers, IEEE Transactions on,2009,56(9):1968~1978
    [82] Diao M, Kim J. Multimedia mining on manycore architectures: The case for gpus. Advancesin Visual Computing.Springer,2009.619~630
    [83] Cederman D, Chatterjee B, Tsigas P. Understanding the performance of concurrent datastructures on graphics processors. Euro-Par2012Parallel Processing.Springer,2012.883~894
    [84] Teschner M, Kimmerle S, Heidelberger B, et al. Collision detection for deformableobjects.2005.61~81
    [85] Hodgins JK. Capturing and Animating Skin Deformation in Human Motion. ACMtransactions on graphics,2006,(3):881~889
    [86] Yang X, Somasekharan A, Zhang JJ. Curve skeleton skinning for human and creaturecharacters. Computer Animation and Virtual Worlds,2006,17(3-4):281~292
    [87] Jacka D, Reid A, Merry B, et al.A comparison of linear skinning techniques for characteranimation.2007.177~186
    [88] Han-Bing Y, Shi-Min H, Martin RR, et al. Shape Deformation Using a Skeleton to DriveSimplex Transformations. Visualization and Computer Graphics, IEEE Transactions on,2008,14(3):693~706
    [89] Elsen E, Houston M, Vishal V, et al. N-Body simulation on GPUs. Proceedings of the2006ACM/IEEE conference on Supercomputing. Tampa, Florida: ACM,2006.188
    [90] Nyland L, Harris M, Prins J. Fast n-body simulation with cuda. GPU gems,2007,3:677~695
    [91] Ma A, Cai J, Cheng Y, et al. Performance Optimization Strategies of High PerformanceComputing on GPU. Advanced Parallel Processing Technologies. Springer,2009.150~164
    [92]马安国,成玉,唐遇星,等. GPU异构系统中的存储层次和负载均衡策略研究.国防科技大学学报,2009,31(5):38~43
    [93]卢风顺,宋君强,银福康,等. CPU/GPU协同并行计算研究综述.计算机科学,2011,38(3):5~9
    [94] Harris M. Mapping computational concepts to GPUs.ACM SIGGRAPH2005Courses.ACM,2005.50
    [95] Muyan-Ozcelik P, Owens JD, Xia J, et al. Fast deformable registration on the GPU: ACUDA implementation of demons. Computational Sciences and Its Applications,2008.ICCSA'08.International Conference on.IEEE,2008.223~233
    [96] Zhang X, Kim YJ. Interactive collis ion detection for deformable models using streamingAABBs. Visualization and Computer Graphics, IEEE Transactions on,2007,13(2):318~329
    [97] Amorim R, Haase G, Liebmann M, et al. Comparing CUDA and OpenGL implementationsfor a Jacobi iteration. High Performance Computing&Simulation,2009.HPCS'09.International Conference on.IEEE,2009.22~32
    [98] Joselli M, Clua E, Montenegro A, et al. A new physics engine with automatic processdistribution between cpu-gpu.Proceedings of the2008ACM SIGGRAPH symposium onVideo games.ACM,2008.149~156
    [99] R dal KES, Storli G. Physically based simulation and visualization of fire in real-time usingthe gpu: Norwegian University of Science and Technology,2006
    [100] Seshadrinathan M, Dempski KL.Implementation of advanced encryption standard forencryption and decryption of images and text on a gpu.Computer Vis ion and PatternRecognition Workshops,2008.CVPRW'08.IEEE Computer Society Conference on.IEEE,2008.1~6
    [101]韩俊刚,蒋林,杜慧敏,等.一种图形加速器和着色器的体系结构.计算机辅助设计与图形学学报,2010,23(3):363~372
    [102] Ohmer JF. Computer vision applications on graphics processing units:[Master Thesis].Queensland: Queensland University of Technology,2007
    [103] Kumar R, Tullsen DM, Jouppi NP, et al. Heterogeneous chip multiprocessors. Computer,2005,38(11):32~38
    [104] Benner RE, Gustafson JL, Montry GR. Development and analysis of scientific applicationprograms on a1024-processor hypercube. SAND88-0317, Sandia National Laboratories,1988:
    [105]杨际祥,谭国真,王荣生.并行与分布式计算动态负载均衡策略综述.电子学报,2010,38(5):1122~1130
    [106] Williams S, Waterman A, Patterson D. Roofline: an ins ightful visual performance model formulticore architectures. Communications of the ACM,2009,52(4):65~76
    [107] Guz Z, Bolotin E, Keidar I, et al. Many-core vs. many-thread machines: Stay away from thevalley. Computer Architecture Letters,2009,8(1):25~28
    [108] Yao E, Bao Y, Tan G, et al. Extending Amdahl's law in the multicore era. ACMSIGMETRICS Performance Evaluation Review,2009,37(2):24~26
    [109] Sun X, Chen Y. Reevaluating Amdahl s law in the multicore era. Journal of Parallel andDistributed Computing,2010,70(2):183~188
    [110] Daga M, Aji AM, Feng W. On the efficacy of a fused cpu+gpu processor (or apu) forparallel computing. Application Accelerators in High-Performance Computing (SAAHPC),2011Symposium on. IEEE,2011.141~149
    [111] Joao JA, Suleman MA, Mutlu O, et al. Bottleneck identification and scheduling inmultithreaded applications. ACM SIGARCH Computer Architecture News,2012,40(1):223~234
    [112] Williams S, Datta K, Carter J, et al. PERI-Auto-tuning memory-intensive kernels formulticore. Journal of Physics: Conference Series. IOP Publishing,2008.1~15
    [113] Rul S, Vandierendonck H, D'Haene J, et al.An experimental study on performanceportability of OpenCL kernels.2010Symposium on Application Accelerators in HighPerformance Computing (SAAHPC'10).2010.
    [114] Bailey DH, Lucas RF, Williams SW. Performance tuning of scientific applications.CRCPress,2011.
    [115] Scott LE, David M. Fast matrix multiplies using graphics hardware. Proceedings ofSupercomputing, Denver,2001,60:55
    [116] Govindaraiu Naga K, Scott L, Jim G, et al.A memory model for scientific algorithms ongraphics processors. Supereomputing, Proceedings Of The ACM. Tampa, Florida:2006.6
    [117] Ryoo S, Rodrigues CI, Stone SS, et al. Program optimization space pruning for amultithreaded gpu.2008.195~204
    [118] Nukada A, Ogata Y, Endo T, et al. Bandwidth intens ive3-D FFT kernel for GPUs usingCUDA.2008.1~11
    [119] Baskaran MM, Ramanujam J, Sadayappan P. Automatic C-to-CUDA code generation foraffine programs.2010.244~263
    [120] Terriberry TB, French LM, Helmsen J. GPU accelerating speeded-up robust features.2008.355~362
    [121] Endo T, Matsuoka S. Massive supercomputing coping with heterogeneity of modernaccelerators.2008.1~10
    [122] Eladio G, Sergio R, Maria AT, et al. Memory Locality Exploitation Strategies for FFT onthe CUDA Architecture. Lecture Notes in Computer Science, High PerformanceComputing for Computational Science-VECPAR2008, Springer Berlin/Heidelberg,2008,5336:16
    [123] Govindaraju NK, Lloyd B, Wang W, et al. Fast computation of database operations usinggraphics processors.2004.215~226
    [124] Du P, Weber R, Luszczek P, et al. From CUDA to OpenCL: Towards aperformance-portable solution for multi-platform GPU programming. Parallel Computing,2012,38(8):391~407
    [125] Tomov S, McGuigan M, Bennett R, et al. Benchmarking and implementation ofprobability-based simulations on programmable graphics cards. Computers\&Graphics,2005,29(1):71~80
    [126] Trancoso P, Charalambous M. Exploring graphics processor performance for generalpurpose applications.2005.306~313
    [127] Bodin F, Bihan S. Heterogeneous multicore parallel programming for graphics processingunits. Scientific Programming,2009,17(4):325~336
    [128] Buck I, Hanrahan P. Data parallel computation on graphics hardware. Graphics Hardware2003: Panel Presentation,2003:
    [129] Zeller C, Fernando R, Wloka M, et al. Programming graphics hardware. Proc. Eurographics—Tutorials, Sept,2004:1~17
    [130] Lahabar S, Agrawal P, Narayanan PJ.High performance pattern recognition on GPU.Proceedings of NCVPRIPG,2008,2008:154~159
    [131] Bolz J, Farmer I, Grinspun E, et al.The GPU as numerical simulation engine. ACMSIGGRAPH,2003:1~9
    [132] Brodtkorb AER.The graphics processor as a mathematical coprocessor in MATLAB.2008.822~827
    [133] NVIDIA.NVIDIA Visual Profiler v5.0.2012.https://developer.nvidia.com/nvidia-visual-profiler
    [134] NVIDIA.Nsight Visual Studio Edition v3.1.2012.https://developer.nvidia.com/nvidia-nsight-visual-studio-edition
    [135] NVIDIA.CUDA C Programming Guide v5.5.2013.http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
    [136] NVIDIA.CUDA C Best Practices Guide v5.5.2013.http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
    [137] NVIDIA.Cg Users Manual v1.4.2005.
    [138] NVIDIA.Cg Reference Manual v3.1.2012.
    [139] Khronos.The OpenCL Specification v1.2.2011.http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf
    [140] Khronos.The OpenCL Reference v1.2.2011.http://www.khronos.org/registry/cl/sdk/1.2/docs/OpenCL-1.2-refcard.pdf
    [141] Khronos. ARB fragment program2002.http://www.opengl.org/registry/specs/ARB/fragment_program.txt
    [142] Khronos.ARB vertex program2002.http://www.opengl.org/registry/specs/ARB/vertex_program.txt
    [143] Microsoft.Programming Guide for HLSL.2013.http://msdn.microsoft.com/en-us/library/bb509635(v=vs.85).aspx
    [144] Microsoft.Reference for HLSL.2013.http://msdn.microsoft.com/en-us/library/bb509638(v=vs.85).aspx
    [145] AMD. Accelerated Parallel Processing (APP) SDK v2.8.2013.http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
    [146] Intel. The Intel SDK for OpenCL Applications v2013.2013.http://software.intel.com/en-us/vcsource/tools/opencl-sdk
    [147] Pheatt C. Intel threading building blocks. Journal of Computing Sciences in Colleges,2008,23(4):298
    [148] Reinders J. Intel threading building blocks: outfitting C++for multi-core processorparallelism. O'Reilly Media, Inc.,2010.
    [149] Unat D. Domain-specific translator and optimizer for massive on-chip parallelism:[Ph.DDissertation]. San Diego, California: University of California,2012
    [150]吕晖,李宏亮,郑方,等.高性能计算领域向量技术应用分析.高性能计算技术,2007,(003):1~5
    [151] Fritz N. SIMD Code Generation in Data-Parallel Programming.epubli,2009.
    [152] Chiu J, Chou Y.A multi-streaming SIMD multimedia computing engine. Microprocessorsand Microsystems,2010,34(7):247~258
    [153] Peleg A, Weiser U. MMX technology extension to the Intel architecture. Micro, IEEE,1996,16(4):42~50
    [154] Peleg A, Wilkie S, Weiser U. Intel MMX for multimedia PCs. Communications of the ACM,1997,40(1):24~38
    [155] Leupers R. Code selection for media processors with SIMD instructions.Proceedings of theconference on Design, automation and test in Europe.ACM,2000.4~8
    [156] Shahbahrami A, Juurlink B, Vassiliadis S. Performance impact of misaligned accesses inSIMD extensions.Proceedings of the17th Annual Workshop on Circuits, Systems andSignal Processing (ProRISC2006).2006.334~342
    [157] Shahbahrami A, Juurlink B, Vassiliadis S. SIMD vectorization of histogramfunctions.Application-specific Systems, Architectures and Processors,2007.ASAP.IEEEInternational Conf. on.IEEE,2007.174~179
    [158] Lee RB.Multimedia extensions for general-purpose processors.Signal Processing Systems,1997. SIPS97-Design and Implementation.,1997IEEE Workshop on. IEEE,1997.9~23
    [159] Barik R, Zhao J, Sarkar V. Efficient selection of vector instructions using dynamicprogramming.Microarchitecture (MICRO),201043rd Annual IEEE/ACM InternationalSymposium on.IEEE,2010.201~212
    [160] Lorenz JUR. Automatic SIMD vectorization: Ph. D. Thesis, Institute for AppliedMathematics and Numerical Analys is, Vienna University of Technology,2004
    [161] Shi X, Zhou K, Tong Y, et al. Example-based dynamic skinning in real time. ACMTransactions on Graphics (TOG).ACM,2008.29
    [162]钟庆.基于CUDA并行计算的三维形状变形编辑:[硕士学位论文].大连:大连理工大学,2012
    [163] Pennycook SJ, Hammond SD, Wright SA, et al.An investigation of the performanceportability of OpenCL. Journal of Parallel and Distributed Computing,2012:1~12
    [164] Weber R, Gothandaraman A, Hinde RJ, et al. Comparing hardware accelerators in scientificapplications: A case study. Parallel and Distributed Systems, IEEE Transactions on,2011,22(1):58~68
    [165] Park H, Han J. Fast rendering of large crowds using GPU.Entertainment Computing-ICEC2008.Springer,2009.197~202
    [166] Ivanovska T, Linsen L, Hahn HK, et al. GPU implementations of a relaxation scheme forimage partitioning: GLSL versus CUDA. Computing and visualization in science,2011,14(5):217~226
    [167] Gomes T, Estevao L, de Toledo R, et al.A Survey of GLSL Examples.Graphics, Patternsand Images Tutorials (SIBGRAPI-T),201225th SIBGRAPI Conference on.IEEE,2012.60~73
    [168] Marroquim R, Maximo A. Introduction to GPU Programming with GLSL. ComputerGraphics and Image Processing (SIBGRAPI TUTORIALS),2009Tutorials of the XXIIBrazilian Symposium on. IEEE,2009.3~16
    [169] Jacob F. CUDACL+: a framework for GPU programs. Proceedings of the ACMinternational conference companion on Object oriented programming systems languagesand applications companion. ACM,2011.55~58
    [170] Oliveira RS, Rocha BM, Amorim RMCC, et al. Comparing cuda, opencl and openglimplementations of the cardiac monodomain equations. Parallel Processing and AppliedMathematics. Springer,2012.111~120
    [171] Zhang Y, Sinclair II M, Chien AA.Improving Performance Portability in OpenCLPrograms.2013.136~150
    [172]卢贺齐.基于OpenCL的实时KD-Tree与动态场景光线跟踪:[硕士学位论文].杭州:浙江大学,2011
    [173]黄鑫. CUDA光线跟踪渲染器设计与实现:[硕士学位论文].北京:北京邮电大学,2012
    [174] Rudy G. CUDA-CHiLL: A programming language interface for GPGPU optimizations andcode generation:[Ph.D Dissertation]. Utah: The University of Utah,2010
    [175] Huang B, Plaza AJ.High-performance computing in remote sensing.Society ofPhoto-Optical Instrumentation Engineers (SPIE) Conference Series.2011.
    [176]黄强强.基于GPU的粒子系统数值模拟研究及其应用:[硕士学位论文].南昌:南昌大学,2012
    [177] Lei a R, Hack S, Wald I. Extending a C-like language for portable SIMD programming.ACM SIGPLAN Notices,2012,47(8):65~74
    [178] McFarlin DS, Arbatov V, Franchetti F, et al. Automatic SIMD vectorization of fast fouriertransforms for the larrabee and AVX instruction sets. Proceedings of the internationalconference on Supercomputing.ACM,2011.265~274
    [179] Kofsky SM, Johnson DR, Stratton JA, et al. Implementing a GPU programming model on aNon-GPU accelerator architecture. Computer Architecture.Springer,2012.40~51
    [180]李建江,路川,张磊.基于指导语句的CUDA程序性能分析工具研究与实现.电子科技大学学报,2012,41(2):280~284
    [181]王恒,高建瓴.基于GPU的MATLAB计算与仿真研究.贵州大学学报(自然科学版),2012,6:23
    [182] Mayanglambam S, Malony AD, Sottile MJ.Performance measurement of applications withgpu acceleration using cuda. Parallel Computing: From Multicores and GPU's to Petascale,2010,19:341
    [183] Malony AD, Biersdorff S, Spear W, et al.An experimental approach to performancemeasurement of heterogeneous parallel applications using cuda.Proceedings of the24thACM International Conference on Supercomputing.ACM,2010.127~136
    [184] Farber R. CUDA application design and development. Access Online via Elsevier,2011.
    [185] Sim J, Dasgupta A, Kim H, et al. A performance analysis framework for identifying potentialbenefits in GPGPU applications. ACM SIGPLAN Notices. ACM,2012.11~22
    [186] Torres Y, Gonzalez-Escribano A, Llanos DR. Understanding the impact of CUDA tuningtechniques for Fermi.High Performance Computing and Simulation (HPCS),2011International Conference on.IEEE,2011.631~639
    [187] KIM Y, SHRIVASTAVA A. Memory Performance Estimation of CUDA Programs.:
    [188] Yablonski D. Numerical accuracy differences in CPU and GPGPU codes:[Master Thesis].Boston: Northeastern University,2011
    [189]乔香珍.并行计算时间模型和并行机系统性能.计算机学报,1998,21(5):413~418