节点内多CPU多GPU协同并行绘制关键技术研究

英文题名：Multi-CPU and Multi-GPU Collaborative Parallel Rendering in Cluster Node
作者：刘华海
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：多CPU多GPU ; 协同并行绘制 ; 节点内并行绘制 ; 并行绘制框架 ; 图像合成 ; 合成通信模型 ; GPGPU
英文关键词：multi-CPUmulti-GPU ; collaborativeparallelrendering ; parallelren-
英文关键词：dering in node ; parallel rendering framework ; composition ; composition communi-
英文关键词：cation model ; GPGPU
学位年度：2012
导师：李思昆
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2012-12-10

摘要

并行绘制是将绘制通道从统一的程序执行模型主循环中分离出来，扩展多条独立的图形流水线，并通过并行派发绘制任务实现协同并行绘制计算。并行绘制是提高大规模复杂场景图形绘制性能的有效技术途径。
     并行绘制系统一般由多个分布式并行绘制节点构成，绘制节点通常采用CPU作为通用计算单元，GPU作为图形协处理单元。早期并行绘制系统节点内CPU计算产生的数据难以满足单个GPU的需求，因而节点内一般只配置一个GPU。随着商业多核处理器和图形硬件技术的发展，目前的并行绘制系统节点可以配置多个CPU和多个GPU。许多研究和应用表明，深入研究节点内协同并行绘制技术，充分发挥绘制节点内多CPU多GPU协同并行计算性能，既是提高单机图形绘制效率的有效技术途径，又是构建大规模复杂场景高效分布式并行绘制系统的重要基础。
     现有节点内多CPU多GPU并行绘制技术并没有充分考虑绘制节点内硬件体系结构特点，系统难以充分发挥节点内多CPU多GPU的协同并行绘制计算能力。本文以充分发挥绘制节点内协同并行绘制计算能力为目的，针对绘制节点内CPU和GPU的非对称性计算与访存体系结构特点，研究了节点内多CPU多GPU协同并行绘制模型及其sort-last并行绘制模式下性能优化方法，主要工作和研究成果如下：
     （1）针对已有节点内并行绘制模型将硬件绘制与合成显示阶段串行耦合导致GPU停顿问题，从发挥节点内多核CPU计算能力和提高节点内多GPU并行绘制能力的角度出发，提出了一种面向节点内多核CPU多GPU体系结构的并行混合绘制模型。该模型一方面将应用事件逻辑与绘制逻辑分离，保证了系统的易配置和扩展性；另一方面，采用CPU软件绘制与GPU硬件绘制相结合将硬件绘制与图像合成分离，同时利用DMA异步传输机制构建节点内绘制、读回和合成三段并行绘制流水线，保证了系统的高效性。理论分析与实验表明：该模型易配置、可扩展，同时可以极大的提高节点内并行绘制性能。
     （2）针对已有节点内CPU端图像合成操作效率低和存在大量冗余操作问题，提出一种基于GPGPU加速的节点内多GPU图像高效合成方法。该方法通过GPGPU计算生成有效像素合成索引列表，完全避免了节点内多GPU图像合成过程中CPU端的冗余合成计算。理论分析表明：在理想负载平衡条件下，该方法加速比为图像有效像素百分比与节点内GPU数量的比值。实验结果表明：在节点内配置4个GPU时，针对有效像素比为12%～76%的高分辨率图像，该方法与原始方法相比合成性能提高3～5倍。
     （3）针对已有节点内基于CPU-GPU通信模型的图像合成方法数据通信和计算时间开销大的问题，提出了基于节点内P2P直接通信模型的合成策略，一方面避免了大量的GPU与CPU间的数据交换，另一方面高效的利用了GPU片上高速通信带宽和其强大的计算能力；基于该合成策略，提出了图像合成过程中的推合成与挽合成操作相结合的图像合成方法，优化了多GPU图像合成过程中本地显存与远程显存的存储访问效率，为实现高效的并行图像合成算法奠定了坚实理论基础；同时，提出一种基于位图掩码的GPU端图像合成优化方法，该方法依据图像中的有效像素生成掩码位图，通过对GPU间掩码位图进行集合运算快速得到图像重叠区域的掩码位图，使得图像合成操作仅发生在有效像素区域以内，有效减少了图像合成过程中的传输数据量及合成判别计算开销。实验结果表明：采用基于掩码位图的方法能够有效提高约40%的图像合成效率。
     （4）针对已有并行绘制框架并行绘制流水线难以发挥多CPU多GPU绘制节点性能问题，研究和实现了一个面向多CPU多GPU绘制节点的层次式节点间sort-last并行绘制框架。框架采用基于层次式合成的绘制流水线组织将系统内GPU划分为绘制节点内和节点间两个层次，并针对各自的GPU互联网络拓扑结构特点选用高效的合成通信模型，同时结合节点内无效像素剔除算法去除了冗余图像数据合成与传输。实验结果表明：该框架可以有效避免节点间无效像素传输并具有较高的图像绘制与合成性能。
Parallel rendering extends multiple graphics pipelines with separating graphics ren-deringfromtheunifiedbasicexecutionmodelofrenderingapplication,andrenderingtaskare computed in parallel after being sent to rendering resources. Parallel rendering is anefficient technical approach to improve performance of large-scale and complex scenesrendering.
     Parallel rendering system is composed of multiple distributed rendering nodes ingeneral. Rendering nodes normally use CPU as general computing processor and GPUas graphics rendering coprocessor. It was hard to produce enough data to make full useof GPU shaders with CPUs in an early rendering node, so only one GPU was deployed innode. With technological development of COTS multi-core processor and graphics hard-ware, rendering nodes could have multiple CPUs and multiple GPUs. Many researchesand applications show that researching on collaborative parallel rendering to fully usecomputing resources in multi-CPU and multi-GPU cluster node is not only a technicalapproach to improve performance of graphics workstation, but also a great foundational‘building block’to compose the larger systems capable of rendering very large data.
     Hardwarearchitectureofthemulti-CPUandmulti-GPUrenderingnodewasnotfullyconsidered in existing parallel rendering technology, so parallel rendering system couldnot collaboratively do the rendering task in parallel with high performance. To make fulluse of computing resources in node, non-uniform computing units and memory accessarchitecture was considered in character for multi-CPU and multi-GPU rendering node inour researches. And the researches were mainly on multi-CPU multi-GPU collaborativeparallel rendering model and approaches to improve the model’s performance when dorenderingtasksinsort-lastmode. Themainresearchachievementsaredetailedasfollows:
     (1)To solve the problem that the composition and display stage is coupled with hard-warerenderingstageinexistingparallelrenderingmodelsforclusternode,anovelparallelhybrid rendering model was introduced. And it could make full use of multi-core CPUsand multiple GPUs in cluster node. In order to ensure easy configuration and good s-calability, the model separates graphics rendering from application’s main event loop.With asynchronous DMA transfer and decoupling hardware rendering stage from compo-sition stage by hybrid software and hardware rendering, a parallel rendering pipeline with rendering, readback and composition stages is constructed in node to obtain high render-ing performance. Theoretical analysis and Experiment results show that the model haseasy configuration and good scalability, and it can efficiently improve parallel renderingperformance of multi-CPU and multi-GPU rendering node.
     (2)To solve the problem of low efficiency and redundant operations of composi-tion on CPU, a novel composition method accelerated by GPGPU computing was intro-duced. The method generated active pixels composition index list with GPGPU tech-nology and totally avoided inactive pixels composition operations on CPU. Theoreticalanalysis shows that speedup of the method is equal to the radio of active pixels percent-age of image and number of the GPU deployed in node. Experiment results show that themethod performance is about3to5times to original one when compositing high resolu-tion images with12%to76%active pixels percentage in the node with4GPUs.
     (3) To solve the problem of high computing and communication cost of compositionmethod based on CPU-GPU communication model, a novel composition method basedon GPU P2P direct communication model was introduced. It not only avoided lots of dataexchangebetweenGPUandCPU,butalsofullyusedGPUhighspeedmemorybandwidthand powerful computing ability. To optimize local and remote GPU memory access effi-ciencyofthemethodimplementation, PushCompositingOperationandPullCompositingOperation were presented. A novel bitmap-based composition method was also proposedto reduce data transfer and composition operation discrimination. It made compositiononly operate on overlap regions of GPU images, which got by doing set operation on ac-tive pixels lists. Experiment results show that image composition with the bitmap-basedmethod can raise efficiency about40%.
     (4)To solve the problem that parallel graphics pipeline of existing parallel render-ing framework could not make full use of computing resources in multi-CPU and multi-GPU rendering node, a novel hierarchical sort-last parallel rendering framework betweenmulti-CPU and multi-GPU rendering nodes was introduced. The framework classifiedGPUs into in-node and out-node, and it composited image in two steps with hierarchicalcomposition pipeline. In each step, composition communication model was decided bycharacter of the topology of GPUs interconnect. And inactive pixels were totally avoidedbeing composited and transmitted by using inactive pixels rejection algorithm. Experi-ment results show that the framework could efficiently avoid inactive pixels transfer and has a good rendering and composition performance.

引文

1http://www.geforce.com/hardware/technology/sli/technology
    2http://sites.amd.com/us/game/technology/Pages/crossfirex.aspx
    3http://www.nvidia.com/object/tesla-supercomputing-solutions.html
    [1] Fuchs H. Distributing A Visible Surface Algorithm Over Multiple Processors [C].In Seminal graphics.1998:311–313.
    [2] Crockett T. Design Considerations for Parallel Graphics Libraries [R].1994.
    [3] Liu Z, Shi J, Peng H, et al. A Survey of Cluster-based Parallel Rendering Sys-tem [J]. Journal of System Simulation.2006,1:3–14.
    [4] Pajarola R. Cluster Parallel Rendering [C]. In ACM SIGGRAPH ASIA2008courses.2008:1–34.
    [5] Moya V, Gonzalez C, Roca J, et al. Shader Performance Analysis on A ModernGpu Architecture [C]. In38th Annual IEEE/ACM International Symposium onMicroarchitecture.2005:10–16.
    [6] Owens J, Luebke D, Govindaraju N, et al. A Survey of General-Purpose Compu-tation on Graphics Hardware [C]. In Computer graphics forum.2007:80–113.
    [7] Owens J, Houston M, Luebke D, et al. GPU Computing [J]. Proceedings of theIEEE.2008,96(5):879–899.
    [8] Luebke D, Harris M, Krüger J, et al. GPGPU: General Purpose Computation OnGraphics Hardware [C]. In ACM SIGGRAPH2004Course Notes.2004:33–55.
    [9] Karnick P. GPGPU: General Purpose Computing on Graphics Hardware [J]. Push-pak’s Home Page.[Online][Cited: November19,2008.] http://www. public. asu.edu/pkarnic/portfolio/papers/IntraVis2006. pdf.
    [10] BudrukR,AndersonD,ShanleyT.PCIExpressSystemArchitecture[M].AddisonWesley Publishing Company,2004.
    [11] Fogal T, Childs H, Shankar S, et al. Large Data Visualization on Distributed Mem-ory Multi-Gpu Clusters [C]. In Proceedings of the Conference on High Perfor-mance Graphics.2010:57–66.
    [12] Mitchell J, Maybury W, Sweet R. Mesa Language Manual [M]. Xerox, Palo AltoResearch Center,1979.
    [13] Bigler J, Stephens A, Parker S. Design for Parallel Interactive Ray Tracing Sys-tems [C]. In IEEE Symposium on Interactive Ray Tracing2006.2006:187–196.
    [14] Stephens A, Boulos S, Bigler J, et al. An Application of Scalable Massive ModelInteraction Using Shared Memory Systems [C]. In Proceedings of the Eurograph-ics Symposium on Parallel Graphics and Visualization.2006:19–26.
    [15] Eilemann S, Makhinya M, Pajarola R. Equalizer: A Scalable Parallel RenderingFramework[J].IEEETransactionsonVisualizationandComputerGraphics.2009,15(3):436–452.
    [16] Squillacote A, Ahrens J. The Paraview Guide [M]. Kitware,2006.
    [17] Glassner A. An Introduction to Ray Tracing [M]. Morgan Kaufmann,1989.
    [18] Pfister H, Hardenbergh J, Knittel J, et al. The Volumepro Real-Time Ray-CastingSystem [C]. In Proceedings of the26th annual conference on Computer graphicsand interactive techniques.1999:251–260.
    [19] Ahrens J, Lo L, Nouanesengsy B, et al. Petascale Visualization: Approaches andInitial Results [C]. In Workshop on Ultrascale Visualization2008.2008:24–28.
    [20] Callahan S, Ikits M, Comba J, et al. Hardware-Assisted Visibility Sorting for Un-structured Volume Rendering [J]. IEEE Transactions on Visualization and Com-puter Graphics.2005,11(3):285–295.
    [21] Cabral B, Cam N, Foran J. Accelerated Volume Rendering and Tomographic Re-construction Using Texture Mapping Hardware [C]. In Proceedings of the1994symposium on Volume visualization.1994:91–98.
    [22] Bernardon F, Pagot C, Comba J, et al. GPU-based Tiled Ray Casting Using DepthPeeling [J]. Journal of Graphics, GPU,&Game Tools.2006,11(4):1–16.
    [23] Fernando R, Kilgard M. The Cg Tutorial: The Definitive Guide to ProgrammableReal-Time Graphics [M]. Addison-Wesley Longman Publishing Co., Inc.,2003.
    [24] Mark W, Glanville R, Akeley K, et al. Cg: A System for Programming GraphicsHardware in A C-Like Language [C]. In ACM Transactions on Graphics (TOG).2003:896–907.
    [25] Rost R. OpenGL (R) Shading Language [M]. Addison-Wesley Professional,2005.
    [26] Sebastien S, St-Laurent S. The Complete Effect and Hlsl Guide [M]. Paradoxal Pr,2005.
    [27] Sanders J, Kandrot E. CUDA by Example: An Introduction to General-PurposeGpu Programming [M]. Addison-Wesley Professional,2010.
    [28] Gaster B, Kaeli D, Howes L, et al. Heterogeneous Computing with OpenCL [M].Morgan Kaufmann,2011.
    [29] Kowalik J, Puzniakowski T. Using OpenCL: Programming Massively ParallelComputers [M]. Ios PressInc,2012.
    [30] Schmitz A, Tavenrath M, Kobbelt L. Interactive Global Illumination for De-formable Geometry In Cuda [C]. In Computer Graphics Forum.2008:1979–1986.
    [31] Poli G, Saito J, Mari J, et al. Processing Neocognitron of Face Recognition onHigh Performance Environment Based on GPU With Cuda Architecture [C]. In20th International Symposium on Computer Architecture and High PerformanceComputing.2008:81–88.
    [32] Maximo A, Ribeiro S, Bentes C, et al. Memory Efficient Gpu-Based Ray Castingfor Unstructured Volume Rendering [C]. In IEEE/EG International Symposium onVolume and Point-Based Graph.2008:55–62.
    [33] Govindaraju N, Larsen S, Gray J, et al. A Memory Model for Scientific Algo-rithms on Graphics Processors [C]. In Proceedings of the ACM/IEEE SC2006Conference.2006:6–6.
    [34] Moazeni M, Bui A, Sarrafzadeh M. A Memory Optimization Technique forSoftware-Managed Scratchpad Memory in GPUs [C]. In IEEE7th Symposiumon Application Specific Processors.2009:43–49.
    [35] Siegel J, Ributzka J, Li X. CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator [C]. In International Conference on ParallelProcessing Workshops.2009:174–181.
    [36] Siegel J, Ributzka J, Li X. CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator [J]. Journal of Algorithms&ComputationalTechnology.2011,5(2):341–362.
    [37] Hong S, Kim H. An Analytical Model for A GPU Architecture With Memory-Level and Thread-Level Parallelism Awareness [J]. ACM SIGARCH ComputerArchitecture News.2009,37(3):152–163.
    [38] Satish N, Sundaram N, Keutzer K. Optimizing the Use of GPU Memory in Ap-plications With Large Data Sets [C]. In2009International Conference on HighPerformance Computing.2009:408–418.
    [39] Harrison O, Waldron J. Optimising Data Movement Rates for Parallel ProcessingApplications on Graphics Processors [C]. In Proceedings of the25th Internation-al Conference Parallel and Distributed Computing and Networks (PDCN2007).2007:251–256.
    [40] CopeB,CheungP,LukW.UsingReconfigurableLogictoOptimiseGPUMemoryAccesses [C]. In Design, Automation and Test in Europe2008.2008:44–49.
    [41] Joselli M, Clua E, Montenegro A, et al. A New Physics Engine with AutomaticProcess Distribution Between CPU-GPU [C]. In Proceedings of the2008ACMSIGGRAPH symposium on Video games.2008:149–156.
    [42] Luk C, Hong S, Kim H. Qilin: Exploiting Parallelism on Heterogeneous Multi-processors With Adaptive Mapping [C]. In42nd Annual IEEE/ACM InternationalSymposium on Microarchitecture.2009:45–55.
    [43] Pheatt C. Intel Threading Building Blocks [J]. Journal of Computing Sciencesin Colleges.2008,23(4):298–298.
    [44] JoselliM,ZamithM,CluaE,etal.AutomaticDynamicTaskDistributionBetweenCPUAndGPUforReal-TimeSystems[C].In11thIEEEInternationalConferenceon Computational Science and Engineering.2008:48–55.
    [45] Bautin M, Dwarakinath A, Chiueh T. Graphic Engine Resource Management [C].In SPIE-The International Society for Optical Engineering.2008:6818–6830.
    [46] Cassagnabere C, Rousselle F, Renaud C. CPU-GPU Multithreaded ProgrammingModel:ApplicationtothePathTracingWithNextEventEstimationAlgorithm[J].Advances in Visual Computing.2006:265–275.
    [47] ZamithM,CluaE,ConciA,etal.AGameLoopArchitectureforTheGPUUsedasA Math Coprocessor in Real-Time Applications [J]. Computers in Entertainment(CIE).2008,6(3):42.
    [48] Snir M, Otto S, Walker D, et al. MPI: The Complete Reference [M]. MIT press,1995.
    [49] Gabriel E, Fagg G, Bosilca G, et al. Open MPI: Goals, Concept, and Design ofa Next Generation MPI Implementation [J]. Recent Advances in Parallel VirtualMachine and Message Passing Interface.2004:353–377.
    [50] Graham R, Woodall T, Squyres J. Open MPI: A Flexible High PerformanceMPI [J]. Parallel Processing and Applied Mathematics.2006:228–239.
    [51] Graham R, Shipman G, Barrett B, et al. Open MPI: A High-Performance, Hetero-geneous MPI [C]. In2006IEEE International Conference on Cluster Computing.2006:1–9.
    [52] Scheifler R, Gettys J. X Window System: Core and Extension Protocols [M]. Dig-ital Equipment Corp.,1996.
    [53] RajagopalanR,GoswamiD,MudurS.FunctionalityDistributionforParallelRen-dering[C].InProceedingsof19thIEEEInternationalParallelandDistributedPro-cessing Symposium.2005:18–18.
    [54] FanZ,QiuF,KaufmanA.Zippy:AFrameworkforComputationandVisualizationon A Gpu Cluster [C]. In Computer Graphics Forum.2008:341–350.
    [55] Nieplocha J, Palmer B, Tipparaju V, et al. Advances, Applications and Perfor-manceoftheGlobalArraysSharedMemoryProgrammingToolkit[J].Internation-al Journal of High Performance Computing Applications.2006,20(2):203–231.
    [56] Stuart J, Owens J. Message Passing on Data-Parallel Architectures [C]. In IEEEInternational Symposium on Parallel&Distributed Processing2009.2009:1–12.
    [57] Manzke M, Brennan R, O’Conor K, et al. A Scalable and Reconfigurable Shared-Memory Graphics Architecture [C]. In ACM SIGGRAPH2006Sketches.2006:182.
    [58] Hellwagner H, Reinefeld A. SCI, Scalable Coherent Interface: Architecture andSoftware for High-Performance Compute Clusters [M]. Springer Verlag,1999.
    [59] Cassagnabere C, Rousselle F, Renaud C. CPU-GPU Multithreaded ProgrammingModel: Application to the Path Tracing with Next Event Estimation Algorithm [J].Advances in Visual Computing.2006:265–275.
    [60] MonfortJ,GrossmanM.Scalingof3DGameEngineWorkloadsonModernMulti-Gpu Systems [C]. In Proceedings of the Conference on High Performance Graph-ics2009.2009:37–46.
    [61] Igehy H, Stoll G, Hanrahan P. The Design of a Parallel Graphics Interface [C]. InProceedings of the25th annual conference on Computer graphics and interactivetechniques.1998:141–150.
    [62] HumphreysG,EldridgeM,BuckI,etal.WireGL:AScalableGraphicsSystemforClusters [C]. In Proceedings of the28th annual conference on Computer graphicsand interactive techniques.2001:129–140.
    [63] Humphreys G, Houston M, Ng R, et al. Chromium: A Stream-Processing Frame-work for Interactive Rendering on Clusters [C]. In ACM SIGGRAPH ASIA2008courses.2008:43.
    [64] Engel W, Geva A. Beginning Direct3D Game Programming [M]. Premier Press,Incorporated,2001.
    [65] Thomson R. The Direct3D Graphics Pipeline.2006.
    [66] Penner E, Schmidt R, Carpendale S. A GPU Cluster Without the Clutter: A Drop-in Scalable Programmable-Pipeline with Several GPUs and Only One PC [C]. Inthe ACM Conference on Interactive3D Interface’s Poster Compendium, I3D.2006:17–18.
    [67] Marchesin S, Mongenet C, Dischler J. Multi-GPU Sort-Last Volume Visualiza-tion [J].2008:1–8.
    [68] Porter T, Duff T. Compositing Digital Images [J]. ACM Siggraph ComputerGraphics.1984,18(3):253–259.
    [69] Zhou K, Hou Q, Ren Z, et al. RenderAnts: Interactive Reyes Rendering on G-PUs [C]. In ACM Transactions on Graphics (TOG).2009:155.
    [70] Cook R, Carpenter L, Catmull E. The Reyes Image Rendering Architecture [C].In ACM SIGGRAPH Computer Graphics.1987:95–102.
    [71] OwensJ,KhailanyB,TowlesB,etal.ComparingReyesandOpenGLonAStreamArchitecture [C]. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICSconference on Graphics hardware.2002:47–56.
    [72] Bhaniramka P, Robert P, Eilemann S. OpenGL Multipipe SDK: A Toolkit for S-calable Parallel Rendering [C]. In Proceedings of the Conference on IEEE Visual-ization2005.2005:119–126.
    [73] GlasbeyC,MardiaK.AReviewofImage-WarpingMethods[J].Journalofappliedstatistics.1998,25(2):155–171.
    [74] Smit F, Van Liere R, Beck S, et al. A Shared-Scene-Graph Image-Warping Archi-tecture for VR: Low Latency Versus Image Quality [J]. Computers&Graphics.2010,34(1):3–16.
    [75] Juarez-Comboni A, Day A, Juarez J. A Multi-Pass Multi-Stage MultiGPU Colli-sion Detection Algorithm [J]. Graphicon2005Proceedings.2005.
    [76] Tsuchiyama R, Nakamura T, Iizuka T, et al. The OpenCL Programming Book [M].Fixstars Corporation,2010.
    [77]张舒,褚艳利. GPU高性能运算之CUDA [M]. DynoMedia Inc.,2009.
    [78] MoerschellA,OwensJ.DistributedTextureMemoryinaMulti-GpuEnvironmen-t [C]. In Proceedings of the21st ACM SIGGRAPH/Eurographics symposium onGraphics hardware.2006.
    [79] Moerschell A, Owens J. Distributed Texture Memory in a Multi-GPU Environ-ment [C]. In Computer Graphics Forum.2008:130–151.
    [80] Lefohn A, Sengupta S, Kniss J, et al. Glift: Generic, Efficient, Random-AccessGPU Data Structures [J]. ACM Transactions on Graphics (TOG).2006,25(1):60–99.
    [81] OwensJ.TowardsMulti-GPUSupportforVisualization[C].InJournalofPhysics:Conference Series.2007:012055.
    [82] McDonnel B, Elmqvist N. Towards Utilizing GPUs in Information Visualization:a Model and Implementation of Image-Space Operations [J]. IEEE Transactionson Visualization and Computer Graphics.2009,15(6):1105–1112.
    [83] Schaa D, Kaeli D. Exploring the Multiple-GPU Design Space [C]. In IEEE Inter-national Symposium on Parallel&Distributed Processing2009.2009:1–12.
    [84]石教英.分布并行图形绘制技术及其应用[M].科学出版社;第1版,2010.
    [85]沈沉,沈向洋.基于图像的光照模型研究综述[J].计算机学报.2000,23(12):1261–1269.
    [86] Molnar S, Cox M, Ellsworth D, et al. A Sorting Classification of Parallel Render-ing [J]. Computer Graphics and Applications, IEEE.1994,14(4):23–32.
    [87] Akeley K. Reality Engine Graphics [C]. In Proceedings of the20th annual confer-ence on Computer graphics and interactive techniques.1993:109–116.
    [88] Montrym J, Baum D, Dignam D, et al. InfiniteReality: A Real-Time Graphics Sys-tem [C]. In Proceedings of the24th annual conference on Computer graphics andinteractive techniques.1997:293–302.
    [89] Fuchs H, Goldfeather J, Hultquist J, et al. Fast Spheres, Shadows, Textures, Trans-parencies,andImgageEnhancementsinPixel-Planes[J].ACMSIGGRAPHCom-puter Graphics.1985,19(3):111–120.
    [90] FuchsH,PoultonJ,EylesJ,et al.Pixel-planes5:AHeterogeneousMultiprocessorGraphics System Using Processor-Enhanced Memories [M]. ACM,1989.
    [91] Moll L, Heirich A, Shand M. Sepia: Scalable3D Compositing Using PCIPamette [C]. In Seventh Annual IEEE Symposium on Field-Programmable Cus-tom Computing Machines1999.1999:146–155.
    [92] LombeydaS,MollL,ShandM,etal.ScalableInteractiveVolumeRenderingUsingOff-The-Shelf Components [C]. In Proceedings of the IEEE2001symposium onparallel and large-data visualization and graphics.2001:115–121.
    [93] Stoll G, Eldridge M, Patterson D, et al. Lightning-2: A High-Performance DisplaySubsystem For Pc Clusters [C]. In Proceedings of the28th annual conference onComputer graphics and interactive techniques.2001:141–148.
    [94] Blanke W, Bajaj C, Fussell D, et al. The Metabuffer: A Scalable Multiresolu-tion Multidisplay3-D Graphics System Using Commodity Rendering Engines [J].Tr2000-16, University of Texas at Austin.2000.
    [95] Muraki S, Ogata M, Ma K, et al. Next-Generation Visual Supercomputing UsingPC Clusters With Volume Graphics Hardware Devices [C]. In Proceedings of the2001ACM/IEEE conference on Supercomputing (CDROM).2001:51–51.
    [96] Molnar S, Eyles J, Poulton J. PixelFlow: High-Speed Rendering Using ImageComposition [C]. In ACM SIGGRAPH Computer Graphics.1992:231–240.
    [97] BlytheD.TheDirect3D10System[C].InACMTransactionsonGraphics(TOG).2006:724–734.
    [98] McCormick P, Inman J, Ahrens J, et al. Scout: A Data-Parallel ProgrammingLanguage For Graphics Processors [J]. Parallel Computing.2007,33(10-11):648–662.
    [99] Christadler I, Weinberg V. RapidMind: Portability Across Architectures and itsLimitations [J]. Facing the multicore-challenge.2011:4–15.
    [100] McCormick P, Inman J, Ahrens J, et al. Scout: A Hardware-Accelerated Systemfor Quantitatively Driven Visualization and Analysis [C]. In Proceedings of IEEEconference on Visualization2004.2004:171–178.
    [101] Buck I, Foley T, Horn D, et al. Brook for GPUs: Stream Computing On GraphicsHardware [C]. In ACM Transactions on Graphics (TOG).2004:777–786.
    [102] Eilemann S, Pajarola R. Direct Send Compositing for Parallel Sort-Last Render-ing [C]. In ACM SIGGRAPH ASIA2008courses.2008:39.
    [103] Yu H, Wang C, Ma K. Massively Parallel Volume Rendering Using2–3SwapImage Compositing [C]. In International Conference for High Performance Com-puting, Networking, Storage and Analysis2008.2008:1–11.
    [104] Peterka T, Goodell D, Ross R, et al. A Configurable Algorithm for Parallel Image-Compositing Applications [C]. In Proceedings of the Conference on High Perfor-mance Computing Networking, Storage and Analysis.2009:4.
    [105] Eilemann S. Equalizer Programming Guide [J]. SIGGRAPH2008, Article.2007,45.
    [106] Shaw C, Green M, Schaeffer J. A VLSI Architecture for Image Composition [J].Advances in Computer Graphics Hardware III.1988:183–199.
    [107] Ma K, Painter J, Hansen C, et al. Parallel Volume Rendering Using Binary-SwapImage Composition [C]. In ACM SIGGRAPH ASIA2008courses.2008:38.
    [108] Yang D, Yu J, Chung Y. Efficient Compositing Methods for the Sort-Last-SparseParallel Volume Rendering System on Distributed Memory Multicomputers [J].The Journal of Supercomputing.2001,18(2):201–220.
    [109] Ahrens J, Painter J. Efficient Sort-Last Rendering Using Compression-Based Im-age Compositing [C]. In Proceedings of the2nd Eurographics Workshop on Par-allel Graphics and Visualization.1998:145–151.
    [110] Takeuchi A, Ino F, Hagihara K. An Improved Binary-Swap Compositing forSort-Last Parallel Rendering On Distributed Memory Multiprocessors [J]. ParallelComputing.2003,29(11):1745–1762.
    [111] Sano K, Kobayashi Y, Nakamura T. Differential Coding Scheme for Efficient Par-allel Image Composition On A Pc Cluster System [J]. Parallel Computing.2004,30(2):285–299.
    [112] Moreland K, Wylie B, Pavlakos C. Sort-Last Parallel Rendering for Viewing Ex-tremely Large Data Sets On Tile Displays [C]. In Proceedings of the IEEE2001symposium on parallel and large-data visualization and graphics.2001:85–92.
    [113] Makhinya M, Eilemann S, Pajarola R. Fast Compositing for Cluster-Parallel Ren-dering [J]. Eurographics Association.2010:111–120.
    [114] Purcell T, Adviser-Hanrahan P. Ray Tracing on A Stream Processor [J].2004.
    [115] BlellochG.ScansasPrimitiveParallelOperations[J].IEEETransactionsonCom-puters.1989,38(11):1526–1538.
    [116] Harris M, Sengupta S, Owens J. Parallel Prefix Sum (Scan) with CUDA [J]. GPUGems.2007,3(39):851–876.
    [117] Sengupta S, Harris M, Zhang Y, et al. Scan Primitives for GPU Computing [C].In Proceedings of the22nd ACM SIGGRAPH/EUROGRAPHICS symposium onGraphics hardware.2007:97–106.
    [118] Dotsenko Y, Govindaraju N, Sloan P, et al. Fast Scan Algorithms on Graphic-s Processors [C]. In Proceedings of the22nd annual international conference onSupercomputing.2008:205–213.
    [119] Sengupta S, Harris M, Garland M, et al. Efficient Parallel Scan Algorithms forMany-Core GPUs [J]. Scientific Computing with Multicore and Accelerators.2011.
    [120] Greb A, Zachmann G. GPU-ABiSort: Optimal Parallel Sorting on Stream Archi-tectures [C]. In Proceedings of the20th International Parallel and Distributed Pro-cessing Symposium2006.2006:10–pp.
    [121] Gre A, Guthe M, Klein R. GPU-based Collision Detection for Deformable Pa-rameterized Surfaces [C]. In Computer Graphics Forum.2006:497–506.
    [122] Sengupta S, Lefohn A, Owens J. A Work-Efficient Step-Efficient Prefix Sum Al-gorithm [C]. In Proceedings of the Workshop on Edge Computing Using NewCommodity Architectures.2006:26–27.
    [123] Horn D. Stream Reduction Operations for Gpgpu Applications [J]. Gpu gems.2005,2:573–589.
    [124] Garland M, Le Grand S, Nickolls J, et al. Parallel Computing Experiences withCUDA [J]. IEEE Micro.2008,28(4):13–27.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700