GPU通用计算虚拟化方法研究

英文题名：The Research of Virtualization on General-purpose Computation on Graphic Processing Unit
作者：石林
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：GPGPU ; 虚拟化 ; RPC ; Xen ; KVM ; VMware ; CUDA
英文关键词：GPGPU ; virtualization ; RPC ; Xen ; KVM ; VMware ; CUDA
学位年度：2012
导师：李肯立
学科代码：081203
学位授予单位：湖南大学
论文提交日期：2012-02-16
答辩委员会主席：王志英

摘要

作为云计算的关键性基础设施，系统级虚拟机技术是当前计算机体系结构领域的研究热点之一。系统级虚拟机技术成功的将许多物理设备抽象成内存或硬盘中的数据结构，如网卡、硬盘、内存等，但GPU（Graphic Processing Unit，图形处理单元）从来不曾被完善的虚拟化。这其中一个主要原因在于GPU缺乏统一的硬件接口和开放的体系规范。在实践中，学术界和工业界选择在协议栈的更高层——用户API（Application Programming Interface，应用程序编程接口）层进行虚拟化，一些针对传统图形API的虚拟化工作已取得阶段性成果。CUDA (ComputeUnified Device Architecture)是一套崭新的GPU应用编程API，专注于通用计算（General-Purpose computation）而不是图形应用，是目前事实上的通用计算工业标准。它向程序员提供了直接控制GPU执行并行计算的能力，而不再依赖于传统图形API如OPENGL。CUDA框架的出现为系统级虚拟化技术带来了新的问题：已有的针对图形API的虚拟化工具对虚拟化CUDA没有任何帮助，运行在虚拟机中的应用程序无法调用CUDA API，从而不能利用GPU先进的并行加速功能。
     为了解决这个问题，本文提出并实现了首个针对通用计算的虚拟化解决方案——vCUDA（virtual CUDA）。vCUDA允许虚拟机中的应用程序访问位于虚拟机外部的通用计算资源，向它们提供GPU通用计算能力，对加速虚拟机环境下的高性能计算具有重要意义。vCUDA的设计和实现有四个主要的方面：API重定向（API remoting）、懒惰RPC（lazy RPC）、虚拟机专用RPC系统VMRPC（VirtualMachine Remote Procedure Call）和虚拟机高级特性支持。
     API重定向是指在用户层拦截API接口，将其参数转发到远程服务器，在远程服务器端完成实际计算工作，最后将计算结果返回用户。由于采用了在动态库中拦截和重定向API的技术，vCUDA在CUDA程序运行过程中实时进行虚拟化，不要求修改程序源码、不重新编译、不改变操作系统，实现了完全的二进制兼容，对程序员完全透明。针对NVIDIA官方样例程序和第三方CUDA应用的测试表明：vCUDA在虚拟机内实现了CUDA的全部功能，在远程服务器中完整而准确地模拟了CUDA的内部语义，所有实验均得到了与非虚拟化条件下相同的执行结果。
     由于CUDA应用可能涉及成千上万个API，针对每个API发送RPC将导致极大的性能开销。vCUDA借鉴了图形项目中延迟更新（lazy update）的优化思路，采用了懒惰RPC的方法。尽可能地推迟RPC的发生时刻，通过累积、过滤、合并上层调用，合理确定发送RPC的时机，将连续的API打包一次性发送，有效的提高了系统性能。相关实验显示，懒惰RPC可将RPC次数减少至30%，开启懒惰RPC后vCUDA虚拟化性能最高可提升148%。
     VMRPC是首个直接针对虚拟机体系的远程过程调用系统，与传统RPC系统不同的是，它利用了虚拟机之间的共享内存机制来实现堆和栈的共享，避免了不必要的数据复制和序列化操作，达到了传统RPC在虚拟机平台下所无法企及的性能。相关实验结果显示，VMRPC的吞吐率最大可达到传统RPC的100倍以上。利用API重定向和VMRPC，vCUDA同时实现了虚拟化的透明性和高效性，针对官方样例和第三方程序的测试表明，vCUDA所引入的虚拟化开销不超过21%，具有较大的实用价值。
     在CUDA虚拟化的相关基础上，vCUDA实现了对多路复用、挂起/恢复等传统虚拟机高级特性的支持，使得依赖于这些特性的虚拟机应用可以无缝部署于vCUDA框架之上。vCUDA采用了“一对多”模式来实现多路复用，由单一的工作线程来满足不同客户的应用请求。在挂起/恢复功能方面，vCUDA利用Kernel运行的间隙来保存和恢复当前CUDA状态。相关实验结果显示，vCUDA在实施多路复用、挂起/恢复的过程中仅引入了有限的开销，满足实际应用的需要。进一步，本文利用vCUDA所采用的CUDA状态追踪和管理方法，在传统GPU核外检查点方案的基础上实现了核内检查点方案——IKC，增强了GPU的容错能力。
System virtual machine is an important research topic of virtualization, which is thefundamental infrastructure of cloud computing. The system virtual machinetechnology successfully virtualized lots of I/O devices, while the GPU (Graph PrecessUnit) is an exception. In particular, the general precessing ability of GPU (GPGPU)was never fully virtualized in system virtual machine platform. In practice, theacademic circles and VMM industry choose to realize the GPU virtualization inhigher layer: Application Programming Interface (API). Some primitive results whichfocus on the traditional graphic API have been published. CUDA (Compute UnifiedDevice Architecture) is a brand new API directly designed for GPGPU. It provides theability to manipulate the GPU hardware without the help of Graphic API. The rise ofCUDA shows that the virtualization of graphic API is not enough for the GPGPUusing a dedicate API framework: the existing graphic API virtualization have noeffect for CUDA applications. So an independent GPU API framework calls for anindependent virtualization method.
     In order to improve the usability of GPGPU in virtualization environment, thispaper describes vCUDA, a general-purpose graphics processing unit (GPGPU)computing solution for virtual machines (VMs). vCUDA allows applicationsexecuting within VMs to leverage hardware acceleration, which can be beneficial tothe performance of a class of high-performance computing (HPC) applications. Thekey insights in our design include API call interception/redirection, lazy RPC, adedicated RPC system for VMs and the support for advance features of VMM.
     With API interception and redirection, Compute Unified Device Architecture(CUDA) applications in VMs can access a graphics hardware device and achieve highcomputing performance in a transparent way, without the modification of theapplication or operating system. The evaluation about the official examples and thirdparty applications show that vCUDA mimics the original CUDA protocol invirtualization environment, all test get the same result as the native environment.
     Thousands of CUDA APIs could be called in a CUDA application. If vCUDA sendevery API call to remote site at the moment the API is intercepted, the same numberof RPCs will be invoked and the overhead of excessive world switch will beinevitably introduced into the vCUDA system. vCUDA borrowed the idea from thegraphic API virtualization and adopted an optimization mechanism called Lazy RPC to improve the system performance by intelligently batching specific API calls. Therelated experiments show the Lazy RPC reduces the numbers of remote call to30%,and speed up the vCUDA performance to148%.
     In the current study, vCUDA achieved a near-native performance with the dedicatedRPC system, VMRPC. VMRPC is a light-weight RPC framework specificallydesigned for VMs that leverages heap and stack sharing to circumvent unnecessarydata copying and serialization/deserilization, and achieve high performance. Ourevaluation shows that the throughput of RPC has improved by two orders ofmagnitude. We carried out a detailed analysis of the performance of our framework(vCUDA+VMRPC). Using a number of unmodified official examples from CUDASDK and third-party applications in the evaluation, we observed that CUDAapplications running with vCUDA exhibited a very low performance penalty (lessthan21%) in comparison with the native environment, thereby demonstrating theviability of vCUDA architecture.
     vCUDA expose the device multiplex and suspend/resume function at the base ofCUDA virtualization, any CUDA application built on top of these features can run asusual in virtual machines without any modification. vCUDA develops a one-to-manymodel to multiplex GPU device in the VM. Under the coordination of the vCUDAstub, two different service threads can cooperatively manipulate one hardwareresource by connecting to a single working thread. The suspend/resume is realized bystore and restore the CUDA state while the kernel is not running. The devicemultiplex and suspend/resume tests show the performance degradation comes fromvCUDA is trivial. Base on the CUDA state tracking technology of the vCUDA, werealize the inter-kernel checkpoint scheme on GPU.

引文

[1] Barham P, Dragovic B, Fraser K, et al. Xen and the Art of Virtualization. In:Proc of19th ACM Symposium on OperatingSystems Principles. BoltonLanding,2003,164-177
    [2] Kivity A, Kamay Y, Laor D, et al. KVM: The Linux Virtual Machine Monitor, In:Proc of Linux Symposium. Ottawa,2007,225-230
    [3] VMware Workstation. www.vmware.com/products/ws/,2012-05-15
    [4] Virtual PC. www.microsoft.com/windows/virtual-pc/,2012-05-15
    [5] Hyper V. www.microsoft.com/en-us/server-cloud/windows-server/hyper-v.aspx,2012-05-15
    [6] GPGPU: General Purpose Programming on GPUs. www.gpgpu.org,2012-05-15
    [7] Lahabar S, Agrawal P, Narayanan P J. High Performance Pattern Recognition onGPU. In: Proc of National Conference on Computer Vision Pattern RecognitionImage Processing and Graphics. Gandhinagar,2008,154-159
    [8] Owens D, Luebke N, Govindaraju, et al. A Survey of General-PurposeComputation on Graphics Hardware. Computer Graphics Forum,2007,26(1):80-113
    [9] Garland M, Grand S, Nickolls J, et al. Parallel Computing Experiences withCUDA. IEEE Micro,2008,28(4):13-27
    [10] DirectCompute, msdn.microsoft.com/zh-cn/directx,2012-05-15
    [11]王庆波，金涬，何乐，等．虚拟化与云计算．北京：电子工业出版社，2009
    [12] Humphreys G, Houston M, Ng R, et al. Chromium: a StreamprocessingFramework for Interactive Rendering on Clusters. In: Proc of29th AnnualConference on Computer Graphics and Interactive Techniques. NewYork,2002,693-702
    [13] CUDA: Compute Unified Device Architecture. www.nvidia.com/object/cuda_home_new.html.2012-05-15
    [14] Lagar-Cavilla H A, Tolia N, Satyanarayanan M, et al. VMM-independentGraphics Acceleration. In: Proc of Virtual execution environments, NewYork,2007,33-43
    [15]何家俊，廖鸿裕，陈文智．Kernel虚拟机的3D图形加速方法．计算机工程，2010，36(16)：251-253
    [16] Sun Microsystems. Virtualbox2.1.0user manual.download.virtualbox.org/virtualbox/2.1.0/UserManual.pdf,2012-05-15
    [17] Hansen J G. Blink:3d Display Multiplexing for Virtualized Applications.Technical Report, www.diku.dk/~jacobg/pubs/blink-techreport.pdf.2012-05-15
    [18] Parallels Desktop, www.parallels.com/en/desktop,2012-05-15
    [19] Wine. www.winehq.com,2012-05-15
    [20] Smowton C. Secure3D graphics for virtual machines. In: Proc of the EuropeanWorkshop on System Security. New York,2009,36-43
    [21] Dowty M, Sugerman J. GPU virtualization on VMware's hosted I/O architecture.SIGOPS Operation Systems Review,2009,43(3):73-82
    [22] Shi L, Chen H, Sun J. vCUDA: GPU Accelerated High Performance Computingin Virtual Machines. In: Proc of International Parallel&Distributed ProcessingSymposium. Rome,2009,1-11
    [23] Shi L, Chen H, Sun J, et al. vCUDA: GPU-accelerated High-performanceComputing in Virtual Machines. IEEE Transactions on Computers,2012,61(6):804-816
    [24] Chen H, Shi L, Sun J. VMRPC: A High Efficiency and Light Weight RPCSystem for Virtual Machines. In: Proc of IEEE International Workshop onQuality of Service, Beijing,2010,1-9
    [25] Gupta V, Gavrilovska A, Schwan K, et al. Gvim: Gpu-accelerated virtualmachines. In: Proc of ACM Workshop on System-level Virtualization for HighPerformance Computing. New York,2009,17-24
    [26] Duato J, Pena A, Silla F, et al. rCUDA: Reducing the number of GPU-basedaccelerators in high performance clusters. In: Proc of International Conferenceon High Performance Computing and Simulation. Caen,2010,224-231
    [27] Giunta G, Montella R, Agrillo G, et al. A GPGPU Transparent VirtualizationComponent for High Performance Computing Clouds. In: Proc. of EuroParconference on Parallel Processing, Berlin/Heidelberg,2010,379-391
    [28] Ravi V T, Becchi M, Agrawal G, et al. Supporting GPU sharing in cloudenvironments with a transparent runtime consolidation framework. In: Proc ofinternational symposium on High performance distributed computing. NewYork,2011,217-228
    [29] Gupta V, Schwan K, Tolia N, et al. Pegasus: Coordinated Scheduling forVirtualized Accelerator-based Systems, In Proc of USENIX Annual TechnicalConference, Berkeley,2011,3-3
    [30] Merritt A M, Gupta V, Verma A. Shadowfax: scaling in heterogeneous clustersystems via GPGPU assemblies. In: Proc of International Workshop onVirtualization Technologies in Distributed Computing. New York,2011,3-10
    [31] Takizawa H, Sato K, Komatsu K, et al. CheCUDA: A Checkpoint/Restart Toolfor CUDA Applications. In: Proc of International Conference on Parallel andDistributed Computing Applications and Technologies. HigashiHiroshima,2009,408-413
    [32] Takizawa H, Koyama K, Sato K, et al. CheCL: Transparent Checkpointing andProcess Migration of OpenCL Applications, In: Proc of International Paralleland Distributed Processing Symposium, Anchorage,2011,864-876
    [33] Nukada A, Takizawa H, Matsuoka S. NVCR: A Transparent Checkpoint-RestartLibrary for NVIDIA CUDA. In: Proc of IPDPS Workshop, Alaska,2011,104-113
    [34] Li T, Narayana V K, El-Araby E, et al. GPU Resource Sharing and Virtualizationon High Performance Computing Systems In Proc of International Conferenceon Parallel Processing. Taipei,2011,733-742
    [35] Lombardi F, Pietro R D. CUDACS: Securing the Cloud with CUDA-EnabledSecure Virtualization. Information and Communications Security. Lecture Notesin Computer Science,2010,6476:92-106
    [36] Seshadri A, Luk M, Qu N, et al. SecVisor: a tiny hypervisor to provide lifetimekernel code integrity for commodity OSes. SIGOPS Operation SystemsReview,2007,41(6):335-350
    [37] Proudfoot K, Mark W R, Tzvetkov S, et al. A real-time procedural shadingsystem for programmable graphics hardware. In: Proc of annual conference onComputer Graphics and Interactive Techniques. New York,2001,159-170
    [38] Rost R J, Licea-Kane B. OpenGL Shading Language. third edition,Boston:Pearson Education,2006
    [39] Peeper C, Mitchell J L. Introduction to the DirectX9High-Level ShadingLanguage. www.cs.uoi.gr/~fudos/grad-exer2/hlsl-intro.pdf,2012-05-15
    [40] Mark W R, Glanville S, Akeley K, et al. Cg: A system for programming graphicshardware in a C-like language. ACM Transactions on Graphics,2003,22(3):896-907
    [41] Buck I, Foley T, Horn D, et al. Brook for GPUs: Stream computing on graphicshardware. ACM Transactions on Graphics,2004,23(3):777-786
    [42] OpenCL. www.khronos.org/opencl/,2012-05-15
    [43] McCool M, Toit S D, Popa T, et al. Shader algebra. ACM Transactions onGraphics,2004,23(3):787–795
    [44] VMware Workstation5.5: Experimental Support for Direct3D.www.vmware.com/support/ws55/doc/ws_vidsound_d3d.html,2012-05-15
    [45] Bellard F. QEMU, a fast and portable dynamic translator. In: Proc of the annualconference on USENIX Annual Technical Conference, Berkeley,2005,41-41
    [46] VMware SVGA Device Developer Kit. vmware-svga.sourceforge.net,2012-05-15
    [47] Collange S, Defour D, Parello D. Barra,a Modular Functional Gpu Simulator forGPGPU. Technical Report hal-00359342,2009
    [48] Bakhoda A, Yuan G, Fung W, et al. Analyzing CUDA Workloads Using aDetailed GPU Simulator. IEEE Analysis of Systems and Software,2009,163-174
    [49] Xen VGA Passthrough. wiki.xensource.com/xenwiki/XenVGAPassthrough,2012-05-15
    [50] Intel VT-d. www.intel.com/technology/itj/2006/v10i3/2-io/1-abstract.htm,2012-05-15
    [51] Parallels Desktop. www.parallels.com,2012-05-15
    [52] VMware VMDirectPath I/O. communities.vmware.com/docs/DOC-11089,2012-05-15
    [53] VirtualGL. virtualgl.sourceforge.net,2012-05-15
    [54] Humphreys G, Eldridge M, Buck I, et al. WireGL: A Scalable Graphics Systemfor Clusters. In: Proc of SIGGRAPH, Los Angeles,2001,129-140
    [55] Mohr A, Gleicher M. HijackGL: Reconstructing from Streams for StylizedRendering. In: Proc of International Symposium on Non-photorealisticAnimation and Rendering, New York,2002,13-ff
    [56] Brian N B, Thomas E A, Edward D L, et al. Lightweight remote procedure call.ACM Transactions on Computer Systems,1990,8(1):37-55
    [57] Bourassa V, Zahorjan J. Implementing lightweight remote procedure calls in theMach3operation system. Technical Report,TR-95-02-01, University ofWashington, Department of Computer Science and Engineering,1995
    [58] Bershad B, Anderson T. Lazowska E, et al. User-level interprocesscommunication for shared memory multiprocessors. ACM Transactions onComputer Systems,1991,9(2):175-198
    [59] Bilas A, Felten E. Fast RPC on the SHRIMP Virtual Memory Mapped NetworkInterface. Journal of Parallel and Distributed Computing,1997,40(1):138-146
    [60] Yarvin C, Bukowski R, Anderson T. Anonymous RPC: Low Latency Protectionin a64-Bit Address Space. In: Proc of USENIX Summer Technical Conference.Berkeley,1993,13:1-13:12
    [61] Zhang X, McIntosh S, Rohatgi P, et al. Xensocket: A high-throughputinterdomain transport for virtual machines. In: Proc of International MiddlewareConference, Berlin,2007,184-203
    [62] Huang W, Koop M, Gao Q, et al. Virtual machine aware communication librariesfor high performance computing. In: Proc of SuperComputing. NewYork,2007,9:1-9:12
    [63] Kim K, Kim C, Jung S, et al. Inter-domain Socket Communications SupportingHigh Performance and Full Binary Compatibility on Xen. In: Proc ofInternational Conferenceof Virtual execution environments. Seattle,2008,11-20
    [64] Wang J, Wright K, Gopalan K. XenLoop: A Transparent High PerformanceInter-VM Network Loopback. In: Proc of International Symposium of HighPerformance Distributed Computing. Boston,2008,109-118
    [65] Burtsev A, Srinivasan K, Radhakrishnan P, et al. Fido: Fast Inter VirtualMachine Communication for Enterprise Appliances. In: Proc of USENIX Annualtechnical conference, Berkeley,2009,25-25
    [66] Buck I, Humphreys G, Hanrahan Pat. Tracking graphics state for networkedrendering. In: Proc of ACM SIGGRAPH/EUROGRAPHICS workshop onGraphics hardware, New York,2000,87-95
    [67] XML-RPC. www.xmlrpc.com,2012-05-15
    [68] Chen H, Shi L, Sun J. VMRPC: A high efficiency and light weight RPC systemfor virtual machines. In: Proc of IEEE International Workshop on Quality ofService. Beijing,2010,1-9,16-18
    [69] MP3LAME Encoder (Nvidia's CUDA Contest). cudacontest.nvidia.com,2012-05-15
    [70] MDGPU. www.amolf.nl/\~{}vanmeel/mdgpu/about.html,2012-05-15
    [71] Fujimoto N. Faster Matrix-Vector Multiplication on GeForce8800GTX. In: Procof IEEE International Parallel and Distributed Processing Symposium.Miami,2008,1-8
    [72] Al-Kiswany S, Gharaibeh A, Santos-Neto E, et al. StoreGPU: ExploitingGraphics Processing Units to Accelerate Distributed Storage Systems. In: Procof ACM/IEEE International Symposium on High Performance DistributedComputing. New York,2008,165-174
    [73] Lessig C. An Implementation of the MRRR Algorithm on a Data-ParallelCoprocessor. Technical Report, University of Toronto,2008
    [74] Vinoski S. CORBA: Integrating diverse applications within distributedheterogeneous environments. IEEE Communications,1997,35(2):46-55
    [75] Henning M. A new approach to object-oriented middleware. IEEE InternetComputing,2004,8:66-75
    [76] Menon A, Santos J, Turner Y, et al. Diagnosing Performance Overheads in theXen Virtual Machine Environment. In: Proc of ACM/USENIX InternationalConference on Virtual Execution Environments. Chicago,2005,13-23
    [77] XenAccess. xenaccess.sourceforge.net.2012-05-15
    [78] VMCI. pubs.vmware.com/vmci-sdk/index.html.2012-05-15
    [79] IVSHMEM. www.linux-kvm.org/wiki/images/e/e8/0.11.Nahanni-CamMacdonell.pdf.2012-05-15
    [80] OProfile. oprofile.sourceforge.net/news/.2012-05-15
    [81] stfufs. www.guru-group.fi/too/sw/stfufs/.2012-05-15
    [82] Filesystem in Userspace. fuse.sourceforge.net/.2012-05-15
    [83] FileBench. www.fsl.cs.sunysb.edu/~vass/filebench/.2012-05-15
    [84] Sheaffer J, Luebke D, Skadron K. The visual vulnerability spectrum: characterizingarchitectural vulnerability for graphics hardware. In: Proc of ACMSIGGRAPH/EUROGRAPHICS symposium on Graphics hardware. NewYork,2006,9-16
    [85] Haque I, Pande P. Hard Data on Soft Errors: A Large-Scale Assessment of Real-WorldError Rates in GPGPU, In: Proc of Workshop on Resiliency in High PerformanceComputing (Resilience) in Clusters, Clouds and Grid Computing.Washington,2010,691-696
    [86] Bautista L, Nukada A, Maruyama N, et al. Low-overhead diskless checkpoint forhybrid computing systems. In: Proc of High Performance Computing. DonaPaula,2010,1-10
    [87] Xu X, Lin Y, Tang T, et al. HiAL-Ckpt: A hierarchical application-level checkpointingfor CPU-GPU hybrid systems, In: Proc of International Conference on ComputerScience and Education. Hefei,2010,1895-1899
    [88] Dimitrov M, Mantor M, Zhou H. Understanding software approaches for gpgpureliability. In: Proc of workshop on General-Purpose Computation on GraphicsProcessing Units. Washington,2009,94-104
    [89] Sheaffer J, Luebke D, Skadron K. A Hardware Redundancy and Recovery Mechanismfor Reliable Scientific Computation on Graphics Processors, In: Proc of ACMSIGGRAPH/EUROGRAPHICS symposium on Graphics hardware. SanDiego,2007,55-64
    [90] Maruyama N, Nukada A, Matsuoka S. A High-Performance Fault-Tolerant SoftwareFramework for Memory on Commodity GPUs. In: Proc of IEEE InternationalSymposium on Parallel&Distributed Processing. Atlanta,2010,1-12
    [91] Zhong H, Nieh J. CRAK: Linux Checkpoint/Restart As a KERNEL Module. TechnicalReport, Columbia University,2002
    [92] Duell J. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart.Paper LBNL-54941. Berkeley,2005
    [93] Litzkow M, Tannenbaum T, J. Basney, et al. Checkpoint and Migration of UNIXProcess in the Condor Distributed Processing System. Technical Report,1346,University of Wisconsin Madison
    [94] Laosooksathit S, Naksinehaboon N, Leangsuksan C, et al. Lightweight CheckpointMechanism and Modeling in GPGPU Environment. In: Proc of HPCVirt workshop.Paris,2010
    [95] Toan N, Jitsumoto H, Maruyama N, et al. MPI-CUDA Applications Checkpointing. In:Proc of Summer United Workshops on Parallel, Distributed and CooperativeProcessing. Technical Report, Kanazawa,2010

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700