MPI高性能云计算平台关键技术研究

英文题名：Research on Key Technologies of the MPI-based High Performance Cloud Computing Platform
作者：郭羽成
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：云计算 ; 高性能计算 ; 多级容灾 ; 分布式处理 ; MPI ; 海量数据处理
英文关键词：Cloud Computing ; High Performance Computing ; Multilayer
英文关键词：Fault-tolerant ; Distributed Processing ; MPI ; Massive Data Processing
学位年度：2013
导师：杨克俭
学科代码：081203
学位授予单位：武汉理工大学
论文提交日期：2013-09-01

摘要

云计算是一种新近流行的计算模式,是目前进行海量数据处理的主要技术手段。现有云计算平台技术虽然已较为成熟,但在处理兼具数据密集和计算密集的一类问题上仍有不足,效率较为低下。一方面,目前主流平台底层普遍采用虚拟化技术,所有软件和应用均运行在虚拟硬件之上,从而带来了一定程度上性能的降低,有文献指出,其性能损耗可达20%左右,本文使用非虚拟化方法构建云计算平台底层的模拟环境,在其上对其性能进行测试,实验表明,虚拟化技术的效率只有非虚拟化技术的50-70%；另一方面,现有云计算MapReduce模型,对中间数据采用先存储、再转发处理的策略,当中间数据规模变大时,产生了大量无用的远程I/O操作,其效率不能满足高性能计算的应用需求。
     本文在深入分析、研究现有云计算平台的缺点和MPI (Message Passing Interface)技术容错容灾能力的基础之上,自主研发了一种基于MPI的高性能云计算平台原型系统(MPI-based HPCCP)。该云平台不经过虚拟化,直接使用异构计算节点构建云平台底层；采用增加多级容错容灾功能的MPI技术和多线程技术重写MapReduce编程模型,避免大量无用的I/O操作,从而提高云计算的效率,以满足兼具数据密集和计算密集的海量数据高性能计算问题对云计算的需要。
     本文提出的"MPI高性能云计算平台”的主要创新点如下：
     1.针对现有云计算平台底层性能损失较大的问题,提出了一种云计算平台底层的非虚拟化构建方法。
     本文在节点异构环境下,不使用目前流行的虚拟化技术,而是利用MPI良好的异构环境开发能力,直接使用异构硬件架设云基础设施服务层,减少了虚拟化对云底层硬件性能的影响,从而提高了云平台效率,实验表明虚拟化技术的性能只有非虚拟化技术的50%-70%.这是本文的一个重要创新点。
     2.针对目前MPI在容灾能力上的不足,本文改进了MPI的容灾技术,提出并设计实现了一种MPI的多级容灾方案。
     虽然MPI有进行高性能计算的优异能力,但容灾能力一直以来是MPI的一个重要缺陷。此缺陷限制了MPI在海量数据处理上的应用。不解决此问题,无法将MPI技术应用于云计算之中。本文着重研究了MPI容错容灾技术,实现了“作业重调度”,“作业／任务恢复”和“动态任务迁移”三个不同层次的容灾方案,弥补了MPI容灾能力的不足,是本文的一个重要特色和创新点。
     3.针对现有云计算编程模型MapReduce计算效率的不足,本文提出并实现了一种基于多级容错MPI的MapReduce模型。
     现有MapReduce模型的数据传递由分布式文件系统所封装,计算过程中需要不断重复的对分布式文件系统进行I/O操作,从而影响了计算效率。本文使用多级次容灾MPI重写MapReduce编程模型,对中间结果进行直接处理,减少不必要的I/O操作,提高了云计算速度和效率,执行时间是当前主流云平台Hadoop的25%,这是本文的重要特色和创新点。
     本文从系统分片大小对性能的影响,多级容灾的健壮性、效率,新云计算平台的总体性能方面进行了详细的测试与分析。并将新云计算平台与传统Hadoop平台进行了比较。实验表明本文提出的MPI高性能云计算平台的执行效率是传统Hadoop平台的4倍以上。
     在文章的最后,本文对本文所做的研究进行了工作总结。并简要说明了有待于进一步研究的问题及将来的研究计划。本文下一步将针对如何解决各节点计算有依赖性的问题,以及如何在本平台实现节点间MPI并行、节点内CPU+GPU并行的问题进一步展开研究。
Cloud computing is a main technique for massive data processing, however it is inefficient for dealing with both data intensive and computational intensive problems. The low layer of the cloud computing uses the virtual technique, so that all system and application softwares execute on the virtual hardware, which reduce performance up to20percent pointed out by a literature. In other hand the MapReduce paradigm of cloud computing adopts the store-forward stratagem for medium data, which would create great amount I/O operations for big data and cannot be applied efficiently to high performance science computing.
     Based on above considerations and in view of the MPI weakness in fault-tollerent capability the dissertation focuses on developing a MPI-based high performance cloud computing platform (HPCCP), which configures the low layer of the platform directly using heterogeneous computing nodes without virtualization, and reprograms the MapReduce paradigm with integration of multilayer fault-tolerant MPI techniques and multi thread techniques to avoid great amount unnecessary I/O operations and increase the efficiency. The proposed and implemented MPI-based HPCCP platform prototype can efficiently deal with the data-intensive as well as computing intensive problems to satisfy high performance cloud computing requirements.
     The main creations of the proposed MPI-based HPCCP platform are as followings:
     1. A methodology, which configures the low layer of the cloud-computing platform directly using heterogeneous computing nodes without virtualization.
     The proposed and implememted MPI-based HPCCP platform in the dissertation, instead of adopting the fashionable virturalization techniques, fully takes advantage of the MPI ability of exploration and adaptivity in heterogeneous computing nodes to directly construct the IaaS layer of the cloud-computing platform. This is an important creation that increases productivity of the cloud plateform by decreasing harm influences of the virtualization to hardware capability in the IaaS layer.
     2. Amelioration and implementation of the MPI multi-layer fault-tollerent techniques.
     The weak fault-tollerent ability is a crucial defect of the MPI, comparing with its excellent ability of high performance computing, which limits the MPI application in big data processing. The MPI technique could not be adapted in the cloud computing provided that the defect would not be solved. The dissertation has comprehensively studied the MPI fault torrelent technoques, proposed and implemented three different fault tollenent techniques:job rescheduling, job/task recovering, and task dynamic migration, which are allocated in three different layers. This creation has remedied the defect of MPI in the fault tolerant ability, which is another distinguishing feature of the dissertation.
     3. An efficient MapReduce prototype of the MPI-based HPCCP platform has been designed and implemented.
     The data transfer in current MapReduce paradigm implemtation is encapsulated by the distributed file system (DFS), so that repeated I/O operations are taken place to the DFS during the data processing, which seriously reduce system efficiency. The dissertation reprograms the MapReduce paradigm on a redisgned multi-layer fault-tolerant MPI platform, which can directly process the medium results, reducing unnecessary I/O operation, speeding up the cloud computing and obtaining higher efficiency. Comparing with the Hadoop, the current fashionable implementation of the MapReduce, our MPI-based HPCCP can reduce a big data processing time of the fingerprint recognition to25percent.
     The dissertation has done intensive tests and case studies for the MPI-based HPCCP platform. Among them there are some of them:the influence of data block size to data processing performance; robustness and efficiency of the multi layer fault tolerancy; gerenal performance of the MPI-based HPCCP platform. Finally the comparision between the Hadoop and the MPI-based HPCCP platform has been done. The experiments have shown that the proposed and implemented cloud-computing platform in the dissertation has four more times better runtime than the traditional Hadoop platform.
     In the last section, conclusions and some to be solved problems have been listed. The near future reseach proposal is also described briefly.

引文

[1]2010 Digital Universe Study [EB/OL]. [2010-09-27]. http:// gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf
    [2]Armando Fox, Above the clouds:a Berkeley view of cloud computing, UC Berkeley Reliable Adaptive Distributed Systems Lab, Technical Report UCB/EECS-2009-28[R]. 2009.
    [3]Stephen B. Google and the Wisdom of Clouds [Z].2007.12. http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm
    [4]中科院计算所所长李国杰院士.云计算与HPC——兼谈加强计算机系统研究的必要性[EB/OL].2011.http://www.ict.cas.cn/liguojiewenxuan/wzlj/lgjxsbg/201103/P020110310511245035663. pdf
    [5]毛新生.IBM云计算—技术和战略[EB/OL].IBM中国开发中心.2009http://download.csdn.net/detail/jyxt1985/1513071
    [6]中国国家自然科学基金委员会. http://www.nsfc.gov.cn/PortalO/defaultl 52.htm
    [7]成都云计算机实验室.http://www.creatcn.com/news/672.html
    [8]张云勇,程莹,徐雷.云计算国内外发展现状分析.电信科学.2010.第S1期.
    [9]周洪波.云计算-技术、应用、标准和商业模式.电子工业出版社.2011.6.
    [10]云计算重点应用领域发展趋势[EB/OL].http://gxst.gxlib.org.cn/saobs/upload/works/20130725160758/网站/files/jc_yjsdyy.html
    [11]薛正华,杨顺程,聂磊,董小社.服务器虚拟化技术综述.2006中国计算机学会体系结构专委会学术年会.2006.
    [12]兰雨晴,宋潇豫,马立克,徐舫.系统虚拟化技术性能评测[J].电信科学2010年第8A期
    [13]OpenMPI:Open Source High Performance Computing. http://www.open-mpi.org/
    [14]http://meetings.mpi-forum.org/MPI_3.0_main_page.php
    [15]Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra. Towards Effcient MapReduce Using MPI. Lecture Notes in Computer Science.2009. P 240-249.
    [16]Hisham Mohamed, Stephane Marchand-Maillet. Enhancing MapReduce using MPI and an optimized data exchange policy.2012 41st International Conference on Parallel Processing Workshops.2012.6.P11-18.
    [17]美国sandia国家实验室.1ittp://mapreduce.sandia.gov/[online] February 26,2013.
    [18]Yu-Fan Ho, Sih-Wei Chen. A Mapreduce Programming Framework Using Message Passing. Computer Symposium (ICS).2010.11. P 883-888.
    [19]Ying Peng, Fang Wang. Cloud computing model based on MPI and OpenMP. Computer Engineering and Technology (1CCET).2010.4. Vol.7. P 85-87.
    [20]郑启龙,王吴,吴晓伟,房明,HPMR:多核集群上的高性能计算支撑平台[p].微电子学与计算机2008年第9期：P21.
    [21]郑启龙,王向前,王昊.HPMR系统KV路由算法设计[J].计算机工程2010年第20期：P102-105.
    [22]郑启龙,汪睿,王向前HPMR内存管理模块优化设[J].计算机系统应用2011年第8期：P104-109.
    [23]王春海,盖俊飞,蒋建华.2011年中国企业计算需求展望——虚拟化、客户端计算与云计算[J].微型计算机.2011.1.P105-107.
    [24]雷万云.云计算：企业信息化建设策略与实践[M].清华大学出版社.2010.12.
    [25]杨望仙,朱定局,谢毅,范朝冬.虚拟化技术在云计算中的研究进展[J].先进技术研究通报2010年第8期-中国科学院深圳先进技术研究院.2010.10.
    [26]Status:Guest OSes. VirtualBox [online]. http://www.virtualbox.org/wiki/Guest_OSes
    [27]Virtual Appliance Marketplace. Vmware Inc [EB/OL].2011.01. http://www.vmware.com/company/news/releases/server_beta.html
    [28]The Xen virtual machine monitor. University of Cambridge [EB/OL]. http://www.cl.cam.ac.uk/research/srg/netos/xen/
    [29]E.Walker. Benchmarking Amazon EC2 for High-Performance Scientific Computing. The USENIX Magazine.vol 33. no.5.2008.10.
    [30]Parallel Virtual Machine [EB/OL]. [2012-02-15]. http://en.wikipedia.org/wiki/Parallel_Virtual_Machine
    [31]Message Passing Interface [EB/OL].2012.2. http://en.wikipedia.org/wiki/Message_Passing_Inter-face
    [32]FT-MPI [EB/OL]. [2012-02-09]. http://icl.cs.utk.edu/ftmpi/
    [33]LA-MPI [EB/OL]. [2012-02-09]. http://public.lanl.gov/lampi/
    [34]LAM-MPI [EB/OL]. [2012-02-09]. http://www.lam-mpi.org/
    [35]OpenMPI [EB/OL]. [2012-02-08]. http://www.open-mpi.org/
    [36]DeinoMPI [EB/OL]. [2012-02-06]. http://mpi.deino.net/
    [37]about MPICH2[EB/OL]. [2012-02-09]. http://www.mcs.anl.gov/research/projects/mpich2/about/index.php?s=about.
    [38]Berkeley Lab Checkpoint/Restart (BLCR) [EB/OL]. [2012-02-08]. https://ftg.Ibl.gov/projects/CheckpointRestart/
    [39]Condor [EB/OL]. http://research.cs.wisc.edu/htcondor/
    [40]Chpox [EB/OL]. http://cluster.univ.kiev.ua/eng/chpx
    [41]CryoPID [EB/OL]. http://cryopid.berlios.de
    [42]DMTCP [EB/OL]. http://dmtcp.sourceforge.net
    [43]OpenVZ [EB/OL]. http://openvz.org/Main_Page
    [44]Berkeley Lab Checkpoint/Restart (BLCR) User's Guide [EB/OL]. [2011-12-12]. https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html/
    [45]Application checkpointing [EB/OL]. [2012-02-09]. http://en.wikipedia.org/wiki/Checkpoint_restart
    [46]E.N. Elnozahy, L. Alvisi, Y-M. Wang, D.B. Johnson. A survey of rollback-recovery protocols in message-passing systems [J]. ACM Comput. Surv.,2002, vol.34, no.3: 375-408.
    [47]Yibei Ling, Jie Mi, Xiaola Lin. A Variational Calculus Approach to Optimal Checkp-oint Placement [J]. IEEE Trans. Computers 50(7),2001:699-708.
    [48]BLCR Flyer for SC2004 [EB/OL]. [2012-02-12]. https://ftg.lbl.gov/projects/CheckpointRestart/Checkp-ointflyer/
    [49]Jason Duell. The Design and Implementation of Berkeley Lab's LinuxCheckpoint/Restart [Z]. Lawrence Berkeley National Laboratory,2005-04-30
    [50]Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdair(?) Jason Duell, Paul Hargrove. The LAM/MPI Checkpoint/Restart Framework:System-Initiated Check-pointing[R]. LACSI Symposium.2003-10.
    [51]SciDAC [EB/OL]. http://computation.llnl.gov/casc/scidac/SciDAC_Org.html
    [52]PaulPaul Hargrove, Eric RomanHargrove, Eric Roman, Jasonand Jason. Advanced Ch-eckpoint Fault Tolerance Solutions for HPC [R]. WTTC2008.2008-07-09
    [53]Using the Hydra Process Manager [EB/OL]. [2012-02-08]. http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager/
    [54]MPI:A Message-Passing Interface Standard Version 2.2[R]. Message Passing Interface. 2009-09-04.
    [55]Hydra Process Management Framework [EB/OL]. [2012-01-25]. http://wiki.mcs.anl.gov/mpich2/index.php/Hydra_Process_Management_Framework/
    [56]FAQ:Fault tolerance for parallel MPI jobs [EB/OL]. [2011-12-02]. http://www.open-mpi.org/faq/?category=ft
    [57]Xu Liu, Bibo Tu, Jianfeng Zhan, Dan Meng. A Fast-start, Fault-tolerant MPI Launcher on Dawning Supercomputers[C]. New Zealand:PDCAT 2008,2008:263-266
    [58]谢旻.高可用MPI并行编程环境及并行程序开发方法的研究与实现[D].长沙：国防科学技术大学计算机科学与技术专业,2007-10-01.
    [59]Ron Brightwell, Kurt B Ferreira, Rolf Riesen. Transparent Redundant Computing with MPI [C]. Univ Stuttgart:EuroMPI 2010,2010:208-218
    [60]Varghese Blesson, Mckee Gerard, Alexandrov Vassil. Implementing Intelligent Cores using Processor Virtualization for Fault Tolerance [C]. Netherlands:ICCS 2010,2010: 2191-2199
    [61]Walters John Paul, Chaudhary Vipin. A fault-tolerant strategy for virtualized HPC Cl-usters [J]. JOURNAL OF SUPERCOMPUTING,2009, Volume 50, Issue 3:209-239
    [62]Liu Tiantian, Ma Zhong, Ou Zhonghong. A Novel Process Migration Method for M-PI Applications [C]. Shanghai:PRDC 2009,2009:247-251
    [63]Walters John Paul, Chaudhary Vipin. Replication-Based Fault Tolerance for MPI App-lications [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,2009, Volume 20, Issue 7:997-1010
    [64]LeBlanc Troy, Anand Rakhi, Gabriel Edgar, Subhlok Jaspal. VolpexMPI:An MPI Li-brary for Execution of Parallel Applications on Volatile Nodes [C]. Espoo:16th Euro-pean PVM/MPI Users' Group Meeting,2009:124-133
    [65]Wang Chao, Mueller Frank, Engelmann Christian, Scott Stephen. Proactive processlev-el live migration and back migration in HPC environments [J]. JOURNAL OF PAR-ALLEL AND DISTRIBUTED COMPUTING,2009, Volume 72, Issue 2:254-267
    [66]Collaborators [EB/OL]. [2012-02-09]. http://www.mcs.anl.gov/research/projects/mpich2/about/index.php?s=collab
    [67]news & events [EB/OL]. [2012-02-09]. http://www.mcs.anl.gov/research/projects/mpich2/about/index.php?s=news
    [68]Sandra Loosemore, Richard M. Stallman, Roland McGrath, Andrew Oram, Ulrich Dr-epper. The GNU C Library Reference Manual [M].2007-10-27:127-620
    [69]A Ubuntu MPI Cluster (PART 1-server setup) [EB/OL]. [2012-02-04]. http://allintech.info/2008/09/a-ubuntu-mpi-cluster-part-1-server-setup/
    [70]Mark Shuttleworth:200 million Ubuntu users by 2015 [EB/OL]. [2012-01-10]. http://www.techspot.com/news/43709-mark-shuttleworth-200-million-ubuntu-users-by-2 015.html
    [71]王艳平,周铮Windows程序设计(第2版)[M].北京：人民邮电出版社,2008-02：19-33.
    [72]MapReduce [EB/OL]. [2012-02-01]. http://zh.wikipedia.org/wiki/MapReduce
    [73]苏森,唐雪飞,刘锦德.一个基于Transputer的容错系统[J].计算机科学,1997,24(03)：68-71.
    [74]申红芳.基于检查点的进程迁移技术在PVM系统中的实现[D].北京.北京交通大学.2004.
    [75]李毅.基于PVM的研究任务迁移,C++对象分布并行及Capability实现[D].成都.电子科技大学.2001.
    [76]Stephen B. Google and the Wisdom of Clouds [Z]. [2007-12-13].
    [77]Shadi Ibrahim. Performance-Aware Scheduling for Data-Intensive Cloud Computing [D].华中科技大学博士学位论文,2011.6.
    [78]李成华,张新访,金海,向文.MapReduce_新型的分布式并行计算编程模型[J].计算机科学与工程.2011,第33卷第13期.P129-135.
    [79]李铮.多媒体云计算平台关键技术研究[D].中国科学技术大学博士学位论文.2011.
    [80]杨义彬.基于云计算的分布式处理框架的研究与设计[D].电子科技大学.2011.
    [81]Sanwon Seo, Edward J.Yoon, Jaehong Kim, Seong wook Jin, Jin-Soo kim, Seungryoul Maeng. HAMA:An Efficient Matrix Computation with the MapReduce Framework [J]. 2nd IEEE International Conference on Cloud Computing Technology and Science.2010, P721-726.
    [82]史恒亮.云计算任务调度研究[D].南京理工大学博士学位论文,2012.
    [83]MapReduce:原理分析及实践[EB/OL].2012.04.http://space.doit.com.cn/?uid-129961-action-viewspace-itemid-55551
    [84]张辉.多级负载平衡系统的设计和实现[J].南京理工大学学报自然科学版,2002年5期
    [85]李庆阳.网格环境计算中的动态任务分配和调度算法的研究[D].黑龙江大学,2004.
    [86]陈华平,黄刘生.并行分布式计算中任务调度模型[J].计算机科学.1999,第26卷第6期.
    [87]孙广中,陈国良.线性网络上分布式任务调度算法[J].计算机研究与发展,2003,第40卷第10期.
    [88]朱晓敏.异构集群系统中实时任务调度若干问题研究[D].复旦大学博士学位论文.2009.
    [89]冯国富,王明,李亮,陈明.共享存储MapReduce云计算性能测试方法[J].计算机工程.2012.6.第38卷第6期.
    [90]M.D. Linderman, J.D. Collins, H. Wang, T.H. Meng. Merge:a programming model for heterogeneous multi-core systems. ACM SIGPLAN Notices.2008.
    [91]Chen R, Chen H, Zang B. Tiled-MapReduce:optimizing resource usages of data-parallel appli-cations on multicore with tiling. Proceedings of the 19th international conference on Parallel architectures and compilation techniques.2010.
    [92]王淼.面向多核处理器的并行编译及优化关键技术研究[D].国防科学技术大学博士学位论文.2010.
    [93]陈华才.虚拟化环境中计算效能优化研究[D].华中科技大学博士学位论文.2011.
    [94]Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G. Twister:A runtime for iterative MapReduce. Proceedings of the 19th ACM International Symposium onHigh Performance Distributed Computing.2010.
    [95]Condie T, Conway N, Alvaro P, Hellerstein J M, ElmeleegyK, Sears R. MapReduce online. Proceedings of the 7thUSENIX Symposium on Networked Systems Design and Im-plementation (NSDI 2010).2010.
    [96]王凯,吴泉源,杨树强.一种多用户MapReduce集群的作业调度算法的设计与实现[J].计算机与现代化.2010(10).
    [97]Kirsten Hildrum, John D. Kubiatowicz, Satish Rao, Ben Y. Zhao. Distributed Object Location in a Dynamic Network [J]. Theory of Computing Systems.2004 (3).
    [98]Dean J, Ghemawat S. MapReduce:simplified data processing on large clusters. Communications of the ACM.2008.
    [99]顾慧,郑晓微,张建强,吴华平.面向任务的TBB多核集群混合并行编程模型[J].微电子学与计算机.2011.2,第28卷第2期,P91-94.
    [100]彭蔓蔓,黄亮.多核处理器中任务调度与负载均衡的研究[J].微电子学与计算机.2011.11,第28卷第11期,P82-86.
    [101]Pseudo-Random Numbers [EB/OL]. [2012-01-17]. http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_13.html