云计算中的MapReduce并行编程模式研究

英文题名：Research on MapReduce Parallel Programming Model in the Cloud Computing
作者：吴贵鑫
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：云计算 ; MapRedcue ; 并行编程 ; 数据分配 ; Hadoop ; 异构集群
英文关键词：Cloud Computing ; MapRedcue ; Parallel Programming ; Data Distribution ; Hadoop ; Heterogeneous Cluster
学位年度：2010
导师：许合利
学科代码：081203
学位授予单位：河南理工大学
论文提交日期：2010-10-10
答辩委员会主席：贾宗璞

摘要

云计算是并行计算、分布式计算和网格计算的发展,使并行技术走进了人们的生活。云计算、个人高性能计算机(PHPC)等技术的深入发展,使许多技术人员开始从单机工作模式向并行计算模式转变。云计算的逐步普及使并行程序设计成为许多程序设计人员必须面对和解决的一个关键性问题。
     Google提出的MapReduce并行编程模式极大的降低了并行程序的开发难度。与传统的分布式程序设计相比,MapReduce封装了并行处理、容错处理、本地化计算、负载均衡等细节,还提供了一个简单而强大的编程接口,极大的简化了并行程序设计的难度。
     本文首先介绍了云计算的概念、基本理论和研究现状,阐述了几种传统的并行编程模式,分析和研究的它们的原理和发展。对Google云计算和Hadoop云计算架构进行了简要的介绍,并将MapReduce与MPI进行比较,研究两者的区别与各自优势。
     文中详细地阐述了MapReduce的编程思想,分析和研究了MapReduce解决问题的工作原理、具体步骤和方法。介绍了MapReduce的容错机制,并对MapReduce作业的调度算法进行了详细的分析。研究了MapReduce在异构Hadoop集群环境下的性能差异,分析了异构环境对MapReduce性能的影响。本文提出一种新的数据分配机制HDDM,以集群中各异构节点的计算比率为依据来分配输入文件,提高了MapReduce在异构Hadoop集群中的性能。
     最后通过实验证明,我们提出的数据分配机制HDDM能够极大的提高MapReduce程序的执行效率。
Cloud computing is parallel computing, distributed computing and grid computing’s development, and make parallel technology into people's life. Cloud computing, technology of personal high-performance computer (PHPC) developed deeply, which make many technical personnel to start working from Stand-alone mode transfer to parallel computing mode. The popular of Cloud computing make parallel programming as a key problem many programmers must confront and resolve.
     Google suggest the MapReduce parallel programming model greatly reduced difficulty of the parallel programming. Comparing with traditional distributed program design, MapReduce encapsulates the parallel processing, tolerant, localization calculation, load balancing etc. details. Also provides a simple and powerful programming interface, and greatly simplifies the design of parallel programs.
     This paper firstly introduces the concept of cloud computing, basic theory and research status, and state several traditional parallel programming models, analyses and studies its principle and development. Briefly introduce Google computing clouds and Hadoop cloud computing structure, and compare MapReduce will with the MPI, studies the difference between the two with their respective advantages.
     This paper elaborates the thoughts of MapReduce programming in details, analyzes and studies principle of MapReduce solving work problems and specific steps and methods. MapReduce fault is introduced, and scheduling algorithm of MapReduce is analyzed in details when in working. then studies the difference for properties of MapReduce in heterogeneous Hadoop cluster environment, and analysis the influence on MapReduce in heterogeneous environment. This article suggests a new data distribution mechanism HDDM, according to calculation ratio of heterogeneous cluster nodes input file, improve performance of MapReduce in heterogeneous Hadoop cluster.
     Finally, the experiments show that the proposed data allocation mechanism HDDM can greatly improve the efficiency of MapReduce programs.

引文

[1] Michael Armbrust , Armando Fox. Above the clouds: A Berkeley view of cloud computing[J]. Technical Report No. UCB/EECS-2009-28,University of California at Berkley, USA, 2009.2:3-5
    [2] Eugene Ciurana. Developing with Google App Engine[M]. New York: Berkeley, CA Apress , 2008
    [3] J.Dean and S.Ghemawat. MapReduce: Simplified data processing on large clusters [J]. Operating Systems Design and Implementation, 2004(9)8:137-149 .
    [4] IBM, IBM Introduces Ready-to-Use Cloud Computing[J/OL]. 2007.11.5(2-4), [2009-06-1], http://www-03.ibm.com/press/us/en/pressrelease/22613.wss
    [5] Peng Liu,Yao Shi,Francis C.M.Lau,Cho-Li Wang,San-Li Li,Grid Demo Proposal:AntiSpamGrid,IEEE International Conference on Cluster Computing,Hong Kong,Dec1-4,2003,selected as one of the excellent Grid research projects for the GridDemo session
    [6] John Darlington, Yi-ke Guo, Hing Wing To. Structured parallel programming: theory meets practice. Computing tomorrow: future research directions in computer science book contents Pages: 49-65
    [7] K. Birman, G. Chockler, and R. van Renesse. Towards a cloud computing research agenda. SIGACT News,40(2):68-80, 2009.
    [8] Sanjay Ghemawat,Howard Gobioff,Shun-Tak Leung. The Google file system [J]. ACM SIGOPS Operating Systems, 2003,9(8):1-5
    [9] Hadoop. Hadoop homepage. http://hadoop.apache.org/
    [10] Dhruba Borthaku, The Hadoop Distributed File System: Architecture and design [R]. http://hadoop.apache.org/core/docs/current/hdfs design.pdf.
    [11] Hbase Development Team. Hbase: Bigtable-like structured storage for hadoop hdfs[J/OL]. http://wiki.apache.org/lucene-hadoop/Hbase, 2007.
    [12] Heli Xu,Guixin Wu.Parallel programming in Grid: Using MPI[C]. Third International Symposium on Electronic Commerce and Security Workshops(ISECS 2010),ISBN 978-952-5726-11-4,2010.7
    [13] Jeffrey Dean,sanjay Ghemawat .MapReduce:Simplified Data Processing on Large ClustersCommunications of the ACM [J]. 2008(9):107-1 13.
    [14] Proc. 15~(th) International Conference on Parallel Architectures and Compilation Techniques.Experiences with MapReduce, an abstraction for large-scale computation [C], Google Inc, 2006.1
    [15] Jeffrey Dean, Sanjay Ghemawat. Distributed Programming with MapReduce Beautiful code [J]. Google Inc ,2007.9( 23):1-4.
    [16] Jeffrey Dean,sanjay Ghemawat. MapReduce:Simplified Data Processing on Large Clusters [J]. 2008(1)4:107-1 13.
    [17] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, Map-reduce for Machine Learning on Multicore [J]. Stanford University, 2004(5): 5-15.
    [18] Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching [C]. Carnegie Mellon University.2000
    [19] Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra. Towards Efficient MapReduce Using MPI[C]. In Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing interface. 2009.9
    [20] Tom White. Running Hadoop MapReduce on Amazon EC2 and Amazon S3[J] . http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873, 2008.10
    [21] H.Yang, A.Dasdan, R.Hsiao,DS.Parker. Map-reduce-merge: Simple relational data processing on large clusters [J]. SIGMOD,2007.(5)1209-1233.
    [22] The Hadoop Distributed File System : Architecture and Design[J/OL]. http ://hadoop.apache.org/core/docs/r0.16.0/hdfs design.html 2009-04-15.
    [23] Michael O.Rabin. Efficient dispersal of information for security, load balancing and fault tolerance [J]. Journal of the ACM, 1989, 4(36) 335-340.
    [24] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, I. Stoica, Improving mapreduce performance in heterogeneous environments, in: Proc. 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, San Diego,USA, Dec. 2008.
    [25] L. Zhang. The efficiency and fairness of a fixed budget resource allocation game. In International Colloquium on Automata, Languages and Programming, pages 485–496,2005.
    [26] Y. Becerra, V. Beltran, D. Carrera, M. Gonzalez, J. Torres, E. Ayguadé,Speeding up distributed mapreduce applications using hardware accelerators, ICPP’09: Proceedings ofthe 2009 International Conference on Parallel Processing, IEEE Computer Society, 2009, p. 42–49.
    [27] J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguadé, M. Steinder, I. Whalley. Performance Management of Accelerated MapReduce Workloads in Heterogeneous Clusters. 2010 39th International Conference on Parallel Processing ,2010.
    [28] T. Sandholm and K. Lai. MapReduce Optimization using Regulated Dynamic Prioritization. In ACM SIGMETRICS’09: International Conference on Measurement and Modeling of Computer Systems, 2009.
    [29] Benjamin Mako Hill, Jono Bacon, Corey Burger,Jonathan Jesse, Ivan Krstic. The Official Ubuntu Book[M]. Englewood Cliffs,NJ:Prentice Hall, 2006.
    [30]刘鹏.云计算[M] .北京:电子工业出版社,2010
    [31]陈国良.并行计算.北京:高等教育出版社,2003
    [32]万至臻.基于MapReduce模型的并行计算平台的设计与实现[D].杭州:浙江大学计算机科学与技术学院,2008
    [33]朱珠.基于Hadoop的海量数据处理模型研究和应用[D].北京:北京邮电大学,2008.
    [34]刘轶,张昕,李鹤,钱德沛.多核处理器大规模并行系统中的任务分配问题及算法[J].小型微型计算机系统,2008,5(29):2-3
    [35]孙广中,肖锋,熊曦. MapReduce模型的调度及容错机制研究[J].微电子学与计算机, 2007,9(24):1-2
    [36] (美)库勒瑞思(Coulouris,G.)著.分布式系统概念与设计[M].北京:机械工业出版社,2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700