基于分配适应度的Spark渐进填充分区映射算法

英文篇名：Progressive filling partitioning and mapping algorithm for Spark based on allocation fitness degree
作者：卞琛 ; 于炯 ; 修位蓉 ; 廖彬 ; 英昌甜 ; 钱育蓉
英文作者：BIAN Chen;YU Jiong;XIU Wei-rong;LIAO Bin;YING Chang-tian;QIAN Yu-rong;College of Software, Xinjiang University;College of Statistics and Information, Xinjiang University of Finance and Economics;
关键词：并行计算 ; Spark ; 渐进填充 ; 分区映射 ; 分配适应度
英文关键词：parallel computing;;Spark;;progressive filling;;partitioning and mapping;;allocation fitness degree
中文刊名：TXXB
英文刊名：Journal on Communications
机构：新疆大学软件学院;新疆财经大学统计与信息学院;
出版日期：2017-09-25
出版单位：通信学报
年：2017
期：v.38;No.361
基金：国家自然科学基金资助项目(No.61262088,No.61462079,No.61562078,No.61363083,No.61562086);; 新疆维吾尔自治区自然科学基金资助项目(No.2017D01A20);; 新疆维吾尔自治区高校科研计划基金资助项目(No.XJED2016S106);; 新疆财经大学科研博士启动基金资助项目(No.2015BS007)~~
语种：中文;
页：TXXB201709015
页数：15
CN：09
ISSN：11-2102/TN
分类号：137-151

摘要

分析Spark的作业执行机制,建立了执行效率模型和Shuffle过程模型,给出了分配适应度(AFD,allocation fitness degree)的定义,提出了算法的优化目标。根据模型的相关定义求解,设计了渐进填充分区映射算法(PFPM,progressive filling partitioning and mapping algorithm),通过扩展式分区和渐进填充映射,建立适应Reducer计算能力的数据分配方案,有效缩减Shuffle过程的同步延时,提高集群计算效率。实验表明该算法提高了Shuffle过程数据分配的合理性,优化了并行计算框架Spark的作业执行效率。
The job execution mechanism of Spark was analyzed, task efficiency model and Shuffle model were established, then allocation fitness degree(AFD) was defined and the optimization goal was put forward. On the basis of the model definition, the progressive filling partitioning and mapping algorithm(PFPM) was proposed. PFPM established the data distribution scheme adapting Reducers' computing ability to decrease synchronous latency during Shuffle process and increase cluster the computing efficiency. The experiments demonstrate that PFPM could improve the rationality of workload distribution in Shuffle and optimize the execution efficiency of Spark.

引文

[1]孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169.MENG X F,CI X.Big data management:concepts,techniques and challenges[J].Journal of Computer Research and Development,2013,50(1):146-169.
    [2]付钰,李洪成,吴晓平,等.基于大数据分析的APT攻击检测研究综述[J].通信学报,2015,36(11):1-14.FU Y,LI H C,WU X P,et al.Detecting APT attacks:a survey from the perspective of big data analysis[J].Journal on Communications,2015,36(11):1-14.
    [3]STRANDE S M,CICOTTI P,SINKOVITS R S,et al.Gordon:design,performance,and experiences deploying and supporting a data intensive supercomputer[C]//The 1st Conference on the Extreme Science and Engineering Discovery Environment.2012:1-8.
    [4]杜小勇,陈峻,陈跃国.大数据探索式搜索研究[J].通信学报,2015,36(12):77-88.DU X Y,CHEN J,CHEN Y G.Exploratory search on big data[J].Journal on Communications,2015,36(12):77-88.
    [5]ZAHARIA M,CHOWDHURY M,DAS T,et al.Fast and interactive analytics over hadoop data with spark[J].Login,2012,37(4):45-51.
    [6]ZAHARIA M,XIN R,WENDELL P,et al.Apache Spark:a unified engine for big data processing[J].Communications of the ACM,2016,59(11):56-65.
    [7]CARBONE P,EWEN S,HARIDI S,et al.Apache flink:stream and batch processing in a single engine[J].Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2015,36(4):28-38.
    [8]TUMMALAPALLI S,MACHAVARAPU V R.Managing mysql cluster data using cloudera impala[J].Procedia Computer Science,2016,85(5):463-474.
    [9]SIKKA V,LEHNER W,SANG K C,et al.Efficient transaction processing in SAP HANA database:the end of a column store myth[C]//The 2012 ACM SIGMOD International Conference on Management of Data.2012:731-742.
    [10]DEAN J,GHEMAWAT S.Map Reduce:simplifed data processing on large clusters[C]//The Conference on Operating System Design and Implementation(OSDI).2004:137-150.
    [11]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//The 9th USENIX Conference on Networked Systems Design and Implementation.2012:2.
    [12]LIN X,WANG P,WU B.Log analysis in cloud computing environment with hadoop and spark[C]//The 5th IEEE International Conference on Broadband Network&Multimedia Technology(IC-BNMT).2013:273-276.
    [13]DONG X,XIE Y,MURALIMANOHAR N,et al.Hybrid checkpointing using emerging nonvolatile memories for future exascale system[J].ACM Transactions on Architecture and Code Optimization(TACO),2011,8(2):1-29.
    [14]田俊峰,张亚姣.基于马尔可夫的检查点可信评估方法[J].通信学报,2015,36(1):234-240.TIAN J F,ZHANG Y J.Checkpoint trust evaluation method based on Markov[J].Journal on Communications,2015,36(1):234-240.
    [15]ARMBRUST M,XIN R S,LIAN C,et al.Spark SQL:relational data processing in spark[C]//The 2015 ACM SIGMOD International Conference on Management of Data.2015:1383-1394.
    [16]IQBAL M H,SOOMRO T R.Big data analysis:apache storm perspective[J].International Journal of Computer Trends&Technology,2015,19(1):9-14.
    [17]ZAHARIA M,DAS T,LI H Y,et al.Discretized streams:fault-tolerant streaming computation at scale[C]//ACM Symposium on Operating Systems Principles.2013:423-438.
    [18]MENG X,BRADLEY J,YAVUZ B,et al.MLlib:machine learning in apache Spark[J].Journal of Machine Learning Research,2015,17(1):1235-1241.
    [19]GONZALEZ J E,XIN R S,DAVE A,et al.Graph X:graph processing in a distributed dataflow framework[C]//The 11th USENIX conference on Operating Systems Design and Implementation.2014:599-613.
    [20]廖彬,于炯,孙华,等.基于存储结构重配置的分布式存储系统节能算法[J].计算机研究与发展,2013,50(1):3-18.LIAO B,YU J,SUN H,et al.Energy-efficient algorithms for distributed storage system based on data storage structure reconfiguration[J].Journal of Computer Research and Development,2013,50(1):3-18.
    [21]DEAN J,GHEMAWAT S.Map Reduce:simplified data processing on large clusters[J].Operating Systems Design&Implementation,2004,5(1):147-152.
    [22]KWON Y,BALAZINSKA M,HOWE B,et al.A study of skew in Map Reduce application[J].Open Cirrus Summit,2011,1:1-5.
    [23]KWON Y,BALAZINSKA M,HOWE B,et al.Skew-resistant parallel processing of feature-extracting scientific user-defined functions[C]//The 1st ACM Symposium on Cloud Computing.2010:75-86.
    [24]王卓,陈群,李战怀,等.基于增量式分区策略的Map Reduce数据均衡方法[J].计算机学报,2016,39(1):19-35.WANG Z,CHEN Q,LI Z H,et al.An incremental partitioning strategy for data balance on Map Reduce[J].Chinese Journal of Computers,2016,39(1):19-35.
    [25]KWON Y,BALAZINSKA M,HOWE B,et al.Skew Tune:mitigating skew in Map Reduce applications[C]//The 2012 ACM SIGMOD International Conference on Management of Data.2012:25-36.
    [26]YAN W,XUE Y,MALIN B.Scalable and robust key group size estimation for reducer load balancing in Map Reduce[C]//IEEE Int Conference on Big Data.2013:156-162.
    [27]RAMAKRISHNAN S R,SWART G,URMANOV A,et al.Balancing reducer skew in Map Reduce workloads using progressive sampling[C]//The 3rd ACM Symposium on Cloud Computing(SOCC’12).2012:1-14.
    [28]GUFLER B,AUGSTEN N,REISER A,et al.Handing data skew in Map Reduce[C]//The 1st International Conference on Cloud Computing and Services Science.2011:574-583.
    [29]GUFLER B,AUGSTEN N,REISER A,et al.Load balancing in Map Reduce based on scalable cardinality estimates[C]//The 28th IEEE International Conference on Data Engineering(ICDE).2012:522-533.
    [30]TANG Z,ZHANG X S,LI K,et al.An intermediate data placement algorithm for load balancing in Spark computing[J].Future Generation Computer Systems,2016.
    [31]KOLB L,THOR A,RAHM E.Load balancing for Map Reduce-based entity resolution[C]//The 28th IEEE International Conference on Big Data Engineering(ICDE).2012:618-629.
    [32]KOLB L,THOR A,RAHM E,et al.Block-based load balancing for entity resolution with Map Reduce[C]//The 20th ACM International Conference on Information and Knowledge Management(CIKM).2011:2397-2400.
    [33]CHEN Q,YAO J Y,XIAO Z.Libra:lightweight data skew mitigation in Map Reduce[J].IEEE Transactions on Parallel&Distributed Systems,2015,26(9):2520-2533.
    [34]RACHA S C.Load balancing Map Reduce communications for efficient executions of applications in a cloud[M].India,Bangalore:Indian Institute of Science,2012:12-16.
    [35]IBRAHIM S,JIN H,LU L,et al.Handling partitioning skew in Map Reduce using LEEN[J].Peer-to-Peer Networking and Applications,2013,6(4):409-424.
    [36]DAI W,IBRAHIM I,BASSIOUNI M.Improving load balance for data-intensive computing on cloud platforms[C]//2016 IEEE International Conference on Smart Cloud.2016:140-145.
    [37]TANG Z,ZHANG X S,LI K L,et al.A data skew oriented reduce placement algorithm based on sampling[J].IEEE Transactions on Cloud Computing,2016.
    [38]FAN Y Q,WU W G,XU Y L,et al.Improving Map Reduce performance by balancing skewed loads[J].Communications,2014,11(8):85-108.
    [39]TRIGUERO I,GALAR M,VLUYMANS S.Evolutionary undersampling for extremely imbalanced big data classification under apache spark[C]//2016 IEEE Congress on Evolutionary Computation.2016:715-722.
    [40]MESTRE D G,PIRES C E,NASCIMENTO D C,et al.An efficient spark-based adaptive windowing for entity matching[J].Journal of Systems and Software,2017,128(6):1-10.
    [41]GHODSI A,ZAHARIA M,SHENKER S,et al.Choosy:max-min fair sharing for datacenter jobs with constraints[C]//The 8th ACM European Conference on Computer Systems.2013:365-378.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700