并行计算框架Spark的自适应缓存管理策略

英文篇名：Self-Adaptive Strategy for Cache Management in Spark
作者：卞琛 ; 于炯 ; 英昌甜 ; 修位蓉
英文作者：BIAN Chen;YU Jiong;YING Chang-tian;XIU Wei-rong;School of Information Science and Engineering,Xinjiang University;School of Information and Engineering,Urumqi Vocational University;
关键词：并行计算 ; 缓存管理策略 ; Spark ; 弹性分布式数据集
英文关键词：parallel computing;;cache management strategy;;Spark;;resilient distribution datasets
中文刊名：DZXU
英文刊名：Acta Electronica Sinica
机构：新疆大学信息科学与工程学院;乌鲁木齐职业大学信息工程学院;
出版日期：2017-02-15
出版单位：电子学报
年：2017
期：v.45;No.408
基金：国家自然科学基金(No.61262088,No.61462079)
语种：中文;
页：DZXU201702003
页数：7
CN：02
ISSN：11-2087/TN
分类号：24-30

摘要

并行计算框架Spark缺乏有效缓存选择机制,不能自动识别并缓存高重用度数据;缓存替换算法采用LRU,度量方法不够细致,影响任务的执行效率.本文提出一种Spark框架自适应缓存管理策略(Self-Adaptive Cache Management,SACM),包括缓存自动选择算法(Selection)、并行缓存清理算法(Parallel Cache Cleanup,PCC)和权重缓存替换算法(Lowest Weight Replacement,LWR).其中,缓存自动选择算法通过分析任务的DAG(Directed Acyclic Graph)结构,识别重用的RDD并自动缓存.并行缓存清理算法异步清理无价值的RDD,提高集群内存利用率.权重替换算法通过权重值判定替换目标,避免重新计算复杂RDD产生的任务延时,保障资源瓶颈下的计算效率.实验表明:我们的策略提高了Spark的任务执行效率,并使内存资源得到有效利用.
As a parallel computation framework,Spark does not have a good strategy to select valuable RDD to cache in limited memory.When memory has been full load,Spark will discard the least recently used RDD while ignoring other factors such as the computation cost and so on.This paper proposed a self-adaptive cache management strategy(SACM),which comprised of automatic selection algorithm(Selection),parallel cache cleanup algorithm(PCC) and lowest weight replacement algorithm(LWR).Selection algorithm can seek valuable RDDs and cache their partitions to speed up data intensive computations.PCC clean-up the valueless RDD sasynchronously to improve memory utilization.LWR takes comprehensive consideration of the usage frequency of RDD,the RDD's computation cost,and the size of RDD.Experiment results show that Spark with our selection algorithm calculates faster than traditional Spark,parallel cleanup algorithm contributes to the improvement of memory utilization,and LWR shows better performance in limited memory.

引文

[1]Zaharia M,Das T,Li H,et al.Discretized streams:An efficient and fault-tolerant model for stream processing on large clusters[A].Proceedings of the 4th USENIX Workshop on Hot Topics in Cloud Computing[C].Boston,MA:USENIX,2012.1-6.
    [2]Apache Spark.Spark Overview[EB/OL].http://spark.apache.org,2015-01-21/2015-03-18.
    [3]Zaharia M,Chowdhury M,Das T,et al.Resilientdistributed datasets:A fault-tolerant abstraction for in-memory cluster computing[A].Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation[C].Berkeley,CA:USENIX,2012.1-14.
    [4]Young N.On-line file caching[A].Proceedings of the 9th Annual ACM-SIAM Symposium on DiscreteAlgorithms[C].Baltimore,MD:ACM,1999.82-86.
    [5]Swain D,Paikaray B,Swain D.AWRP:Adaptive weight ranking policy for improving cache performance[EB/OL].http://arxiv.org/abs/1107.4851,2011-07-25/2015-03-18.
    [6]FANG Juan,WANG Jing,LI Chengyan,et al.Partitionbased cache replacement to manage shared L2 caches[J].Chinese Journal of Electronics,2014,23(3):464-467.
    [7]司成祥,孟晓烜,许鲁.一种针对websearch应用的缓存替换算法[J].电子学报,2011,39(5):1205-1209.SI Chengxiang,MENG Xiaoxuan,XU Lu.A novel replacement algorithm designed for websearchapplications[J].Acta Electronica Sinica,2011,39(5):1205-1209.(in Chinese)
    [8]Byan S,Lentini J,Madan A,et al.Mercury:Host-side flash caching for the datacenter[A].Proceedings of the 28th Symposium on Mass Storage Systems and Technologies[C].New York,NY:ACM,2012.1-12.
    [9]HaoyuanLi,Ghodsi A,Zaharia M,et al.Tachyon:Reliable,memory speed storage for cluster computing frameworks[A].Proceedings of the 27th IEEE Conference on SYSTEM-ON-CHIP[C].Las Vegas,NV:IEEE,2014.1-15.
    [10]Ongaro D,Rumble S M,Stutsman R,et al.Fast crash recovery in RAMCloud[A].Proceedings of 23th ACM Symposium on Operating Systems Principles[C].New York,NY:ACM,2011.29-41.
    [11]Ghodsi A,Zaharia M,Shenker S,et al.Choosy:Max-min fair sharing for datacenter jobs with constraints[A].Proceedings of the 8th ACM European Conference on Computer Systems[C].New York,NY:ACM,2013.365-378.
    [12]Jure L.Stanford Network Analysis Project[EB/OL].http://snap.Stanford.edu/,2013-05-16/2015-03-18.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700