云计算数据中心HDFS差异性存储节能优化算法

英文篇名：HDFS Differential Storage Energy-Saving Optimal Algorithm in Cloud Data Center
作者：杨挺 ; 王萌 ; 张亚健 ; 赵英杰 ; 盆海波
英文作者：YANG Ting;WANG Meng;ZHANG Ya-Jian;ZHAO Ying-Jie;PEN Hai-Bo;School of Electrical and Information Engineering,Tianjin University;
关键词：云计算数据中心 ; 分布式文件存储系统 ; 节省能量 ; 超图 ; ■横贯
英文关键词：cloud data center;;distributed file storage system;;energy-saving;;hypergraph;;■
中文刊名：JSJX
英文刊名：Chinese Journal of Computers
机构：天津大学电气自动化与信息工程学院;
出版日期：2018-03-05 14:55
出版单位：计算机学报
年：2019
期：v.42;No.436
基金：国家自然科学基金(61571324);; 天津市自然科学基金重点项目(16JCZDJC30900);; 国家国际科技合作专项(2013DFA11040)资助~~
语种：中文;
页：JSJX201904003
页数：15
CN：04
ISSN：11-1826/TP
分类号：47-61

摘要

摘要在云计算的基础设施———数据中心内,Hadoop分布式文件存储系统(Hadoop Distributed File System,HDFS)以高容错性、高可靠性、高可扩展性的优势被广泛使用.但HDFS中遵循机架感知的存储策略没有考虑数据间的差异性和使用频度,所有数据以相同副本数复制后分散存储在不同的DataNode节点中,这势必会开启过多的DataNode而导致数据中心能耗过高.针对这一问题,突破现有HDFS对数据块的恒定副本个数存储的限制,提出保证数据块可用性的可变副本存储策略.建立了分布式文件存储超图模型,数学表述了数据块、文件和DataNode间的多对多关系.基于模型提出一种■横贯超边计算方法实现数据中心HDFS可变■重极小覆盖集选择,从而确定保证数据可用性的最小数量DataNode开启集合,实现数据中心存储单元节能.在原问题的可行域中会存在多个最优解的情况,即在满足数据块■覆盖的条件下,存在开启DataNode数目最少且相等的多种方案,因此该问题是一个多态函数优化问题,该文提出采用贪心萤火虫算法加以求解.算法性能测试实验通过Hadoop环境下的WordCount、TeraSort和Grep三种典型计算实例运算实验,进行了数据可用性实验,HDFS集群存储负载均衡实验,集群能耗分析以及数据中心网络性能试验.实验结果表明,可变■数据副本最小覆盖集算法在保证数据块和文件可用的条件下,可以实现更少的DataNode开启,有效节省HDFS集群能耗,并且通过开启DataNode的合理配置,缓解了网络传输拥塞.
In Data center,as the infrastructure of Cloud,Hadoop Distributed File System(HDFS) have been widely used for handling large amounts of data due to their excellent performance in terms of fault tolerance,reliability and scalability.Large size of files stored in the HDFSbased datacenter are split into a number of small size of data blocks,and the default size of each data block is 64M.In order to improve the reliability of data blocks,HDFS creates multiple replicas for each data block in the datacenter.The replicas and the original data blocks will be stored in different data nodes according to the rack-aware storage strategy.With this strategy,if any kind of failure happens to a data node,the availability of data hosted on this physical machine can be guaranteed since its replicas can still be retrieved from other data nodes.However,these storage systems usually adopt the same replication and storage strategy to guarantee data availability,i.e.creating the same number of replicas for all data sets and randomly storing them across data nodes.Such strategies do not fully consider the difference requirements of data availability on different data sets.More servers than necessary should thus be used to store replicas of rarelyused data,which will lead to increased energy consumption.With the increasing number of datacenters built around the world to maintain cloud computing capabilities,huge amount of electricity bills have to face.To address this issue,this paper studies the HDFS differential storage energy-saving optimal algorithm applying in Cloud Data center.Breaking through the limitation of the constant number of replicas in existing storage methods,we propose a variable number of active replicas storage strategy for each data block according to user requirements of data availability.Firstly,this paper develops a novel hypergraph-based storage model for Cloud data centers,which can precisely represent the many-to-many relationship among files,data blocks,data racks,and data nodes.Based on the hypergraph-based storage model,a ■ hyperedge algorithm is proposed to calculate the minimum set of data nodes variable ■.Because of just running the minimum number of required data nodes,it can not only save energy for the datacenter,but also maintain full functionality.Analyzing this optimal problem,there is more than one optimal solution in the feasible region.That is,there are multi-solutions with the minimum and equal number of active data nodes to satisfy the data blocks ■ constraints.It is a polymorphic function optimizal problem,and this paper proposed a greedy firefly algorithm to solve it.We have also implemented our proposed algorithm in a HDFS based prototype datacenter with WordCount,TeraSort,and Grep cloud computing cases for performance evaluation,and the four different aspects,namely,data availability,load balance,energy consumption and network performance of the data center are analyzed.Experimental results show that the variable hypergraph coverage based strategy can not only reduce energy consumption with less number of data nodes active,but can also relieve the delivery congestion problem in data center network.

引文

[1]Guo H,Wang L,Chen F,et al.Scientific big data and Digital Earth.Chinese Science Bulletin,2014,59(35):5066-5073
    [2]White T,Cutting D.Hadoop:The Definitive Guide.O’reilly Media Inc Gravenstein Highway North,2012,215(11):1-4
    [3]Feng Deng-Guo,Zhang Min,Li Hao,Big data security and privacy protection.Chinese Journal of Computers,2014,37(1):246-258(in Chinese)(冯登国,张敏,李昊.大数据安全与隐私保护.计算机学报,2014,37(1):246-258)
    [4]Subashini S,Kavitha V.A survey on security issues in service delivery models of cloud computing.Journal of Network and Computer Applications,2011,34(1):1-11
    [5]Armbrust M,Fox A,Griffith R,et al.Above the clouds:A Berkeley view of cloud computing.EECS Department University of California Berkeley,2009,53(4):50-58
    [6]Koomey J G.Growth in Data Center Electricity Use 2005to2010.Oakland,USA:Analytics Press,2011
    [7]Gu Li-Jing,Zhou Fu-Qiu,Meng Hui,Research on energy consumption and energy efficiency of data center in China,Energy of China,2010,32(11):42-45(in Chinese)(谷立静,周伏秋,孟辉.我国数据中心能耗及能效水平研究.中国能源,2010,32(11):42-45)
    [8]Data Center Energy Efficiency Assessment Guide.Cloud Computing Development and Policy Forum Technical Report,2012.3.16(in Chinese)(数据中心能效测评指南.“云计算发展与政策论坛”技术报告,2012.3.16)
    [9]Elomari A,Maizate A,Hassouni L.Data storage in big data context:A survey//Proceedings of the IEEE International Conference on Systems of Collaboration(SysCo).Casablanca,Morocco,2016:1-4
    [10]Song Bao-Yan,Wang Jun-Lu,Wang Yan.Optimized storage strategy research of HDFS based on Vandermonde code.Chinese Journal of Computers,2015,38(9):1825-1837(in Chinese)(宋宝燕,王俊陆,王妍.基于范德蒙码的HDFS优化存储策略研究.计算机学报,2015,38(9):1825-1837)
    [11]Di S,Kondo D,Cappello F.Characterizing cloud applications on a Google data center//Proceedings of the 42nd International Conference on Parallel Processing(ICPP).Lyon,France,2013:468-473
    [12]Chen D,Chen Y,Brownlow B N,et al.Real-time or near real-time persisting daily healthcare data into HDFS and ElasticSearch index inside a big data platform.IEEE Transactions on Industrial Informatics,2017,13(2):595-606
    [13]Chen S,Pedram M.Efficient peak shaving in a data center by joint optimization of task assignment and energy storage management//Proceedings of the 10th IEEE International Conference on Cloud Computing.Honolulu,USA,2017:77-83
    [14]Cheng Z,Luan Z,Meng Y,et al.ERMS:An elastic replication management system for HDFS//Proceedings of the 2012IEEE International Conference on Cluster Computing Workshops(CLUSTER WORKSHOPS).Beijing,China,2012:32-40
    [15]Abad C L,Lu Y,Campbell R H.DARE:Adaptive data replication for efficient cluster scheduling//Proceedings of the2011IEEE International Conference on Cluster Computing(CLUSTER).Austin,USA,2011:159-168
    [16]Kaushik R T,Bhandarkar M.GreenHDFS:Towards an energy-conserving,storage-efficient,hybrid Hadoop compute cluster//Proceedings of the International Conference on Power Aware Computing&Systems.Berkeley,USA,2010:1-9
    [17]Kaushik R T,Bhandarkar M,Nahrstedt K.Evaluation and analysis of GreenHDFS:A self-adaptive,energy-conserving variant of the Hadoop distributed file system//Proceedings of the 2010IEEE Second International Conference on Cloud Computing Technology and Science(IEEE CloudCom).Indiana University,USA,2010:274-287
    [18]Maheshwari N,Nanduri R,Varma V.Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework.Future Generation Computer Systems,2012,28(1):119-127
    [19]Liao B,Yu J,Zhang T,et al.Energy-efficient algorithms for distributed storage system based on block storage structure reconfiguration.Journal of Network&Computer Applications,2015,48(1):71-86
    [20]Harnik D,Naor D,Segall I.Low power mode in cloud storage systems//Proceedings of the IEEE International Symposium on Parallel&Distributed Processing.Chengdu,China,2009:1-8
    [21]Kim J,Chou J,Rotem D.Energy proportionality and performance in data parallel computing clusters//Proceedings of the International Conference on Scientific and Statistical Database Management(SSDBM 2011).Portland,USA,2011:1762-1774
    [22]Kim J,Chou J,Rotem D.iPACS:Power-aware covering sets for energy proportionality and performance in data parallel computing clusters.Journal of Parallel&Distributed Computing,2014,74(1):1762-1774
    [23]Yazd S A,Venkatesan S,Mittal N.Boosting energy efficiency with mirrored data block replication policy and energy scheduler.ACM SIGOPS Operating Systems Review,2013,47(2):33-40
    [24]Singh K,Kaur R.Hadoop:Addressing challenges of big data//Proceedings of the IEEE Advance Computing Conference.Gurgaon,India,2014:686-689
    [25]Kaushik R T,Bhandarkar M.GreenHDFS:Towards an energy-conserving,storage-efficient,hybrid Hadoop compute cluster//Proceedings of the International Conference on Power Aware Computing and Systems.Vancouver,Canada,2010:1-9
    [26]Jelassi M N,Largeron C,Yahia S B.Concise representation of hypergraph minimal transversals:Approach and application on the dependency inference problem//Proceedings of the IEEEInternational Conference on Research Challenges in Information Science,Aitolia-Akarnania,Greece,2015:434-444
    [27]Reiss C,Wilkes J,Hellerstein J L.Google cluster-usage traces:Format+schema.Google Inc.,White Paper,2011
    [28]Fan X,Weber W D,Barroso L A.Power provisioning for a warehouse-sized computer.ACM SIGARCH Compute Architect News 2007;35(2):13-23
    [29]Verma A,Ahuja P,Neogi A.pMapper:Power and migration cost aware application placement in virtualized systems//Proceedings of the Middleware 2008.New York,USA:Springer,2008:243-264
    [30]Kusic D,Kandasamy N,Jiang G.Combined power and performance management of virtualized computing environments serving session-based workloads.IEEE Transactions on Network&Service Management,2011,8(3):245-258

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700