Hadoop备份数据存放策略的改进

英文篇名：Improvement of Backup Data Placement Policy of Hadoop
作者：周长俊 ; 宗平
英文作者：ZHOU Chang-jun;ZONG Ping;School of Computer,Nanjing University of Posts and Telecommunications;School of Overseas Education,Nanjing University of Posts and Telecommunications;
关键词：Hadoop ; 备份数据存放策略 ; 内部带宽 ; 负载均衡 ; 热点数据
英文关键词：Hadoop;;backup data placement policy;;internal bandwidth;;load balance;;hot data
中文刊名：WJFZ
英文刊名：Computer Technology and Development
机构：南京邮电大学计算机学院;南京邮电大学海外教育学院;
出版日期：2018-11-15 10:11
出版单位：计算机技术与发展
年：2019
期：v.29;No.261
基金：国家“863”高技术发展计划项目(2006AA01Z208);; 江苏省高校自然科学基础研究项目(06KJB520079)
语种：中文;
页：WJFZ201901003
页数：6
CN：01
ISSN：61-1450/TP
分类号：17-22

摘要

对于默认的Hadoop备份数据存放策略来说,一旦本地的数据副本发生失效,那么就需通过远端机架上存放的备份数据来实现恢复,而对于默认的备份数据存放策略,备份数据存放节点的选择具有随机性,那么可能带来的问题是不同节点间备份数据存放不均衡,数据恢复时由于距离的因素造成内部带宽的巨大消耗。针对上述问题,提出一种改进的备份数据存放策略。该策略将节点之间的距离,节点的负载以及备份数据恢复次数纳入节点选择的考虑范围,由此计算出每个节点的匹配度,随之选出匹配度最高的节点作为远端机架间的备份数据存放的最优节点。该策略不但实现了节点间备份数据放置的负载均衡,而且兼顾了数据恢复时消耗的内部带宽,将数据副本失效次数纳入考虑,实现了经常失效数据副本的快速恢复。通过在Hadoop平台上实现所提出的改进策略,结果达到了预期的要求。
On the topic of the default Hadoop backup data storage strategy,once the local data copy fails,backup data stored in the remote rack should be used to restore. However,for the default backup data storage strategy,the choice of storage nodes is random,so the problem that may arise is that backup data is stored unevenly among different nodes,and the internal bandwidth is greatly consumed due to the distance when data is recovered. In order to solve these problems,we propose an improved backup data storage strategy. The strategy considers the distance between nodes,the load of nodes and the number of backup data recovery into consideration,and calculates the matching degree of each node. Thus node with the highest matching degree is selected as the optimal node for storing the backup data between the remote racks. This strategy not only realizes the load balancing of backup data placement between nodes,but also takes the internal bandwidth consumed during data recovery into account,besides that it covers the number of data copy failures and achieve rapid recovery of frequently failed data copies. By implementing the proposed improvement strategy on the Hadoop platform,the results meet the expected requirements.

引文

[1]孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-149.
    [2]陈晨,陈达丽.谷歌大数据技术的研究及开源实现[J].软件产业与工程,2015(5):31-36.
    [3]王铃惠,李小勇,张轶彬.海量小文件存储文件系统研究综述[J].计算机应用与软件,2012,29(8):106-109.
    [4]王鲁俊,龙翔,吴兴博,等. SFFS:低延迟的面向小文件的分布式文件系统[J].计算机科学与探索,2014,8(4):438-445.
    [5]童明.基于HDFS的分布式存储研究与应用[D].武汉:华中科技大学,2012.
    [6]王永洲.基于HDFS存储技术的研究[D].南京:南京邮电大学,2013.
    [7]曹卉. Hadoop分布式文件系统原理[J].软件导刊,2016,15(3):15-17.
    [8] LIAO Wenzhe. Application of Hadoop in the document storage management system for telecommunication enterprise[J]. International Journal of Interdisciplinary Telecommunications and Netw orking,2016,8(2):58-68.
    [9] BENDE S,SHEDGE R. Dealing with small files problem in Hadoop distributed file system[J]. Procedia Computer Science,2016,79:1011-1012.
    [10]邵秀丽,王亚光,李云龙,等. Hadoop副本放置策略[J].智能系统学报,2013,8(6):489-496.
    [11] BUYYA R,YEO C S,VENUGOPAL S,et al. Cloud computing and emerging IT platforms:vision,hype,and reality for delivering computing as the 5th utility[J]. Future Generation Computer Systems,2009,25(6):599-616.
    [12]段效琛,李英娜,贾会玲,等.初始信息素筛选的蚁群优化算法在HDFS副本选择中的研究[J].传感器与微系统,2017,45(4):31-33.
    [13]王来,翟健宏.基于HDFS的分布式存储策略分析[J].智能计算机与应用,2016,6(1):5-8.
    [14]李晓恺,代翔,李文杰,等.基于纠删码和动态副本策略的HDFS改进系统[J].计算机应用,2012,32(8):2150-2153.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700