面向数据密集型超级计算的基于纠删码的容错存储技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向数据密集型超级计算的基于纠删码的容错存储技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Erasure Coding Based Fault-Tolerant Storage Technology for Data Intensive Super Computing
作者：周松
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：数据密集型超级计算 ; 容错存储 ; 纠删码 ; 中间数据 ; 数据放置
英文关键词：Data-intensive super computing ; Fault-tolerant storage ; Erasure coding ; Intermediate data ; Data placement
学位年度：2010
导师：王意洁
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

数据密集型超级计算作为一种新兴的计算模式,在高能物理、生物信息技术、天文计算、地震预报以及商业计算等数据密集型应用领域发挥着极其重要的作用。数据密集型超级计算以数据为中心,由系统负责存储、维护和处理海量数据。海量数据的存储和处理需求使得系统规模不断增长。随着存储规模的扩大,由于硬件故障、人员操作失误、病毒攻击、断电及火灾等各种原因,存储结点失效而导致整个系统发生故障的机率显著提高。因此,系统应具有较强的容错能力,保证数据的高可靠性和可用性。
     纠删码技术以其较强的容错能力和高空间利用率,为构造面向数据密集型超级计算的高可靠性和高容错性的大规模存储系统提供了一种有效的容错机制。然而,在数据密集型超级计算背景下,纠删码过高的修复成本易导致修复过程耗费大量的系统带宽,频繁的结点失效易导致大规模数据处理过程中产生的海量中间数据丢失而致使作业运行失败,数据放置不均易导致系统内结点利用率过低而使系统能耗过大。针对上述问题,本文对高容错低修复成本的纠删码编码技术、基于纠删码的中间数据容错存储管理方法以及功耗敏感的数据放置方法进行了深入的研究。取得的主要研究进展如下:
     纠删码过高的修复成本易导致修复过程占用大量系统带宽而降低系统性能。针对已有纠删码技术存在的不足,本文提出一种基于阵列结构的高容错低修复成本的纠删码EXPyramid。在EXPyramid码的基础上,针对多点失效和单点失效情况,本文分别提出了以最低修复成本为目标的多点失效修复算法RMFA和单点失效修复算法RSFA。EXPyramid采用阵列结构提高容错能力,将大数据集分组编码降低原始数据和冗余数据的关联程度,降低修复成本。在多点失效情况下,RMFA采用迭代的方法保证每次修复过程均具有最小成本,以使整体修复成本最低化;在单点失效情况下,RSFA采用广度优先搜索提高遍历效率,寻找最短修复路径,从而获得最低修复成本。理论分析表明,相对于已有纠删码,EXPyramid具有较强的容错能力和较低的修复成本。
     对大规模数据进行分布并行处理的过程中产生的中间数据是一类关键数据,中间数据丢失或损坏易导致后续任务的失败。已有的中间数据容错管理方法中,任务重执行方法易导致级联效应从而耗费大量计算资源,复制方法存储空间开销过高。针对当前中间数据容错管理方法存在的不足,本文提出一种基于EXPyamid码的中间数据容错存储管理方法EBIDS。EBIDS采用基于XOR运算的EXPyamid码对中间数据进行容错存储管理。XOR运算速度快,对中间数据短暂性的适应能力强;通过冗余编码,可有效降低存储开销;采用流式通信方式进行计算和传输冗余信息,能够有效降低单结点上的计算负载和带宽负载。实验结果表明,在正常情况下,EBIDS对系统的干扰很小。在单点失效情况下,EBIDS能有效防止级联效应的产生,降低失效对作业和系统的影响。和基于复制技术的中间容错存储管理方法相比,在提供相同容错保障的同时,EBIDS方法可有效降低冗余中间数据量,减小存储开销。
     在面向数据密集型应用的基于纠删码技术的存储系统内,合理地放置数据,有利于均衡存储负载和结点利用率。同时,可通过挂起部分闲置结点达到节省能耗的目的。已有的动态数据放置技术易造成大量数据迁移,从而占用系统带宽,降低系统性能。静态数据放置技术未考虑数据的访问特性,易造成结点利用率不均等问题。针对上述问题,本文提出了基于时间相关性的功耗敏感的数据放置方法TRBDPM。TRBDPM考虑数据密集型应用中数据访问模式的统计特性,引入时间相关性的概念,通过把无时间相关性的数据块和冗余块交叉放置,避开任务之间的相关性,从而使得在较长的一段时间内各结点的利用率保持均衡。同时,可通过挂起部分闲置结点以达到节省能耗的目的。实验结果表明,TRBDPM能够均匀地散布数据,保持存储负载均衡,在较长的时间内平衡结点利用率,部分结点有充足的时间进入低功耗模式从而被挂起,达到节省能耗的目的。
As a new computing paradigm, data-intensive super computing(DISC) plays an important role in the field of high-energy physics, biology information technology, astronomy, meteorology, earthquake prediction and commercial computing. DISC is responsible for storage, maintenance and handling of data. Due to storage and maintenance of large-scale data, the system size becomes increasingly large. However, the failure of the storage node causes the increasing probability of the system fault, owing to hardware fault, operation mistake, virus attack, power failure, fire accident, etc. Therefore, the system must have high fault tolerance to ensure strong reliability and availability of the data.
     The high fault tolerance and space utilization ratio of erasure coding provides an effective fault-tolerant scheme for constructing large-scale storage system with high reliability and fault tolerance oriented to DISC. Nevertheless, there are many new problems need to be solved in order to use erasure coding in DISC. Firstly, the high recovery overhead of erasure coding may consume too much bandwidth. Secondly, the loss of the intermediate data created in the process of multi-stage computing will cause the job failure. Thirdly, the system may have a high energy consumption caused by the unbalance of data dissemination. To solve these issues, erasure code with high fault-tolerant ability and low recovery-overhead, fault-tolerant storage management strategy for intermediate data, and power-aware data placement strategy are proposed. The research results are listed as follows:
     The high recovery-overhead of erasure code may degrade the system performance. Firstly, EXPyramid, an array-based erasure code with high fault-tolerant ability and low recovery-overhead, is proposed. Secondly, two recovery algorithms, RMFA and RSFA aiming to lower the recovery-overhead, in multiple nodes failure and single node failure situation respectively are proposed. The array structure of EXPyramid enhances the fault tolerance and lowers recovery-overhead by dividing and coding data. RMFA guarantees minimum total recovery-overhead by the minimum recovery-overhead in each recovery process iteratively. Meanwhile, to find the shortest recovery-path and make recovery-overhead of single recovery minimum, RSFA uses breadth-first search strategy to find every available recovery-path. Analysis shows that EXPyramid has both higher fault-tolerant ability and lower recovery-overhead compared to current typical erasure code.
     Intermediate data is critical to the large-scale distributed computing of large-scale dataset. The re-execution mechanism may cause cascaded re-execution which will consume amount of various system resources. The replication strategy for intermediate data may cause large storage overhead. For these reasons, EBIDS, a fault-tolerant storage management strategy for intermediate data based on EXPyramid code is proposed. By using XOR-based EXPyramid code, which is applicable to intermediate data, EBIDS reduces storage overhead. By using pipe-line communication to computing and transferring redundant data, EBIDS alleviates the computing and communication burden of single node. Experiments show that, EBIDS create fewer redundant intermediate data than the replication strategy. EBIDS can effectively prevent the cascaded re-execution and make the system resilient for the failure with little interference to system.
     Appropriate placement of data in storage system based on erasure coding for data-intensive application is helpful to the load balance of storage and node utilization rate. Meanwhile, hanging up idle nodes can save energy consumption of system. On one hand, the dynamic data placement may consume too much system bandwidth; On the other hand, the static data placement doesn’t consider the data access features, which may cause the node utilization rate unbalance. To solve these problems, TRBDPM, a time-relativity based power-aware data placement strategy is proposed. Considering the data access features, TRBDPM distributes the data block and redundant data block of different datasets in a cross manner to avoid the relativity during different tasks. Experiments show that TRBDPM can distribute the data uniformly, keep balanced storage load and node utilization rate, and save energy consumption by hanging up a part of idle nodes for a long time.

引文

[1] R. E. Bryant. Data-intensive supercomputing: The case for DISC. Technical report, Carnegie Mellon, 2007. http://www.cs.cmu.edu/~bryant/pubdir/cmu-cs-07-128.pdf.
    [2] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of USENIX ,OSDI, 2004.
    [3] L. Barroso and U. Holzle. The case for energy-proportional computing.Computer, pages 33-37, 2007.
    [4] Zheng Shao. Hadoop Hive General Introduction. Presented to Hadoop Salon, Beijing, Nov, 2008.
    [5] Jason Sobel. Needle in a Haystack: Efficient Storage of Billions of Photos. Facebook. http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv
    [6] Todd Hoff. YouTube Architecture [online]. March 13, 2008.http://highscalability. com/youtube-architecture
    [7] Sorting 1 PB with MapReduce. http://googleblog.blogspot.com/2008/11/ sorting- 1pb-with-mapreduce.html.
    [8] Todd Hoff. eBay Architecture [online]. May 27, 2008. http:// high scalability. com/ ebay-architectur
    [9] Todd Hoff. Wiki Architecture. August 22, 2007 http://highscalability.com /wikimedia-architecture
    [10] CHEN, P. M., LEE, E. K., GIBSON, G. A., KATZ, R. H., AND PATTERSON, D. A. 1994. RAID: High-Performance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145–185.
    [11] XIA, H. AND CHIEN, A. A. 2007. Robustore: A distributed storage architecture with robust and high performance. In Proceedings of the ACM/IEEE Conference on SuperComputing (SC’07), 1–11.
    [12] Kubiatowicz J, Weimer W, Wells C, et al. OceanStore: an architecture for global-scale persistent storage. In: Proc. ACM SIGPLAN Notices.2000, 35(11): 190-201.
    [13] B. Fan,W. Tantisiriroj, L. Xiao, and G. Gibson. Diskreduce: Raid for data-intensive scalable computing. Petascale Data Storage Workshop, SC’09.
    [14] Zhe Zhang, Amey Deshpande. Xiao song Ma, Eno Thereska, and Dushyanth Narayanan.Does erasure coding have a role to play in my data center? Microsoft Research Technical Report MSR-TR-2010-52,May 2010.
    [15] REED, I. S. AND SOLOMON, G. 1960. Polynomial codes over certain finite fields. J. Soc. Industrial Appl. Math. 8, 2, 300–304.
    [16] ROTH, R. M. AND LEMPEL, A. 1989. On MDS codes via Cauchy matrices. IEEE Trans. Inf. Theory. 35, 6, 1314–1319.
    [17] BLOEMER, J., KALFANE, M., KARP, R., KARPINSKI, M., LUBY, M., AND ZUCKERMAN, D. 1995. An XOR-based erasure resilient coding scheme. Tech. rep. TR-95-048, International Computer Science Institute, Berkeley, California.
    [18] PLANK, J. S. AND XU, L. 2006. Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications. In Proceedings of the 5th IEEE International Symposium on NetworkComputing and Applications (NCA’06). IEEE Computer Society, 173–180.
    [19] Plank, J. S. 2008. The RAID-6 liberation codes. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, 1–14.
    [20] Blaum, M.,BRADY, J.,BRUCK, J., ANDMENON, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Compute. 44, 2, 192–202.
    [21] Corbett, P., ENGLISH, B., GOEL, A., GRCANAC, T., KLEIMAN, S., LEONG, J., AND SANKAR, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). USENIX Association, 1–14.
    [22] Huang, C. and Xu, L. 2005. Star: An efficient coding scheme for correcting triple storage node failures. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05). USENIX Association.
    [23] FENG, G.-L., DENG, R. H., BAO, F., AND SHEN, J. C. 2005a. New efficient MDS array codes for RAID, Part I. Reed-Solomon-Like codes for tolerating three disk failures. IEEE Trans. Compute. 54, 9, 1071–1080.
    [24] FENG, G.-L., DENG, R. H., BAO, F., AND SHEN, J. C. 2005b. New efficient MDS array codes for RAID, Part II. Rabin-Like codes for tolerating multiple (greater than or equal to 4) disk failures. IEEE Trans. Compute. 54, 12, 1473–1483.
    [25] XU, L. AND BRUCK, J. 1999. X-Code: MDS array codes with optimal encoding. IEEE Trans. Inf. Theory 45, 1, 272–276.
    [26] HAFNER, J. L. 2005. Weaver codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05). USENIX Association, 211–224.
    [27] HAFNER, J. L. 2006. Hover erasure codes for disk arrays. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’06). IEEE Computer Society, 217–226.
    [28] RUBINOFF, M. 1961. N-Dimensional codes for detecting and correcting multiple errors. Communication. ACM 4, 12, 545–551.
    [29] WONG, T. E., AND SHEA, J. M. 2001. Multi-Dimensional parity check codes for bursty channels. In Proceedings of the IEEE International Symposium on Information Theory (ISIT’01), 123.
    [30] ANNE, N. B., THIRUNAVUKKARASU, U., AND LATIFI, S. 2004. Three and four-dimensional parity-check codes for correction and detection of multiple errors. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC), vol. 2. IEEE Computer Society, 840–845.
    [31] GALLAGER, R. G. 1962. Low density parity check codes. IRE Trans. Inf. Theory 8, 1, 21–28.
    [32] LUBY, M. C.,MITZENMACHER, M., SHOKROLLAHI, M. A., AND SPIELMAN, D. A. 2001. Efficient erasure correcting codes. IEEE Trans. Inf. Theory 47, 2, 569–584.
    [33] TANNER, R. M. 1981. A recursive approach to low complexity codes. IEEE Trans. Inf. Theory 27,5, 533–547.
    [34] M.Luby ,M.Mitzenmacher, A.Shokrollahi,D.Spielman,and V.Stemann.Practicl Loss Resilient Codes. In 29th Annual ACM Symposium on Theory of Computing,pages 150-159,1997
    [35] J. PLANK, J. S. AND THOMASON, M. G. 2004. A practical analysis of low-density parity-check erasure codes for wide-area storage applications. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’04). IEEE Computer Society.
    [36] PLANK, J. S.,BUCHSBAUM, A. L., COLLINS, R. L., AND THOMASON, M. G. 2005. Small parity-check erasure codes- Exploration and observations. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, 326–335.
    [37]郭斯杰,贾鸿飞,熊劲.互联网海量数据存储和处理技术综述[J].信息技术快报, 2009(9):13-14
    [38] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks. In Proc. of ACM EuroSys, 2007.
    [39] Ralf Lammel etc al. Google’s MapReduce Programming Model– Revisited. Published in SCP.
    [40] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In Proc. of ACM, SIGMOD, 2008.
    [41] Steven Y.Ko, Imranul Hoque,Brian Cho.Illinois at Urbana-Champaign .On Availability of Intermediate Data in Cloud Computations .In Proc. of 12th USENIX Workshop on Hot Topics in Operating Systems,2009
    [42] Report to Congress on Server and Data Center Energy Efficiency. Tech. rep., U.S. Environmental Protection Agency, 2007.
    [43] Green IT: A New Industry Shock Wave, Gartner Symposium/ITxpo, October 2007.
    [44] HAMILTON, J. CEMS: Low Cost, Low Power Servers for Internet-Scale Services. In Conference on Innovative Data Systems Research (2009), pp. 1–8.
    [45] EPA Report on Server and Data Center Energy Efficiency, Public Law 109-431, U.S. Environmental Protection Agency, ENERGY STAR Program http://www.energystar.gov.
    [46] StorageIO, Greg Sculz, http://www.storageio.com.
    [47] The Diverse and Exploding Digital Universe, An IDC White Paper - sponsored by EMC, www.emc.com/collateral/analyst-reports/.
    [48] Lakshmi Ganesh, Hakim Weatherspoon, Mahesh Balakrishnan. Optimizing Power Consumption in Large Scale Storage Systems. HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems.
    [49] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke. Reducing disk power consumption in servers with drpm. In IEEE Computer, Dec 2003.
    [50] HAMILTON, J. CEMS: Low Cost, Low Power Servers for Internet-Scale Services. In Conference on Innovative Data Systems Research (2009), pp. 1–8.
    [51] LIM, K., RANGANATHAN, P., CHANG, J., PATEL, C., MUDGE, T., AND REINHARDT, S.Understanding and designing new server architectures for emerging warehouse-computing environments. In Proceedings of the 35th International Symposium on Computer Architecture (2008), IEEE Computer Society, pp. 315– 326.
    [52] D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In SC, pages 1–11, 2002
    [53] E. Pinheiro and R. Bianchini. Energy conservation techniques for disk array-based servers. In ICS, pages 68–78, 2004.
    [54] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical power management for enterprise storage. In FAST, pages 253–267, 2008.
    [55] M. Storer, K. Greenan, E. Miller, and K. Voruganti. Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage. In FAST, pages 1–16, 2008.
    [56] Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes. Hibernator: helping disk arrays sleep through the winter. In SOSP, pages 177–190, 2005.
    [57] D. Li and J.Wang. Eeraid: energy efficient redundant and inexpensive disk array. In ACM SIGOPS European Workshop, page 29, 2004.
    [58] HARNIK, D., NAOR, D., AND SEGALL, I. Low power mode in cloud storage systems. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing (2009), IEEE Computer Society.
    [59] E. Pinheiro, R. Bianchini, and C. Dubnicki. Exploiting redundancy to conserve energy in storage systems. In SIGMETRICS/Performance, pages 15–26, 2006.
    [60] K. Greenan, D. Long, E. Miller, T. Schwarz, and J. Wylie. Spinup saved is energy earned: Achieving power-efficient, erasure-coded storage. In HotDep08, 2008.
    [61] C.Weddle, M.Oldham, J. Qian, A.Wang, P. Reiher, and G. Kuenning. Paraid: A gear-shifting power-aware raid. TOS, 3(3), 2007.
    [62] X. Yao and J. Wang. Rimac: a novel redundancy-based hierarchical cache architecture for energy efficient, high performance storage systems. In EuroSys, pages 249–262, 2006.
    [63] WANG, J., ZHU, H., AND LI, D. eRAID: Conserving Energy in Conventional Disk-Based RAID System. IEEE Transactions on Computers 57 (2008), 359–374.
    [64] Q. Zhu et. al. Reducing energy consumption of disk storage using power-aware cache management. HPCA, 00:118, 2004.
    [65] J. Considine. Generating good degree distributions for sparse parity check codes using oracles. In Technical Report, BUCS-TR 2001-019, Boston University, 2001.
    [66] A. Kamra, J. Feldman, V. Misra, and D. Rubenstein. Growth codes: Maximizingsensor network data persistence. In ACM SIGCOMM, 2006.
    [67] Cheng Huang, Minghua Chen,and Jin Li,"Pyramid Codes:Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems",Sixth IEEE International Symposium on Network Computing and Applications(NCA 2007).
    [68] R.Ahlswede,N.Cai,S.Y.R.Li,et.al. Networking information flow[j].IEEE Trans on InformationTheory,2000,46(4):1204-1216
    [69] A.G.Dimakis.P.B.Godfrey.Y.Wu,M.J.Wainwright,and K.Ramchandran."Network coding for distributed storage systems" Submitted for journal publication.Preliminary version appeared in Infocom 2007.
    [70] Y.Wu, A.G.Dimakis, and K.Ramchandran."Deterministic regenerating codes for distributed storage," In Allerton Conference on Control ,Computing, and Communication, Urbana-Champaign,IL,September 2007.
    [71] Y. Wu and A. Dimakis,“Reducing Repair Traffic for Erasure Coding-Based Storage via Interference Alignment,”Proc. ISIT, Jul. 2009.
    [72] J. Dean. Experiences with MapReduce, an Abstraction for Large-Scale Computation. In Keynote I: PACT, 2006.
    [73] Hadoop Presentations. http://wiki.apache.org/hadoop/HadoopPresentations.
    [74] Steven Y.Ko, Imranul Hoque, Brian Cho, Indranil Gupta. Making Cloud Intermediate Data Fault-Tolerant. In: Proc. SoCC'10, June 10-11, 2010, Indianapolis, Indiana, USA.
    [75] F.Chang, M. Ji, S. A. Leng, J. MacCormick, S. Perl, and L.Zhang. Myriad: Cost-effective Disaster Tolerance. Conference on File and Storage Technologies ( Monterey,CA,28-30January 2002), page 103-116. USENIX Association,2002.
    [76]张大为,韩华,代亚非,田敏.ESStore:提高网络存储的可行性机制.Technical Report(PKU_CS_NET_TR2004002),http://162.105.80.88/crazysite//home/report.2004

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700