分布式存储系统中磁盘故障检测机制

英文篇名：Disk failure detection mechanism in distributed storage systems
作者：刘榴 ; 李小勇
英文作者：LIU Liu;LI Xiao-yong;School of Cyber Security,Shanghai Jiaotong University;
关键词：分布式存储系统 ; 磁盘故障检测 ; 随机取点
英文关键词：distributed storage system;;disk failure detection;;random access point
中文刊名：HDZJ
英文刊名：Information Technology
机构：上海交通大学网络空间安全学院;
出版日期：2018-05-25
出版单位：信息技术
年：2018
语种：中文;
页：HDZJ201805020
页数：7
CN：05
ISSN：23-1557/TN
分类号：91-97

摘要

在大规模分布式存储系统中,经常会出现磁盘故障的情况,一方面需要尽快找出故障磁盘以降低数据丢失的风险,另一方面需要高准确率地找出故障磁盘以降低更换磁盘带来的时间成本和经济成本。文中针对以上需求,提出了一种基于磁盘空间随机取点的检测方法,通过将磁盘空间均分为N等份,然后在这些空间中随机读一个扇区,根据I/O状态以及I/O延迟时间来判断磁盘是否故障。实验表明,该方法能够在较短时间内以较高的准确率找出分布式存储系统中的故障磁盘,提高了分布式存储系统的可靠性。
In large-scale distributed storage systems,there are lots of disk failures. On the one hand,it is necessary to find failed disks to reduce the risk of data loss. On the on other hand,it is necessary to accurately identify the fault disk to reduce the time costs and economic costs due to the replacement of the disk. For the above requirements,this paper proposes a method that accessing points randomly based on disk space. By dividing the disk space into N equivalents,then reading a sector in these equivalents.According to the I/O states and the I/O delay time to determine whether the disk is faulty. The experiments show that the method can find failed disks accurately in a short time in distributed storage systems,and improve the reliability of distributed storage systems.

引文

[1]Gantz J,Reinsel D.The digital universe in 2020:big data,bigger digital shadows,and biggest growth in the far east[EB/OL].[2012-03-22].http:∥www.emc.com/leadership/digital-universe/index.htm.
    [2]Vishwanath K V,Nagappan N.Characterizing cloud computing hardware reliability[C].Proceedings of the 1st ACM symposium on Cloud computing.ACM,2010:193-204.
    [3]Liu Jin,Zhu Jia-ji,Zhang Hai-yong.Technical challenges of largescale cloud computing platform[EB/OL].[2012-02-20].http:∥prog3.com/article/1970-01-01/312024.
    [4]Schroeder B,Gibson G A.Disk failures in the real world:What does an MTTF of 1,000,000 hours mean to you?[C].Proc of the5th USENIX Conference on File and Storage Technologies,2007:287-299.
    [5]Joseph F Murray,Gordon F Hughes,Kenneth Kreutz-Delgado.Machine Learning Methods for Predicting Failures in Hard Drives:A Multiple-Instance Application[J].Journal of Machine Learning research,2005,6:783-816.
    [6]Pinheiro E,Weber W-D,Andr L,et al.Failure treads in a large disk drive population[C].Proc of the 5th USENIX Conference on File and Storage Technologies,2007:1.
    [7]董勇,蒋艳凰,卢宇彤,等.面向磁盘故障预测的机器学习方法比较[J].计算机工程与科学,2015(12):2200-2207.
    [8]胡维.基于智能预警和自修复的高可靠磁盘阵列关键技术研究[D].合肥:国防科学技术大学,2010.
    [9]Joy Ding.磁盘的结构简介[EB/OL].[2012-05-21].http:∥www.cnblogs.com/joydinghappy/archive/2012/05/21/2511948.html.
    [10]Bairavasundaram L N,Goodson G R,Pasupathy S,et al.An Analysis of Latent Sector Errors in Disk Drives[C].Proc.of SIGMETRICS’07,2007.
    [11]Hughes G F,Murray J F,Kreutz-Delgado K,et al.Improved diskdrive failure warnings[J].Reliability IEEE Transactions on,2002,51(3):350-357.
    [12]Schroeder B,Damouras S,Gill P.Understanding latent sector errors and how to protect against them[C].Usenix Conference on File and Storage Technologies.USENIX Association,2010:6.
    [13]Ford D,Popovici F I,Stokely M,et al.Availability in globally distributed storage systems[C].Usenix Conference on Operating Systems Design and Implementation.USENIX Association,2010:61-74.
    [14]Schroeder B,Lagisetty R,Merchant A.Flash reliability in production:the expected and the unexpected[C]∥Usenix Conference on File and Storage Technologies.USENIX Association,2016:67-80.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700