摘要
失效检测器是构建高可用分布式系统的基础组件之一,能够保证分布式系统提供持续、可靠的服务,以最低的检测负载实现快速、准确的失效检测为目标。目前的失效检测器主要围绕自适应失效检测和检测结果共享机制展开研究,以期能够在检测时间、检测准确性以及检测负载等失效检测服务质量方面不断改进。
Failure detector is one of fundamental components to build high availability distributed systems,and can ensure that the distributed systems provide the continuous and reliable service. The target of failure detector is to achieve the fast and accurate failure detection with the lowest overhead. At present,in order to improve the detection time,accuracy and overhead,the failure detector mainly focuses on adaptive failure detection and mechanism of sharing result.
引文
[1]常光辉.大规模分布式可信监控系统研究[D].重庆:重庆大学,2011.
[2]张家琳.分布式计算中的共识问题研究[D].北京:清华大学,2010.
[3]李磊.分布式系统中容错机制性能优化技术研究[D].长沙:国防科学技术大学,2007.
[4]CHANDRA T D,TOUEG S.Unreliable failure detectors for reliable distributed systems[J].Journal of the ACM(JACM).1996,43(2):225-267.
[5]LARREA M,ANTA F A,ARVALO S.Optimal implementation of the weakest failure detector for solving consensus[C]//The19thIEEE Symposium on Reliable Distributed Systems(SRDS).Nürnberg,Germany:IEEE Computer Society Press,2000:52-59.
[6]CHEN Wei,TOUEG S,AGUILERA MK.On the quality of service of failure detectors[J].Computers,IEEE Transactions on,2002,51(1):13-32.
[7]SOTOMA I,MADEIRA E R M.Adaptation-algorithms to adaptive fault monitoring and their implementation on Corba[C]//The 3rdInternational Symposium on Distributed Objects and Applications.Rome,Italy:IEEE,2001:219-228.
[8]FETZER C,RAYNAL M,TRONEL F.An adaptive failure detection protocol[C]//Pacific Rim International Symposium on Dependable Computing,Seoul,Korea:IEEE,2001:146-153.
[9]FALAI L,BONDAVALLI A.Experimental evaluation of the Qos of failure detectors on wide area network[C]//Proceedings of International Conference on Dependable Systems and Networks.Yokohama,Japan:IEEE Press,2005:624-633.
[10]BERTIER M,MARIN O,SENS P.Implementation and performance evaluation of an adaptable failure detector[C]//DSN'02:Proceedings of the 2002 International Conference on Dependable Systems and Networks.Washington,DC,USA:IEEE Computer Society,2002:354-363.
[11]TOMSIC A,SENS P,GARCIA J,et al.2W-FD:A failure detector algorithm with QoS[C]//International Parallel and Distributed Processing Symposium(IPDPS2015).Hyderabad,India:IEEE Press,2015:885-893.
[12]DFAGO X,URBAN P,HAYASHIBARA N et al.Definition and specification of accrual failure detectors[C]//International Conference on Dependable Systems and Networks.Yokohama,Japan:IEEE Computer Society,2005:206-215.
[13]HAYASHIBARA N,DFAGO X,YARED R,et al.TheΦaccrual failure detector[C]//Proceedings of the 23rdIEEE International Symposium on Reliable Distributed Systems.Florianopolis,Brazil:IEEE Press,2004:66-78.
[14]XIONG N,DFAGO X.ED FD:Improving the Phi accrual failure detecor[R].Japan:JAIST,2007.
[15]LAKSHMAN A,MALIK P.Cassandra:A decentralized structured storage system[J].ACMSIGOPS Operating Systems Review,2010,44(2):35-40.
[16]SATZGER B,PIETZOWSKI A,TRUMLER W,et al.A newadaptive accrual failure detector for dependable distributed systems[C]//Proceedings of ACMsymposium on Applied computing(SAC'07).Seoul,Korea:ACMPress,2007:551-555.
[17]HE Yanzhang,JIANG Xiaohong,DAI Changbo,et al.Selfadaptive failure detector for peer-to-peer distributed system considering the link faults[M]//DOU Y,LIN H,SUN G,et al.Advanced parallel processing technologies.APPT 2017.Lecture Notes in Computer Science,Cham:Springer,2017,10561:64-75.
[18]FELBER P,DFAGO X,GUERRAOUI R,et al.Failure detectors as first class objects[C]//Proceedings of the International Symposium on Distributed Objects and Applications.Edinburgh,United Kingdom:IEEE Computer Society,1999:132-141.
[19]STELLING P,FOSTER I,KESSELMAN C,et al.A fault detection service for wide area distributed computations[C]//Proceedings of The Seventh International Symposium on High Performance Distributed Computing.Chicago,USA:IEEE Computer Society,1998:268-278.
[20]BERTIER M,MARIN O,SENS P.Performance analysis of a hierarchical failure detector[C]//Proceedings of 2003International Conference on Dependable Systems and Networks.San Francisco,CA,USA:IEEE Computer Society Press,2003:635-644.
[21]LIN Mengjiang,MARZULLO K,MASINI S.Gossip versus deterministic flooding:Low message overhead and high reliability for broadcasting on small networks[R].CA,USA:University of California,1999.
[22]EUGSTER P T,GUERRAOUI R.Probabilistic multicast[C]//Proceedings of International Conference on Dependable Systems and Networks.Washington,DC,USA:IEEE Computer Society,2002:313-322.
[23]VAN RENESSE R,MINSKY Y,HAYDEN M.A gossip-style failure detection service[C]//Proceedings of IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing Middleware.The Lake District,UK:ACM,1998:55-70.
[24]GUPTA I,CHANDRA T D,GOLDSZMIDT A S.On scalable and efficient distributed failure detectors[C]//Twentieth ACMSymposium on Principles of Distributed Computing.New port,Rhode Island:ACMPress,2001:170-179.
[25]SNYDER S,CARNS P,JENKINS J,et al.A case for epidemic fault detection and group membership in HPC storage systems[M]//JARVIS S,WRIGHT S,HAMMOND S.High performance computing systems.performance modeling,Benchmarking,and Simulation.PMBS 2014.Lecture Notes in Computer Science,Cham:Springer,2014,8966:237-248.
[26]HORITA Y,TAURA K,CHIKAYMA T.A scalable and efficient self-organizing failure detector for grid applications[C]//The 6thIEEE/ACMInternational Workshop on Grid Computing.Seattle,WA,USA:IEEE Computer Society Press,2005:202-210.
[27]WARD J S,BARKER A.Monitoring large-scale cloud systems with layered gossip protocols[J].ar Xiv preprint ar Xiv:1305.7403,2013.