Analysis of Relationship Between Modes of Intercomputer Communications and Fault Types in Redundant Computer Systems
详细信息    查看全文
  • 关键词:Reliability ; Fault ; tolerance ; Redundant computer system ; Non ; Byzantine and Byzantine fault types ; Byzantine agreement algorithm ; Protocols of intercomputer communications ; Modes of intercomputer communications
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2017
  • 出版时间:2017
  • 年:2017
  • 卷:10220
  • 期:1
  • 页码:1-32
  • 参考文献:1.Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 1(1), 11–33 (2004)CrossRef
    2.Bentley, J.: Introduction to Reliability and Quality Engineering. Addison-Wesley, Reading (1999)
    3.Pradhan, D.K. (ed.): Fault-tolerant Computer System Design. Prentice-Hall Inc., Upper Saddle River (1996)
    4.Kwak, S.W., Choi, B.J., Kim, B.K.: An optimal checkpointing-strategy for real-time control systems under transient faults. IEEE Trans. Reliab. 50(3), 293–301 (2001)CrossRef
    5.Zhang, Y., Jiang, J.: Integrated active fault-tolerant control using IMM approach. IEEE Trans. Aerosp. Electron. Syst. 37(4), 1221–1235 (2001)CrossRef
    6.Alvisi, L., Malkhi, D., Pierce, E., Reiter, M.K.: Fault detection for Byzantine quorum systems. IEEE Trans. Parallel Distrib. Syst. 12(9), 996–1007 (2001)CrossRef
    7.Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)CrossRef MATH
    8.Lima, G.M., Burns, A.: A consensus protocol for CAN-based systems. In: 24th IEEE Real-Time Systems Symposium, RTSS 2003, pp. 420–429. IEEE (2003)
    9.Cristian, F., Aghili, H., Strong, R., Dolev, D.: Atomic broadcast: from simple message diffusion to Byzantine agreement. Inf. Comput. 118(1), 158–179 (1995)MathSciNet CrossRef MATH
    10.Pelc, A., Peleg, D.: Broadcasting with locally bounded Byzantine faults. Inf. Process. Lett. 93(3), 109–115 (2005)MathSciNet CrossRef MATH
    11.Fitzi, M., Gottesman, D., Hirt, M., Holenstein, T., Smith, A.: Detectable Byzantine agreement secure against faulty majorities. In: Proceedings of the Twenty-First Annual Symposium on Principles of Distributed Computing, pp. 118–126. ACM (2002)
    12.Fitzi, M., Hirt, M.: Optimally efficient multi-valued Byzantine agreement. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing, pp. 163–168. ACM (2006)
    13.Bao, F., Igarishi, Y.: Reliable broadcasting in product networks with Byzantine faults. In: Proceedings of Annual Symposium on Fault Tolerant Computing, pp. 262–271. IEEE (1996)
    14.Keichafer, R.M., Walter, C.J., Finn, A.M., Thambidurai, P.M.: The MAFT architecture for distributed fault tolerance. IEEE Trans. Comput. 37(4), 398–404 (1988)CrossRef
    15.Powell, D., Arlat, J., Beus-Dukic, L., Bondavalli, A., Coppola, P., Fantechi, A., Jenn, E., Rabejac, C., Wellings, A.: GUARDS: a generic upgradable architecture for real-time dependable systems. IEEE Trans. Parallel Distrib. Syst. 10(6), 580–599 (1999)CrossRef
    16.Totel, E., Beus-Dukic, L., Blanquart, J.P., Deswarte, Y., Powell, D., Wellings, A.: Integrity management in GUARDS. In: Davies, N., Jochen, S., Raymond, K. (eds.) Middleware 1998, pp. 105–122. Springer, London (1998)
    17.Palumbo, D.L., Butler, R.W.: A performance evaluation of the software-implemented fault-tolerance computer. J. Guidance Control Dyn. 9(2), 175–180 (1986)CrossRef
    18.Hopkins, A.L., Smith, T.B., Lala, J.H.: FTMP: a highly reliable fault-tolerant multiprocess for aircraft. Proc. IEEE 66(10), 1221–1239 (1978)CrossRef
    19.Han, S., Shin, K.G.: Experimental evaluation of failure-detection schemes in real-time communication networks. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, FTCS-27, Digest of Papers, pp. 122–131. IEEE (1997)
    20.Rufino, J., Verissimo, P., Arroz, G., Almeida, C., Rodrigues, L.: Fault-tolerant broadcasts in CAN. In: Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 150–159. IEEE (1998)
    21.AlMohammad, B., Bose, B.: Fault-tolerant communication algorithms in toroidal networks. IEEE Trans. Parallel Distrib. Syst. 10(10), 976–983 (1999)CrossRef
    22.Hsieh, H.C., Chiang, M.L.: A new solution for the Byzantine agreement problem. J. Parallel Distrib. Comput. 71(10), 1261–1277 (2011)CrossRef MATH
    23.Saini, P., Singh, A.K.: An efficient Byzantine fault tolerant agreement. In: AIP Conference Proceedings, vol. 1324, no. 1 (2010)
    24.Wang, S.S., Yan, K.Q., Wang, S.C.: An optimal solution for Byzantine agreement under a hierarchical cluster-oriented mobile ad hoc network. Comput. Electr. Eng. 36(1), 100–113 (2010)CrossRef MATH
    25.Moniz, H., Neves, N.F., Correia, M.: Byzantine fault-tolerant consensus in wireless ad hoc networks. IEEE Trans. Mobile Comput. 12(12), 2441–2454 (2013)CrossRef
    26.Veronese, G.S., Correia, M., Bessani, A.N., Lung, L.C., Verissimo, P.: Efficient Byzantine fault-tolerance. IEEE Trans. Comput. 62(1), 16–30 (2013)MathSciNet CrossRef
    27.Kotla, R., Clement, A., Wong, E., Alvisi, L., Dahlin, M.: Zyzzyva: speculative Byzantine fault tolerance. Commun. ACM 51(11), 86–95 (2008)CrossRef
    28.Keidar, I., Rajsbaum, S.: On the cost of fault-tolerant consensus when there are no faults: preliminary version. SIGACT News 32(2), 45–63 (2001)CrossRef
    29.Banu, N., Izumi, T., Wada, K.: Adaptive and doubly-expedited one-step consensus in Byzantine asynchronous systems. Parallel Process. Lett. 21(04), 461–477 (2011)MathSciNet CrossRef MATH
    30.Patra, A., Choudhury, A., Rangan, C.P.: Asynchronous Byzantine agreement with optimal resilience. Distrib. Comput. 27(2), 111–146 (2014)MathSciNet CrossRef MATH
    31.Xu, X., Lin, Y.: Checkpoint selection in fault recovery based on Byzantine fault model. In: Fourth International Conference on Computational Intelligence and Communication Networks (CICN), pp. 582–587, November 2012
    32.Widder, J., Biely, M., Gridling, G., Weiss, B., Blanquart, J.P.: Consensus in the presence of mortal Byzantine faulty processes. Distrib. Comput. 24(6), 299–321 (2012)CrossRef MATH
    33.Wang, S.C., Yan, K.Q., Ho, C.L., Wang, S.S.: The optimal generalized Byzantine agreement in cluster-based wireless sensor networks. Comput. Stan. Interfaces 36(5), 821–830 (2014)CrossRef
    34.Abdelhakim, M., Lightfoot, L.E., Ren, J., Li, T.: Distributed detection in mobile access wireless sensor networks under Byzantine attacks. IEEE Trans. Parallel Distrib. Syst. 25(4), 950–959 (2014)CrossRef
    35.Duran, A., Ferrer, R., Costa, J.J., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: A proposal for error handling in OpenMP. Int. J. Parallel Prog. 35(4), 393–416 (2007)CrossRef MATH
    36.Bronevetsky, G., Marques, D., Pingali, K., Szwed, P., Schulz, M.: Application-level checkpointing for shared memory programs. In: Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS XI, pp. 235–247. ACM, New York (2004)
    37.Bronevetsky, G., Pingali, K., Stodghill, P.: Experimental evaluation of application-level checkpointing for OpenMP programs. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 2–13. ACM, New York (2006)
    38.Fu, H., Ding, Y.: Using redundant threads for fault tolerance of OpenMP programs. In: 2010 International Conference on Information Science and Applications, pp. 1–8, April 2010
    39.Li, M., Hsiao, M.S.: 3-D parallel fault simulation with GPGPU. IEEE Trans. Comput. Aided Design Integr. Circuits Syst. 30(10), 1545–1555 (2011)CrossRef
    40.Guo, X., Jiang, H., Li, K.C.: A checkpoint/restart scheme for CUDA applications with complex memory hierarchy. In: 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp. 247–252, July 2013
    41.Carlo, S.D., Gambardella, G., Martella, I., Prinetto, P., Rolfo, D., Trotta, P.: Fault mitigation strategies for CUDA GPUs. In: 2013 IEEE International Test Conference (ITC), pp. 1–8, September 2013
    42.Xu, X.H., Yang, X.J., Xue, J.L., Lin, Y.F., Lin, Y.S.: PartialRC: a partial recomputing method for efficient fault recovery on GPGPUs. J. Comput. Sci. Technol. 27(2), 240–255 (2012)CrossRef
    43.Laosooksathit, S., Nassar, R., Leangsuksun, C., Paun, M.: Reliability-aware performance model for optimal GPU-enabled cluster environment. J. Supercomputing 68(3), 1630–1651 (2014)CrossRef
    44.Demchik, V., Kolomoyets, N.: QCDGPU: open-source package for Monte Carlo lattice simulations on OpenCL-compatible multi-GPU systems (2013)
    45.Avizienis, A.: Fault-tolerance: a property that ensures constant availability of digital system. IEEE Trans. Comput. 66(10), 5–25 (1978)MathSciNet
    46.Pease, M., Shostak, R., Lamport, L.: Reaching agreement in the presence of faults. J. ACM 27(2), 228–234 (1980)MathSciNet CrossRef MATH
    47.Mamedli, È.M., Samedov, R.Y., Sobolev, N.: A method for localization of Byzantine and nonbyzantine faults. Avtomatika i Telemekhanika 5, 126–138 (1992)MATH
    48.Samet, R.: Recovery device for real-time dual-redundant computer systems. IEEE Trans. Dependable Secure Comput. 8(3), 391–403 (2011)CrossRef
    49.Samet, R.: Choosing between design options for real-time computers tolerating a single fault. J. Circuits Syst. Comput. 19(05), 1041–1068 (2010)CrossRef
    50.Sivencrona, H., Johannessen, P., Persson, M., Torin, J.: Heavy-ion fault injections in the time-triggered communication protocol. In: Lemos, R., Weber, T.S., Camargo, J.B. (eds.) LADC 2003. LNCS, vol. 2847, pp. 69–80. Springer, Heidelberg (2003). doi:10.​1007/​978-3-540-45214-0_​8 CrossRef
    51.Driscoll, K., Hall, B., Sivencrona, H., Zumsteg, P.: Byzantine fault tolerance, from theory to reality. In: Anderson, S., Felici, M., Littlewood, B. (eds.) SAFECOMP 2003. LNCS, vol. 2788, pp. 235–248. Springer, Heidelberg (2003). doi:10.​1007/​978-3-540-39878-3_​19 CrossRef
    52.Tanenbaum, A.S.: Computer Networks, vol. 3. Prentice Hall, New Jersey (1996)MATH
    53.Stallings, W.: Data and computer communications. Pearson/Prentice Hall (2007)
    54.Mullender, S.: Distributed Systems. ACM Press/Addison-Wesley Publishing Co. (1993)
    55.Coulouris, G.F., Dollimore, J., Kindberg, T.: Distributed Systems: Concepts and Design. Pearson education (2005)
    56.Mitra, S., Saxena, N.R., McCluskey, E.J.: A design diversity metric and analysis of redundant systems. IEEE Trans. Comput. 51(5), 498–510 (2002)CrossRef
    57.Samedov, R.: An approach to the support of the fault-tolerance of the double redundant computer control systems. Math. Comput. Appl. 4(2), 175–184 (1999)
    58.Kim, H., Jeon, H.J., Lee, K., Lee, H.: The design and evaluation of all voting triple modular redundancy system. In: Proceedings. Annual Reliability and Maintainability Symposium, pp. 439–444. IEEE (2002)
    59.Smith, T.B.: Fault tolerant processor concepts and operation. In: Digest of Papers, FTCS-14, Kissimmee, USA, pp. 158–163 (1984)
    60.Laprie, J.C.: Dependable computing and fault-tolerance. In: Digest of Papers FTCS-15, pp. 2–11 (1985)
    61.Mamedli, È.M., Samedov, R.Y., Sobolev, N.: A method for localization of Byzantine and NonByzantine faults. J. Autom. Remote Control 53(5), 734–744 (1992)MATH
    62.Oh, N., Mitra, S., McCluskey, E.J.: ED4I: error detection by diverse data and duplicated instructions. IEEE Trans. Comput. 51(2), 180–199 (2002)CrossRef
    63.Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems: Design and Evaluation, 3rd edn. A.K. Peters Ltd., Natick (1998)MATH
    64.Samet, R.: Fault-tolerant procedures for redundant computer systems. Qual. Reliab. Eng. Int. 25(1), 41–68 (2009)CrossRef
    65.Hurst, S.L.: VLSI Testing: digital and mixed analogue/digital techniques, vol. 9. IET (1998)
    66.Lala, P.K.: Self-checking and fault-tolerant digital design. Morgan Kaufmann (2001)
    67.Powell, D.: Failure mode assumptions and assumption coverage. In: Randell, B., Laprie, J.C., Kopetz, H., Littlewood, B. (eds.) Predictably Dependable Computing Systems, pp. 123–140. Springer, Heidelberg (1995)CrossRef
    68.Laprie, J.C., Arlat, J., Blanquart, J., Costes, A., Crouzet, Y., Deswarte, Y., Fabre, J., Guillermain, H., Kaâniche, M., Kanoun, K., et al.: Guide de la sûreté de fonctionnement (dependability handbook). Cépaduès, Toulouse (1995)
    69.Ziv, A., Bruck, J.: An on-line algorithm for checkpoint placement. IEEE Trans. Comput. 46(9), 976–985 (1997)MathSciNet CrossRef
    70.Ling, Y., Mi, J., Lin, X.: A variational calculus approach to optimal checkpoint placement. IEEE Trans. Comput. 50(7), 699–708 (2001)CrossRef
    71.Lincoln, P., Rushby, J.: A formally verified algorithm for interactive consistency under a hybrid fault model. In: The Twenty-Third International Symposium on Fault-Tolerant Computing, FTCS-23, Digest of Papers, pp. 402–411. IEEE (1993)
    72.Meyer, F.J., Pradhan, D.K.: Consensus with dual failure modes. IEEE Trans. Parallel Distrib. Syst. 2(2), 214–222 (1991)CrossRef
    73.Thambidurai, P., Park, Y.K.: Interactive consistency with multiple failure modes. In: Proceedings, Seventh Symposium on Reliable Distributed Systems, pp. 93–100. IEEE (1988)
    74.Chor, B., Coan, B.A.: A simple and efficient randomized Byzantine agreement algorithm. IEEE Trans. Softw. Eng. 6, 531–539 (1985)MathSciNet CrossRef
    75.Kopetz, H.: Real-Time Systems: Design Principles for Distributed Embedded Applications. Springer Science & Business Media, London (2011)CrossRef MATH
  • 作者单位:Refik Samet (15)
    Nermin Samet (16)

    15. Ankara University, Ankara, Turkey
    16. Middle East Technical University, Ankara, Turkey
  • 丛书名:Transactions on Computational Science XXIX
  • ISBN:978-3-662-54563-8
  • 卷排序:10220
文摘
This paper analyzes the reasons of appearance of non - Byzantine and Byzantine fault types in redundant computer systems. The proposed approach is based on analysis of the relationship between the modes of intercomputer communications and fault types. This analysis allows the users to design the redundant computer systems in such a way that Byzantine faults cannot appear. Consequently, designing the redundant computer systems, in which Byzantine faults cannot appear, allows the designers to increase the degree of reliability by preventing the masking of any forms of appearance of faults and by decreasing the time period of checkpoints. In addition, this approach decreases the cost of software and hardware involved in the execution of fault-tolerant procedures.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700