Experiences with software-based soft-error mitigation using AN codes

详细信息查看全文

作者：Martin Hoffmann ; Peter Ulbrich ; Christian Dietrich…
关键词：Fault injection ; Arithmetic code ; Dependability
刊名：Software Quality Journal
出版年：2016
出版时间：March 2016
年：2016
卷：24
期：1
页码：87-113
全文大小：2,121 KB
参考文献：Aidemark, J., Vinter, J., Folkesson, P., & Karlsson, J. (2002). Experimental evaluation of time-redundant execution for a brake-by-wire application. 32nd International Conference on Dependable Systems & Networks (DSN ’02) (pp. 210–215). doi:10.1109/DSN.2002.1028902 .
Avižienis, A., Gilley, G., Mathur, F. P., Rennels, D., Rohr, J., & Rubin, D. (1971). The star (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, 20(11), 1312–1321. doi:10.1109/T-C.1971.223133 .MATH CrossRef
Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., et al. (2011). The gem5 simulator. SIGARCH Computer Architecture News, 39(2), 1–7. doi:10.1145/2024716.2024718 .CrossRef
Borkar, S. Y. (2005). Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25(6), 10–16.CrossRef
Braun, J., Geyer, D., & Mottok, J. (2012). Alternative measure for safety-related software. ATZelektronik Worldwide, 7(4), 40–43. doi:10.1365/s38314-012-0106-1 .
Chang, J., Reis, G., & August, D. (2006). Automatic instruction-level software-only recovery. 36th International Conference on Dependable Systems & Networks (DSN ’06), IEEE (pp. 83–92). Washington, DC, USA. doi:10.1109/DSN.2006.15 .
Cho, H., Mirkhani, S., Cher, C.Y., Abraham, J., & Mitra, S. (2013). Quantitative evaluation of soft error injection techniques for robust system design. Proceedings of the 50th annual Design Automation Conference (pp. 1–10).
Dodd, P. E., & Massengill, L. W. (2003). Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Transactions on Nuclear Science, 50(3), 583–602. doi:10.1109/TNS.2003.813129 .CrossRef
Engel, M., & Döbel, B. (2012). The reliable computing base: A paradigm for software-based reliability. 1st International W’shop on Software-Based Methods for Robust Emb. Sys. (SOBRES ’12). LNCS. Gesellschaft für Informatik.
Forin, P. (1989). Vital coded microprocessor principles and application for various transit systems. Symposium on Control, Computers, Communication in Transportation (CCCT ’89) (pp. 79–84).
Frohwerk, R. A. (1977). Signature analysis: A new digital field service method. Hewlett-Packard Journal, 28(9), 2–8.
Goloubeva, O., Rebaudengo, M., Reorda, M. S., & Violante, M. (2006). Software-Implemented Hardware Fault Tolerance (1st ed.). New York, NY: Springer.MATH
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160.MathSciNet CrossRef
Hoffmann, M., Dietrich, C., & Lohmann, D. (2013). dOSEK: A dependable RTOS for automotive applications. 19th International Symposium on Dependable Computing (PRDC ’13). IEEE. Washington, DC, USA. doi:10.1109/PRDC.2013.22 . http://www.danceos.org/publications/PRDC-FAST-2013-Hoffmann.pdf . Fast abstract.
Hoffmann, M., Ulbrich, P., Dietrich, C., Schirmeier, H., Lohmann, D., & Schröder-Preikschat, W. (2014). A practitioner’s guide to software-based soft-error mitigation using AN-codes. 15th IEEE International Symposium on High-Assurance Systems Engineering (HASE ’14), IEEE (pp. 33–40). Miami, Florida, USA. doi:10.1109/HASE.2014.14 .
Kanawati, G. A., Kanawati, N. A., & Abraham, J. A. (1995). Ferrari: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44, 248–260.MATH CrossRef
Lawton, K. P. (1996). Bochs: A portable PC emulator for Unix/X. Linux Journal, 1996(29es), 7.
Li, X., Shen, K., Huang, M.C., & Chu, L. (2007). A memory soft error measurement on production systems. In: 2007 USENIX ATC, pp. 1–14. USENIX, Berkeley, CA, USA.
Maiz, J., Hareland, S., Zhang, K., & Armstrong, P. (2003). Characterization of multi-bit soft error events in advanced SRAMs. International Electron Devices Meeting (IEDM ’03). IEEE Press, New York, NY, USA. doi:10.1109/IEDM.2003.1269335 .
Mandelbaum, D. (1967). Arithmetic codes with large distance. IEEE Transactions on Information Theory, 13(2), 237–242. doi:10.1109/TIT.1967.1054015 .MATH CrossRef
Massey, J. L. (1964). Survey of residue coding for arithmetic errors. International Computation Center Bulletin, 3(4), 3–17.MathSciNet
Medwed, M., & Schmidt, J.M. (2009). Coding schemes for arithmetic and logic operations - how robust are they? In: H. Youm, M. Yung (eds.) Information Security Applications, Lecture Notes in Computer Science, vol. 5932, pp. 51–65. Springer, Heidelberg. doi:10.1007/978-3-642-10838-9_5 .
Oh, N., Mitra, S., & McCluskey, E. (2002). Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2), 180–199. doi:10.1109/12.980007 .CrossRef
Peterson, W. W., & Weldon, E. J. (1972). Error-correcting codes (2nd ed.). Cambridge, MA, USA: MIT Press.MATH
Rao, T. R. N. (1974). Error coding for arithmetic processors (1st ed.). Orlando, FL: Academic Press.MATH
Reis, G., Chang, J., Vachharajani, N., Rangan, R., August, D., & Mukherjee, S. (2005). Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization (TACO ’05), 2(4), 366–396. doi:10.1145/1113841.1113843 .CrossRef
Schiffel, U. (2011). Hardware error detection using AN-codes. Ph.D. thesis, Technische Universität Dresden, Fakultät Informatik.
Schiffel, U., Schmitt, A., Süßkraut, M., & Fetzer, C. (2010). ANB- and ANBDmem-encoding: detecting hardware errors in software. In: E. Schoitsch (ed.) 29th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’10) (pp. 169–182). Springer, Heidelberg, Germany. doi:10.1007/978-3-642-15651-9_13 .
Schirmeier, H., Hoffmann, M., Kapitza, R., Lohmann, D., & Spinczyk, O. (2012). FAIL*: Towards a versatile fault-injection experiment framework. 25th International Conference on Architecture of Computer Systems, Lecture Notes in Informatics, vol. 200. Gesellschaft für Informatik.
Shye, A., Moseley, T., Reddi, V.J., Blomstedt, J., & Connors, D.A. (2007). Using process-level redundancy to exploit multiple cores for transient fault tolerance. 37th International Conference on Dependable Systems & Networks (DSN ’07), IEEE (pp. 297–306). Washington, DC, USA. doi:10.1109/DSN.2007.98 .
Steindl, M., Mottok, J., & Meier, H. (2010). Ses-based framework for fault-tolerant systems. Proceedings of the 8th Workshop on Intelligent Solutions in Embedded Systems (WISES ’10) (pp. 12–16). doi:10.1109/WISES.2010.5548427 .
Ulbrich, P., Hoffmann, M., Kapitza, R., Lohmann, D., Schröder-Preikschat, W., & Schmid, R. (2012). Eliminating single points of failure in software-based redundancy. 9th Europe Dep. Computing Conference (EDCC ’12), IEEE (pp. 49–60). Washington, DC, USA. doi:10.1109/EDCC.2012.21 .
Ulbrich, P., Kapitza, R., Harkort, C., Schmid, R., & Schröder- reikschat, W. (2011). I4Copter: An adaptable and modular quadrotor platform. 26th ACM Symposium on Applied Computing (SAC ’11), ACM (pp. 380–396). New York, NY, USA.
Wappler, U., & Fetzer, C. (2007). Software encoded processing: Building dependable systems with commodity hardware. In: F. Saglietti, N. Oster (eds.) 26th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’07) (pp. 356–369). Springer, Heidelberg, Germany. doi:10.1007/978-3-540-75101-4_34 .
作者单位：Martin Hoffmann (1)
Peter Ulbrich (1)
Christian Dietrich (1)
Horst Schirmeier (2)
Daniel Lohmann (1)
Wolfgang Schröder-Preikschat (1)

1. Chair of Distributed Systems and Operating Systems, Friedrich–Alexander University Erlangen–Nuremberg, 91058, Erlangen, Germany
2. Department of Computer Science 12, Technische Universität Dortmund, 44221, Dortmund, Germany
刊物主题：Software Engineering/Programming and Operating Systems; Programming Languages, Compilers, Interpreters; Data Structures, Cryptology and Information Theory; Operating Systems;
出版者：Springer US
ISSN：1573-1367

文摘

Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors. Keywords Fault injection Arithmetic code Dependability

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700