空间辐射环境下软件实现的硬件故障检测技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
当前,世界上太空探索的热潮再度兴起。人类探索太空的活动更加活跃的同时,太空辐射环境对探测器可靠性的负面影响也日益突出。空间辐射环境对电子器件的影响可分为单粒子效应和总剂量效应两类,其中单粒子效应尤其是单粒子翻转故障构成了对星载计算机安全的主要威胁。
     相关研究结果表明,相对于使用抗辐照器件的硬件容错技术,采用软件容错技术,即在商用器件上采用软件的方法容忍硬件故障,可以在保证系统可靠性的前提下,获得更高的系统性能。软件容错技术同时也具有低成本、低功耗、可灵活配置等优点。
     本文在分析当前已有软件容错技术成果的基础上,围绕故障检测算法、容错优化技术进行了深入研究。首先,本文提出了一种新的故障检测技术——基于层次分解的故障检测技术。不同于其它已有的故障检测技术,基于层次分解的故障检测技术将程序结构划分为不同的层次,并在不同的层级上使用不同的故障检测算法,通过这些检测算法相互配合、层层检测,实现了对不同种类故障、不同类型的程序错误进行检测,提高了程序运行的可靠性。然后,本文根据程序不同区域在应用故障检测算法后通常在可靠性和性能方面具有不同反应的特点,提出了一种可配置的故障检测算法。算法建立了容错程序的可靠性反应和性能反应分析模型,并基于分析结果获得具有最佳性价比的容错配置方案。最后,本文基于编译容错的思路,实现了基于层次分解的故障检测技术和可配置的故障检测算法,并通过故障注入实验对这些技术的故障检测能力和性能代价进行了测试。
     基于层次分解的故障检测技术对硬件故障的检测率达到了97.9%-99.1%。相比基于层次分解的故障检测技术,可配置的故障检测算法以0.5%-1.4%的故障检测率损失为代价,使容错的性能消耗下降了12%-20%。
At present, the worldwide space exploration boom is re-emerging. While space exploration activities become more active, the negative impact on the reliability of space detectors caused by space radiation also becomes more severe. The impact of the space radiation environment on the electronic devices can be divided into the single event effect and the total ionizing dose. The single event effect, particularly, the single event upset has become a major threat to the security of on-board computers.
     The correlative research results have shown that, compared with hardware implemented fault tolerant techniques based on the anti-radiation devices, software implemented hardware fault tolerance techniques which can tolerate hardware fault based on COTS components, not only can guarantee the reliability of the system but also can improve the system performance. At the same time, software implemented hardware fault tolerance techniques are also low-cost, low-power, flexible configuration, etc.
     Based on the analysis of the current achievements of software implemented hardware fault tolerance techniques, this essay has an in-depth study on the fault detection algorithms and the optimization of fault tolerance techniques. Firstly, this essay has present a novel fault detection technique—Fault Detection Technique by Program Hiberarchy which is called FDTPH. Unlike other existed fault detection techniques, the FDTPH divides program structure into different layers, and uses different fault detection algorithms at different layers. By these detection algorithms cooperating with each other and detecting errors layer upon layer, the FDTPH has accomplished detecting different kinds of faults and different kinds of errors, and the FDTPH can improve the reliability of programs. Secondly, based on the phenomena that the different regions of program usually have different performance response and reliability response after applying the fault detection algorithms, this essay has proposed the Configurable Fault Detection Algorithm which is called CFDA. The CFDA has established the performance response and the reliability response analysis models, and it can get the best cost-effective fault tolerant configuration based on the analysis results of the models. Finally, this essay has implemented the FDTPH and the CFDA in the method outputting fault tolerance program by fault tolerance compiling, and the essay has tested the fault detection capability and performance cost of these techniques by fault injection experiments.
     The fault detection rate of the FDTPH has reached 97.9%-99.1%. Compared to the FDTPH, the CFDA has made the performance cost of fault tolerance technique drop by 12%-20%, at the cost of only reducing the fault detection rate by 0.5%-1.4%.
引文
[1] P. Shivakumar,M. Kistler,S.W. Keckler,D. Burger,L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. Proceedings of the 2002 International Conference on Dependable Systems and Networks,2002: 389-399.
    [2] N. Oh,P. P. Shirvani,E. J. McCluskey. ED4I: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers,2002,51: 180-199.
    [3] Shirvani P P. Software-Implemented Hardware Fault Tolerance Experiments: COTS in Space. Proc. of International Conference on Dependable Systems and Networks (DSN 2000),2000: 6-7.
    [4] Shirvani P P,Edward J M. Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project. CRC Technical Report No. 98-2,1998.
    [5] Oh N. Software Implemented Hardware Fault Tolerance [Ph.D. Thesis]. Stanford,Calif.:Stanford University., 2000.
    [6] N. Oh,P. P. Shirvani and E. J. McCluskey. Control-flow checking by software signatures. IEEE Transactions on Reliability,2002,51: 111-122.
    [7] P. P. Shirvani,N. Saxena,E. J. McCluskey. Software implemented EDAC protection against SEUs. IEEE Transactions on Reliability,2000,49: 273-284.
    [8] N. Oh,P. P. Shirvani,E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability,2002,51: 63-75.
    [9]高珑.面向硬件故障的软件容错——模型、算法和实验[博士学位论文].长沙:国防科技大学,2007.
    [10]王长河.单粒子效应对卫星空间运行可靠性影响.半导体情报,1998,35(1):1-8.
    [11]都亨等.中国空间科学进展.北京:国防工业出版社,1995.
    [12] Normand E. Single Event Effects in Avionics. IEEE Transanstions Nuclear Science,1996,43(2): 461-474.
    [13]王同权.高能质子辐射效应研究[博士学位论文].长沙:国防科技大学,2003.
    [14] R. C. Baumann. Soft errors in commercial semiconductor technology:Overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes,2002: 121 01.1-121 01.14.
    [15] Tang H H K. Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-induced soft errors from a physicist's perspective. IBM Journal of Research and Development,1996,40(1): 91-108.
    [16] Shirvani P P. Fault Tolerant Computing for Radiation Environment [Ph.D. Thesis]. Stanford,Calif.:Stanford University,2001.
    [17] Avizienis A. . Design of Fault-Tolerant Computers. Proc. of AFIPS Fall Joint Computer Conferenc,1967,31: 733-743.
    [18] Clark J A,Pradhan D K. Fault injection: a method for validating computer-system dependability. IEEE Computer,1995,28(6): 47-56.
    [19]徐拾义.可信计算系统设计和分析.北京:清华大学出版社,2006.
    [20] Avizienis A. Toward Systematic Design of Fault-Tolerant Systems. IEEE Computer,1997.
    [21] Lyons R E,Vanderkulk W. The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development,1962,6(2): 200-209.
    [22] Pradhan D K. Fault-Tolerant Computer System Design. Prentice Hall,1996.
    [23] http://www-03.ibm.com/ibm/history/exhibits/space/space_saturn.html.
    [24] Lu D J. Watchdog Processor and Structural Integrity Checking. IEEE Transactions on Computers,1982,31(7): 681-685.
    [25] A. Mahmood,E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers,1988,37(2): 160-174.
    [26] Rao T R N,Fujiwara E. Error-Control Coding for Computer Systems. Upper Saddle River,NJ,USA:Prentice Hall,1989.
    [27] Chen C L,Hsiao M Y. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review. IBM Journal of Research and Development,1984,28(2): 124-134.
    [28] Avizienis A. , The N-Version Approach to Fault-Tolerant Software. IEEE Transactions on Software Engineering,1985,11(12): 1491-1501.
    [29]高珑,杨学军.高性能低功耗的容错编译技术:错误流压缩算法.软件学报,2006,17(12): 2425-2437.
    [30] B. Nicolescu,R. Velazco. Detecting soft errors by a purely software approach: method, tools and experimental results. Design Automation and Testing in Europe (DATE 2003),2003.
    [31] Tomoyuki Yokogawa , Tatsuhiro Tsuchiya , Tsuchiya Kikuno. Automatic Verification of Fault Tolerance Using Model Checking. Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing(PRDC 2001), 2001: 95.
    [32] E. M. Clarke, O. Grumberg,D. A. Peled. Model Checking. MIT Press,1999.
    [33] Nicolescu B,Savaria Y,Velazco R. Software Detection Mechanisms Providing Full Coverage Against Single Bit-Flip Faults. IEEE Transactions on Nuclear Science,2004,51 (6): 3510-3518.
    [34] Nicolescu B,Gorse N,Savaria Y,et a. On the Use of Model Checking for the Verification of a Dynamic Signature Monitoring Approach. IEEE Transactions on Nuclear Science,2005,52 (5): 1555-1561.
    [35] David Walker,Lester W. Mackey,Jay Ligatti,George A. Reis,David I. August. Static typing for a faulty lambda calculus. Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming,2006: 38-49.
    [36] Frances Perry,Lester W. Mackey,George A. Reis,Jay Ligatti,David I. August,David Walker. Fault-tolerant typed assembly language. Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation,2007: 42-53.
    [37] Jonathan Chang,George A. Reis,David I. August. Automatic Instruction-Level Software-Only Recovery. Proceedings of the International Conference on Dependable Systems and Networks,2006: 83-92.
    [38] G. A. Reis,J. Chang,N. Vachharajani,R. Rangan, D. I. August. SWIFT: Software implemented fault tolerance. Proceedings of the 3rd International Symposium onCode Generation and Optimization,2005.
    [39] G. A. Reis,J. Chang,N. Vachharajani,R. Rangan,D. I. August,S. S. Mukherjee. Design and evaluation of hybrid fault-detection systems. Proceedings of the 32th Annual International Symposium on Computer Architecture,2005: 148-159.
    [40] Maurizio Rebaudengo,Matteo Sonza Reorda,Massimo Violante,Marco Torchiano. A Source-to-Source Compiler for Generating Dependable Software. 1st IEEE International Workshop on Source Code Analysis and Manipulation,2001: 35-44.
    [41] D.T. Brown. Error Detecting and Correcting Binary Codes for Arithmetic Operations. IRE Transactions on Electronic Computers,1960,9: 333-337.
    [42] H. Engel. Data Flow Transformations to Detect Results which are Corrupted by Hardware Faults. Proc. IEEE High-Assurance Systems Eng. Workshop,1997: 279-285.
    [43] A. Benso,S. Chiusano,P. Prinetto,L. Tagliaferro. A C/C++ Source-to-Source Compiler for Dependable Applications. Proceedings of the International Conference on Dependable Systems and Networks,2000.
    [44] George A. Reis. Software Modulated Fault Tolerance. The doctor degree dissertation,2008.
    [45] George A. Reis,Jonathan Chang,David I. August,Robert Cohn, Shubhendu S. Mukherjee. Configurable Transient Fault Detection via Dynamic Binary Translation. Proceedings of the 2nd Workshop on Architectural Reliability (WAR),2006.
    [46] Aiguo Li,Bingrong Hong. Software implemented transient fault detection in space computer. Aerospace Science and Technology,2007,11(2-3): 245-252.
    [47] B. Nicolescu,Y. Savaria,R. Velazco. SIED: Software Implemented Error Detection. 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems,2003.
    [48] Martin R C,Ghoniem N M,Song Y,et al. The size effect of ion charge tracks on single event multiple-bit upset. IEEE Transactions on Nuclear Science,1987,34(6): 1305-1309.
    [49]黄振远.一种星载计算机软件检错技术的研究与实现[硕士学位论文].哈尔滨:哈尔滨工业大学,2006.
    [50] Jianjun Xu,Rui Shen,Qingping Tan. PRASE: An Approach for Program Reliability Analysis with Soft Errors. Pacific Rim International Symposium on Dependable Computing(PRDC08),2008.
    [51] X. Li,S.V. Adve,P. Bose,J.A. Rivers. SoftArch: An Architecture-Level Tool for Modeling and Analyzing Soft Errors. Proc. Int’l Conf. on Dependable Systems and Networks,2005: 496-505.
    [52]张晨曦,王志英,张春元,戴葵,朱海滨.计算机体系结构.北京:高等教育出版社,2000.
    [53] Burger DC,Austin TM. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture New,1997,25(3): 13-25.
    [54] George A. Reis,David I. August,Robert Cohn, Shubhendu S. Mukherjee. Software Fault Detection Using Dynamic Instrumentation. In Proceedings of the Fourth Annual Boston Area Architecture Workshop,2006.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700