基于汇编语言的控制流错误检测算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着集成电路的特征尺寸、供电电压和阈值电压的减少,处理器对串扰、电磁干扰以及粒子辐射等各种噪声干扰变得更加敏感,硬件瞬时故障导致的计算机系统可靠性问题日显突出。尤其在辐射环境下,因粒子辐射产生的硬件瞬时故障成为影响计算机系统可靠性的重要因素。采用抗辐照器件可以防止辐射环境中的硬件瞬时故障,但由于其性能低、价格高、功耗高的特点不适合应用于高性能计算领域。因此,开始尝试在辐射环境中采用高性能、低价格、低功耗的COTS器件,在其上通过软件、硬件技术容忍硬件瞬时故障,提高系统可靠性。在硬件瞬时故障中,危害最大的是控制流错误跳转造成的故障,为了降低此类故障对计算机系统可靠性的影响,本文对检测控制流错误跳转的算法及其评估方法进行了研究。
     控制流错误检测算法的研究对象主要分为高级语言和汇编语言,由于基于汇编语言的控制流错误检测算法实现简单,比基于高级语言的控制流错误检测算法具有更低的系统性能负载和未检测出的错误率,本文主要研究基于汇编语言的控制流错误检测算法。此类算法主要采用签名检测技术,主要解决四个方面的问题:检测的粒度、签名信息的表示方法、检测指令的插入位置及签名检测方法。本文围绕着这些问题开展研究,提出改进的基于汇编语言的控制流错误检测算法。为了从理论上更准确的分析控制流错误检测算法的检测能力,本文进一步完善原有的控制流错误检测能力验证模型。同时,大部分控制流错误检测算法不具备故障恢复能力,如何结合微处理器体系结构的特点完成容控制流错误的功能也是一个值得研究的问题。
     针对上述问题,本文主要研究以下内容:
     (1)由于传统的验证控制流错误检测算法检测能力的模型很少考虑新增的检测指令对算法检测能力的影响,为了从理论上更准确的分析算法的检测能力,本文开展了对控制流错误检测算法检测能力的验证模型的研究;
     (2)由于基于汇编语言的CFCSS算法实用性较强,本文对其中存在的检测混淆和检测错误的问题展开研究;同时,为了降低系统功耗,减少检测点,本文尝试修改检测粒度,在不影响控制流错误检测能力的基础上,提出低功耗的控制流错误检测算法;为了提高控制流错误检测能力,消除基本块间冗余的依赖关系,本文对签名表示方法和签名检测方法进行研究,提出基于汇编语言的DPNCFC算法;
     (3)基于签名的控制流错误检测算法在编译时就确定检测位置,导致延迟发现故障,降低了系统的可靠性;同时,这类算法是以基本块作为检测的基本单位,在不增加冗余检测指令时无法检测基本块内的控制流错误跳转。为了解决这两个问题,本文从软硬结合的角度对控制流错误检测算法展开研究;
     (4)由于大部分控制流检测算法不具备容错能力,为了使控制流错误检测算法和故障恢复技术结合的更紧密,本文基于R80515体系结构,采用软硬结合的方法,对容控制流错误的方法展开研究。
As the reducing of feature size of Integrate Circuit,power supply voltage andthreshold voltage,processors became more sensitive to noise disturbances such ascrosstalk,EMI radiation and particle radiation.The computer reliability problemscaused by hardware transient fault are more and more important.Especially inradiation environments,hardware transient fault produced by particle radiation isthe one of the most important elements which influences the computer systemreliability.Using radiation-hardened components in radiation environments canprevent hardware transient fault.But because of the high price,low capability andhuge power consumption,they are not adaptable for today's high-performancecomputing.With the high-performance,low price and low power consumptioncharacters,software and hardware technology on COTS compoents can toleratehardware transient fault and improve system reliability.Thus COTS could be usedin radiation enviroments.Among the hardware Transient Faults,the mostdamaging fault is attributed to control flow jump error.In order to reduce theinfluence made by this kind of fault,the dissertation mainly discusses control flowerror checking algorithm and its evaluation method.
     The research object of control flow error checking algorithm mainly includesthe high-level language and assembly language.Assembly language-based controlflow error checking algorithms are easier to implement,of which the systemperformance and undetected error ratio are both less than those of high-levellanguage-based algorithms.So the dissertation will focus on assemblylanguage-based control flow error checking algorithms.This kind of algorithmsmainly uses signature technology,solves four aspects of problems:particle size ofchecking,the express of signature information,the location of checkinginstructions and the signature checking method.This dissertation researches theabove questions,and presents an improved algorithm.In order to get atheoretically more accurate analysis of the checking capabilities of control flowchecking algorithm,this dissertation further improves the existing model onverifying control flow error checking ability.At the same time,most of the control flow error checking algorithms do not have the ability of recovery,how tocombine the characteristics of microprocessor architecture to complete the controlflow error recovery is a problem worth studying.
     In response to these problems,the main contents are as follows:
     (1) For the traditional model seldom considers the influence on the checkingability of the algorithm,caused by the added checking instructions,in order toanalyze the checking capabilities of control flow checking algorithm moreaccurately,this dissertion researches the verifying model on control flow errorchecking ability.
     (2) Since the assembly language-based CFCSS algorithm is more practical,this dissertation researches the checking confusion and checking error problem ofit.At the same time,in order to reduce system power consumption and checkingpoints,this dissertation presents LPICFCSS algorithm without influencing thecontrol flow error checking ability by modifying the particle size;In order toimprove control flow error checking ability and eliminate the redundantdependence among basic blocks,this dissertation researchs the signatureexpression and checking method,and presents the assembly language-basedDPNCFC algorithm.
     (3) Signature-based control flow error checking algorithm ensures thechecking location when it is compiled,that leads to the delay discovery of fault,and the reduce of system reliability.At the same time,since this kind of algorithmstreats the basic block as the basic test unit,the control flow jump error inside basicblock can not be checked out without the redundant checking instructions.For theabove two points,this dissertation researches the control flow error checkingalgorithm from the soft-hard view.
     (4) Since most control flow checking algorithms do not have fault tolerancecapability,in order to combine the control flow error checking algorithm andbreakpoint recovery technology closer,the dissertation uses hard-soft method toresearch control flow error tolerant method on the R80515 architecture.
引文
[1]贺朝会.单粒子效应研究的现状和动态.抗核加固,2000,17(1):82~86
    [2]Freeman L B.Critical charge calculations for a bipolar SRAM array,IBM Journal of Research and Development,1996,40(1):119~130
    [3]Normand E.Single Event Effects in Avionics.IEEE Trans.Nucl.Sci.,1996,43,(2):461~468
    [4]Karlsson,J.,Liden,P.,Dahlgren,P.,et al..Using heavy-ion radiation to validate fault-handling mechanisms,IEEE Micro,1994,14(1):8~11,13~23
    [5]Mitra S.,Seifert N.,Zhang M.,et al..Robust system design with built-in soft-error resilience.IEEE Computer,2005,38(2):43~52
    [6]Ronen R.,Mendelson A.,Lai K.,et al..Coming challenges in microarchitecture and architecture.Proceedings of the IEEE,2001,89(3):325~340
    [7]贺朝会,李永宏,杨海亮.单粒子效应辐射模拟实验研究进展.核技术.2007,30(4):347~350
    [8]Normand E.Single event upsets at ground level.IEEE Transactions on Nuclear Science,1996,43(6):2742~2750.
    [9]Irom F,Farmanesh F F,Johnston A H,et al.Single-event upset in commercial silicon-on-insulator PowerPC microprocessors.IEEE Trans.on Nuclear Science,Dec 2002,49(6):3148~3155.
    [10]Swift G M,Fannanesh F F,Guertin S M,et al.Single-event upset in the PowerPC750 microprocessor.IEEE Trans.on Nuclear Science,Dec 2001,48(6):1822~1827
    [11]CAMPBELL A,MCDONALD P,RAY K.Single event upset rates in space.IEEE Trans.on Nuclear Science,Dec.1992,39(6):1828~1835.
    [12]陈盘训,周开明.模拟电路的单粒子瞬时效应.核技术.2006,29(3):194~197
    [13]Sternberg A L, M assengill L M, Schrimpf R D, et al.. IEEE Trans Nucl Sci. 2002,49(3) :1496-1501
    [14]Boulghassoul Y, Massengill L M, Pease R L, et al.. IEEE Trans Nucl Sci. 2002, 49(6): 3090-3096
    [15] http://www.webopedia.com
    [16]Shirvani P. Fault Tolerant Computing for Radiation Environment [Ph.D. Thesis]. Stanford, Calif.: Stanford Univ., 2001.
    [17]Oh N. Software Implemented Hardware Fault Tolerance [Ph.D. Thesis]. Stanford, Calif.: Stanford Univ., 2000.
    [18]Le C, Hensley S. Using COTS Components for Real-Time Processing of SAR Systems. In Proc. of FY'98 Trade Study, USA: NASA JPL, 1998.
    [19]Kayali S. Utilization of COTS electronics in space application, reliability challenges and reality, In Proc. of Commercialization of Military and Space Electronics Conference, Feb 2002, Los Angeles, CA, USA:NASA JPL, 2002.
    [20]Alkalai L, Tai A, Chau S. COTS-Based Fault Tolerance in Deep Space: Qualitative and Quantitative Analyses of A Bus Network Architecture. In Proc.of 4th IEEE International Symposium on High Assurance (HASE 99), Nov 1999, U.S.A:IEEE CS, 1999.
    [21]Ramesham R, Ghaffarian R, Kim N. Reliability Issues of COTS MEMS for Aerospace Applications. In Proc. of SPIE 1999, Micromachining and Micro Fabrication, Bellingham, WA, USA: SPIE, 1999.
    [22]Mitra S, Saxena N R, McCluskey E J. A design diversity metric and reliability analysis for redundant systems. In Proc. of International Test Conference,1999, Atlantic City, NJ, USA: IEEE CS, 1999. 662-671.
    [23]Namjoo M, McCluskey EJ. Watchdog processors and capability checking. In: Proceedings of the 12th international symposium on fault-tolerant computing,1982: 245-248
    [24]Namjoo N. Cerberus-16: An architecture of a general purpose watchdog processor. Symposium on Fault Tolerant Computing. Proceedings of 13rd Int., 1983,216-219.
    [25]Mahmood A, McCluskey E. J. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 1988, 37(2): 160-174.
    [26]Upadhyaya S, Ramamurthy B. Concurrent process monitoring with no reference signatures. IEEE Transactions on Computers, 1994, 43(4):475-480.
    [27]Madeira H, Camoes J, Silva J. G. A watchdog processor for concurrent error detection in multiple processor systems. Microprocessors and Microsystems, 1991, 15(3):123-131.
    [28]Michel T., Leveugle R., Saucier G. A new approach to control flow checking without program modification. Symposium on Fault Tolerant Computing.Proceedings of 21st Int., 1991, 334-341.
    [29]Saxena, N. R., McCluskey E. J.. Control-Flow Checking Using Watchdog Assists and Extended-Precision Checksums. IEEE Trans. on Computers, 1990,39(4):554-559.
    [30]Wilken, K., hen J P. Continuous Signature Monitoring: Low-Cost Concurrent-Detection of Processor Control Errors. IEEE Trans on Computer Aided Design, 1990, 9(6): 629-641.
    [31] Michel E. Concurrent error detection using watchdog processors in the multiprocessor system MEMSY. In Fault Tolerant Computing Systems.Proceedings of 283, 1991: 54-64.
    [32]Benso A., Di Carlo S., Natale G. Di, and Prinetto P. A watchdog processor to detect data and control flow errors. In Proc. of the 9th IEEE On-Line Testing Symposium. 2003: 144-148.
    [33] Lu D. J. Watchdog processors and structural integrity checking. IEEE Trans. on Comp, 1982, 31(7):681-685.
    [34]Ghassem Miremadi,Johan Karlsson,et al..Two Software Techniques for On-line Error Detection.FTCS-22.Twenty-Second International Symposium on Miremadi.1992:328~335.
    [35]彭宇,洪炳熔.一种控制流错误检测方法的实现.计算机应用研究.1999,16(8):24~26.
    [36]Majzik I.,Pataricza A.Control flow checking in multitasking systems.Periodica Polytechnica Ser.Electrical Engineering,1995,39(1):27~36.
    [37]Xiaobin Li,Jean-Luc Gaudiot et al.A Compiler-Assisted On-Chip Assigned-Signature.Control Flow Checking.Proceeding of 9th Asia-Pacific Computer Systems Architecture Conference,Beijing,2004,554~567.
    [38]Delord X.,Saucier G..Formalizing signature analysis for control flow testing of pipelined risc microprocessors.Proceeding of Int.Test Conference,1991:936~945.
    [39]Yung-Yuan Chen.Concurrent Detection of Control Flow Errors by Hybrid Signature Monitoring.IEEE Transactions on Computers,2005,54(10):1298~1313.
    [40]Bolchini C.,Miele A.,Rebaudengo M.et al..Software and Hardware Techniques for SEU Detection in IP Processors.Journal of Electronic Testing.2008,24(1):35~44
    [41]Bernardi P.,Bolzani L.,Rebaudengo M.,et al..An Integrated Approach for Increasing the Soft-Error Detection Capabilities in SoCs processors,In Proc.of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.2005.10:445~453
    [42]Eduardo Luis Rhod,Carlos Arthur Lang Lisb(?)a,Luigi Carro et al..Hardware and Software Transparency in the Protection of Programs Against SEUs and SETs.Journal of Electronic Testing.2008,24(1):45~56
    [43] Schillaci M, Sonza Reorda M, Violante M. A new approach to cope with single event upsets in processor-based systems. In Proc. of the 7th IEEE Latin-American test workshop, 2006.3:145-150
    [44]Giaconia G.C., Di Stefano, A., Capponi, G.. FPGA-based concurrent watchdog for real-time control systems. IEEE Electronics. Letters. 2003: 769 -770
    [45]Bernardi P., Bolzani L., Rebaudengo M. On-line Detection of Control-Flow Errors in SoCs by means of an Infrastructure IP core. In Proc. of the International Conference on Dependable Systems and Networks. 2005: 50-58.
    [46]Ragel R.G., Parameswaran S. Hardware assisted pre-emptive control flow checking for embedded processors to improve reliability. In Proc. of the 4th international conference on Hardware/software codesign and system synthesis. 2006:100-105
    [47]Ottavi M., Pontarelli S., Leandri A., and Salsano A.Design and Evaluation of a Hardware on-line Program-Flow Checker for Embedded Microcontrollers. In Proc. of the 21~(st) IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2006: 371-379
    [48]Rota F., Dutt S., and Krishna S. Off-Chip Control Flow Checking of On-Chip Processor-Cache Instruction Stream. In Proc. of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2006: 507-515
    [49]Nhon Quach. High availability and reliability in the itanium processor, Micro, IEEE. 2000, 20(5):61-69
    [50]Hamada, M., Fujiwara, E. A class of error control codes for byte organized memory systems-SbEC-(Sb+S)ED codes. Computers, IEEE Transactions on 1997,46(1): 105-109
    [51]Kessler R.E. The Alpha 21264 microprocessor. Micro, IEEE. 1999,19(2): 24-36
    [52]Bishop J.W. PowerPC As A10 64 bit RISC microprocessor. IBM Journal of Research and Development. 1996,40(4): 495-505
    [53]Mitra S. Diversity Techniques for Concurrent Error Detection [Ph.D. Thesis].Stanford, Calif.: Stanford Univ., 2000.
    [54] Oh N, Shirvani P P, McCluskey E J. Error detection by duplicated instructions in super scalar processors, IEEE Trans. on Reliability, Mar. 2002,51(1):63-75
    [55] Shirvani P P, Saxena N R, McCluskey E J. Software-implemented EDAC protection against SEUs. IEEE Trans. on Reliability, Sept. 2000,49(3):273-284
    [56]Reis G A, Chang J, Vachharajani N, et al. SWIFT: software implemented fault tolerance. In Proc. of International Symposium on Code Generation and Optimization (CGO 2005), Mar. 2005, San Jose, CA, USA; IEEE CS, 2005.243-254
    [57]Kaijie Wu; Karri, R. Algorithm level re-computing with shifted operands-a register transfer level concurrent error detection technique, Test Conference,2000. Proceedings. International,3-5 Oct. 2000:971- 978
    [58]Sohi, G.S.; Franklin, M.; Saluja, K.K. A study of time-redundant fault tolerance techniques for high-performance pipelined computers,Fault-Tolerant Computing. Nineteenth International Symposium. 1989:436-443
    [59]Li, J.; Swartzlander, E.E., Jr. Concurrent error detection in ALUs by recomputing with rotated operands. In Proc. of Defect and Fault Tolerance in VLSI Systems. 1992:109 -116
    [60] Gaisler J. Evaluation of a 32-bit microprocessor with built-in concurrent error detection. In Proc. of 27th Annual International Symposium on Fault-Tolerant Computing. Seattle, WA, USA: IEEE CS, 1997. 42-46
    [61]Gaisler J.Concurrent error-detection and modular fault-tolerance in a 32-bitprocessing core for embedded space flight applications.In Proc.of 24th Annual International Symposium on Fault-Tolerant Computing.Jun 1994,Austin,TX,USA:IEEE CS,1994.128~130
    [62]Gaisler J.A portable and fault-tolerant microprocessor based on the SPARC v8 architecture,In Proc.of International Conference on Dependable Systems and Networks,June 2002,Bethesda,MD,USA:IEEE CS,2002.409~415
    [63]Banerjee P,Abraham J A.Bounds on Algorithm-Based Fault tolerance in Multiple Processor Systems.IEEE Trans.on Computers,April 1986,35(4):296~306
    [64]Prata P,Silva J.Algorithm Based Fault Tolerance Versus Result-Checking for Matrix Computations.In Proc.of 29th International Symposium on Fault Tolerant Computing(FTCS-29),June 1999,Madison,WI,USA:IEEE CS,1999.4~11
    [65]Anfinson C,Luk F.A Linear Algebraic Model of Algorithm Based Fault Tolerance.IEEE Trans.on Computers,Dec.1988,37(12):1599~1604
    [66]Reddy A L N,Banerjee P.Algorithm-Based Fault Detection for Signal Processing Applications.IEEE Trans.on Computers,Oct.1990,39(10):1304~1308
    [67]Banerjee P,Rahmeh J T,Stunkel C,et al.Algorithm-based fault tolerance on a hypercube multiprocessor.IEEE Trans.Computers,Sep.1990,39(9):1132~1145
    [68]Wang S J,Jha N K.Algorithm-based fault tolerance for FFT networks.IEEE Trans.on Computers,July 1994,43(7):849~854
    [69]Redinbo G R.Generalized Algorithm-Based Fault Tolerance:Error Correction via Kalman Estimation.IEEE Trans.on Computers,June 1998,47(6):639~655
    [70] Rexford J, Jha N K. Partitioned encoding schemes for algorithm-based fault tolerance in massively parallel systems. IEEE Trans. on Parallel and Distributed Systems, June 1994, 5(6): 649-653
    [71]Rosenkrantz D J, Ravi S S. Improved Bounds for Algorithm-Based Fault Tolerance. IEEE Trans. on Computers, May 1993, 42(5): 630-635
    [72] Gu D, Rosenkrantz D J, Ravi S S. Construction of check sets algorithm-based fault tolerance. IEEE Trans. on Computers, June 1994, 43(6): 641-650
    [73]Silva J G, Prata P, Madeira H. Practical issues in the use of ABFT and a new failure model. In Proc. of 28th Annual International Symposium on Fault-Tolerant Computing (FTCS-28), June 1998, Munich, Germany: IEEE CS, 1998. 26-35
    [74]Luk F, Park H. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Trans. on Computers, Nov. 1988, 37(11): 1434-1438
    [75] Feng G L, Rao T R N, Kolluru M S. Error correcting codes over Z2 for algorithm-based fault tolerance. IEEE Trans. on Computers, Mar.1994,43(3):370-374
    [76] Vinnakota B, Jha N K. Design of algorithm-based fault-tolerant multiprocessor systems for concurrent error detection and fault diagnosis.IEEE Trans. on Parallel and Distributed Systems, Oct. 1994, 5(10):1099-1106
    [77]Maurizio R, Matteo S R, Massimo V, et al. A Source-to-Source Compiler for Generating Dependable Software. In Proc. of 1st International Workshop on Source Code Analysis and Manipulation, Nov. 2001, Florence, Italy: IEEE CS,2001. 33-42
    [78]Avizeinis A.The N-version approach to faul-tolerant software. IEEE Trans on Software Engineering, 1985, SE-11(12):1491-1501
    [79]Oh, N., Mitra S., McCluskey, E. J. ED4I: Error detection by diverse data and duplicated instructions. IEEE Trans on Computers, 2002b, 51(2): 180-199
    [80]Alkhalifa Z.,Nair V.S.S,Krishnamurthy N.,et al.Design and Evaluation of System-level Checks for On-line Control Flow Error Detection IEEE Trans.On Parallel and Distributed Systems,1999,10(6):627~641
    [81]Michael Winter,Christian Zeidler,and Christian Stich.The PECOS software process.Conf.on Software Reuse.In Proc of Components-based Software Development Processes 7~(th)Int.,2002 76~83
    [82]Mcfearin L.,Nair V.S.S.Control-Flow Checking Using Assertions.Computer science& engineering.In Proc of IFIP Int'l Working Conf.Dependable Computing for Critical Applications,1995,103~112
    [83]Kanawati K.,Krishnamurthy N.,Nair S.,et al.Evaluation of Integrated System-Level Checks for On-Line Error Detection.Parallel and Distributed Systems.Proceedings of IEEE Int'l Symp,1996,292~301
    [84]Goloubeva,O.,Rebaudengo,M.,Sonza Reorda,M..Soft-error detection using control flow assertions.In Proc of 18th IEEE International Symposium.2003:581~588
    [85]李爱国,洪炳熔,王司.一种软件实现的程序控制流错误检测方法.宇航学报.2006.11.1424~1430
    [86]Oh,N.,Shirvani,P.,McCluskey,et al.Control Flow Checking by Software Signatures.Center for Reliable Computing Technical Report.Proceedings of 51,2002,111~122
    [87]Ghassem Miremadi,Johan Karlsson,et al.Two Software Techniques for On-line Error Detection.FTCS-22 Twenty-Second International Symposium on Miremadi.1992,328~335
    [88]Majzik I.,Pataricza A.Control flow checking in multitasking systems.Periodica Polytechnica Ser.Electrical Engineering,1995,39(1):27~36
    [89]Alfredo Benso,Stefano Di Carlo,Giorgio Di Natale,et al.Control-Flow Checking via Regular Expressions.IEEE Computer Society.In Proc.of the 10th Asian Test Symposium,2001,299~303
    [90]高星,廖明宏,吴翔虎等.基于虚拟寄存器的控制流错误检测算法.宇航学报.2007,1:183~187
    [91]Farazmand,N.Fazeli,M.Miremadi,S.G.FEDC:Control Flow Error Detection and Correction for Embedded Systems without Program Interruption.In Proc.of Availability,Reliability and Security,2008.2008:33~38
    [92]Borin E.,Wang C.,Wu Y.,Guido Araujo.Dynamic Binary Control-Flow Errors Detection.In Proc.of the ACM SIGARCH Computer Architecture News.2005,33(5):15~20.
    [93]Nicolescu B.,Velazco R.,Detecting Soft Errors by a Purely Software Approach:Method,Tools and Experimental Results.In Proc.of the Design,Automation and Test in Europe Conference and Exhibition.2003:57~62
    [94]Borin E.,Cheng Wang,Youfeng Wu.Software-Based Transparent and Comprehensive Control-Flow Error Detection.In Proc.of the International Symposium on Code Generation and Optimization.2006:333~345
    [95]贺朝会,李国政,罗晋生等.CMOS SRAM单粒子翻转效应得解析分析.半导体学报.2000,21(2):174~178
    [96]Mukherjee S S,Kontz M,Reinhardt S K.Detailed design and evaluation of redundant multithreading alternatives.In Proc.of the 29th annual international symposium on Computer architecture(ISCA 2002),May 2002,Anchorage,AK,USA:IEEE CS,2002.99~110
    [97]Gomaa M,Scarbrough C,Vijaykumar T N,et al.Transient-fault recovery for chip multiprocessors.In Proc.of the 30th annual international symposium on Computer architecture(ISCA 2003),June 2003,San Diego,California,USA:IEEE CS,2003.98~109
    [98]Reinhardt S K,Mukherjee S S.Transient fault detection via simultaneous multithreading.In Proc.of the 27th annual international symposium on Computer architecture(ISCA 2000),June 2000,Vancouver,BC,Canada:IEEE CS,2000.25~36
    [99]Vijaykumar T,Pomeranz I,Cheng K.Transient-fault recovery using simultaneous multithreading.In Proc.of the 29th Annual International Symposium on Computer Architecture(ISCA 2002),May 2002,Anchorage,AK,USA:IEEE CS,2002.87~98
    [100]SCHUETTE,M.A.AND SHEN,J.P.Exploiting instruction-level parallelism for integrated control-flow monitoring.IEEE Transactions on Computers,1994,43(2):129~133.
    [101]ROTENBERG,E.AR-SMT:A microarchitectural approach to fault tolerance in microprocessors.IEEE Computer Society.In Proc.of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing,1999,84~91
    [102]VIJAYKUMAR,T.N.,POMERANZ,I.,et al.Transient-fault recovery using simultaneous multithreading.IEEE Computer Society.Proceedings of the 29th annual international symposium on Computer architecture,2002,87~98.
    [103]Kaijie Wu,Karri,R.Algorithm level re-computing with shifted operands-a register transfer level concurrent error detection technique.In Proc.of Test Conference,2000.3-5 Oct.2000:971~978
    [104]Reinhardt,S.K.,Mukherjee,S.S.Transient fault detection via simultaneous multithreading.In Proc of the 27th International Symposium.2000:25~36
    [105]随厚堂.几种256kbit SRAM芯片的单粒子翻转规律.中国空间科学技术,1999,1:56~62
    [106]Rashid,F.,Saluja,K.K.,Ramanathan,P.Fault tolerance through re-execution in multiscalar architecture.In Proc of Dependable Systems and Networks,2000,482~491
    [1071李爱国,洪炳镕,王司.基于错误传播分析的软件脆弱点识别方法研究.计算机学报,2007,30(11):1910~1921
    [108]杨学军,高珑.错误流模型:硬件故障的软件传播建模与分析.软件学报,2007,18(4):808~820
    [109]Sriram S,Jeffrey M,Squyres B B,et al.The Lam/Mpi Checkpoint/Restart Framework:System-Initiated Checkpointing.International Journal of High Performance Computing Applications,2005,19(4):479~493
    [110]Zhang Y,Xue R,Wong D,et al.A Checkpointing/Recovery System for MPI Applications on Cluster of IA-64 Computers.In Proc.of 2005 International Conference on Parallel Processing Workshops(ICPPW'05),June 2005,Oslo,Norway:IEEE CS,2005:320~327
    [111]Cao J,Li Y,Guo M.Process Migration for MPI Applications based on Coordinated Checkpoint.In Proc.of 11th International Conference on Parallel and Distributed Systems(ICPADS'05),July 2005,Fuduoka,Japani:IEEE CS,2005:306~312
    [112]Gao Q,Yu W,Huang W,et al.Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand.In Proc.of 2006 International Conference on Parallel Processing(ICPP'06),Aug.2006,Columbus,Ohio,USA:IEEE CS,2006:471~478
    [113]Gioiosa R,Sancho J C,Song J,et al.Transparent,Incremental Checkpointing at Kernel Level:a Foundation for Fault Tolerance for Parallel Computers,In Proc.of the 2005 ACM/IEEE conference on Supercomputing,March 2005,Seattle,WA,USA:IEEE CS,2005:9~23