面向硬件故障的软件容错
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
空间计算机是空间信息处理的基础平台,具有重大的战略意义。在空间环境中,硬件瞬时故障给空间计算机带来的可靠性问题非常突出。使用抗辐照器件可以提高空间计算机的可靠性,但是抗辐照器件性能非常低,价格非常高,功耗也很高,不适合用来建造用于科学计算目的的高性能的空间计算机。COTS器件性能很高,价格和功耗都很低,在COTS器件上面通过软件技术容忍硬件瞬时故障,可以提供高可靠、高性能、低成本和低功耗的空间计算机的解决方案。
     但是软件对于硬件瞬时故障传播的影响如何,软件容忍硬件瞬时故障的能力到底如何,这种能力对于系统有什么样的影响,都还没有模型能够描述。软件冗余在容忍硬件故障的同时,也带来了很大的开销,如何减小这种开销的影响,也是需要解决的问题。
     本文先建立了计算数据流模型,并在计算数据流模型的基础上建立了错误流模型。通过区分两种不同类型的错误,以及引入的6条错误传播规则和2条错误独立定律,我们计算出了错误流模型中任意数据在任意时刻产生错误的概率。在此基础上,我们根据容错概念的本质含义,概率化的定义了程序的容错能力。并分析了程序的容错能力对软件实现的双冗余容错系统的容错能力和性能的影响。以程序的容错能力为优化目标,我们提出了通过基于错误流分析的等价变换提高程序的容错能力的概念和方法。其中,我们还在错误流分析的基础上,提出了两种容错算法的优化方法,明显增加了性能并降低了功耗。
     本文的主要创新如下,
     1.通过引入原子数据和计算关系的概念,建立了计算数据流模型,描述了存储单元之间由于计算而形成的时空联系。通过引入原子数据的错误概率函数和计算关系的错误传播概率函数,在计算数据流模型上建立了错误流模型,概率化的描述了计算关系传播硬件错误的特性,计算出了任意存储单元在任意时刻发生错误的概率。最终建立了错误流分析的理论框架。
     2.基于错误流分析提出程序容错能力的概念,给出了程序容错能力的计算方法,提出容忍错误是程序内在属性的观点。并以程序的容错能力为优化目标,提出了一种不进行任何显式的冗余,而仅通过基于错误流分析的等价变换就能提高程序容错能力的方法。并且应用错误流分析,描述了构建双冗余容错系统的方法,分析了提高单个软件副本的容错能力会给双冗余容错系统带来的影响。
     3.提出对于程序容错能力具有关键影响的错误流关键子图的概念,基于错误流分析分别给出了由关键结点和关键路径生成错误流关键子图的方法。并且提出一种仅复制错误流关键子图的部分冗余容错算法,和EDDI算法相比,部分冗余容错算法在损失很小的错误覆盖率的情况下,能够提高IPC性能10%,减少执行时间15%,减小能量消耗10%。
     4.通过分析EDDI算法由于插入的分支指令而造成的性能和功耗损失,提出了一种通过附加计算减少分支指令数量的错误流压缩算法,和EDDI算法相比,错误流压缩算法在增加很小的错误延迟的情况下,能够提高性能12%,减少执行时间10%,减小能量消耗5%。
Onboard computers are very important to information processing in space. In space environments, transient hardware faults bring great impacts on onboard computers. Radiation hardened components can improve system reliability, but their performance lag several generations behind COTS components. Radiation hardened components are very expensive due to their rare availability, and they often consume more power, take up more space and weight heavier. They are not suitable to build high performance space computers. Compared with radiation hardened components, COTS components have very high performance, lower price and lower power dissipations. Software implemented hardware fault tolerance on COTS components can provide space computers with high reliability, high performance, low cost and low power dissipations.
     But there still remain problems. The problems include how do hardware faults propagate within software, how is the fault tolerance capability of software measured, and what effects can it bring to system reliability. And there is great overhead if we use software to tolerante hardware faults, how to minimize this overhead is still a problem.
     In this paper, we first setup computational data flow model, based on what we setup error flow model. By categorizing errors into two kinds, introducing 6 rules of error propagation and 2 error independence rules, we can get error probility of any data at any time. According to the concept of fault tolerance, we defined the fault tolerance capability of a program. We analyzed the consequences the fault tolerance of a program can bring to the fault tolerance and performance of a system. Take fault tolerance capability as a target, we suggested that by equivalent transformation based on error flow analyses we can improve the fault tolerance capability of a program during compiling time. Finally, we give two optimized fault tolerance algorithms which can improve performance and reduce power dissipations at the same time.
     Our major contributions can be concluded into 5 aspects as below,
     1. We defined concepts of atomic data and computational relations to describe relations between registers or storage units, which are affected by computations in programs. We setup the model of computational data flow. We defined error probability function of atomic data and error propagation probability function of computational relations, with which we setup the error flow model on top of computational data flow model. Error flow model described how errors propagate through computational relations in a probability way. By analyses on error flow model, we can compute the error probability of any registers or any other storage unit at any time. Finally we setup a theory framework of error flow analyses.
     2. To measure the capability of a program's fault tolerance, we defined a concept of fault tolerance capability based on error flow analyses, give a method of error flow anayses to calculate fault tolerance capability of any program. And we suggested a method to improve a program's fault tolerance capability by error flow analyses and equivalent transformation, without any explicit redundancy. Finally we applied error flow analyses to describe the method to build a double redudancy fault tolerant system, and describe the effects on a double redudancy fault tolerant system if we improve a single program replica's fault tolerant capability.
     3. We suggest the concept of key subgraph of error flow graph, which has critical effetcs on a program's fault tolerance capability, and give the methods to generate key subgraph from key nodes or key paths. And we suggest a partial redundancy fault tolerance algorithm by only replicating key subgraph instead of whole error flow graph. Compared with EDDI, partial redundancy can improve IPC by 10%, reduce execution time by 15%, and reduce power dissipations by 10%, at a cost of very little loss of error comverage.
     4. Based on error flow analyses, we suggest error flow compressing algorithm to reduce branch instructions inserted in EDDI algorithm, which have great impacts on performance and power dissipations. Compared with EDDI, error flow compressing algorithm can improve IPC by 12%, reduce execution time by 10%, reduce power dissipations by 5%, at a cost of very little increasement of error latency.
引文
[1] http://science.ksc.nasa.gov/history/apollo/apollo.html
    
    [2] http://history.nasa.gov/apollo.html
    
    [3] http://spaceflight.nasa.gov/history/apollo/
    
    [4] http://military.people.com.cn/GB/1078/3765449.html
    
    [5] http://news.xinhuanet.com/st/2005-10/21 /content_3661334.htm
    
    [6] http://news.xinhuanet.com/photo/2003-06/20/content_928726.htm
    
    [7] http://www.people.com.cn/GB/keji/1059/2965140.html
    
    [8] http://glast.gsfc.nasa.gov/
    
    [9] Gehrels N, Michelson P. GLAST: The Next-Generation High-Energy,Gamma-Ray Astronomy Mission. Astroparticle Physics, 1999, 11(1-2):277-282
    [10] Michelson, Peter F. GLAST: A detector for high-energy gamma rays. In Proc. of SPIE, Volume 2806, Gamma-ray and cosmic-ray detectors, techniques, and missions, 5-7 Aug. 1996. Denver, CO, USA:SPIE. 31-40
    [11] http://www.nasa.gov/mission_pages/swift/main/index.html
    [12] Fox D B, Frail D A, P Price A, et al. The afterglow of GRB 050709 and the nature of the short-hard y -ray bursts. Nature, 2005,437(7060):845-850
    [13] Katz D L, Springer P L, Granat R, et al. Applications Development for a Parallel COTS Spaceborne Computer. In Proc. of 3rd High Performance Embedded Computing (HPEC'99), Lincoln Laboratory, MIT, Lexington, MA:IEEE CS,1999.
    
    [14] http://jwst.gsfc.nasa.gov/
    
    [15] Sebastien L, John J D, Brian R, James S. Dynamic testing of a subscale sunshield for the Next Generation Space Telescope (NGST), In Proc. Of 42~(nd) Structures,Structural Dynamics, and Materials Conference and Exhibit, 16-19 Apr. 2001.Seattle, WA, USA:IEEE CS, 2001.
    
    [16] Robert C, Jeremy A S, David R, et al. ESA-NGST Integral Field and Multiobject Spectrograph slicer system , In Proc. SPIE Vol. 4013, UV, Optical, and IR Space Telescopes and Instruments, 29-31 Mar. 2000. Munich, Germany:SPIE, 2000.851-860
    
    [17] Santisteban N, Hanisch M A, Offenberg R J, et al. On-Board Supercomputing for NGST and NASA's Remote Exploration and Experimentation Project, in ASP Conf. Vol. 216, Astronomical Data Analysis Software and Systems IX, eds. N.Manset C, Veillet D. Crabtree, San Francisco, USA: IEEE CS, 2000. 311
    [18] Crocker J, Atkinson C, Ebbets D, et al. TRW/Ball: Next Generation Space Telescope, NGST . In Proc. SPIE Vol. 4013, UV, Optical, and IR Space Telescopes and Instruments, 29-31 Mar. 2000. Munich, Germany:SPIE, 2000. 27-34
    
    [19] http://hubble.nasa.gov/index.php
    [20] Estlin T, Gaines D Chouinard C et al. Enabling autonomous rover science through dynamic planning and scheduling. In Proc. of Aerospace 2005, 5-12 March 2005,Big Sky, MT, USA:IEEE CS. 385-396
    [21] Weisbin C R, Rodriguez G, Schenker P S, et al. Autonomous rover technology for Mars Sample Return. In Proc. of the 5~(th) International Symposium on Artificial Intelligence, Robotics and Automation in Space, 1-3 June 1999. Noordwijk,Netherlands:IEEE CS, 1999. 1-10
    [22] Lacroix S, Mallet A, Bonnafous D, et al. Autonomous rover navigation on unknown terrains: functions and integration. International Journal of Robotics Research, Oct.-Nov. 2002, 21(10-11):917-942
    [23] BARES J, HEBERT M, KANADE T, et al. Ambler - An autonomous rover for planetary exploration. IEEE Computer, June 1989, 22(6): 18-26.
    [24] Castano R, Judd M, Estlin T, et al. Autonomous onboard traverse science system. In Proc. of Aerospace 2004, 6-13 March 2004. Big Sky, MT, USA:IEEE CS,2004. (1):167-167
    [25] Tompkins P, Stentz A, Wettergreen D. Global path planning for Mars rover exploration, In Proc. of Aerospace 2004, 6-13 March 2004. Big Sky, MT,USA:IEEE CS, 2004. (2):801-815
    [26] http://marsrovers.nasa.gov/home/
    
    [27] http://www.nasa.gov/mission_pages/stereo/main/mdex.html
    [28] http://phoenix.lpl.arizona.edu/
    [29] http://mars.jpl.nasa.gov/missions/future/phoenix.html
    [30] Ziegler J F. IBM experiments in soft fails in computer electronics (1978-1994).IBM Journal of Research and Development, 1996,40(1):3-18
    [31] Normand E. Single event upset at ground level. IEEE Trans. on Nuclear Science,December 1996, 43(6):2742-2750
    [32] Taber A, Normand E. Single event upset in avionics. IEEE Trans. on Nuclear Science, 1993,40(2):120-126
    [33] Sims A J, Dyer C S, Peerless C L, et al. The single event upset environment for avionics at high latitude. IEEE Trans. on Nuclear Science, Dec 1994,41(6):2361-2367
    [34] CAMPBELL A, MCDONALD P, RAY K. Single event upset rates in space.IEEE Trans. on Nuclear Science, Dec. 1992, 39(6):1828-1835
    [35] Pickel, James C. Single-event effects rate prediction. IEEE Trans. on Nuclear Science, 1996,43(2):483-495
    [36] Dodd P E. Device simulation of charge collection and single-event upset, IEEE Trans. on Nuclear Science, 1996, 43(2):561-575
    [37] Koga R, Penzin S H, Crawford K B, et al. Single event upset (SEU) sensitivity dependence of linearintegrated circuits (ICs) on bias conditions. IEEE Trans. on Nuclear Science, Dec 1997,44(6):2325-2332
    [38] Woodruff R L, Rudeck P J. Three-dimensional numerical simulation of single event upset of an SRAM cell. IEEE Trans, on Nuclear Science, 1993,40(6):1795-1803
    [39] Irom F, Farmanesh F F, Johnston A H, et al. Single-event upset in commercial silicon-on-insulator PowerPC microprocessors. IEEE Trans. on Nuclear Science,Dec 2002,49(6):3148-3155
    [40] Swift G M, Fannanesh F F, Guertin S M, et al. Single-event upset in the PowerPC750 microprocessor. IEEE Trans. on Nuclear Science, Dec 2001,48(6): 1822-1827
    [41] Zoutendyk J A, Smith L S, Soli G A, et al. Experimental evidence for a new single-event upset (SEU) mode in a CMOS SRAM obtained from model verification. IEEE Trans. on Nuclear Science, 1987,34(6): 1292-1299
    [42] Schwartz H R, Nichols D K, Johnston A H. Single-event upset in flash memories,IEEE Trans. on Nuclear Science, Dec 1997,44(6):2315-2324
    [43] Hiemstra D M, Baril A. Single event upset characterization of the Pentium(R) MMX andPentium(R) II microprocessors using proton irradiation, IEEE Trans.on Nuclear Science, Dec 1999,46(6): 1453-1460
    [44] Adolphsen J, Barth J L, Stassinopoulos E G, et al. Single event upset rates on 1 Mbit and 256 Kbit memories: CRUXexperiment on APEX. IEEE Trans. on Nuclear Science, Dec 1995,42(6): 1964-1974
    [45] Johansson K, Dyreklev P, Granbom O, et al. In-flight and ground testing of single event upset sensitivity instatic RAMs. IEEE Trans. on Nuclear Science, Jun 1998,45(3): 1628-1632
    [46] Shoga M, Jobe K, Glasgow M, et al. Single event upset at gigahertz frequencies,IEEE Trans. on Nuclear Science, Dec 1994,41(6):2252-2258
    [47] Massengill L W, Baranski A E, Van Nort, et al. Analysis of single-event effects in combinational logic simulation of the AM2901 bitslice processor. IEEE Trans. on Nuclear Science, Dec. 2000,47(6):2609~2615
    [48] Irom F, Farmanesh F H, Swift G M, et al. Single-event upset in evolving commercial silicon-on-insulator microprocessor technologies. IEEE Trans. on Nuclear Science, Dec. 2003, 50(6):2107-2112
    [49] Dodd P E, Massengill L W. Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Trans. on Nuclear Science, June 2003,50(3):583-602
    [50] Martin R C, Ghoniem N M, Song Y, et al. The size effect of ion charge tracks on single event multiple-bit upset. IEEE Trans, on Nuclear Science, 1987, 34(6):1305-1309
    [51] Makihara A, Shindou H, Nemoto N, et al. Analysis of single-ion multiple-bit upset in high-density DRAMs. IEEE Trans. on Nuclear Science, Dec. 2000,47(6):2400-2404
    [52] Neuberger G, Lima F, Cairo L, et al. A multiple bit upset tolerant SRAM memory.ACM Trans. on Design Automation of Electronic Systems, Oct. 2003, 8(4): 577 -590
    [53] Dyer C S, Comber C, Truscott P R, et al. Microdosimetry code simulation of charge-deposition spectra,single-event upsets and multiple-bit upsets. IEEE Trans.on Nuclear Science, Dec 1999,46(6): 1486-1493
    [54] Swift G M, Guertin S M. In-flight observations of multiple-bit upset in DRAMs.IEEE Trans. on Nuclear Science, Dec. 2000,47(6):2386-2391
    [55] Musseau O, Gardic F, Roche P, et al. Analysis of multiple bit upsets (MBU) in CMOS SRAM, IEEE Trans. on Nuclear Science, Dec 1996,43(6):2879-2888
    [56] Koga R, Pinkerton S D, Lie T J, et al. Single-word multiple-bit upsets in static random access devices. IEEE Trans. on Nuclear Science, Dec. 1993,40(6):1941-1946
    [57] Buchner S, Campbell A, Reed R, et al. Angular dependence of multiple-bit upsets induced by protons in a 16 mbit DRAM. IEEE Trans. on Nuclear Science, Dec.2004,51(6):3270-3277
    [58] Ziegler J F. Terrestrial cosmic rays. IBM Journal of Research and Development,1996,40(1): 19-39
    [59] Shirvani P. Fault Tolerant Computing for Radiation Environment[Ph.D. Thesis].Stanford, Calif. Stanford Univ., 2001.
    [60] Karapetian A V, Some R R, Beahan J J. Radiation fault modeling and fault rate estimation for a COTS based space-borne supercomputer. In Proc. of Aerospace Conference, 2002. Big Sky, MT, USA:IEEE CS, 2002. (5)2121-(5)2131
    [61] Ziegler J F, Muhlfeld H P, Montrose C J, et al. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development, 1996,40(1):51-72
    [62] Srinivasan G R. Modeling the cosmic-ray-induced soft-error rate in integrated circuits: An overview. IBM Journal of Research and Development, 1996,40(1):77-90
    [63] Shivakumar P, Kistler M, Keckler S W, et al. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In Proc. of the 2002 International Conference on Dependable Systems and Networks (DSN 2002).Bethesda, MD, USA:IEEE CS, 2002. 389-398
    [64] Tang H H K. Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-induced soft errors from a physicist's perspective. IBM Journal of Research and Development, 1996,40(1):91-108
    [65] Freeman L B. Critical charge calculations for a bipolar SRAM array, IBM Journal of Research and Development, 1996,40(1): 119-130
    
    [66] George M. NASA's Reverse Thrust. Scientific American, June 2006,7-8
    [67] http://news.xinhuanet.com/ziliao/2005-10/10/content_3601540.htm
    [68] Some R R, Ngo D C. REE: a COTS-based fault tolerant parallel processing supercomputer for spacecraft onboard scientific data analysis. In Proc. of 18~(th) Digital Avionics Systems Conference, Oct. 1999, St Louis, MO, USA:IEEE CS,1999.Vol.2:7.B.3-l~7.B.3-12
    [69] http://www.webopedia.com
    [70] Oh N. Software Implemented Hardware Fault Tolerance[Ph.D. Thesis]. Stanford,Calif.:Stanford Univ., 2000.
    [71] Oh N, Subhasish M, McCluskey E J. ED~4I: Error detection by diverse data and duplicated instructions. IEEE Trans. on Computers, Feb. 2002, 51(2):180~199
    [72] Weaver C, Austin T. A fault tolerant approach to microprocessor design. In Proc.of International Conference on Dependable Systems and Networks(DSN 2001),July 2001, Goteborg, Sweden:IEEE CS, 2001. 411-420
    [73] Rashid F, Saluja K K, Ramanathan P. Fault tolerance through re-execution in multiscalar architecture. In Proc. of International Conference on Dependable Systems and Networks (DSN 2000), June 2000, New York, NY, USA:IEEE CS,2000. 482-491
    [74] Mendelson A, Suri N. Designing high-performance and reliable superscalar architectures: the out of order reliable superscalar (O3RS) approach. In Proc. of International Conference on Dependable Systems and Networks (DSN 2000),June 2000, New York, NY, USA:IEEE CS, 2000. 473-481
    [75] Gaisler J. Evaluation of a 32-bit microprocessor with built-in concurrent error detection. In Proc. of 27~(th) Annual International Symposium on Fault-Tolerant Computing (FTCS-27), June 1997, Seattle, WA, USA:IEEE CS, 1997. 42-46
    [76] Gaisler J. Concurrent error-detection and modular fault-tolerance in a 32-bitprocessing core for embedded space flight applications. In Proc. of 24~(th) Annual International Symposium on Fault-Tolerant Computing (FTCS-24), Jun 1994, Austin, TX, USA:IEEE CS, 1994. 128-130
    [77] Gaisler J. A portable fault-tolerant microprocessor based on the SPARC V8 architecture, In Proc. of the Conference on Data systems in aerospace (DASIA 99),May 1999, Lisbon, Portugal; NETHERLANDS:IEEE CS, 1999. 173-177
    [78] Gaisler J. A portable and fault-tolerant microprocessor based on the SPARC v8 architecture, In Proc. of International Conference on Dependable Systems and Networks (DSN 2002), June 2002, Bethesda, MD, USA:IEEE CS, 2002. 409-415
    
    [79] Gerke R D, Shapiro A A. Use of commercial off-the-shelf (COTS) for space applications. In Proc. of Aerospace Conference, Mar. 2003, Big Sky, MT,USA:IEEE CS,2003. 230
    
    [80] Sandor M, Agarwal S, Peters D. COTS actives initiative for space applications. In Proc. of Space Parts Working Group, Apr. 2002, Torrence, CA, USA:NASA JPL,2002.
    
    [81] Nikora A, Schneidewind N. Issues and Methods for Assessing COTS Reliability,Maintainability, and Availability, In Proc. of COTS Workshop/International Conference on S/W Engineering, May 1999, Los Angeles, CA, USA:NASA JPL,1999.
    
    [82] Sokol J. COTS at JPL. In Proc. of JEDEC/G12 Meeting, DSA Group Meeting,Jan 2005, San Antonio, Texas Pasadena, CA:NASA JPL, 2005.
    
    [83] Equils D J. Method for enhancing the process of software tool evaluation and selection: COTS, heritage, and custom software reviewed. In Proc. of SpaceOps 2002, May 2004, Montreal, Quebec, Canada:NASA JPL, 2004
    [84] Le C, Hensley S. Using COTS Components for Real-Time Processing of SAR Systems. In Proc. of FY'98 Trade Study, USA:NASA PL, 1998
    
    [85] Kayali S. Utilization of COTS electronics in space application, reliability challenges and reality, In Proc. of Commercialization of Military and Space Electronics Conference, Feb 2002, Los Angeles, CA, USA:NASA JPL, 2002.
    [86] Alkalai L, Tai A, Chau S. COTS-Based Fault Tolerance in Deep Space:Qualitative and Quantitative Analyses of A Bus Network Architecture. In Proc. of 4~(th) IEEE International Symposium on High Assurance (HASE 99), Nov 1999,U.S.A:IEEE CS, 1999.
    
    [87] Ramesham R, Ghaffarian R, Kim N. Reliability Issues of COTS MEMS for Aerospace Applications. In Proc. of SPIE 1999, Micromachining and Micro Fabrication, Bellingham, WA, USA:SPIE, 1999.
    [88] Chau S N, Alkalai L, Tai A T, et al. Design of a fault-tolerant COTS-based bus architecture. IEEE Transactions on Reliability, Dec 1999, 48(4):351-359
    [89] Equils D J. Method for enhancing the process of software tool evaluation and selection: COTS, heritage, and custom software. In Proc. of SpaceOps 2004, May 2004, Montreal, Canada:NASA JPL, May 2004.
    
    [90] Chau S N, Tai A T, Smith J. A design-diversity based fault-tolerant COTS avionics bus network. In Proc. of 2001 Pacific Rim International Symposium on Dependable Computing (PRDC), Dec. 2001, Seoul, Korea:IEEE CS, 2001.
    [91] Katz D, Springer P. A Spaceborne Embedded COTS Cluster for Computational Optics, In Proc. of Workshop on Computational Optic and Imaging for Space Applications , May 2000, Greenbelt, Maryland, USA:NASA JPL, 2000.
    [92] Lieneweg U. Comparison of Electrical Failure Mechanisims in COTS Parts and Their Scaling with Supply Voltage - An Overview. In Proc. of 2nd Annual Microelectronics Reliability and Qualification Workshop, Pasadena, Oct. 1999,CA, USA:NASA JPL, 1999.
    
    [93] Graves R. Application of COTS for Early, Low Cost, Avionics System Testbed. In Proc. of 2000 IEEE Aerospace Conference, Mar. 2000, Big Sky, MT,USA:NASA JPL, 2000.
    
    [94] Ramesham R, Ghaffarian R, Kim N. Reliability Assessment of COTS MEMS Components for Aerospace Environment, In Proc. of SPIE, Sep. 1999, Santa Clara,CA,USA:SPIE, 1999.
    
    [95] Ramesham, Rajeshuni, Ghaffarian, et al. Reliability issues of COTS MEMS for aerospace applications. In Proc. of SPIE Vol. 3880, p. 83-88, MEMS Reliability for Critical and Space Applications, Russell A L, William M M, Gisela L,Rajeshuni R eds., Aug. 1999.
    
    [96] Ko A. Mission adaptation reusability using COTS based system. In Proc. of 5~(th) International Symposium on Reducing the Cost of Spacecraft Ground Systems and Operations, Jul. 2003, Pasadena, CA, USA:NASA JPL, 2003.
    
    [97] Chau S, Alkalai L, Tai A. The Analysis of Multi-Layer Fault-Tolerance Methodology for Applying COTS in Deep Space Missions, In Proc. of Symposium on Application-Specific Systems and Software Engineering and Technology, Mar. 2000, Richardson, TX, USA:NASA JPL, 2000.
    
    [98] Whisnant K, Iyer R K, Jones P, et al. An experimental evaluation of the REE SIFT environment for spaceborne applications, In Proc. of International Conference on Dependable Systems and Networks (DSN 2002), 2002, Bethesda,MD, USA:IEEE CS, 2002. 585- 594
    
    [99] Whisnant K, Iyer R K, Kalbarczyk Z T, et al. The Effects of an ARMOR-based SIFT environment on the performance and dependability of user applications.IEEE Trans. on Software Engineering, April 2004, 30(4):257- 277
    [100] Chen F, Craymer L, Deifik J, et al. Demonstration of the Remote Exploration and Experimentation (REE) Fault-Tolerant Parallel-Processing Supercomputer for Spacecraft Onboard Scientific Data Processing, In Proc. of International Conference on Dependable Systems and Networks (DSN 2000), June 2000, New York, NY, USA:IEEE CS, 2000. 367-372
    
    [101] Chen D, Dharmaraja S, Chen D, et al. Reliability and availability analysis for the JPL Remote Exploration and Experimentation System. In Proc. of International Conference on Dependable Systems and Networks (DSN 2002), June 2002,Bethesda, MD, USA:IEEE CS, 2002. 337- 342
    
    [102] Beahan J, Edmonds L, Ferraro R D, et al. Detailed radiation fault modeling of the Remote Exploration and Experimentation (REE) first generation testbed architecture. In Proc. of Aerospace Conference Mar. 2000, Big Sky, MT,USA:IEEE CS, 2000. (5)279-281
    [103] Madeira H, Some R R, Moreira F, et al. Experimental evaluation of a COTS system for space applications. In Proc. of International Conference on Dependable Systems and Networks (DSN 2002), June 2002, Bethesda, MD, USA:IEEE CS,2002. 325- 330
    [104] Some R R, Kim W S, Khanoyan G, et al. Fault injection experiment results in space borne parallel application programs. In Proc. of Aerospace Conference,2002, Big Sky, MT, USA:IEEE CS, 2002. Vol. 5:2133-2147
    [105] Karapetian A V, Some R R, Beahan J J. Radiation fault modeling and fault rate estimation for a COTS based space-borne supercomputer. In Proc. of Aerospace Conference, 2002, Big Sky, MT, USA:IEEE CS, 2002. Vol. 5:2121-2131
    [106] Raphael R, Some W S, Kim G K, et al. A Software-Implemented Fault Injection Methodology for Design and Validation of System Fault Tolerance. In Proc. of The International Conference on Dependable Systems and Networks (DSN 2001),July 2001, Goteborg, Sweden:IEEE CS, 2001. 501-506
    [107] http://beowulf.gsfc.nasa.gov/ESS/brochures/2000/intro.htm
    [108] Mitra S. Diversity Techniques for Concurrent Error Detection[Ph.D. Thesis].Stanford, Calif.Stanford Univ., 2000.
    [109] Oh N, Shirvani P P, McCluskey E J. Control-flow checking by software signatures. IEEE Trans. on Reliability, Mar. 2002, 51(1): 111-122
    [110] Oh N, Shirvani P P, McCluskey E J. Error detection by duplicated instructions in super scalar processors, IEEE Trans. on Reliability, Mar. 2002, 51(1):63-75
    [111] Oh N, Mitra S, McCluskey E J. ED~4I: Error detection by diverse data and duplicated instructions, IEEE Trans. on Computers, Feb. 2002, 51(2): 180-199
    [112] Shirvani P P, Edward J M. Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project. CRC Technical Report No. 98-2, Stanford Univ., Stanford,California, USA:Center for Reliable Computing, Dec. 1998.
    [113] Oh N, McCluskey E J. Error detection by selective procedure call duplication for low energy consumption. IEEE Trans. on Reliability, Dec 2002, 51(4):392- 402
    [114] Shirvani P P, Saxena N R, McCluskey E J. Software-implemented EDAC protection against SEUs. IEEE Trans. on Reliability, Sept. 2000, 49(3):273-284
    [115] Reis G A, Chang J, Vachharajani N, et al. SWIFT: software implemented fault tolerance. In Proc. of International Symposium on Code Generation and Optimization (CGO 2005), Mar. 2005, San Jose, CA, USA;IEEE CS, 2005. 243-254
    [116] Mitra S, Saxena N R, McCluskey E J. A design diversity metric and analysis of redundant systems. IEEE Trans. on Computers, May 2002. 51(5): 498-510
    [117] Mitra S, Saxena N R, McCluskey E J. A design diversity metric and reliability analysis for redundant systems. In Proc. of International Test Conference, 1999,Atlantic City, NJ, USA:IEEE CS, 1999. 662-671
    [118]Al-Yamani A A,Oh N,McCluskey E J.Performance evaluation of checksum based ABFT.In Proc.of International Symposium on Defect and Fault Tolerance in VLSI Systems,2001,San Francisco,CA,USA:IEEE CS,2001.461-466
    [119]Shirvani P P.Software-Implemented Hardware Fault Tolerance Experiments:COTS in Space.In Proc.of International Conference on Dependable Systems and Networks(DSN 2000),Fast Abstracts,June 2000,New York,NY:IEEE CS,2000.6-7
    [120]Golombek M P.Overview of the Mars Pathfinder mission:launch through landing,surface operations,data sets,and science results.Journal of Geophysical Research American Geophy,Apr.1999,Union 25,104(E4):8523-8553
    [121]Matijevic J.The mission and operation of the Mars Pathfinder microrover.CONTROL ENGINEERING PRACTICE Elsevier,June 1997,5(6):827-835
    [122]Titus J L.First Observations of Enhanced Low Dose Rate Sensitivity in Space:One part of the MPTB Experiment.IEEE Trans.on Nuclear Science,Dec.1998,45(6):2673-2680
    [123]Avizienis A.Design of Fault-Tolerant Computers.In Proc.of AFIPS Fall Joint Computer Conferenc,1967,Vol.31,Washington,D.C.,USA:Thompson Books,1967.733-743.
    [124]徐拾义.可信计算系统设计和分析.第1版.北京:清华大学出版社,2006年.
    [125]Clark J A,Pradhan D K.Fault injection:a method for validating computer-system dependability,IEEE Computer,Jun 1995,28(6):47-56
    [126]Avizienis A.Toward Systematic Design of Fault-Tolerant Systems,IEEE Computer,1997,30(4):51-58
    [127]崔林,吴鹤龄.IEEE计算机先驱奖.第1版.北京:高等教育出版社,2003年.
    [128]http://heasarc.gsfc.nasa.gov/docs/copernicus/copemicus.html
    [129]http://www.avizienis.info/index-en.html
    [130]http://voyager.jpl.nasa.gov/
    [131]http://www-03.ibm.com/ibm/history/exhibits/space/space_saturn.html
    [132]ONeill P M,Badhwar G D.Single even upsets for Space Shuttle flights of new general purpose computer memory devices,IEEE Trans.on Nuclear Science,Oct.1994,41(5):1755-1764
    [133]Karl N L,Peter G N,Recent SRI work in verification,ACM SIGSOFT Software Engineering Notes,July 1981,6(3):27-35
    [134]Goldberg J,Kauts W H,Melliar-Smith P M,et al.Development and analysis of the software implemented fault-tolerance(SIFT) computer.The fault-tolerant multiprocessor computer(A87-13197 03-62),Park Ridge,NJ,USA:Noyes Publications,1986.507-731
    [135]Wensley J H,Lamport L,Goldberg J,et al.SIFT:Design and analysis of a fault-tolerant computer for aircraft control. Proceedings of the IEEE, Oct. 1978,66(10): 1240- 1255
    
    [136] Hopkins A L, Smith T B, Lala J H. FTMP: A highly reliable Fault-Tolerant Multiprocessor for aircraft, Proceedings of the IEEE, Oct. 1978, 66(10):1221-1239
    
    [137] Finelli G. Characterization of fault recovery through fault injection on FTMP.IEEE Trans. on Reliability, June 1987, R-36:164-170
    
    [138] Smith T B, Lala J H. Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer, The fault-tolerant multiprocessor computer (A87-13197 03-62), Park Ridge, NJ, USA:Noyes Publications, 1986. 1-506
    
    [139] http://www.sri.com/
    
    [140] Harper R E, Lala J H, Deyst J J. Fault Tolerant Parallel Processor Architecture Overview. In Proc. of 25~(th) International Symposium on Fault-Tolerant Computing (FTCS-25), Highlights from Twenty-Five Years, Jun. 1995, Pasadena, Calif.USA:IEEE CS, 1995. 62
    
    [141] Harper R E, Lala J H, Deyst J J. Fault-Tolerant Parallel Processor Architectural Overview. In Proc. of the 18~(st) International Symposium on Fault-Tolerant Computing (FTCS-18), June 1988, Tokyo:IEEE CS, 1988. 252-257
    
    [142] Harper R E, Lala J H. Fault-tolerant parallel processor, Journal of Guidance,Control, and Dynamics. May-June 1991, 14:554-563
    
    [143] Avizienis A. Fault-tolerance: The survival attribute of digital systems, Proceedings of the IEEE, Oct. 1978, 66(10): 1109- 1125
    
    [144] Gray J. A census of Tandem system availability between 1985 and 1990, IEEE Trans.on Reliability, Oct 1990, 39(4): 409-418
    
    [145] Joel F B. A NonStop kernel. ACM SIGOPS Operating Systems Review, Dec.1981, 15(5):22-29
    
    [146] http://www.stratus.com/
    
    [147] Webber S, Beirne J. The Stratus Architecture. In Proc. of the 21~(st) International Symposium on Fault-Tolerant Computing (FTCS-21), June 1991, Montreal, Que.,Canada:IEEE CS, 1991.79-85
    
    [148] http://www.hp.com
    
    [149] Vogels W, Dumitriu D, Birman K, et al. The Design and Architecture of the Microsoft Cluster Service: A Practical Approach to High-Availability and Scalability. In Proc. of the 28~(th) Annual International Symposium on Fault-Tolerant Computing (FTCS-28), 1998, Munich, Germany:IEEE CS, 1998. 422
    
    [150] Neumann J V. Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components. Automata Studies, C.E. Shannon and J. McCarthy, eds.,Annals of Math Studies, 1956, Princeton, N.J. USA:Princeton Univ. Press, 1956.34:43-98
    [151] Lyons R E, Vanderkulk W. The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 1962,6(2):200-209
    [152] Pradhan D K, Fault-Tolerant Computer System Design, Upper Saddle River, NJ,USA:Prentice Hall, 1996.
    [153] Rotenberg E. AR-SMT: a microarchitectural approach to fault tolerance inmicroprocessors, In Proc. of 29~(th) Annual International Symposium on Fault-Tolerant Computing (FTCS-29), June 1999, Madison, WI, USA:IEEE CS,1999. 84-91
    [154] Mukherjee S S, Kontz M, Reinhardt S K. Detailed design and evaluation of redundant multithreading alternatives. In Proc. of the 29~(th) annual international symposium on Computer architecture (ISCA 2002), May 2002, Anchorage, AK,USA:IEEECS,2002. 99-110
    [155] Gomaa M, Scarbrough C, Vijaykumar T N, et al. Transient-fault recovery for chip multiprocessors. In Proc. of the 30~(th) annual international symposium on Computer architecture (ISCA 2003), June 2003, San Diego, California, USA:IEEE CS, 2003.98-109
    [156] Reinhardt S K, Mukherjee S S. Transient fault detection via simultaneous multithreading. In Proc. of the 27~(th) annual international symposium on Computer architecture (ISCA 2000), June 2000, Vancouver, BC, Canada:IEEE CS, 2000.25-36
    [157] Vijaykumar T, Pomeranz I, Cheng K. Transient-fault recovery using simultaneous multithreading. In Proc. of the 29~(th) Annual International Symposium on Computer Architecture (ISCA 2002), May 2002, Anchorage, AK, USA:IEEE CS, 2002.87-98
    [158] Christmansson J, Kalbarczyk A, Torin J. Dependable Flight Control System Using Data Diversity with Error Recovery. Computer Systems Science and Eng.,Apr. 1994, 9(2): 142-150
    [159] Ersoz A, Andrews D M, McCluskey E J. The Watchdog Task: Concurrent Error Detection Using Assertions. CRC Technical Report TR 85-8, Stanford Univ.,Stanford, California, USA:Center for Reliable Computing, 1985.
    [160] Rao T R N, Fujiwara E. Error-Control Coding for Computer Systems, Upper Saddle River, NJ, USA:Prentice Hall, 1989.
    [161] Wicker S B, Error Control Systems for Digital Communications and Storage,Upper Saddle River, NJ, USA:Prentice Hall, 1995.
    [162] Chen C L, Hsiao M Y. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review. IBM Journal of Research and Development,, Mar. 1984, 28(2): 124-134
    [163]刘品.可靠性工程基础.修订版.北京:计量出版社,2002年.
    [164]Avizienis A,Kelly J P J.Fault Tolerance by Design Diversity:Concepts and Experiments.IEEE Computer,Aug.1984,17(8):67-80Avizeinis A.The N-Version Approach to Fault-Tolerant Software.IEEE Trans.on Software Engineering,1985,SE-11(12):1491-1501
    [165]Avizienis A,Chen L.On the Implementation of N-Version Programming for Software Fault-Tolerance During Program Execution.In Proc.of Int'l Computer Software and Application Conf(COMPSAC 77),Nov.1977,Chicago,Illinois,USA:IEEE CS,1977.145-155
    [166]Chen L,Avizienis A.N Version Programming:A Fault Tolerance Approach to Reliability of Software Operation.In Proc.of 8~(th) Int'l Synip.Fault-Tolerant Computing(FTCS-8),June 1978,Toulouse,France:IEEE CS,1978.3-9
    [167]Laprie J C,Béounes C,Kanoun K.Definition and Analysis of Hardware- and Software- Fault-Tolerant Architectures.IEEE Computer,July 1990,23(7):39-51
    [168]Scott R K,McAllister D F.Cost modeling of N-version fault-tolerant software systems for large N.IEEE Transactions on Reliability,Jun 1996,45(2):297-302
    [169]Anderson T,Kerr R.Recovery blocks in action:A system supporting high reliability.In Proc.of the 2~(nd) international conference on Software engineering,1976,Los Alamitos,CA,USA:IEEE CS,1976.447-457
    [170]Scott R K,Gault J W,McAllister D F.Fault-tolerant software reliability modeling.IEEE Trans.on Software Engineering,May 1987,13(5):582-592
    [171]Kim K H,Kavianpour A.distributed recovery block approach to fault-tolerant executionof application tasks in hypercubes.IEEE Transactions on Parallel and Distributed Systems,Jan 1993,4(1):104-111
    [172]Ammann P E,Knight J C.Data Diversity:An Approach to Software Fault Tolerance,IEEE Trans.on Computers,April 1988,37(4):418-425
    [173]徐仁佐,谢旻,郑人杰.软件可靠性模型及应用.第1版.北京:清华大学出版社,1994年.
    [174]Huang K H,Abraham J A.Algorithm-based fault tolerance for matrix operations.IEEE Trans.on Computers,June 1984,33(6):518-528
    [175]Banerjee P,Abraham J A.Bounds on Algorithm-Based Fault tolerance in Multiple Processor Systems.IEEE Trans.on Computers,April 1986,35(4):296-306
    [176]Prata P,Silva J.Algorithm Based Fault Tolerance Versus Result-Checking for Matrix Computations.In Proc.of 29~(th) International Symposium on Fault Tolerant Computing(FTCS-29),June 1999,Madison,WI,USA:IEEE CS,1999.4-11
    [177]Anfinson C,Luk F.A Linear Algebraic Model of Algorithm Based Fault Tolerance.IEEE Trans.on Computers,Dec.1988,37(12):1599-1604
    [178] Reddy A L N, Banerjee P. Algorithm-Based Fault Detection for Signal Processing Applications. IEEE Trans. on Computers, Oct. 1990, 39(10): 1304 - 1308
    [179] Banerjee P, Rahmeh J T, Stunkel C, et al. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Trans. Computers, Sep. 1990, 39(9): 1132-1145
    [180] Wang S J, Jha N K. Algorithm-based fault tolerance for FFT networks. IEEE Trans, on Computers, July 1994,43(7): 849-854
    [181] Redinbo G R. Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation. IEEE Trans. on Computers, June 1998,47(6): 639 - 655
    [182] Rexford J, Jha N K. Partitioned encoding schemes for algorithm-based fault tolerance in massively parallel systems. IEEE Trans. on Parallel and Distributed Systems, June 1994, 5(6): 649-653
    [183] Rosenkrantz D J, Ravi S S. Improved Bounds for Algorithm-Based Fault Tolerance. IEEE Trans. on Computers, May 1993,42(5): 630-635
    [184] Gu D, Rosenkrantz D J, Ravi S S. Construction of check sets for algorithm-based fault tolerance. IEEE Trans. on Computers, June 1994,43(6): 641-650
    [185] Silva J G, Prata P, Madeira H. Practical issues in the use of ABFT and a new failure model. In Proc. of 28~(th) Annual International Symposium on Fault-Tolerant Computing (FTCS-28), June 1998, Munich, Germany:IEEE CS, 1998. 26-35
    [186] Luk F, Park H. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Trans. on Computers, Nov. 1988, 37(11): 1434-1438
    [187] Feng G L, Rao T R N, Kolluru M S. Error correcting codes over Z~2 for algorithm-based fault tolerance. IEEE Trans. on Computers, Mar. 1994,43(3):370-374
    [188] Vinnakota B, Jha N K. Design of algorithm-based fault-tolerant multiprocessor systems for concurrent error detection and fault diagnosis. IEEE Trans. on Parallel and Distributed Systems, Oct. 1994, 5(10): 1099-1106
    [189] Maurizio R, Matteo S R, Massimo V, et al. A Source-to-Source Compiler for Generating Dependable Software. In Proc. of 1~(st) International Workshop on Source Code Analysis and Manipulation, Nov. 2001, Florence, Italy:IEEE CS,2001. 33-42
    [190] Patel J H, Fung L T. Concurrent error detection in ALU's by recomputing with shifted operands. IEEE Trans. on Computers, July 1982, 31(7):589-595
    [191] Brown D T. Error detecting and correcting binary codes for arithmetic operations. IRE Trans. on Electronic Computers, Sep. 1960, EC-9:333~337
    [192] Engel H. Data flow transformations to detect results which are corrupted byhardware faults. In Proc. of High-Assurance Systems Engineering Workshop,1996, Oct. 1996, Niagara on the Lake, Ont, Canada:IEEE CS, 1996. 279-285
    [193] Lu D J. Watchdog Processor and Structural Integrity Checking. IEEE Trans. on Computers,, July 1982, C-31(7): 681-685
    [194] Namjoo M. Techniques for Concurrent Testing of VLSI Processor Operation. In Proc. of International Test Conference, Nov. 1982, Philadelphia, PA: IEEE CS,1982. 461-468
    [195] Shen J P, Schuette M A. On-line Self-Monitoring Using Signatured Instruction Streams. In Proc. of 13~(th) Int'l Test Conference, Oct. 1983, Philadelphia, PA: IEEE CS, 1983.275-282
    [196] Eifert J B, Shen J P. Processor Monitoring Using Asynchronous Signatured Instruction Streams. In Proc. of 14~(th) Annual Int'l Conf. on Fault-Tolerant Computing (FTCS-14), June 1984, Silver Spring, Md, USA:IEEE CS, 1984.394-399
    [197] Saxena N R, McCluskey E J. Control-Flow Checking Using Watchdog Assists and Extended-Precision Checksums. IEEE Trans. on Computers, April 1990,39(4): 554-559
    [198] Wilken K, Shen J P. Concurrent Error Detection Using Signature Monitoring and Encryption: Low-Cost Concurrent-Detection of Processor Control Errors. In Proc.of The 1~(st) IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA-1), Aug. 1989, Santa Barbara, Calif., USA:Springer-Verlag, 1989. 4:365-384
    [199] Wilken K, Shen J P. Continuous Signature Monitoring: Low-Cost Concurrent-Detection of Processor Control Errors. IEEE Trans. on Computer Aided Design, June 1990, 9(6)629-641
    [200] Sosnowski J. Detection of control flow errors using signature and checking instructions. In Proc. of International Test Conference, Sep. 1988, Washington,DC, USA:IEEE CS, 1988. 81-88
    [201] Miremadi G, Ohlsson J T, Rimen M J, et al. Use of Time, Location and Instruction Signatures for Control Flow checking. In Proc. of The 5~(th) IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA-5), Sep. 1995, Urbana-Champaign, Illinois, USA: Springer-Verlag.
    [202] Miremadi G, Karlsson J, Gunneflo J U, et al. Two Software Techniques for On-line Error Detection. In Proc. of 22~(nd) Annual Int'l Symo. on Fault-Tolerant Computing (FTCS-22), July 1992, Boston, Massachusetts, USA:IEEE CS, 1992.328-335
    [203] Tian J. Integrating Time Domain and Input Domain Analyses of Software Reliability Using Tree-Based Models. IEEE Trans. on Software Engineering, Dec.1995,21(12):945-958.
    [204] Huang C Y, Lyu M R. A Unified Scheme of Some Nonhomogenous Poisson Process Models for Software Reliability Estimation. IEEE Trans. on Software Engineering,Mar.2003,29(3):261-269.
    [205]陈火旺,钱家骅,孙永强.编译原理.第2版.北京:国防工业出版社,1999年.
    [206]Burger D C,Austin T M.The SimpleScalar tool set,version 2.0.ACM SIGARCH Computer Architecture News,New York,NY,USA:ACM Press,1997.25(3):13-25
    [207]Cliff Y,Michael D S.Static correlated branch prediction.ACM Trans.on Programming Languages and Systems,May 1999,21(5):1028-1075
    [208]Wu Y,Larus J R.Static branch frequency and program profile analysis.In Proc.of the 27~(th) annual international symposium on Microarchitecture(Micro-27),1994,San Jose,California,USA:ACM Press,1994.1-11
    [209]Jason R C,Patterson D A.Accurate static branch prediction by value range propagation.In Proc.of the ACM conference on Programming language design and implementation(SIGPLAN 1995),1995,La Jolla,California,USA:ACM Press,1995.67-78.
    [210]杨东屏,李昂生.可计算性理论.第1版.北京:科学出版社,1999年.
    [211]Patterson D A,Hennessy J L.Computer Architecture:A Quantitative Approach.2~(nd) Edition.San Francisco,Calif.,USA:Morgan Kaufmann Pbulishers Inc.,1990.
    [212]Dharmesh P,Kevin S,Yan Z,et al.Power-Aware Branch Prediction:Characterization and Design.IEEE Trans.on Computers,Feb.2004,53(2):168-186
    [213]Parikh D,Skadron K,Zhang Y,et al.Power Issues Related to Branch Prediction.In Proc.of the 8~(th) International Symposium on High Performance Computer Architecture(HPCA 02),Boston,MA,USA:IEEE CS,2002.233-246
    [214]Hartstein A,Puzak T R.Optimum Power/Performance Pipeline Depth.In Proc.of the 36~(th) International Symposium on Microarchitecture(MICRO-36),Dec.2003,San Diego,CA,USA:IEEE CS,2003.117-128
    [215]Srinivasan V,Brooks D,Gschwind M,et al.Optimizing Pipelines for Power and Performance.In Proc.of the 35~(th) annual ACM/IEEE international symposium on Microarchitecture(Micro-35),Nov.2002.Istanbul,Turkey:IEEE CS,2002.333-344
    [216]Heo S,Asanovic K.Power-Optimal Pipelining in Deep Submicron Technology.In Proc.of the 2004 International Symposium on Low Power Electronics and Design,Aug.2004.Newport Beach,CA:ACM Press,2004.218-223
    [217]Juan L A,Jose G,Antonio G.Power-Aware Control Speculation through Selective Throttling.In Proceedings of the 9th International Symposium on High-Performance Computer Architecture(HPCA 03).),Feb 2003,Anaheim,California,USA:IEEE CS,2003.103-112
    [218]http://cag.csail.mit.edu/streamit
    [219] Hsueh M C, Tsai T K, Iyer R K. Fault injection techniques and tools. IEEE Computer, Apr. 1997, 30(4): 75-82
    [220] Arlat J, Aguera M, Amat L, et al. Fault injection for dependability validation: a methodology and some applications. IEEE Trans, on Software Engineering, Feb.1990, 16(2): 166-182
    [221] Clark J A, Pradhan D K. Fault injection: a method for validating computer system dependability. IEEE Computer, Jun. 1995, 28(6): 47-56
    
    [222] Kanawati G A, Kanawati N A, Abraham J A. FERRARI: a flexible software-based fault and error injection system. IEEE Trans. on Computers, Feb 1995,44(2): 248-260
    [223] Segall Z, Vrsalovic D, Siewiorek D, et al. FIAT: fault injection based automated testing environment. In Proc. of 18~(th) International Symposium on Fault-Tolerant Computing (FTCS-18), Jun. 1988, Tokyo, Japan:IEEE CS, 1988. 102-107
    [224] Barton J H, Czeck E W, Segall Z Z, et al. Fault injection experiments using FIAT.IEEE Trans. on Computers, Apr 1990, 39(4): 575-582
    [225] Carreira J, Madeira H, Silva J G. Xception: Software Fault Injection and Monitoring in Processor Functional Units. IEEE Trans. on Software Engineering,Feb. 1998,24(2): 1-25
    [226] Carreira J, Madeira H, Silva J G. Xception: a technique for the experimental evaluation of dependability in modern computers. IEEE Trans. on Software Engineering, Feb. 1998, 24(2): 125-136
    [227] Sieh V, Tschache O, Balbach F. VERIFY: evaluation of reliability usingVHDL-models with embeddedfault descriptions. In Proc. of 27~(th) Annual International Symposium on Fault-Tolerant Computing (FTCS-27), Jun. 1997,Seattle, WA, USA:IEEE CS, 1997. 32-36
    [228] Han S, Shin K G, Rosenberg H A. DOCTOR: an integrated software fault injection environment for distributed real-time systems. In Proc. of Internation Computer Performance and Dependability Symposium (IPDS'95), April 1995,Erlangen, Germany:IEEE CS, 1995. 204-213
    [229] Kao W L, Iyer R K, Tang D. FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults. IEEE Trans.on Software Engineering, Nov. 1993, 19(11): 1105-1118
    [230] Choi G S, Iyer R K. FOCUS: an experimental environment for fault sensitivity analysis. IEEE Trans. on Computers, Dec 1992, 41(12): 1515-1526
    [231] Goswami K K. DEPEND: a simulation-based environment for system level dependability analysis. IEEE Trans. on Computers, Jan 1997, 46(1): 60-74
    [232] Adiga N R, Almasi G, Almasi G S, et al. An Overview of the BlueGene/L Supercomputer. In Proc. of ACM/IEEE 2002 Conference on Supercomputing,Nov. 2002, Baltimore, MD, USA:IEEE CS, 2002. 1-22
    [233] Oliner A J, Sahoo R K, Moreira J E, et al. Fault-aware job scheduling for BlueGene/L systems. In Proc. Of 18~(th) International Parallel and Distributed Processing Symposium (IPDPS'04), April 2004, Santa Fe, New Mexico,USA:IEEE CS, 2004. 64-73
    [234] Giampapa M E. Blue Gene/L advanced diagnostics environment. IBM Journal of Research and Development, 2005,49(2):319-331
    [235] Sancho J C, Petrini F, Davis K, et al. Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance. In Proc. of 19~(th) International Parallel and Distributed Processing Symposium (IPDPS'05), April 2005, Denver, CA, USA:IEEE CS, 2005. 300
    [236] Sriram S, Jeffrey M, Squyres B B, et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Performance Computing Applications, 2005,19(4): 479-493
    [237] Zhang Y, Xue R, Wong D, et al. A Checkpointing/Recovery System for MPI Applications on Cluster of IA-64 Computers. In Proc. of 2005 International Conference on Parallel Processing Workshops (ICPPW'05), June 2005, Oslo,Norway:IEEE CS, 2005. 320-327
    [238] Cao J, Li Y, Guo M. Process Migration for MPI Applications based on Coordinated Checkpoint. In Proc. of 11~(th) International Conference on Parallel and Distributed Systems (ICPADS'05), July 2005, Fuduoka, Japani:IEEE CS, 2005.306-312
    [239] Gao Q, Yu W, Huang W, et al. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. In Proc. of 2006 International Conference on Parallel Processing (ICPP'06), Aug. 2006, Columbus, Ohio, USA:IEEE CS, 2006.471-478
    [240] Gioiosa R, Sancho J C, Song J, et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers, In Proc.of the 2005 ACM/IEEE conference on Supercomputing, March 2005, Seattle, WA,USA:IEEE CS, 2005. 9
    [241] Brooks D, Tiwari V, Martonosi M. Wattch: A framework for architectural-level power analysis and optimizations. In Proc. of the 27~(th) Annual International Symposium on Computer Architecture (ISCA 2000), June 2000, Vancouver, BC,Canada:IEEE CS, 2000. 83-94
    [242] Freescale Semiconductor Inc., MPC7447A RISC Microprocessor Hardware Specifications. Technical Data, Rev. 3,08/2005, Chandler, Arizona, USA:Freescale Semiconductor Inc., 2005.
    [243] Biro L L, Jackson D B, Gowan M K. Power Considerations in the Design of the Alpha 21264 Microprocessor. In Proc. of 35~(th) Conference on Design Automation Conference (DAC'98), June 1998, San Francico, California, USA:ACM Press, 1998.726-731
    
    [244] Matson M, Bailey D, Bell S, et al. Circuit Implementation of a 600MHz Superscalar RISC Microprocessor. In Proc. of International Conference on Computer Design (ICCD'98), Oct. 1998, Austin, TX, USA:IEEE CS, 1998.104-110

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700