舰载分布式构件系统的容错技术研究

英文题名：Research on Fault-Tolerance Techniques of Shipborne Distributed Component-Based Systems
作者：陈昀林
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：构件系统 ; 容错 ; 后向恢复 ; 分布式存储
英文关键词：component-based systems ; fault-tolerance ; rollback recovery ; distributed store
学位年度：2011
导师：曹万华
学科代码：081202
学位授予单位：中国舰船研究院
论文提交日期：2011-04-01

摘要

舰载作战指挥系统是舰载作战系统的核心部分,是一种典型的分布式实时嵌入式应用系统。它面临着基础计算平台复杂、系统功能多样和用户需求多变等问题。随着作战需求的变化,舰载作战指挥系统的软件规模不断增大,系统的可移植、软件重用性以及可集成能力变得越来越重要,传统结构化软件开发方法很难适应新一代舰载作战指挥系统的研制模式,采用基于构件的软件开发(Component-Based Software Development,CBSD)方法是解决上述问题的有效途径。在构件开发过程中加入冗余、容错功能是保证系统可靠性的方法之一。传统的构件冗余、容错方法是构件开发者根据所需的冗余、容错控制方案编写专用的容错管理代码,使得构件开发的工作量加大,构件复用度变小。为解决上述问题,开展适合构件系统的容错技术研究是很有意义的。
     本文所做工作是“十一五”国防预先研究课题“海战场综合电子信息系统服务集成技术研究”的组成部分,主要研究舰载分布式构件系统的容错技术和实现,结合课题的具体研制要求,设计和实现舰载分布式构件系统容错模块。
     本文主要完成了以下工作:
     (1)结合课题的研究内容和背景,分析了容错技术的国内外研究现状和发展趋势,对基于构件的舰载指挥系统及其系统特点进行了讨论。
     (2)对舰载分布式构件系统容错模块进行了整体设计,完成了各个子模块的实现。容错模块在设计实现过程中充分考虑了舰载计算环境对实时性和可靠性的需求,在保证可靠性的同时兼顾了舰载计算环境对可用性的要求。
     (3)给出了一种适合分布式构件系统的基于检查点的后向恢复机制,该机制针对系统应用环境,简化了失效检测和错误诊断子模块,从系统中分离出存储子模块,减小了系统运行开销,适用于系统资源有限的嵌入式平台。
     (4)给出了分布式构件系统容错模块实验效果,并对该模块的基本功能、错误恢复时间和检查点信息存储时间进行了测试。结果表明,分布式构件系统容错模块具有较好的错误恢复时间和检查点保存速度。
Shipborne combat command system is an essential part of shipborne combat systems. As a typical distributed and embedded real-time system, nowdays it is encounted with several problems about complicated basic computing platforms, various system functions and requirements, and so on. Additionally, the software scale of shipborne combat command system keeps on increasing due to the changing of combat requirements, which leads to the more and more importance of the portability, reusbility and integration for system software.
     Traditional structured software development method is difficult to adapt to the development of new generation of shipborne combat command system, and the component-based software development (CBSD) is effective to solve this problem. Adding redundancy, fault tolerance function in the component development process is one way to ensure system reliability. Traditional way to achieve fault-tolerance in component system is that every developped component included a dedicated fault-tolerant management module. It makes component development workload increase and the effect of component multiplexing decrease. To deal with the issues, the research on fault-tolerance techniques and mechanisms in component system is of great significance.
     This paper takes the "Eleventh Five-Year" national defense research topic in advance—"Research on General Services of sea battlefield Electronics Information System Integration Technology" as the background, carrys out an in-depth research on the design, techniques and implemention of fault-tolerance in shipborne distributed component system. The finished work in the paper mainly includes:
     Analysis the research status of fault-tolerance technology in and its development tendency, on this basis, study the features of component-based shipborne command systems.
     Complete the overall design of the fault-tolerance module of shipborne distributed component system, and finish the implementation of each module. In the process, the needs of real-time and reliability in shipborne computing environment are taken into account, while the availability requirement of shipborne computing environment is also considered based on its reliability.
     A checkpoint-based rollback recovery mechanism for distributed component systems is presented. The mechanism for the system application environment simplifies the failure detection and error diagnosis sub-module, isolated the storage sub-module from system. It reduces the system operating costs to apply to the embedded system whose resource is limited.
     The basic function, fault recovery time and checkpoint storage time of fault-tolerance module of shipborne distributed component systems are verified by experiment. The results show that the fault recovery time and checkpoint storage time of fault-tolerance module for shipborne distributed component system are acceptable.

引文

[1] J. Wensley, et al. SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control”, Proc. IEEE vol 66, Issue 10, Oct. 1978
    [2] J. Von Neumann, Probabilistic Logics and the Synthesis of Reliable Organisims from Unreliable Components, Automata Studies, Princeton, 1956, pp.43-97
    [3] J.Bartlett,J.Gray and B.Horst,Fault Toleance in Tandem Computer Systems, in the Evolution of Fault.Tolerant Systems,A.Avizienis,H. Kopetz and J.Lapfie,ed,Springer-Verlag,1987,pp.55-76.
    [4] Patent Number:5,968,185.1 Transparent Fault Tolerant Computer System. Assignee: Stratus Computers Incorporated,Marlboro,Mass. United States Patent offiee.Date of Patent:ote.19,1999
    [5] windows2000平台上实现服务器可用性的不同方案评估报告. http:// www.stratus.com
    [6] ftserviee客户服务支持方案. http:// www.stratus.com
    [7]增强Windows 2000软件可靠性和可用性. http:// www.stratus.com
    [9]袁由光,马中.高可用容错微机系统OPIAC/FT的研制.CFTC-7,PP.1-6,广州,1997
    [9]马中.面向实时容错的多机通信系统的设计与实现.武汉数字工程研究所硕士论文,1991
    [10]金惠华.非相似冗余容错技术及开发环境研究进展.北京航空航天大学
    [11]谢可嘉,宋过渡.SFTMP---一个实现软件容错的多处理机系统.微电子学与计算机,1998.3,pp.25一27.
    [12]李宏亮,金士尧,胡华平,王志英.短事务、强实时双机容错系统的研究.计算机学报,Vol.26(2),pp.244-249,2003
    [13]李凯原,左德承,崔刚,杨孝宗.双机容错系统FTDC的设计与实现.计算机工程,Vol.25(8),pp.61一63,1999
    [14] B.Randell,J.Xu,The Evolution of the Recovery Block Concept, in Software Fault Tolerance,M.R,Lyu,ed,Wiley,1995,PP.1-22.
    [15] P.Bishop,Soft-ware Fault Tolerance by Design Diversity, in Software Fault Tolerance,M.R.Lyu,ed.,Wiley,1995,PP.211-230.
    [16] Feras Karablieh. Compiler Assisted Application-level Fault Tolerance in Distributed Systems, Ph.D. Thesis. Arizona State University. May 2005.
    [17] Bin Wu, Shijun Liu, Wei Cui. Double Redundant Fault-Tolerance Service Routing Model in ESB. Proc. 2009 Fourth ChinaGrid Annual Conference. 2009
    [18] Sayantan Chakravorty. A Fault Tolerance Protocol for Fast Recovery. Ph.D. Thesis. Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign. 2008
    [19] Xue Liu. Feedback Based Performance Management and Fault Tolerance for Networked and Embedded Computing Systems. Ph.D. Thesis. Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign. 2006
    [20] David Matthew Cummings. Software-Implemented Fault-Tolerance with Rollback Recovery using Large Grain Dataflow. Ph.D. Thesis. Computer Science in University of California Los Angeles. 2009
    [21] George A.Reis . Software Modulated Fault Tolerance. Ph.D. Thesis. Department ofElectrical Engineering Princeton University in Candidacy. 2008
    [22]周明辉.面向对象的容错中间件的研究与实现:[博士学位论文].长沙:国防科学技术大学,2002
    [23] R.Guerraoui,A.Schiper.Software-based replication for fault tolerance.IEEE Computer April 1997,30(4):68—74
    [24] Object Management Group.Fault Tolerant CORBA.OMG,2004
    [25]吴杰著,高传善译.分布式系统设计.北京:机械工业出版社,2001
    [26]钱方,黄杰,周鹏.分布计算环境中面向冗余服务的对象模型.全国开放式分布与并行处理学术会议,保定,1999,128-131
    [27] Michael Rabinovich . Efficient replication management in distributed systems ,Ph.D.Thesis.Department of Computer Science and Engineering,1994.
    [28] Algirdas Avizienis.Toward Systematic Design of Fault—Tolerant Systems.IEEE Computer, 1997,30(4): 51—58
    [29] Richard Golding,Elizabeth Borowsky.Fault—Tolerant Replication Management in Large Scale Di stributed Storage Systems. 18th IEEE Symposium on Reliable Distributed Systems,1999: 18—21
    [30]杨力,曹谢东,陈毅红.基于分布式计算环境可靠性关键技术应用研究[J].微机发展,2005,15(10):52-55
    [31]张军伟.冗余服务中容错算法的研究与设计:[工学硕士学位论文].保定:河北大学,2003
    [32] G.Beedubai 1,Ani sh Karmarkar and Udo Pooch,Fault tolerant objects replication Algorithm.Technical Report TR95—042,Computer Science Department,Texas A&M University,October 1995
    [33] R.Otte,P.Patrick and M.Roy.Understanding CORBA.Prentice—Hall,1996
    [34]唐文胜,张拥军.分布式系统中基于复制的动态容错模型[J].计算机工程与应用,2001,23: 130-132
    [35]钱方.提高冗余服务性能的动态容错算法[J].软件学报,2001,12(6):928-935
    [36]喻占武.一种支持分布式进程迁移的动态负载平衡征募算法的研究[J].小型微型计算机系统1999,20(6):321-325
    [37]郑剑平.CORBA及其应用问题分析[J].通信学报,1999,20(10):42—48
    [38]党新梅,刘惠,吴泉源.容错CORBA的Benchmark测试研究[J].计算机工程, 2004,30(22):50—52
    [39]胡健,唐雪飞,刘锦德.基于CORBA的分布式实时容错系统的研究[J].电子科技大学学报,2000,29(2):193-196
    [40] J2EE http://java.sun.com/j2ee/overview.html
    [41] EJB http://oracle.com/technetwork/java/javaee/ejb/index.html
    [42] Object Management Group.The Common Object Request Broker Architecture and Specification.OMG,2001
    [43] COM http://www.microsoft.com/com/default.mspx
    [44] DCOM http://en.wikipedia.org/wiki/Distributed_Component_Object_Model
    [45]袁由光,陈以农编著.容错与避错技术及其应用.科学出版社,1992.
    [46] A. Avizienis,J.C. Laprie, et al., Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. On Dependable and Secure Computing,Vol.1(1),pp.11一33,Jan一Mar.,200
    [47] J.C. Laprie. Dependability of Computer Systems: from Concepts to limits. IFIP International Workshop on Dependable Computing and its Applications,Johannesburg,PP.108一126,January,1998
    [48] D.A. Rennels. Fault-tolerant Computing-Concepts and Examples. IEEE Trans.Computers,Vol.C-33,1984
    [49] D.P.西沃赖克,R.S.斯沃兹著.袁由光,曹泽翰,刘志模,陈以农译.可靠系统的设计理论与实践(上、下).科学出版社,1988,1993.
    [50]蒋雄伟,马范援.中间件与分布式计算[J].计算机应用,2002,22(4):6-8
    [51] V. Nicola. Checkpointing and the Modeling of Program Execution Time, in Software Fault Tolerance,M.R. Lyu,ed,Wiley,1995,PP 167—188.
    [52] A.Avizienis,The Method of N-Version Programming,in Software Fault Tolerance,M.R.Lyu,ed.,Wiley,1995,PP.23-46.
    [53] S.Poledna,Fault-Tolerant Real-Time Systems:The Problem of Replica Determinism,published by Kluwer Academic Publishers,1996.
    [54] J.C.Laprie,J.Arlat,C.Beounes,K. Kanoun,Architectural Issues in Software Fault Tolerance,in Software Fault Tolerance,M.R.Lyu,ed., Wiley,1995,PP.47-80.
    [55] K.Kim.The Distributed Recovery Block Scheme,in Software Fault Tolerance, M.R.Lyu,ed., Wiley,1995,PP.189—210.
    [56] F. Cristian,Exception Handling and Tolerance of Software Faults,in Software Fault Tolerance,M.R. Lyu,ed., Wiley,1995,PP.81·108.
    [57]魏晓辉,鞠九滨.分布式系统中的检查点算法.计算机学报. 1998 21(4):367-374
    [58] E.N. Elnozahy, Lorenzo Alvisi, et al., A Survey of Rollback-Recovery Protocols in Message-Passing Systems
    [59] D.Manivannan, Mukesh Singhal. Quasi-Synchronous Checkpointing: Models, Characterization and Classification. IEEE Transactions on Parallel and Distributed Systems,Vol.10(7),pp.703-713, July 1999
    [60] K.M. Chandy, C.V. Ramamoorthy. Rollback and Recovery Strategies for Computer Programs. IEEE Trans. Computers,Vol.21(6),pp.546一556,June 1972
    [61]陈靖.凌久构件平台程序员手册. 2009. 3.
    [62] Wei Chen. On the Quality of Service of Failure Detectors. Ph.D. Thesis. Cornell University. 2000
    [63] Chen W, Toueg S, Aguilera M K. On the quality of service of failure detectors. IEEE Transactions on Computers, 2002, 52(5):561-580
    [64] Jacobson V. Congestion avoidance and control.Proc of ACM SIGCOMM’88, Stanford, CA, USA, 1988
    [65] Richard Jones. Anti-RDBMS: A list of distributed key-value stores[Z]. 2009.1
    [66] Avinash Lakshman, Prashant Malik. Cassandra-A Decentralized Structured Storge System
    [67] Fay Chang,Jeffrey Dean, et. al. , Bigtable: A Distributed Storage System for Structured Data[C]. Proceedings of the Operating Systems Design and Implementation (OSDI 2006)
    [68] Giuseppe DeCandia, Deniz Hastorun, et. al., Dynamo: Amazon's Highly Available Key-value Store[C]. Proceeding of Symposium on Operating Systems Principles.(SOSP 07)?

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700