机群操作系统中的高可用管理

作者：刘建华
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：机群 ; 机群操作系统 ; 构件 ; 高可用性 ; 建模
英文关键词：cluster ; high availability ; component ; model
学位年度：2004
导师：孟丹
学科代码：081201
学位授予单位：中国科学院研究生院（计算技术研究所）
论文提交日期：2004-05-01

摘要

机群系统的优点是可扩展性好，但随着机群系统规模的增大，节点数目的增多，机群系统整体的可靠性会相应降低。因此提高机群系统可用性的软件将成为机群操作系统中必不可少的部分。特别是故障恢复手段对大规模系统和长时间运行的应用显得尤为重要。另外，由于在机群操作系统中为每个子系统或子服务以及第三方应用独立维护自身高可用所带来的系统复杂性、系统运行时资源的浪费以开发维护过程中人员浪费与困难导致了机群操作系统中需要开发独立的高可用管理软件用以维护其它子系统或应用的高可用性。
     曙光4000机群操作系统是一个集成的、一体化的机群中间件系统，高可用管理软件HA触发器是这个中间件系统的一个重要组成部分，我们称之为机群操作系统的一个重要“服务”。该服务是从原有的机群系统软件中抽取出来的可以共享的服务之一，它负责小规模应用和服务的高可用管理。HA触发器软件的设计采用了基于服务和一体化构件的思想，以基于CORBA的分布式构件方式实现，具有良好的可扩展性、高可用性和系统的包容性。
     本文以提高机群系统中应用和服务的可用性为目的，以曙光4000机群操作系统为工程背景，探讨设计和实现机群操作系统中高可用管理软件过程中面临的关键问题及其解决方案。论文首先介绍的是课题背景、高可用研究目和高可用基本理论等相关内容。接着介绍了曙光4000机群操作系统的高可用性设计并提出了高可用管理在其中面临的关键问题。然后围绕这几个问题设计并实现了机群系统的高可用管理软件HA触发器。最后对高可用管理带来的应用和服务的可用性影响进行了量化建模分析。
The advantage of cluster system is scalability. But with the number of cluster increasing, the whole dependability of cluster will decrease. So, the high availability software will be an inevitable part of cluster operating system. The recovery method is especially important to the application that is large scale and will run long time. On the other hand, that every subsystem or sub service take care of them high availability in cluster operating system is complex and will result in the waster of resource of system and labor in the development and run process. So the independent management software is needed to maintain the high availability of subsystem.
    The "Dawning 4000" cluster operating system is an integrated middleware system. HA triger belongs to it and is called a service. Being taken out from the past cluster system software, this service can be shared by different system components.. HA triger is useful for managing the high availability of applications and services which are small scalar. The author adopts the idea of service-based design and component-based development to design the HA triger and to implement it with CORBA. The HA triger has the characteristic of high availability, scalability, compatibility and so on.
    The purpose of this paper is to increasing the service availability in cluster system. And the project context is the Dawning 4000 cluster operation system. It discussed the key problem and the methord that solve it in the designing and development of high availability software service management software of cluster operating systems.In the beginning, this paper introduces its background,the purpose of researching and high availability basic theory. Then it describes the high availability designing of Dawning 4000 operating system and put forwards the key problem of high availability admistration in cluster operating system. Furthermore, we designed HA trigger architecture surrouding these problem. At last, we used the math model to analyze the effect of service availability that HA trigger brought forth quantitatively.

引文

[1] 冯洁．透视超级计算机发展历程．微电脑世界，1998，http://www.pcworld.com.cn/98/16.htm．
    [2] 黄凯，许志伟．可扩展并行计算技术、结构与编程．第一版，北京：机械工业出版社，2000，
    [3] K. Hwang,Z. Xu. Scalable Parallel Computing: Technology, Architecture, Programming, Feb 1998.
    [4] Sun Microsystems, Inc. Sun~(TM) Cluster 3 Architecture A Technical Overview, 2000
    [5] WernerVogels,DanDumitriu,KenBirman,RodGamache,MikeMassa, The Design and Architecture of the Microsoft Cluster Service. IEEE.Proceedings of FTCS'98,June, 1998
    [6] 北京拓林思软件有限公司．TurboHA Server技术白皮书．2001
    [7] HP Serviceguard Cluster Configuration for HP-UX 11i v2 partitioned systems,May 2003
    [8] Jane Wright, Ann Katan, Hewlett-Packard MC/ServiceGuard Clustering Software, Novemeber2001
    [9] VERITAS Software Corporation, VERITAS Global Cluster ManagerTM 3.5.1 System Administrator's Guide,December 2002
    [10] Sun Microsystems, Inc. Sun Cluster 3.1 Concepts Guide. USA,May 2003
    [11] Hewlett-Packard Company, TruCluster Server Cluster Highly Available Applications,September 2002
    [12] IBM, High Availability Cluster Multi-Processing for AIX Concepts and Facilities Guide Version 5.1,June 2003
    [13] HA Forum, Providing Open Architecture High Availability Solutions Revision 1.0, February 2001
    [14] Wilfredo Torres-Pomales, Software Fault Tolerance: A Tutorial, NASA Langley Research Center,October 2000
    [15] Kurt Geihs. "Middleware Challenges Ahead, "computer IEEE, June 2001 Vol. 34, No. 6 pp. 24-31. http://dis.eafit.edu.co/cursos/st725/materiaI/sistdist/papers
    [16] Steve Vinoski. "Where is middleware," IEEE Intemet Computing, March-April 2002 http://www.iona.com/hyplan/vinoski/pdfs/
    [17] Dejan Milojicic. "Middleware's role, today and tomorrow," IEEE Computer Society http://computer.org/concurrency/pd1999/p2070abs.htm
    [18] “中间件”，http://www.huihoo.com/middleware/indexl.html
    [19] “构件技术与中间件”，http://www.huihoo.com/middleware/component_middleware.html
    [20] Wang, Schmidt, O'Ryan. Overview of the CORBA Component Model, 2002
    [21] Michi Henning．分布式回调．见：Michi Henning．基于C++CORBA高级编程．清华大学出版社，2000年7月
    [22] T.H Harrison, D.L.Levine, and D.C.Schmidt, "The Design and Performance of a Real-time CORBA Event Service," in Proceedings of OOPSLA '97,(Atlanta,GA), ACM, October 1997
    [23] Chistopher D.Gill,David L.Levine, and D.C.Schmidt, "The Design and Performance of a Real-time CORBA Scheduling Service," August 10,1998
    [24] D.C.Schmidt, D.L.Levine, and S.Mungee, "The Design and Performance of Real-Time Object Request Brokers," Computer Communications,vol.21,pp.294-324,Apr. 1998
    [25] Douglas C.Schmidt and Irfan Pyarali "The Design and Use of the ACE Reactor", September 2001
    [26] Irfan Pyarali,Marina Spivak, and Ron Cytron,Douglas C.Schmidt, "Evaluating and Optimizing Thread Pool Strategies for Real-Time CORBA", June 2001
    [27] Object Management Group. "Common Object Request Broker Architecture: Core Specification Version 3.0". 2002.11 Object Management Group. CORBA Components Version 3.0. 2002.6
    [28] Wolfgang Emmerich. "Distributed component technologies and their software engineering implications". Proceedings of the 24th international conference on Software engineering, May 2002
    [29] D.H.Brown Associates,Inc．“UNIX机群功能高可用性竞争分析报告”，2001．4
    [30] "Clusters High Availability"
    [31] SteelEye Technology Inc. "Ensuing Availability of Business Critical Applications", June 2003
    [32] Kiran Nagaraja, Neeraj Krishnan, Richardo Bianchini. "Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services", SC03, November 2003
    [33] Alexander Keller, Gautam Dar. "Dynamic Dependencies in Application Service Management", Proceedings of the 2000 International Conference on Parallel and Distributed Processing Techniques and Application, June 2000
    [34] Philip Koopman, "Elements of the Self-Healing System Problem Space", Workshop on Architecting Dependable Systems/WADS03, May 2003
    [35] Algirdas Avizienis, Jean-Claude Laprie, Brian Randell. "Fundamental Concepts of Dependability", Software Engineering Institute Carnegie Mellon University, October 2000
    [36] Enrique Vargas, "High Availability Fundamentals",Sun BluePrints~(TM), November 2000
    [37] P.Narasimhan,L.E.Moser and P.M. Melliar-Smith. "Transparent Fault Tolerance for Enterprise Applications", Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet, July 2000.
    [38] Ricky W.Butler, George B.Finelli, "The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software", IEEE Transactions on Software Engineering, January 1993
    [39] 詹建峰，王磊．曙光4000A机群基础件总体设计．北京：中科院计算所高性能室技术报告，2003年6月
    [40] 刘建华．曙光4000A机群HA触发器详细设计．北京：中科院计算所高性能室技术报告，2003年10月
    [41] IBM. Group Service. In: RS/6000 SP High Availability Infrastructure. November 1996. 25～37
    [42] 孙凝晖，刘淘英．支持网格的机群操作系统的设计．北京：计算机研究与发展，Vol．39 No．8，2002年8月

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700