多节点机群系统的高可用管理软件的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
高可用计算机系统的研究一直是计算机科学与工程界的一个重要课题。随着通过Internet提供商业服务的趋势的发展,这一研究显得越发重要。这是因为服务系统的可用性程度对服务提供者的商业利益具有重大的影响。同时,由于通过计算机服务系统提供的服务内容和服务范围都在不断扩大,计算机服务系统的规模也需不断扩大,现有的小规模的高可用系统已经难以满足这样大规模计算机系统对高可用支持的需求。因此研究可扩展的高可用机群系统是十分重要的。
     本文的主要内容之一,是探讨设计和实现多节点高可用机群系统的高可用管理软件过程中面临的关键问题及其解决方案。我们首先研究高可用管理软件的体系结构设计与系统可扩展性的关系,并对两种典型的体系结构——“平等式”和“结构式”进行分析和比较。之后,我们研究高可用管理软件与应用程序的接口设计,比较了3种策略——“黑箱”策略、“cluster-aware应用程序”策略和“虚拟cluster-aware应用程序”策略。
     服务器聚集的概念近年来正日益受到重视,具有单一登录点的机群系统是适合用来实现服务器聚集的体系结构。本文的另一个目的是介绍和评价建立在“曙光2000”机群系统上的“曙光服务器聚集系统”(DSC Dawning Server Consolidation)的高可用管理软件的设计与实现。它实现了多节点机群系统高可用管理软件的基本功能。
During the past years, the research for high-available computer systems has been active. With the rapid increasement of commercial services which are delivered throught Internet, this field has become more important. This is mainly due to the reason that, availability of computer services has great effects on the profits of service ventors. The other effect brought forth by this trendcy is that, high available computer systems with larger scale are in great demands. As a result, the research for scalable high available computer clusters has become very necessary. The thesis just aims at this issue.
    One major part of this thesis is about the essential issues in designing HA (high availability) management software for multi-node cluster. We first focuse on the relationship between the architecture of HA management software and the scalability of the cluster, and two typical architectures namely, "peer-peer architecture" and "structural architecture", are analysized and compared. Then,
    we turn to next focus--interface between HA management software and
    applications, and three interface strategies namely, "black box", "cluster-aware applications" and "virtual cluster-aware applications", are analysized and compared.
    The other major part of this thesis is the design and implementation of DSC's HA management software. DSC (Dawning Server Consolidation) is a server consolidation system built on the Dawning2000 computer cluster. DSC's HA management software has implemented the functions that are essential to a HA management software of a multi-node HA cluster.
引文
[Bre] Thomas C. Bressoud, "TFT: a software system for application-transparent fault tolerance", IEEE FTCS-28, 1998, pp 128-137.
    [Bro] D. H. Brown Associates, Inc. "high availability for clusters: functional analysis", white paper, from: http://www. ncr. com/product/intesrated/analyst reports/bro wn ha. htm, 1997.
    [Bul] R. W. Bulter and S. C. Johnson, "Techniques for Modeling the Reliability of Fault-Tolerant Systems With the Markov State-Space Approach", NASA Reference Publication 1348, September 1995.
    [Chen] Mingyu Chen, Wen Gao, Wensheng Zhang and Liuhui Wu, "The Design of High Availability in the Dawning Server Consolidation System", 4th Asia and Pacific High Performance Computing Conference, May 2000.
    [DSC] “DSC2.0-HA详细设计报告”,国家智能计算机研究开发中心设计文档.
    [DSC-1] “DSC2.0U总体设计报告”,国家智能计算机研究开发中心设计文档.
    [DSC-2] “DSC2.0U详细设计报告”,国家智能计算机研究开发中心设计文档.
    [HACMP] "High Availability Cluster Multi-Processing for AIX version 4.2" from : http://www. clam. com/hacmpaix.
    [HACMP-1] "RS/6000 HACMP for AIX", White Paper, from: http://www. rs6OO. ibm. com/.
    [Hal] Mark Hall, "Special Report" Server Consolidation", http://www. performancecomputing.com/feature/9811f2. shtml.
    [Hwa] K. Hwang and Z. Xu, "Scalable Parallel Computing: Technology, Architecture, Programming", WCB/McGraw-Hill Inc, 1998.
    [Jal] Pankaj Jalote, "Fault Tolerance in Distributed Systems", Prentice Hall Inc, 1994.
    [Kal] Zbigniew T. Kalbarczyk, and etc., "Chameleon: A Software Infrastructure for Adaptive Fault Tolerance", IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 6, June 1999, pp560-579.
    [Lar] L. A. Laranjeira, "NCAPS: Application High Availability In Unix Computer Clusters", IEEE FTCS-28, 1998, pp 441-450.
    [Li] 李海泉,“可靠性、可用性、可维性——微机系统的RAS技术”,清华大学出版社,1996.
    [Ros] Sheldon M.Rose,“随机过程”,中国统计出版社,1997.
    [Sha] Mary Shaw and David Garlan, "Software Architecture:Perspectives on an emerging discipline", Prentice Hall Inc, 1996.
    [Sin] A. Singhai, S. Lim, and S.R. Radia, "The SunSCALR Framework for Internet Servers", IEEE FTCS-28, 1998, pp 108-117.
    [Sol] "White paper: Solstice High Availability", from http://www. sun. com.
    [Tan] Andrew S. Tanenbaum, "Distributed Operating Systems", Prentice-Hall Inc., 1995.
    [Vog] W. Vogels, D. Dumitriu,K. Birman, and etc., "The Design and Architecture of the Microsoft Cluster Service", 1998 IEEE FTCS, pp 422-431.
    [Vog-1] W. Vogels, Dan Dumitriu, A. Agrawal, T. Chia, and K. Guo, "Scalability of the Microsoft Cluster Service", Proceedings of the Second Usenix Windows NT Symposiums, Seattle, WA, August 1998.
    [Wang] 王琳,“高可用系统主动监控的研究”,四川联合大学硕士论文,1999.
    [Wol] http://www. microsoft. com/industry/sap/tech/proj/wolfpack.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700