机群系统容错中间件技术研究

作者：黄伟
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：机群 ; 容错 ; 中间件框架 ; 分区机制 ; 组服务 ; 相关失效 ; 随机回报网
英文关键词：cluster ; fault-tolerant ; middleware framework ; partition ; group service ; correlated failure ; stochastic reward Petri net
学位年度：2005
导师：樊建平
学科代码：081203
学位授予单位：中国科学院研究生院（计算技术研究所）

摘要

在高性能计算机的研究当中，如何保证系统的可用性和应用的可靠性一直就是需要首要考虑的问题之一。机群以其高性价比和高可扩展性已经成为构造高性能计算机一种主要的方法，而节点间松散耦合的结构也使得机群系统更易于保证系统的可用性。随着机群系统规模的逐渐增大，也带来许多了新的问题，如更加频繁的组件失效，软件体系结构的扩展性等问题。这些新的问题对如何保证系统的可用性带来了更多的挑战。机群容错中间件技术将机群、容错和中间件技术结合在一起，是一种在机群系统软件层实现的能够同时保证系统可用性和应用可靠性的方法。
     本文结合曙光4000A系统的机群操作系统Phoenix高可用核心的设计与实现，对机群容错中间件的关键技术进行了探索，重点研究了：1)在大规模情况下，适用于机群系统的容错中间件的框架与体系结构；2)在容错中间件技术中，适合于大规模机群的容错实现机制；3)在采用容错中间件情况下，对机群系统可用性和应用可靠性的评价。本文取得的研究成果如下：
     1．在提出和分析机群系统规模变大给系统可用性所带米的新的挑战后，提出了一个用于大规模机群系统的容错中间件框架DCFT-Kernel。这个框架采用了分区管理的思想及“平等式”与“结构式”结合的体系结构，较当前的机群高可用软件有效地解决了大规模系统所带来的系统扩展性、软件体系结构扩展性、和容错机制扩展性等问题。DCFT-Kernel框架由组服务、故障管理服务、配置服务、事件服务和用户接口组成，能够提供完备的错误侦测、错误修复、错误通知功能。
     2．在分析了将容错技术应用到机群系统在理论上需要解决的问题后，提出了一种用于实现机群容错中间件核心容错机制的关键技术——组服务技术。机群容错中间件的工作基础是自身的高可靠，组服务技术通过采用组结构和成员关系协议，能够保证机群容错中间什自身在运行时严格的一致性和高可靠性。在组服务基础上，提出的机群容错机制充分考虑了机群系统和并行应用的特点，提供了层次化的故障侦测和处理方法，能够对大部分的系统故障和应用故障进行有效的处理。
     3．在曙光4000A系统上实现了一个实际运行的机群容错中间件系统DCFTM。DCFTM位于机群操作系统的核心，为机群操作系统的各种服务部件提供高可用支持，同时也可以直接向上层应用程序提供编程接口，保障应用程序的容错运行。通过对DCFTM实际运行的性能分析表明：1)DCFTM能够保证机群操作系统中各类服务的高可用运行，在故障处理时可以提供很高的响应时间，能够及时的发现和修复各种故障，并通知这些事件。2)DCFTM只占据很少的系统开销，只要将心跳间隔时
Keeping system high available and applications reliable has been one of the most important measures in the research area of high performance computing. Cluster is the mainstream architecture for high performance computing because of its low cost and good scalability. And the loose-coupling architecture between nodes makes cluster system easier to implement high available than centralized system. But with the scale of cluster system become more and more large, some new problems are brought along, such as the more frequent failure rate of system components, the scalability problem of software architecture. These problems also bring new challenges to design and implement high available system software. Cluster fault-tolerant middleware tries to combine fault-tolerant, middleware and cluster technologies together to implement an integrative system software for fault-tolerant in cluster system. It is a new approach to keep cluster system available and cluster applications reliable with low cost but high scalability.
    With the design and implementation of fault-tolerant kernel for cluster operating system in Dawning4000A, this dissertation deeply discusses the key issues of cluster fault-tolerant middleware, which focus on (1) the scalable fault-tolerant middleware framework for large scale cluster system; (2) adaptive and reliable fault-tolerant mechanism for large scale cluster system; (3) evaluating the effect of cluster fault-tolerant middleware by modeling and analysis system availability and application reliability. The contributions of this dissertation include:
    1. The current high available system softwares can't meet the demand for scalability and performance when system's scale becomes very large. To solve this problem, a new fault-tolerant middleware framework named DCFT-Kernel is proposed in this dissertation. DCFT-Kernel adopts the approach of partition and hybrid architecture from master/slave and peer-to-peer to eliminate the scalability problems of system, software architecture and fault-tolerant mechanism. Furthermore, DCFT-Kernel is constituted by group service, event service, configuration service and programming APIs, which make it being able to provide integrated fault-tolerant function for error detection, error recovery and error notify.
    2. To implement fault-tolerant services in nonsynchronous distributed system must take the fundamental consensus problem into account. On the other hand, the work base for fault-tolerant middleware is to keep itself reliable. Group service which aims to integrate a group of cooperating processes to provide common fault-tolerant services is proposed in chapter 4 to solve these two problems. Through group membership protocol and reliable multicast

引文

[ACE] http://www.cs.wustl.edu/～schmidt/ACE.html
    [Agu97] Marcos Kawazoe Aguilera, Wei Chen, Sam Toueg: Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication. WDAG 1997: 126-140
    [Ami04] Yair Amir, Claudiu Danilov, Michal Miskin-Amir, John Schultz, Jonathan Stanton, The Spread Toolkit: Architecture and Performance, Technical Report CNDS-2004-1
    [And02] Andrew S. Tanenbaum, Maarten van Steen, Distributed Systems: Principles and Paradigms, page362-410, Published by Prentice Hall, January 2002.
    [Avi00] Algirdas Avizienis, A fault tolerance infrastructure for dependable computing with high-performance COTS components, In Proc. of the Int. Conference on Dependable Systems and Networks (DSN 2000), pages. 492-500, June 2000
    [Avi04] A. Avizienis, J. C. Laprie, B.Randell and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, Vol.1, No. 1, IEEE Transactions on Dependable and Secure Computing, 2004
    [Bad97] Peter Badovinatz, etc, Group Services: Infrastructure for highly available, clustered computing, a paper submitted to sosp'97, 1997.
    [Bag00] S. Bagchi, B. Srinivasan, K. Whisnant, Z. Kalbarczyk, R.K. lyer, Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment, IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 2, 2000, pp.203-224.
    [Bir86] K.P. Birman, ISIS: A System for Fault-Tolerant Distributed Computing, Dept. of Computer Science, Comell University., TR 86-744, April. 1986.
    [Bir87] K.P. Birman, and Joseph, T. 1987. Exploiting virtual synchrony in distributed systems. In 11th ACM SIGOPS Symposium on Operating Systems Principles (SOSP) (Nov 1987), pp123-138. ACM.
    [Bir96] K.P. Birman, Building Secure and Reliable Network Applications, Manning Publications Co., 1996
    [Bos02] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes, In Proceedings of SC'02, 2002.
    [Bou03] A. Bouteiller, P. Lemarinier, G. Krawezik, F. Cappello, Coordinated checkpoint versus message log for fault tolerant MPI, in: IEEE Cluster 2003, Hong Kong, December 2003
    [Bro98] A.W. Brown, K.C. Wallnau, "The current state of CBSE", in: IEEE Software, pp. 37-46, September/October 1998.
    [Cha84] J. Chang and N. F. Maxemchuk, Reliable broadcast protocols, ACM Trans. Comput. Syst., vol. 2, no. 3, pp. 251-273, Aug. 1984
    [Cha96-1] Chandra T D, Toueg S. Unreliable Failure Detectors for Reliable Distributed Systems[J]. Journal of the ACM, 1996, 43(2): 225-267.
    [Cha96-2] T. D. Chandra, Vassos Hadzilacos and Sam Toueg The weakest failure detector for solving consensus; J. ACM 43, 4 (Jul. 1996), Pages 685-722
    [Che04] 陈熠，孟丹，詹剑锋，甄宁，基于联邦的数据公告的设计与实现，计算机工程与应用，已录用
    [Chi96] Ge-Ming Chiu and Cheng-Ru Yong, Efficient Rollback-Recovery Technique in Distributed Computing Systems, IEEE Transactions on Parallel and Distributed Systems, VOL 7, No. 6, JUNE 1996
    [Cia89] G. Ciardo, J. Muppala and K. Trivedi. SPNP: Stochastic Petri Net Package. Petri Nets and Performance Models, 1989: pp.142-151.
    [Cri91-1] F. Cristian, Reaching agreement on processor group membership in synchronous distributed systems, Distributed Computing 4, 4 (April) 1991, 175-187.
    [Cri91-2] F. Cristian, Understanding Fault-Tolerant Distributed Systems, Communications of the ACM, Vol.34, No.2, pp.56-78, February 1991.
    [Cuk98] M. Cukier, J. Ren, C. Sabnis, D. Henke, J. Pistole, W. H. Sanders, D. E. Bakken, M. E. Berman, D. A. Karr, and R. E. Schantz, AQuA: An Adaptive Architecture That Provides Dependable Distributed Objects. Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems (SRDS'98), West Lafayette, Indiana, USA, October 20-23, 1998, pp. 245-253.
    [Dol96] Dolev, D., and Malki, D., The Transis Approach to High Availability Cluster Communication, pages 64-70, Communications of the ACM, Vol. 39, No. 4, April 1996.
    [Dwo88] Dwork, C., Lynch, N., and Stockmeyer, L., Consensus in the presence of partial synchrony. Journal of the ACM 35, 2 (April), 288-323.
    [Ezh95] Ezhilchelvan, P. D., Macedo, A., and Shrivastava, S. K., Newtop: a fault tolerant group communication protocol. In 15th lntemational Conference on Distributed Computing Systems (ICDCS) (June 1995).
    [Ell02] Richard Elling，Tim Read著，王建华，王卫峰译，Sun Cluster 3.0企业解决方案，机械工业出版社，2002
    [Fis85] Fischer, M. J., Lynch, N. A., and Paterson, M. S., Impossibility of distributed consensus with one faulty process, pages 374-382, Journal of the ACM, Vol. 32, No. 2, Apr 1985.
    [Fle86] K.N. Fleming, A. Mosleh and R. K. Deremer, A systematic procedure for the incorporation of common cause events into risk and reliability models, Nuclear Engineering and Design 93, 245-273
    [Fri97] Ricardo M. Fricks, Kishor S. Trivedi, Modeling Failure Dependencies in Reliability Analysis using Stochastic Petri Nets, Proc. European Simulation Multi-conference (ESM'97), Istanbul, Jun. 1997.
    [Gal97] Galleni, A. and Powell, D., Consensus and membership in synchronous and asynchronous distributed systems. Technical Report 96104 (April), LAAS-CNRS. Revised January 1997.
    [Gao01] 高文，服务器聚集系统中高可用性分析与设计方法，博士论文，中科院计算所，2001．6．
    [Gar99] Felix C. Garther, Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM computing survery, vol 31, No. 1, March, 1999
    [Geo04] George Coulouris，Jean Dollimore，Tim Kindberg著，金蓓弘等译，分布式系统：概念与设计，机械工业出版社，2004
    [Gra86] J.Gray, "Why Do Computers Stop And What Can Be Done About It?" Proc. of Fifth Symposium on Reliability in Disributed Software and Database Systems, Jan. 1986: pages 3-12.
    [Gra91] J.Gray, D. P. Siewiorek, High Availability Computer Systems, IEEE Computer, Vol. 39, No 8, pp 39-48, September 1991.
    [Gre01] Gregory V. Chockler, Idit Keidar, and Roman Vitenberg, Group Communication Specifications: A Comprehensive Study, In ACM Computing Surveys 33(4), pages 1-43, December 2001
    [Hay93] B.R. Haverkort and K. S. Trivedi, Specification and Generation of Markov Reward Models, Discrete-Event Dynamic Systems: Theory and Applications, Vol. 3, (1993), pp. 219-247
    [Hea01] T. Heath, R. P. Martin, T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. Proceedings of the ACM SIGMETRICS 2002.
    [Hit95] Hiltunen, M. and Schlichting, R., Properties of membership services, In 2nd International Symposium on Autonomous Decentralized Systems (1995), pp. 200-207.
    [Hsu97] Mei-Chen Hsueh, Timothy K. Tsai, R. K. Iyer, Fault Injection Techniques and Tools, IEEE Computer, Volume 30, Issue 4 1997, Pages: 75-82
    [Hun99] S.W. Hunter and W. E. Smith, Availability modeling and analysis of a two node cluster, Proc. 5th Int. Conf. on Information Systems, Analysis and Synthesis, Orlando, FL, Oct. 1999.
    [IBM03] IBM Corp., HACMP V5.1: concept and facilities, SC23-4864-00, July, 2003
    [Jah93] F. Jahanian, R. Rajkumar, and S. Fakhouri. Processor Group Membership Protocols: Specification, Design and Implementation. Proc. of the IEEE Symp. on Reliable and Distributed Systems, October 1903: pp.2-11.
    [Joh88] A M. Johnson, Miroslaw Malek, Survey of Software Tools for Evaluating Reliability, Dependability and Serviceability, ACM Computing Surveys, Vol.20, No.4, December 1988.
    [Kai00] 黄凯，徐志伟著，陆鑫达等译，可扩展并行计算：技术、结构与编程，机械工业出版社，2000
    [Kal94] B. Kalyanasundaram and K. R. pruhs. Fault-tolerant scheduling. In 26th Annual ACM Symposium on Theory of Computing, pages 115-124, 1994.
    [Kal99-1] Z. Kalbarczyk, and etc., Chameleon: A Software Infrastructure for Adaptive Fault Tolerance, IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No.6, June 1999, pp560-579.
    [Kal99-2] M. Kalyanakrishnam, Z. Kalbarczyk, and R. lyer, "Failure Data Analysis of LAN of Windows NT Based Computers," Proc. 18th Symp. Reliable and Distributed Systems (SRDS ? 9), 1999.
    [Kim03] K.H.(Kane)Kim, Fault-Tolerant Distributed Computing: Evolution and Issues, Vol.3, No.7 Distributed Computing, IEEE, 2003.
    [Lam82] Lamport, L., Shostak, r. and Pease, M.., Byzantine Generals Problem. ACM Transactions Programming Languages and Systems, Vol.4, No.3, pp, 382-401
    [Lar98] L.A. Laranjeira, "NCAPS: Application High Availability In Unix Computer Clusters", IEEE FTCS-28, 1998, pp 441-450.
    [Lee94] Inhwan Lee, Software Dependability in the Operational Phase, Ph.D. Thesis, University of Illinois at Urbana-Champaign, 1994
    [Lev94] Shem-Tov Levi, Ashok K.Agrawala, Fault Tolerant System Design, McGraw-Hill, 1994.
    [Lgj05] 李国杰，关于超级计算与能力服务的战略思考，计算所信息技术快报05-1期
    [Lhq03] 李海泉，李刚，系统可靠性分析与设计，科学出版社，2003
    [Lin01] 林闯，计算机网络和计算机系统的性能评价，清华大学出版社，2001
    [Lun04] 陆大淦，随机过程及其应用，清华大学出版社，2004
    [Lyu99] Michael R. Lyu, Veena B. Mendiratta, Software Fault Tolerance in a Clustered Architecture: Techniques and Reliability Modeling, in Proceedings 1999 IEEE Aerospace Conference, Snowmass, Colorado, March 6-13 1999, pp. 141-150 vol.5
    [Maf95] S. Maffeis, "Adding group communication and fault-tolerance to CORBA," in Proceedings of the Conference on Object-Oriented Technologies, pp. 135--146, 1995
    [Mey80] J. F. Meyer, On evaluating the performability of degradable computer systems, IEEE Trans. Comput., vol. C-29, pp. 720-731, Aug. 1980
    [Mey95] J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, in Proceeding of the International Computer Performance and Dependability Symposium(IPDS'95), 1995
    [Min95] 闵应骅，容错计算二十五周年，计算机学报，1995
    [Mos96] Moser, L.E., Melliar-Smith, P.M., Agarwal, D.A., Budhia, R.K., and Lingley-Papdopoulos, C.A., Totem: A fault-Tolerant Multicast Group Communication System, pages 54-63, Communications of the ACM, Vol. 39, No. 4, April 1996.
    [Mos99] L. E. Moser, P. M. Melliar-Smith, P. Narasimhan, L. Tewksbury and V. Kalogeraki, The Eternal System: An Architecture for Enterprise Applications, International Enterprise Distributed Object Computing Conference, University of Mannheim, Germany (September 1999), pp. 214-222.
    [Mup90] J. K. Muppala and K. S. Trivedi, GSPN Models: Sensitivity Analysis and Applications, Proc. 28th Annual ACM SE Region Conference, Greenville, SC, Apr. 1990.
    [Mup92] J. K. Muppala, A. S. Sathaye, R. C. Howe, K. S. Trivedi, Dependability Modeling of a Heterogeneous VAXcluster System Using Stochastic Reward Nets, Hardware and Software Fault Tolerance in Parallel Computing Systems, D. Avresky (ed.), 33～59, Ellis Horwood Ltd., 1992.
    [Mup94] Jogesh Muppala, Gianfranco Ciardo, and K. S. Trivedi, Stochastic Reward Nets for Reliability Prediction, Communications in Reliability, Maintainability and Serviceability: An International Journal published by SAE International, Vol. 1, No. 2, pp. 9-20, July 1994.
    [Mup00] J. K. Muppala, R. M. Fricks, K. S. Trivedi, Techniques for System Dependability Evaluation, in Computational Probability, W. Grassman (ed.), pp. 445-480, Kluwer Academic Publishers, The Netherlands, 2000.
    [Nar02] Priya Narasimhan, Louise E. Moser, P. M. Melliar-Smith: Lessons Learned in Building a Fault-Tolerant CORBA System. DSN 2002: 39-44
    [Nat00] B Natarajan, A Gokhale, S Yajnik, D Schmidt, "DOORS: Towards High-performance Fault Tolerant CORBA", Proceedings of the 2nd Distributed Appications and Objects (DOA) conference, Antwerp, Belgium, September 2000
    [Ngb04] Ni Guangbao, Ma Jie, Li Bo, GridView: A Dynamic and Visual Grid Monitoring System, in proceeding of HPCAsia2004.
    [Oli04] A. Oliner, Ramendra K. Sahoo, Jos? E. Moreira, Manish Gupta, Anand Sivasubramaniam, Fault-aware Job Scheduling for BlueGene/L Systems, In IEEE IPDPS, Intl. Parallel. and Distributed Processing Symposium, April 2004.
    [OMG00] Object Management Group, Modeling Language Specification Revision 1,3, march, 2000
    [OMG02] OMG, Fault Tolerant Corba, CORBA 3.0-Fault Tolerant chapter, formal/02-06-59.
    [Pat02] D. Patterson. A New Focus for a New Century: Availability and Maintainability》Performance. Proceeding of the Conference on File and Storage Technologies (FAST'02), 2002.
    [Ran75] B.Randell, System Structure for Software Fault Tolerant, IEEE Trans. On Software Engineering, IEEE CS Press, Los Alamitos, Calif., Vol.SE-1, no.2, June 1975, pp.220-232.
    [Ren96] van Renesse, R., Birman, K.P., and Maffeis, S., Horus: A Flexible Group Communication System. pages 76-83, Communications of the ACM, Vol. 39, No. 4, April 1996.
    [Ric91] Ricciardi, A. M. and Birman, K. P., Using process groups to implement failure detection in asynchronous environments. In ACM Symposium on Principles of Distributed Computing (PODC) (August 1991), pp. 341-352.
    [SAF03] Service Available Forum, The Service Availability~(TM) Forum Specification for High Availability Middleware, April 2003, http://www.saforum.org.
    [Sah04] R. Sahoo, A. Sivasubramaniam, M. Squillante. Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 772-781, June 2004
    [San04] J. C. Sancho, F. Petrini, G. Johnson, J. Fernandez, and E. Fmchtenberg, On the Feasibility of Incremental Checkpointing for Scientific Computing, in International Parallel and Distributed Processing Symposium, (Santa Fe, NM, USA), April 2004.
    [Sch02] Richard E. Schantz and Douglas C. Schmidt. Research Advances in Middleware for Distributed Systems: State of the Art. IFIP World Computer Congress, August 2002, Montreal, Canada
    [sgz04] 孙国忠李艳红樊建平 “高性能并行计算系统检查点技术与应用”计算机科学与技术第八届研究生学术研讨会，7月，2004
    [Sha96] Mary Shaw and David Garlan, "Software Architecture: Perspectives on an emerging discipline", Prentice Hall Inc, 1996.
    [Sie04] Daniel P. Siewiorek, Ram Chillarege, Zbigniew T. Kalbarczyk, Reflections on Industry Trends and Experimental Research in Dependability, IEEE Transactions on Dependability and Security, Vol. 1(No.2): 109-127 (2004)
    [Sim02] C. Simache, M. Kaaniche, and A. Saidane, "Event Log Based Dependability Analysis of Windows NT and 2K Systems," Pacific Rim Int'l Symp. Dependable Computing (PRDC'02), 2002.
    [Tan90] D. Tang, R. K. lyer, and S. S. Subramani. Failure analysis and modelling of a VAXcluster system. In Proc. Intl. Symp. Fault-tolerant Computing, pp. 244-251, 1990.
    [Tan92] D. Tang, R. K. lyer, Analysis and Modeling of Correlated Failures in Multicomputer Systems, IEEE Trans. Computers Vol.41(No.5): 567-577 (1992).
    [Tan93] D. Tang, R. K. lyer, Dependability Measurement and Modeling of a Multicomputer System, IEEE Trans. Computers 42(1): 62-75 (1993)
    [TAO] http://www.cs.wustl.edu/～schmidt/TAO.html
    [Tel00] Gerard Tel, Introduction to Distributed Algorithms, Cambridge University Press, 2000
    [Tri93] K.S. Trivedi, Manish Malhotra, Reliability and Performability Techniques and Tools: A Survey, Proc. 7th ITG/GI Conference on Measurement, Modelling and Evaluation of Computer and Communication Systems, Aachen University of Technology, pp. 27-48, Sept. 1993.
    [Vai01] K. Vaidyanathan, R. E. Harper, S. W. Hunter, K. S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in Proc. of the Joint Intl. Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001
    [Vog98] Vogels, W.; Dumitriu, D.; Birman, K.; Gamache, R.; Massa, M; Short, R.; Vert, J.; Barrera, J.; and J. Gray, The design and architecture of the Microsoft Cluster Service-a practical approach to high-availability and scalability, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, IEEE, 1998, pp.422-431.
    [wei05] 黄伟，詹剑锋，樊建平，DCFT-Kemel：基于组服务的容错管理系统实现，计算机研究与发展，2005
    [Whi00] K. Whisnant, Z. Kalbarczyk, R. Iyer, Microcheckpointing: Checkpointing for Multithreaded Applications, 6th IEEE Int. On-Line Testing Workshop, Spain, 2000.
    [Whi03] Keith Whisnant, A Process Architecture and Runtime Environment for Dependable Distributed Applications. Ph.D. Thesis, University of Illinois at Urbana-Champaign, 2003.
    [Whi04] K. Whisnant, R.K. lyer, Z.T. Kalbarczyk, P.H. Jones Ⅲ, D.A. Rennels and R. Some, The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications, IEEE Transactions on Software Engineering, vol. 30, no. 4, pp. 257-277, April 2004
    [zha01] 张文生，多节点机群系统的高可用管理软件的设计与实现，硕士论文，中科院计算所，2001
    [Zha05] Jianfeng Zhan, Nihui Sun, Fire Phoenix cluster operating system kernel and its evaluation, In Proceeding of Cluster 2005

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700