仪用主从耦合分布式并行处理容错系统体系结构研究

作者：刘峰
论文级别：博士
学科专业名称：生物医学工程及仪器
中文关键词：主从耦合分布式容错体系结构 ; 体系融合 ; 分布式活动信箱 ; 消息登录和传递 ; 故障容错和系统恢复 ; 系统仿真
英文关键词：Fault-tolerant master-slave coupling distributed architecture ; Architecture fusion ; Distributed Active Mailbox Messaging ; Message-logging and Message-passing ; Fault-tolerance and system recovery ; System simulation
学位年度：2003
导师：葛霁光
学科代码：0831
学位授予单位：浙江大学
论文提交日期：2003-06-01

摘要

信息社会的物质基础是信息获取、处理、显示、存储传输和交互技术，其中仪器系统是最重要的技术内容之一。现代医学仪器和科学仪器技术是传统仪器技术的继承和发展，它以信息获取、处理和控制为基础达到对客观对象内在的、本质的客观规律、功能和结构的认识，进而人机交互，最终实现有效使用的目的。现代仪器系统是在传统基础上更强调系统信息处理能力。信息获取过程需要实时动态地观测客观对象的多参数和多层次的信息，信息处理过程需要对这些获取到的信息进行高效和高质量的快速实时处理。两者紧密地结合从而保证了现代仪器系统完成信息融合和系统特征建模的任务。为了构建高性能/价格比的现代医学仪器和科学仪器系统，本文对现代仪器用主从耦合分布式并行处理容错系统体系结构进行了较为深入的研究。
     本文概述了信息化仪器系统的发展过程、现状和方向，并综述了主从耦合和分布式并行处理系统体系结构的主要研究。
     首先论述了系统体系研究和构建的方法学，提出了适用于现代仪器系统设计构建的重要原则和系统化设计构建的生存周期模型。在研究了主从耦合和通用分布式并行处理体系结构的基础上，根据现代仪器信息化系统的重要特征，创新地提出了基于通信体系和容错体系融合的、适度集中的主从耦合分布式并行处理容错系统体系结构。
     主从耦合分布式并行处理系统的设计和构造是一项巨大的系统工程，具有投资大、周期长、涉及技术领域广和复杂性高等特点。因此本论文在系统体系研究和构建的方法学的指导下，以Petri网理论和离散事件仿真技术为基础，构建了模块层次式系统性能评价系统，对系统进行层次化的性能建模仿真和瓶颈分析，获得了系统在不同工作负载条件下的性能特征，为仪器系统体系结构的设计和实现策略提供了重要的性能量化指标和指导依据。
     通信体系结构的设计与构建是本文研究的重点。主从耦合并行处理系统是具有两个或多个处理单元(Process Element)的集合，它们相互通讯以协同求解一个给定的复杂因果问题的计算建模处理系统。通信体系与处理单元之间的高效融合是解决主从耦合并行处理系统中所有问题的基础。
     主从耦合并行系统中处理的并行性提高了系统性能，但同时处理的时空局部性也能提高性能，这是构建主从耦合并行系统过程中需要平衡的重要问题——并行程度。具有更高并行特性的细粒度处理需要处理单元间高性能的同步模型和所提供机制的强大支持。本文提出了高效硬件同步系统方案——全动态栅栏同步模型，并给出了相应的编程原语。
     在本文提出的主从耦合和的分布式并行处理体系结构中，适度集中的并行处理节点通过多通道共享总线的拓扑互联形成了高性能的关键处理环节。共享并行总线是处理节点之间高性能通信的重要资源，为了提高资源的利用效率，本文提出了基于时间优先权并具有仲裁事务缓冲机制的仲裁方案。
     现代仪器用主从耦合并行处理系统需要更加灵活和更高性能的单元间通信服务作为支撑，为此本文研究并构造了分布式活动信箱(Distributed Active Mailbox Messaging，DAMM)通信子系统。DAMM对主从耦合的并行系统的高性能通信网络功能进行了必要抽象，减少了协议的处理层次，使得通信网络的性能特征可以直接为用户所调用，满足了用户对实时性事件处理的需求。同时DAMM提供了必要的共享核心资源管理(中断、时间、协议和缓存管理等)能力，提高了通信服务原语的抽象层次，方便了应用编程和核心实现。
     系统运行可靠性要求主从耦合分布式并行处理系统必须具有可靠的故障容错能力。复杂的主从耦合并行系统产生了主从耦合和适度的分布式容错管理问题，以及现代仪器系统的特点是数据量大和处理模型复杂，这些要求系统在尽量少的冗余资源条件下力求保证自身的

    浙乞〔口弋学体d匕学t立七仑文
    可靠性。在理论上要求系统局部故障条件下，避免产生系统整体失效的可能性，所以系统中
    的处理问题主要集中在监测系统失效和故障事件的机制(硬件和软件、局部和系统)、系统
    状态阶段性保存、系统状态较完整恢复等方面，这些都是研究重点。其中关键在于主从祸合
    的自动转换机制。
     为此本文提出了层次式(硬件和软件层次)多机制(状态空间监测和超时监测)系统
    错误和失效监测体系方案，以及在与通信系统融合的基础上提出了适用于系统状态阶段性保
    存的轻量级和重量级结合的校验点方案和恢复方案。
     现代仪器用主从藕合分布式并行处理容错体系结构的研究涉及面广，涉及问题复杂，
    构造实现难度大，除了研究了仿真系统及其集成问题以便在系统未构成前可以在仿真系统上
    进行深入和广泛的实验研究之外，并描述了仿真系统的整体结构。本文最后给出了未来研究
    发展方向。
Essential foundation of information society depends on information acquisition, processing, display, storage, communication and interface technique. Instrument system lies on the most important position in information technology. Based on information acquisition, processing and controlling, modern medical instrument and science instrument technology evolving from traditional instrument technology is capable to help us recognizing instinctive and essential action regularity of target object, then interfacing with it, and utilizing it eventually. Modern instrument system focuses much more on the capability of information processing than traditional system. It is essential to real-time detect multi-parameter and multi-level information of target object dynamically for the key function of information acquisition of instrument system, however, to real-time process those acquired information for that of information processing high efficiently and qualitatively. Modern instrument system incorporating those two phases efficiently is to guarantee accomplishing information fusion and system modeling. In order to construct high performance-price-ratio modern medical and science instrument system, this dissertation pay fairly deeper and constructive attention on researching modern instrument-oriented fault-tolerant master-slave coupling distributed parallel processing system architecture.
    This dissertation summarizes that evolution, state-of-art and orientation of information instrument, and reviews main research work on the master-slave coupling and distributed parallel processing system architecture.
    At first, this dissertation discuss about methodology of research, design and construction of system architecture, at same time presents the fundamental suitable for designing modern instrument system and evolutional life cycle of designing action systematically. On the basis of researching over master-slave coupling and generic distributed parallel processing architecture, according to that" important characteristic of modern information system, it proposes properly-centralized fault-tolerant master-slave coupling distributed parallel processing system architecture innovatively.
    It is systematic project for designing and constructing master-slave coupling distributed parallel processing system, have those features of high-investment, long -duration, comprehensive technique and high-complicatioin. Considering those causes mentioned above, complying with methodology of researching and constructing system architecture, based on Petri net theory and discrete-event simulation technique, this dissertation construct modular and layered performance-evaluating system to model system performance and analyze its bottleneck at different level, and then to find performance characteristic under the condition of different system payload. The performance results from the discrete-event simulation system can provide important performance-quantized index and tutorial factors for designing and implementing instrument system architecture.
    The research on designing and constructing communication system architecture is at the core of this dissertation. Master-slave coupling distributed parallel processing system contains the set of two or several Process Element, which communicating with each



    other to solve a specific complicated consequent problem over computing, modeling and processing. The communication architecture cooperating with actions of Process Elements efficiently comes into being fundament for solving all problems in master-slave coupling distributed parallel processing system.
    Parallelism of processing in master-slave coupling parallel processing system helps gain its performance, locality of processing in time-space do too. Therefore, How to balance between those factors in constructing master-slave coupling parallel processing system brings out a crucial problem, which being called as Degree Of Parallel. The thin granularity processing with greater parallelism is subject to high-performance Inter-PE synchronization m

引文

[1] ACM, Special Section on Digital Multimedia System. Communication of the ACM, Vol.32, No.7, 1989.
    [2] Anderson T. E, Bershad B. N. , Lazowska E. D., and Levy H. M. Scheduler activation: Effective kernel support for the user-level management of parallelism. In proceedings of the 13rd Workshop on Workstation Operating Systems, pages 92-94. IEEE Computer Society, IEEE Computer Society Press, April 1992.
    [3] Assenmacher H., Breitbach T., Buhler E, Huebsch V., and Schwarz R.. The PANDA system architecture - a pico-kernel approach, In Proceedings of the 4th Workshop on Future Trends of Distributed Computing Systems, pages 470-476. IEEE Computer Society Press, September 1993.
    [4] Peatman J B. Microcomputer-based design. McGraw-Hill Company, 1977
    [5] 葛霁光，沈公羽．多微处理器系统的设计方法．浙江大学学报，1987，21(1)：6-12
    [6] 刘谋用，吴越，葛霁光．现代仪器用实时分布式操作系统．计算机学报，1999，22(6)：608-614
    [7] Liu F and Ge JG, Multi-parameters multi-levels information-fusion architecture in modern biomedical instrument system. IEEE-EMBS Asia Pacific Conf. on Biomedical Engineering - Proceedings., PTS 1 & 2, 813-814, 2000
    [8] 庄天戈，计算机在生物医学中的应用，东南大学出版社，1991
    [9] 陈光禹，VXI总线测试平台技术，电子科技大学出版社，1996
    [10] Hwang K, Xu Z W. Scalable parallel computers for real-time signal processing, IEEE Signal Processing. 1996, 13 (4): 50-66
    [11] Alan D G, Lois W H. Microprocessor-based parallel architecture for reliable digital signal processing system. CRC Press, 1995
    [12] Factor M, et al. Real-Time Data Fusion In the Intensive Care Unit, Computer, 24(11): 45--53, 1991
    [13] Thakkar S S, Dubois M, Sohi G S. New directions in scalable shared-memory multiprocessor architectures. Computer, 1990, 23 (6): 71-74
    [14] G. S. Almasi and A. Gottlieb, Highly Parallel Computing, Benjamin/Cummings, Redwood, CA, 1989
    [15] 侯勤业，张箐，分布式嵌入式实时操作系统QNX，宇航出版社，1999
    [16] [Tabak90] Tabak D. Multiprocessors. Prentice-Hall, 1990
    [17] [Bell97] Bell G, Gray J N. The revolution yet to happen, beyond calculation: the next fifty years of computing. Spring-Verlag, 1997
    [18] [王鼎兴95]王鼎兴，庄伟强．高性能计算的核心技术——并行处理．模式识别与人工智能，1995(8)：32-47
    [19] Slater M. The microprocessor today. IEEE Micro, 1996, 16 (6): 32-45
    [20] [Perterson94] Perterson D D. To Multiprocess or Not to Multiprocess, EDN 1994 (6) : 64-71
    [21] [Culler98] Culler D J, Singh J P, Gupta A. Parallel computer architecture: A hardware/software approach. Morgan Kaufman, 1998


    [22] [Adve93] Adve S V, Hill M D. A unfied formalization of four shared-memory models, IEEE Transaction on Parallel and Distributed Systems, 1993, 4 (6) : 13-24
    [23] [窦勇 94] 窦勇，周兴铭，分布式的共享主存多机系统综述，计算机工程 . 1994,20(5) : 37-43
    [24] [孙自余 98] 孙自余.细解多处理器技术.中国计算机杂志 , 1998 (6) : 43-44
    [25] [Gordon94] Gordon B . Scalable , parallel computers : alternatives , issues , and challenges. International Journal of Parallel Programming, 1994, 22 (1) : 3-44
    [26] [Nitzberg91] Nitzberg B, Virginia J. Distributed shared memory: a survey of issues and algorithms. Computer, 1991, 24 (8) : 52-60
    [27] [Gajski85] Gajski D D, Peir J K. Essential issues in multiprocessor systems. Computer, 1985, 18 (6) : 9-27
    [28] [Yew91] Yew P. C, Wah B. W, (Eds.). Special issue on shared-memory multiprocessors, Journal of Parallel and Distributed Computing, 1991, 12 (6) : 87-92．
    [29] [Hwang98] Hwang K, Xu Z. W. Scalable, Parallel, Computing Technology, Architecture, Programming. McGraw-Hill, 1998
    [30] David E. Culler, J. P. Singh and Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Second Edition. Morgan Kaufmann Publishers, Inc. 1999
    [31] [Lenoski95] Lenoski D E, Weber WD. Scalable shared-memory multiprocessing. Morgan Kaufmann, 1995
    [32] Stenstrom P., A survey of Cache coherence schemes for multiprocessing. Computer, 6: 12-24, 1990
    [33] IEEE standard for Futurebus+-logical layer and profile specification, IEEE Std 896． 2-1991, 24 April 1992
    [34] 熊煦，计算机数字总线标准手册，北京希望电脑公司 ,1994
    [35] Borrill PL., Micro-standards special feature: A comparison of 32-bit buses, IEEE Micro, 5(6) : 71-79, 1985
    [36] Hennessy J.L. et al., Computer Architecture a Qualitative Approach, Morgan Kaufman, 1996
    [37] Thakkar S, Gifford F P. The Balance multiprocessor system, IEEE Micro 1988, 8 (2) : 57-69
    [38] [Lovett88] Lovett. T, Shreekant S T. The Symmetry multiprocessor system. Proceeding of the International Conference on Parallel Processing, 1988, 12 (11) : 303-310
    [39] [Intel-XA-MP91] Intel-XA-MP Architecture Specification ver3． 0． 1991
    [40] [Daniel92a] Lenoski D, Laudon J, Joe T. Nakahira D, Stevens L, Gupta A, Hennessy J. The DASH prototype: implementation and performance. 19 Annual International Symposium on Computer Architecture, May 1992: 92-103
    [41] [Jeffrery98] Jeffrery S K. The Stanford FLASH multiprocessor. Proceedings of the 25th International Symposium on Computers Architecture, 1998: 95-98
    [42] [Laudon97] Laudon J, Lenoski D. The SGI Origin: A ccNUMA Highly Scalable Server, Proceedings of the 24th Annual Symposium on Computers Architecture, 1997: 241-251
    [43] [Bisiani90] Bisiani R , Ravishankar M . PLUS : A Distributed shared-memory system. Proceedings of 17th International Symposium on Computer Architecture, 1990: 115-124
    [44] [Agerwala95] AgerwalaT, Martin J L, MirzaJL, Sadler DC, Dias D M, Snir M. SP2 System Architecture. IBM System Journal, 1995, 34 (2) : 263-272


    [45] [李国杰 94] 李国杰，陈国安，樊建平，刘金水．曙光机一号并行计算机．计算机学报，1994，17(12)：882-889
    [46] [韩承德 95] 韩承德，薛一波．BJ—1并行计算机的设计与实现．计算机学报，1995，18(12)：901-907
    [47] [Guibaly89] Guibaly F E. Design and Analysis of Arbitration Protocols, IEEE Transaction on Computers, 1989, 38 (2): 161-171
    [48] [Lee97] Lee C S. High-Fair Arbiter for Multiprocessor. IEICE Transaction on Information and System, 1997 (1): 94-97
    [49] [Farber81 ] Farber G. A Decentralized Fair Bus-Arbiter, Microprocessor and Microsystems, 1981 (7): 32-36
    [50] [Taub84] Taub D M. Arbitration and Control Acquisition in the Proposed IEEE 896 Futurebus. IEEE Micro, 1984, 4 (8): 28-41
    [51] [Mcbryan94] Mcbryan O. An overview of message passing environments. Parallel Computing, 1994, 20 (4): 417-444
    [52] [张亮97] 张亮，胡波等，一种实时多处理机系统网络设计．计算机学报，1997，20(5)：391—395
    [53] [China96] China C. Y., Hwang K. Packet switching networks four multiprocessors and dataflowcomputers. Computer, 1996, 29 (12): 57-64
    [54] Hwang K．，高等计算机系统结构：并行性可扩展性可编程性，清华大学出版社，1995
    [55] Raghuramireddy D. and Unbehauen R., A new realization for multiprocessor implementation of 2-D denominator-separable digital filters for real-time processing, IEEE Transactions on Signal Processing, 40(9):2349-2353 Sep 1992
    [56] Fink G.A. Jungclaus N. Ritter, H. and Sagerer, G.,A communication framework for heterogeneous distributed pattern analysis, Algorithms and Architectures for Parallel Processing, IEEE First Int'l Conf. On, (ICAPP 95, Australia) vol. 2: 881-890, Apr 1995
    [57] Newburn C.J. and Shen J.P.,Automatic partitioning of signal processing programs for symmetric multiprocessors, Parallel Architectures and Compilation Techniques, Proc. of the Conf. On, pp:269-280, Oct 1996
    [58] Jyh-Biau Chang Tsai Y.J. Shieh C.K. and Chung P.C., An efficient thread architecture for a distributed shared memory on symmetric multiprocessor clusters, Parallel and Distributed Systems, Proc., Int'l Conf. On, pp:816-823, Dec 1998
    [59] Muir S. and Smith J., AsyMOS-an asymmetric multiprocessor operating system, IEEE Open Architectures and Network Programming, pp:25-34, Apr. 1998
    [60] Greenberg A.G. and Wright P.E., Design and analysis of master/slave multiprocessors, IEEE Transactions on Computers, Vol. 40(8):963-976, Aug. 1991
    [61] Sahni S., Scheduling master-slave multiprocessor systems, IEEE Transactions on Computers, Vol.45(10): 1195-1199, Oct 1996
    [62] Kuo-Chan Huang and Feng-Jian Wang, Design patterns for parallel computations of master-slave model, Information, Communications and Signal Processing,. (ICICS. 1997), Proceedings of Int'l Conf. On, vol.3:1508-1512, Sep 1997
    [63] Andrew S. Tanenbaum, Distributed Operation Systems, Prentice-Hall, 1995
    [64] Wu Jie, Distributed System Design, CRC Press LLC, 1999


    [65] LeLann G., Motivation, Objectives, and Characterization of Distributed Systems, in Distributed Systems-Architecture and Implementation, Lampson, L., et al., eds., LNCS 105, Springer Verlag, 1981
    [66] Bovven i. P. and T. 1． Gleeson, Distributed Operating Systems, in Distributed Computer Systems, H. S. M. Zedan, ed., Butterworths, 1990
    [67] [Zaleswski95] Zaleswski ], (Eds. ). Advanced Multiprocessor Bus Architecture, IEEE Computer Society Press, 1995
    [68] Carriero N and D.Gelernter, How to Write Parallel Programms: A First Course, MIT Press, 1990
    [69] Lan Y. A. Esfahanian, and L. M. Ni, Multicast in Hypercube Multiprocessors, Journal of Parallel and Distributed Computing, 8(1) :30-41, Jan. 1990
    [70] Bal H. E., J. G. Steiner and A. S. Tanenbaum, Programming Languages for Distributed Computing Systems, ACM Computing Surveys, 21(3) :262-322, Sep. 1989
    [71] Burns A., Programming in Occam 2, Addison-Wesley Publishing Company, 1988
    [72] Hansen P. B., Distributed Processes: A Concurrent Programming Concept, Communication of the ACM,pp:934-941,Nov. 1978
    [73] Gehani N. H., Broadcasting Sequential Processes(BSP), IEEE Transaction on Software Engineering, 10(4) :343-351, Jul. 1984
    [74] Birrell A. and B. Nelson, Implementing Remote Procedure Calls, ACM Transactions on Computer Systems, 2(l):39-59, Feb. 1984
    [75] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, May, 1994
    [76] Message Passing Interface Forum. MPI-2: Extension to the Message-Passing Interface, Jul, 1997
    [77] A. Geist et. al. PVM3 User's Guide and Reference Manual, Sep. 1994
    [78] T. von Eichen, D. E. Culler, S. C. Goldstein, and K. E. Schauser, Active Messages: A Mechanism for Integrated Communication and Computation, In Proc. ISCA'92, Gold Coast, Australia, May 1992
    [79] Arbenz P., et al., SCIDDLE: A Tool for Large Scale Distributed Computing, Concurrency: Practice and Experience, 7(2) : 121-146
    [80] Liskov B., and R. Scheifler, Promise: Linguistic Support for Efficient Asynchronous Procedure Calls in Distributed System, Proc. Of the ACM SIGPLAN'88 Conf. On Programming Language Design and Implementation, pp:260-267, 1988
    [81] Bershad B. N., T. E. Anderson, E. D. Lazowska, and H. M. Levy, Lightweight Remote Procedure Call, ACM Transactions on Computer Systems, 8(l):37-55, 1990
    [82] Sinha P. K., Distributed Operating Systems: Concepts and Design, IEEE Computer Society Press, 1997
    [83] Stamos J., and D. Gifford, Implementing Remote Evaluation, IEEE Transaction on Sofrware Engineering, 16(7) :710-722, Jul. 1990
    [84] Wu J., V. Rancov, and A. Stoichev, Parallel Computations on LAN-connected Workstations, Proc. Of the 1996 Int'l Conf. On Parallel and Distributed Computing and Systems, pp: 193-197, 1996
    [85] H. F. Jordan, A Special Purpose Architecture for Finite Element Analysis, Proc. Of 1978 Int'l Conf. On Parallel Processing, 1978


    [86] M. T. O'Keefe and H. G. Dietz, Hardware barrier synchronization: static barrier MIMD (SBM), Proc. of 1990 Int'l Conf. on Parallel Processing, St. Charles, IL, pp. I 35-42, August 1990
    [87] W. E. Cohen, H. G. Dietz, and J. B. Sponaugle, Dynamic Barrier Architecture For Multi-Mode Fine-Grain Parallelism Using Conventional Processors; Part Ⅰ: Barrier Architecture, Int'l Conf. on Parallel Processing, 1994
    [88] H. G. Dietz, T. Schwederski, M. T. O'Keefe, and A. Zaafrani, Extending Static Synchronization Beyond VLIW," Supercomputing 89, pp. 416-425, Reno, NV, Nov. 1989
    [89] R. Hoare, Processor Synchronization Using A Hardware Synchronization Matrix, in Proc. Of the IASTED Int'l Conf. On Parallel and Distributed Computing and Systems, Cambridge, MA. Nov. 1999
    [90] G. Almasi and A. Gottlieb, Highly Parallel Computing, Second Edition. Redwood City, CA: The Benjamin/Cummings Publishing Company, Inc., 1994
    [91] R. Hoare, ClusterNet: An Object-Oriented Cluster Network, Int'l Parallel and Distributed Processing Symp., Workshop on Personal Computers based Networks of Workstations, Cancun, Mexico, May 2000
    [92] GCiaccio, Optimal Communication Performance on Fast Ethernet with GAMMA, in Proc. Workshop PC-NOW, IPPS/SPDP'98, LNCS No. 1388, Springer, pp:534-548, Orlando, FL, Apr. 1998
    [93] R. D. Russel and P. J. Hatcher, Efficient Kernel Support for Reliable Communication. In Proc. 1998 ACM symp. On Applied Computing, Atlanta, GA, Feb. 1998
    [94] S. Donaldson, J. M. D. Hill, and D. B. Skillicorn, BSP Clusters: High Performance, Reliable and Very Low Cost, Tech. Rep. PRG-TR-5-98, Oxford University Computing Laboratory, Programming Research Group, 1998
    [95] L. Prylli and B. Tourancheau, BIP: A New Protocol Designed for High Performance Networking on Myrinet, in Proc. Workshop PC-NOW, IPPS/SPDP'98, LNCS No. 1388, Springer, pp:472-485, Orlando, FL, Apr. 1998
    [96] [闽应骅 95] 闽应骅，容错计算二十五年，计算机学报, 18(12) :932-943, 1995
    [97] [Dhiraj96] Dhiraj K. P., Fault-tolerant computer system design, Prentice-Hall Press, 1996
    [98] [Nelson90] Nelson V. P., Fault-Tolerant Computing, Fundamental Concepts. Computer, Vol.23:19-25, 1990
    [99] [ 杨孝宗 93] 杨孝宗，容错技术与 STRATUS 容错计算机，哈尔滨工业大学出版社 , 1993
    [100] [Gary91] Gary J. and Siewiorek D. P., High-availability computer multiprocessor systems, Computer, Vol.9:39-48, 1991
    [101] [Abraham95] Abraham J. A., Challenge in fault detection, In Special Issue FTCS-25, pp:96-114, 1995
    [102] [Douglas93] Douglas M. B., Diagnosis and repair in multiprocessor system, IEEE Transaction on Computers, 42(2) :205-210, 1993
    [103] [Morin96] Morin C., Gefflaut A., Banatre K.. COMA: An opportunity for building, fault-tolerant scalable shared-memory multiprocessors, Proc. of the 23rd Annual Symposium on Computers Architecture, pp:56-65, 1996
    [104] Zaleswski J., (Eds.). Advanced Multiprocessor Bus Architecture, IEEE Computer Society Press, 1995


    [105] [Barlett92] Barlett J., et. al. Fault Tolerance in Tandem Computer Systems, in Reliable Compute System: Design and Evaluation. Digital Press, 1992
    [106] Liu F. and Ge J. G, Hardware support flexible low overhead fault tolerance scheme in scalable shared-memory multiprocessors, IEEE-EMBS ASIA PACIFIC CONFERENCE ON BIOMEDICAL ENGINEERING-PROCEEDINGS, PTS 1 & 2, 821-822, 2000
    [107] M. Dal Cin, H. Hessenauer and W. Hohl, The Modular Expandable Multiprocessor System MEMSY, Computer Systems Science & Engineering, Vol.ll(4) :211-219, 1996
    [108] M. Dal Cin, W. Hohl, E. Michel and A. Pataricza, Error Detection Mechanism for Massively Parallel Multiprocessors, in Proc. EUROMICRO Workshop on Parallel and Distributed Processing, pp:401-408, 1993．
    [109] W. Hohl, E. Michel and A. Patricza, Hardware Support for Error Detection in Multiprocessor Systems-A Case Study, Microprocessors and Microsystems, Vol. 17(4) :201-206, 1993
    [110] Manivannan D. and Singhal M., Quasi-synchronous checkpointing: Models, characterization, and classification, IEEE Transactions on Parallel and Distributed Systems, Vol. 10(7) : 703-713, Jul 1999
    [111] Guohong Cao and Singhal M., On Coordinated Checkpointing in Distributed Systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 9(12) : 1213-1225, Dec 1998
    [112] Ziv A. and Bruck J., Performance Optimization of Checkpointing Schemes with Task Duplication, IEEE Transactions on Computers, Vol. 46(12) : 1381-1386, Dec 1997
    [113] D. Manivannan, R. H. B. Netzer and M. Singhal, Finding Consistent Global Checkpoints in a Distributed Computation, IEEE Trans. Parallel and Distributed Systems, Vol. 8(6) : 623-627, Jun 1997
    [114] Xu J. and R.H.B. Netzer, Adaptive Independent Checkpointing for Reducing Rollback Propagation, Proc. Fifth IEEE Symp. Parallel and Distributed Processing, Dec. 1993
    [115] Alvisi L. and Marzullo K., Message logging: pessimistic, optimistic, causal, and optimal, IEEE Transactions on Software Engineering, Vol. 24(2) : 149-159, Feb 1998
    [116] Elnozahy E.N. and Zwaenepoel W., On the use and implementation of message logging, Twenty-Fourth International Symposium on Fault-Tolerant Computing, (FTCS-24) , pp: 298-307, Jun 1994
    [117] Kwang-Sik Chung, Ki-Bom Kim, et al., Hybrid checkpointing protocol based on selective-sender-based message logging, Proc. Of International Conf. On Parallel and Distributed Systems(1997) , pp: 788-793, Dec 1997
    [118] Jian Xu, Netzer R.H.B. and Mackey M., Sender-based message logging for reducing rollback propagation, Proc. Seventh IEEE Symp. On Parallel and Distributed Processing, pp: 602-609, Oct 1995
    [119] Beedubail G, Karmarkar A., Gurijala A., Marti W. and Pooch U., An algorithm for supporting fault tolerant objects in distributed object oriented operating systems, Fourth International Workshop on Object-Orientation in Operating Systems, 1995, pp: 142-148, Aug 1995
    [120] M. H. Theimer and B. Hayes, Heterogeneous process migration by recompilation, In Proc of the 11th IEEE International Conf. On Distributed Computing Systems, pp: 18-25 Jun 1991


    [121] Litskow M., T. Tannenbaum, J.Basney and M. Livny, Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System, Technical Report #1346, University of Wisconsin-Madison Computer Sciences, Apr. 1997
    [122] Dimitrov B., and Vernon Rego, Arache: A Portable Threads System Supporting Migrant Thread on Heterogeneous Network Farms, IEEE Transactions on Parallel and Distributed Systems, Vol. 9(5) :459-469, May 1998
    [123] Holtkamp M., Thread Migration with Active Threads, International Computer Science Institute, Technical Report, TR-97-038．
    [124] [Wahbe93] Wahbe R. and Lucco S. A., Efficient software-based fault isolation, Proc. of 14th ACM Symposium on Operation System Principle, pp.203-216, 1993
    [125] 朱正 , STRATUS 容错计算机的技术特点，抗恶劣环境计算机 , 7(1) : 1-6, 1993
    [126] Laprie J. C., Dependable Computing and Fault Tolerance: Concepts and Terminology, Proc. Of the 15th Int'l Symp. On Fault-Tolerant Computing, pp: 2-11,1985
    [127] Heimann D.I., N. Mittal and K..S. Trivedi, Availability and Reliability Modeling for Computer Systems, Advances in Computers, 31, M. C. Yovits, ed., Academic Press, Inc., pp: 176-235, 1990
    [128] Laprie J.C., Dependability: a Unifying Concept for Reliable Computing and Fault Tolerance, Dependability of Resilient Computers, T. Anderson, ed., BSP Professional Books, pp: 1-28, 1989
    [129] Jalote P., Fault Tolerance in Distributed Systems, Prentice Hall, Inc., 1994
    [130] Preparata F.P., G. Metze and R.T. Chien, On the Connection Assignment Problem of Diagnosable Systems, IEEE Transactions on Computers, Vol. 16(6) : 848-854, Dec. 1967
    [131] Lomet D.B., Process Structuring, Synchronization, and Recovery Using Atomic Actions, Proc. Of ACM Conf. On Language Design for Reliable Software, SIGPLAN Notices, 12(3) : 128-137, 1977
    [132] Allchin J.E., An Architecture for Reliable Decentralized Systems, Ph. D. Dissertation, School of Info. And Computer Science, Georgia Inst. Of Tech., Sep. 1983
    [133] Shrivastava S.K., Structuring Distributed Systems for Recoverability and Crash Resistance, IEEE Transaction on software Engineering, 7(7) : 436-447, Jul. 1981
    [134] Svobodova L., Resilient Distributed Computing, IEEE Transaction on software Engineering, 10(5) : 257-268, May 1984
    [135] Gregory S.T and J.C. Knight, Concurrent System Recovery, Dependability of Resilient Computers, T. Anderson, ed., BSP Professional Books, pp: 167-190, 1989
    [136] Jalote P., Fault Tolerant Processes, Distributed Computing, 3: 187-195, 1989
    [137] Mancini L.V. and S.K. Shrivastava, Replication within Atomic Actions and Conversations: A Case Study in Fault-Tolerance Duality, Proc. Of the 19th Int'l Symp. On Fault-Tolerant Computing, pp: 454-461, 1989
    [138] Dolev, D.N., The Byzantine Generals Strike Again, Journal of Algorithms, 3: 14-30, Jan. 1989
    [139] Best E. and F. Cristian, Systematic Detection of Exception Occurrence, Sci. Computer Program, 1(1) : 115-144, 1981
    [140] Horning J.J., H.C. Lauer, et al., A Program Structure for Error Detection and Recovery, Proc. Of the Conf. On Operating Systems: Theoretical an Practical Aspects, pp: 177-193, 1974


    [141] Avizienis A., The N-version Approach to Fault-Tolerant Software, IEEE Transaction on Software Engineering, 13(12) : 1491-1510, Dec. 1985
    [142] Schlichting R.D. and F.B. Schneider, Fault-stop Processor: An Approach to Designing Fault-Tolerant Computing Systems, ACM Transactions on Computing Systems, 1(3) : 222-238, 1983
    [143] Lamport L., R. Shostak and M. Pease, The Byzantine Generals Problems, ACM Transactions on Programming Languages and Systems, 4(2) : 382-401, 1982
    [144] Anderson T. and P. A. Lee, Fault Tolerance-Principles and Practice, Prentice-Hall, Inc., 1981
    [145] Liskov B. and R. Scheifler, Guardians and Actions: Linguistic Support for robust, Distributed Program, Proc. Of the 9th Annual Symp. On Principles of Programming Languages, pp:7-19, 1982
    [146] Altmann 3． T. Bartha and A. Pataricza, On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers, in Proc. IPDS'95, IEEE Int'l Computer Performance and Dependability Symp. pp: 154-164, 1995
    [147] Cin M. Dal, W. Hohl, 3． Honig and A. Pataricza, MEMSY-A Modular Expandable Multiprocessor System with Fault Tolerance, in Proc. Parallel Systems Fair of the 8th IEEE Int. Parallel Processing Symp., Cancun, pp: 21-28, 1994
    [148] DEconinck G, J. Vounckx, et al., A Scalable Implementation of Fault Tolerance for Massively Parallel Systems, in Proc. MPCS'96, 2nd Int'l Conf. On Massively Parallel Computing Systems, pp: 214-221, 1996
    [149] Gryfier A. and M. Dal Cin, A Stable Storage Unit for Multiprocessors, in Proc. Pacific Rim Int'l Symp. On Fault-Tolerant Systems, Newport Beach, Ca, pp: 158-163, 1995
    [150] Majzik I., W. Hohl, A. Pataricza and V. Sieh, Multiprocessor Checking using Watchdog Processors, Computer Systems Science & Engineering, Vol. 5: 301-310, 1996
    [151] Lala P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice Hall, NJ 1985
    [152] Siweiorek D.P. and R.S. Swarz. The Theory and Practice of Reliable System Design, Digital Press, Digital Equipment Corporation, ISBN 0-932376-13-4, 1983
    [153] Rela M., H. Madeira and J.G. Silva, Experimental Evaluation of the Fail-Silent Behavior in Programs with Consistency Checks, in Proc. FTCS-26, pp: 394-403, 1996
    [154] Paraticza A., I. Majzik, W. Hohl and J. Honig, Watchdog Processors in Parallel Systems, Microprocessing and Microprogramming, Vol. 39: 69-74, 1993
    [155] Schmid M., R. Trapp, A. Davidoff and G Masson, Upset Exposure by Means of Abstraction Verification, in Proc. FTCS-12, pp: 237-244, 1982
    [156] Bartlett J.F., A 'Non-Stop' Operating System, in Proc. Hawaii Int'l Conf. Of System Sciences, 1978
    [157] Bianchini R. and R. Buskens, An Adaptive Distributed System-Level Diagnosis Algorithm and its Implementation, in Proc. FTCS-21, pp: 222-229, 1991
    [158] Cin M. Dal and F. Florian, Analysis of Fault-Tolerant Distributed Diagnosis Algorithm, in Proc. FTCS-15,pp: 159-164, 1985
    [159] Kuhl J.G. and S.M. Reddy, Fault-diagnosis in Fully Distributed Systems, in Proc. FTCS-11, pp: 100-105, 1981
    [160] Meyer F.J. and G Masson, An Efficient Fault Diagnosis Algorithm for Symmetric Multiprocessor Architecture, IEEE ToC, Vol.EC-27: 1059-1063, 1978


    [161] Long J., W.K. Fuchs and J. A. Abraham, Forward Recovery Using Checkpointing in Parallel Systems, Proc. Of Int'l Conf. On Parallel Processing, pp: I 272-I 275, 1992
    [162] Pradhan D.K. and N.H. Vaidya, Roll-Forward checkpointing Scheme: Concurrent retry with Nondedicated Spares, Proc. Of Workshop Fault-Tolerant Parallel and Distributed Systems, pp: 166-174, 1992
    [163] Huang K., J. Wu and E.B. Fernandez, A Generalized Forward Recovery Checkpointing Scheme, in Parallel and Distributed Processing, J. Rolim, ed., LNCS 1388, Springer-Verlag, pp: 623-643, 1998
    [164] Bowen N.S. and D.K. Pradhan, Processor-and Memory-based Checkpoint and Rollback Recovery, EEEE Computers, 26(2) : 22-30, 1993
    [165] Birman K.P. an T.A. Joseph, Reliable Communication in the Presence of Failure, ACM Transactions on Computer Systems, 5(1) : 47-76, Feb. 1987
    [166] Zhang C. and C.Q. Yang, Analytical Analysis of Reliability for Executing Remote Programs on Idling Workstations, Proc. Of IEEE 9th Annual Int'l Phoenix Conf. On Comp. And Comm., pp: 10-16,1990
    [167] Koo R. and S. Toueg, Checkpointing and Rollback Recovery for Distributed Systems, IEEE Transactions on Software Engineering, 13(1) : 23-31, Jan. 1987
    [168] Hilary J.M., A. Mostefaoui, R.H.B. Netzer and M. Raynal, Preventing Useless Checkpoints in Distributed Computation, Proc. Of the 17th Int'l Symp. on Fault-Tolerant Computing Systems, pp: 68-77, Jun. 1997
    [169] Wang Y.-M, Y. Huang, K.-P. Vo, P.Y. Chuang and C. Kintala, Checkpointing and its Applications, Proc. Of the 25th Int'l Symp. on Fault-Tolerant Computing, pp: 22-30,1995
    [170] Elnozahy E.N., D.B. Johnson, and W. Zwaenepoel, The performance of Consistent Checkpointing, IEEE Symp. on Reliable and Distributed Systems, pp:39-47, 1992
    [171] Plank J.S., M. Beck, G. Kingsley and K. Li, Libckpt: Transparent checkpointing under Unix, Proc of USENK Winter 1995 Technical Conf., pp: 213-223, 1995
    [172] Ramkumar B. andV. Strumpen, Portable Checkpointing for Heterogeneous Architectures, Proc. Of the 27th Int'l Symp. on Fault-Tolerant Computing, pp: 58-67,1997
    [173] Tomg H.C. and N.C. Wilhelm, The Optimal Interconnection of Circuit Modules in Microprocessor and Digital System Design, IEEE Transactions on Computers, Vol. 26(5) : 450-457, May 1977
    [174] Mudge T.N., D.C. Winsor and J.P. Hayes, Multiple Bus Architectures, Computer, Vol.20: 42-48, Jun. 1987
    [175] Pradhan D.K., Fault-Tolerant Multiprocessor Link and Bus Network Architectures, IEEE Transactions on Computers, Vol. 34(1) : 33-45, 1985
    [176] Ward L., A Fault Tolerance Multiple Bus Interconnection Network, PhD dissertation, Dept. of Computer Science, Florida State Univ., 1994
    [177] Ku H.K. and J.P. Hayes, Connectivity and Fault Tolerance of Multiple-Bus Systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 8(6) : 574-586, Jun 1997
    [178] Ku H.K. and J.P. Hayes, Systematic Design of Fault-Tolerant Multiprocessors with Shared Buses, IEEE Transactions on Computers, Vol. 46(4) : 439-455,1997
    [179] Tu Huan-Yu and Lois W. Hawkes, Families of Optimal Fault-Tolerant Multiple-Bus Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 12(1) , Jan. 2001


    [180] Gaughan P.T. and S. Yalamanchili, A family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 6(5) : 482-495, May 1995
    [181] Duato J. A Theory of Fault-Tolerant Routing in Wormhole Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 8(8) : 790-802, Aug. 1997
    [182] Almohammand B. and B. Bose, Fault-Tolerant Broadcasting in Toriodal Networks, in Parallel and Distributed Processing, J. Rolim, ed., LNCS 1388, PP: 681-692
    [183] Wu ]. Fault-Tolerant Adaptive and Minimal Routing in Meshed-Connected Multicomputers Using Extended Safety Levels, Proc. Of the 18th Int'l Conf. On Distributed Computing Systems, pp: 428-435, 1998
    [184] Ramanathan P. and K.G Shin, Reliable Broadcast in Hypercube Multicomputers, IEEE Transactions on Computers, Vol. 37(12) : 1654-1657, Dec. 1988
    [185] Lin X. an dL.M. Ni, Multicast in Multicomputer Networks, Proc. Of the 1990 Int'l Conf. On Parallel Processing, 3: 114-118, 1990
    [186] Lan Y, A.H. Esfahanian and L.M. Ni, Multicast in Hypercube Multiprocessors, Journal of Parallel and Distributed Computing, 8: 30-41
    [187] Kesavan R. and D.K. Panda, Minimizing Node Contention in Multiple Multicast on Wormhole k-ary n-cube Networks, Proc. Of the 1996 Int'l Conf. On Parallel Processing, 1188-1195,1996
    [188] Lan Y, Fault-Tolerant Multi-Destination Routing in Hypercube Multicomputers, Proc. Of the 12th Int'l Conf. On Distributed Computing Systems, pp 632-639, 1992
    [189] Sheu J.P. and M.Y. Su, A Multicast Algorithm for Hypercube Multiprocessors, Proc. Of the 1992 Int'l Conf. On Parallel Processing, (Ⅲ)18-22, 1992
    [190] Liang A.C., S. Bhattacharya and W.T. Tsai, Fault-Tolerant Multicasting in Hypercubes, Journal of Parallel and Distributed Computing, 23(3) : 418-428, Dec 1994
    [191] Melliar-Smith P.M., L.E. Moser and V. Agrawala, Broadcast Protocols for Distributed Systems, IEEE Transactions on Parallel and Distributed Systems, 1(1) : 17-25, Jan 1990
    [192] Birman K.P. and T.A. Joseph, Reliable Communication in the Presence of Failure, ACM Transactions on Computer Systems, 5(1) : 47-76, Feb. 1987
    [193] Amdahl G M., Validity of Single-Processor Approach to Achieving Large-Scale Computing Capability, Proc. Of AFIPS Conf., pp. 483-485, Reston, VA., 1967
    [194] Gustafson J.L., Reevaluating Amdahl's Law, Communication ACM, 31(5) :532-533, May 1988
    [195] Sun X.H. and L.M. Ni, Scalable Problems and Memory-Bound Speedup, Journal of Parallel and Distributed Computing, 1993
    [196] 蔡希尧，多处理机系统的逻辑分析与设计，西北电讯学院出版社 , 1987
    [197] 林闯，随机 Petri 网和系统性能评价，清华大学出版社 , 2000
    [198] Tsuei T.F. and Vernon M.K., A multiprocessor bus design model validated by system measurement. IEEE Transactions on Parallel and Distributed Systems, 3(6) : 712-27, 1992
    [199] Peterson J L. Petri net theory and the modeling of systems. Joyce Levatino, 1981
    [200] Silva M., Colom J.M., Petri nets applied to the modeling and analysis of computer architecture problem, Microprocessing and Microprogramming, 38:1-11, 1993
    [201] Bhuyan L.N., Zhang X., Multiprocessor performance measurement and evaluation, IEEE Computer Society Press, 1995


    [202] Ni L M, Hwang K. Optimal load balancing in a multiple processor system with many job classes. IEEE Transaction on Software Engineering, 21(5) : 491-496, 1995
    [203] Ferrari D. Computer system performance. Prentice-Hall, 1978
    [204] Jain R. The art of computer system performance analysis. John Wiley and Sons, 1991
    [205] Kung H. T. How to Move Parallel Processing into the Mainstream, Proc. First Workshop on Parallel Processing, Taiwan, China, Dec. 1990
    [206] Eager D.L., J. Zahorjan and E.D. Lazowska, Speedup Versus Efficiency in Parallel Systems, IEEE Transactions on Computers, 38(3) : 408-432, Mar. 1989
    [207] Chiola G, C. Dutheillet, G Franceschinis and S. Haddad, Stochastic Well-Formed Colored Nets and Symmetric Modeling Applications, IEEE Transactions on Computers, Vol. 42(11) : 1343-1360, 1993
    [208] Dutheillet C. and S. Haddad, Regular Stochastic Petri Net, in Proc. 10th European Workshop on Applications and Theory of Petri Nets, 1989
    [209] Chiola G, G. Bruno and T. Demaria, Introducing a Color Formalism into Generalized Stochastic Petri Nets, in Proc. 9* European Workshop on Applications and Theory of Petri Nets, pp. 202-215, Venezia, Italy, 1988
    [210] Marsan M.A and M. Gerla, Markov models for multiple bus multiprocessor systems, IEEE Transaction on Computers, 36(1) : 76-85, 1987
    [211] Chong Y.Y. and K. Hwang Evaluation of four consistency models for shared-memory multiprocessors, IEEE Transaction on Parallel and Distributed Systems, 6(10) :1085-1099, 1995
    [212] Siegel H.J. et al., Report of the Purdue Workshop on Grand Challenges in Computer Architecture for the Support of High Performance Computing, J. Parallel and Distributed Computing, Vol. 16(3) : 199-211, 1992
    [213] Valiant L.G., A Bridging Model for Parallel Computation, ACM Communication, Vol. 33(8) : 103-111, 1990
    [214] Tom Shanley, Don Anderson, PCI System Architecture-Fourth Edition, Addison Wesley Longman Press, 1999
    [215] IEEE standard for a versatile backplane bus: VMEbus, ANSI/IEEE Std 1014-1987, 1987
    [216] Haas Peter j. and G.S. Shedler, Stochastic Petri Net Representation of Discrete Event Simulations, IEEE Transactions on Software Engineering, Vol. 15(4) :381-393, Apr. 1989
    [217] Marsan M.A., G. Conte and G Balbo, A Class of Generalized Petri Nets for the Performance Evaluation of Multiprocessor Systems, ACM Transactions on Computer Systems, 2(2) : 93-122, 1984
    [218] Ciardo G, A.Blakemore, et al., Automated Generation Analysis of Markov Reward Models Using Stochastic Reward Nets, Linear Algebra, Markov Chains, and Queueing Models, IMA Volumes in Mathematics and its Applications, C. Meyer, R.J. Plemmons (ed.). Springer-Verlag, 48:145-191, 1993
    [219] Zuberek W.N., Performance Evaluation Using Unbound Timed Petri Net, Proc. Of the 3rd International Workshop on Petri Nets and Performance Models, Kyoto, Japan, pp. 180-186, 1989
    [220] Chiola G, C. Dutheillet, G Franceschinis and S. Haddad, Stochastic Well-formed Coloured Nets and Multiprocessor Modeling Applications, in High-Level Petri Nets: Theory and Application, K. Jensen and G Rozenber, (ed.). New York: Springer-Verlag, 1991


    [221] Rossano Gaeta, Efficient Discrete-Event Simulation of Color Petri Nets, IEEE Transactions on Software Engineering, Vol. 22(9) : 629-639, Sep. 1996
    [222] L. Sha. R. Rajkumar and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization, IEEE Trans, on Computers, 39(9) , pp. 1175-1185, Sep. 1990．
    [223] L. Sha, R. Rajkumar and J. P. Lehocsky. Real-Time Computing with IEEE Futurebus+, IEEE Micro, pp. 30-33, 95-100, June 1991．
    [224] Vernon, M.K. Manber, U. Distributed round-robin and first-come first-serve protocols and their application to multiprocessor bus arbitrary, the 15th Annual International Symposium on Computer Architecture, pp. 269-277, 1988
    [225] J. Chao, A Novel Architecture for Queue Management in the ATM Network[J], IEEE J. Selected Areas in Comm., 9(7) , pp. 1,110-1,118, Sept. 1991．
    [226] J. Chao and N. Uzun, A VLSI Sequencer Chip for ATM Traffic Shaper and Queue Management[J], IEEE J. Solid-State Circuits, 27(11) : 1,634-1,643, Nov. 1992
    [227] A. S. Tanenbaum, S. J. Jullender, and R. van Renesse, Using Sparse Capabilities in a Distributed Operating System, Technical Report, Department of Mathematics and Computer Science, Vrije Universiteit

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700