网格应用生存性增强技术研究

英文题名：Research on Survivability Enhancing Techniques of Grid Applications
作者：王树鹏
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：网格系统 ; 网格应用 ; 生存性 ; 调度 ; 失效检测 ; 复制协议
英文关键词：Grid system ; Grid application ; Survivability ; Schedule ; Failure detection ; Replication protocol
学位年度：2007
导师：云晓春
学科代码：081201
学位授予单位：哈尔滨工业大学
论文提交日期：2007-06-01

摘要

网格技术的出现和发展为人们提供了大量的计算资源来执行大规模的应用,这给人们带来了巨大的机遇。然而,在动态、复杂的网格系统中,恶意攻击或者硬件失效都会导致网格资源的失效,并且失效率远远高于传统的分布式环境中的失效率,这个问题给应用在网格环境中的执行带来了巨大的挑战,分配到远程站点的任务会因为网格资源的失效而无法正常执行,特别是对于大规模的任务来说,它们需要占用大量的资源,并且执行较长的时间,网格资源的失效可能会导致它们根本无法执行结束。因此,本文就如何保证网格应用在复杂动态的网格环境中持续不间断的执行问题进行了研究,将系统生存性的思想应用到网格环境中,提出了网格应用生存性的概念,研究了网格应用的生存性分析方法和可生存性生命周期模型,分析了支撑生存性模型的关键技术,这对于促进网格技术的发展和实用化具有重要的现实意义。本文以网格应用的可生存性作为研究目标,着重对以下几个方面进行了深入的研究:
     首先介绍了网格计算环境和网格应用的背景,分析了网格应用在网格环境中执行所面临的挑战;然后分析了当前网格安全技术和系统生存性技术的研究现状;最后进一步明确了研究网格应用生存性的意义和必要性。
     在此基础上,给出了本文研究所用的系统模型,包括网格模型、失效模型和应用模型,给出了网格应用生存性的定义,对网格应用的生存性分析方法进行了研究,给出了基于离线预防和在线重构的网格应用的可生存性生命周期模型,分析了在该模型中影响网格应用生存性的关键问题,为下文的研究工作奠定了基础。
     为了实现生存性模型中的离线预防机制,提出了网格应用生存性的调度目标,实现了同时考虑网格应用生存性和Makespan目标的局部代价函数,然后分别针对网格独立任务应用和网格工作流应用设计了同时考虑应用生存性和Makespan目标的调度算法,该算法能够避免网格应用被调度到失效率高的计算资源上去,在一定程度上提高了网格应用的生存性。
     为了增强在线响应机制中状态检测的能力,降低检测误报率和漏报率,缩短检测时间,我们研究了网格环境中的失效检测机制。当前的失效检测算法虽然通过自适应的预测机制适应了心跳包传输延迟的变化,避免了传输延迟的变化带来的检测误报,但是这些算法都没有考虑心跳包丢失的情况,心跳包的丢失会导致这些算法出现很高的检测误报率,为此我们提出了一种基于PUSH和PULL的失效检测算法,该算法基于不可靠的半同步分布式系统模型,解决了心跳包丢失带来的检测误报率过高的问题,并缩短了检测时间。
     最后我们基于复制机制实现了网格应用的失效响应功能。针对目前复制机制应用透明性和通用性差的问题,本文对透明通用的复制机制进行了研究。提出了网络数据流层次的消息代理机制和灵活的配置机制,给出了一个异步主动复制协议和失效响应协议,实现了一个透明通用的复制代理,该复制代理能够实现复制组中各副本的状态同步以及主副本失效后的响应恢复功能。
     在上述研究的基础上,基于离线预防和在线重构机制相结合的网格应用生存性模型,设计实现了一个网格应用调度和管理系统,在该系统中,有效利用了上文提出的网格应用生存性支撑技术。最后,通过一个实际网格应用的运行实例,证明了本文提出的网格应用生存性增强技术的有效性。
The emergence and development of Grid provides large numbers of computing resources for large-scale applications. However, the dynamic and complex characteristic of the Grid system cause the higher failure rate of Grid resources, compared to that in traditional distributed systems. This brings great challenges for the execution of Grid applications in Grid environment. The tasks allocated to the grid resources may be halted by the failure of grid resources. Especially for the large-scale applications, which require large numbers of resources and will take lots of time, the failure of grid resources may cause that they can not execute normally. Therefore, this paper focuses on the problem how to make the applications execute normally in the complex grid system. And the survivability theory is applied into the grid system, and the concept of the grid application survivability is proposed. The research on the survivable grid applications in this paper has great significances on the development and application of grid technologies. The main research topic of this paper includes the following aspects:
     The first part of this dissertation introduces the research background of Grid system and Grid applications, analyzes the challenge faced by the execution of grid applications and make clear the significance of the research on Grid application survivability. Then it reviews the research state of Grid security and system survivability. The current research of Grid security adopts the traditional security theory, and the current research of system survivability focuses on the traditional distributed information system. There is no systematic research on the survivability of Grid applications.
     On the basis of that, this dissertation introduces the system model, including Grid model, failure model and application model, and gives the definition of the survivability of Grid applications. Then the survivability analysis method on the grid applications and the survivability life-cycle model of grid applications are proposed. And the key technologies supporting the survivable grid applications are introduced.
     To implement the capability of the grid applications to guard against the failure of grid resources, the dissertation proposes the scheduling objective of survivability and the cost function considering the objectives of survivability and makespan at the same time. Then the scheduling algorithms considering the the objectives of survivability and makespan are proposed for grid independent task applications and Grid workflow applications respectively. These scheduling algorithms can prevent grid tasks from being scheduled to the grid resources with higher failure rate..
     To improve the capability of detecting failures, and decrease the error rate of failure detection and the detection time, the failure detection machinism in grid environment is considered. The current failure detection algorithms can adapt to the variation of transmission delay by adaptive mechanism, and decrese the error rate of failure detection caused by the variation of transimissin delay. However, this algorithm does not consider the loss of detecting packets which cause high error rate. To solve this problem, the PUSH-and-PULL based failure detection algorithm is proposed. This algorithm bases on the semi-synchronous distributing system model and can decrease the high error rate of failure detection efficiently.
     Finally, the failure response capability is implemented by a transparent replication mechaninsm. The message agent mechianism on the level of network message flow and flexible configuration mechanism are proposed, a asynchronous active replication protocol and failure response protocol are proposed, then a transparent and all-purpose replication agent is implemented. This agent can synchronize the state of replicas in the replica group and implement the failure recovery capability after the failure of the primary replica.
     Base on the above studies, a Grid application scheduling and managing system is designed and implemented using the off-line defense and on-line reconfiguration techniques. In this system, the survivability enhancing techniques proposed in the previous chapters are utilized efficiently. Finally, the efficiency of these Grid survivability enhancing techniques is approved by the execution of an real application.

引文

1 I. Foster, S. Tuecke. The Anatomy of the Grid Enabling Scalable Virtual Organizations International. Journal of High Performance Computing Applications, 2001, 15(3):200~222
    2 National Security Agency Central Security Service. Global Information Grid. http://www.nsa.gov/ia/industry/gig.cfm?MenuID=10.3.2.2 (2006)
    3 The DataGrid Project. http://eu-datagrid.web.cern.ch/eu-datagrid/
    4 Cyberinfrastructure: A Special Report for National Science Foundation. http://www.nsf.gov/news/special_reports/cyber/agrand.jsp
    5 M. Sato, H. Nakada, S. Sekiguchi, S. Matsuoka, U. Nagashima, H. Takagi. Ninf: A Network based Information Library for a Global World-Wide Computing Infrastructure. HPCN`97(LNCS-1225).1997:491~502
    6 李伟 , 徐志伟 . 织女星网格的体系结构研究 . 计算机研究与发展 . 2002,39(4):923~929
    7 中国国家网格. http://www.cngrid.org/
    8 I. Foster. What is Grid? A Three Point Checklist. Argonne National Lab & Univeristy of Chicago, http://www-fp.mcs.anl.gov/~foster/articles/
    9 N.T. Anh, Integrating Fault-Tolerance Techniques in Grid Applications, P.h.D Dissertation, 2000:8~10
    10 A. Hollway, G. NEUMAN. Survivable Computer Communication Systems: The Problem and Working Goup Recommendations, Washington: US Army Research Laboratory, 1993
    11 R.J. Ellison, D.A. Fisher, R.C. Linger, H.F. Lispson, T. Longstaff, N.R. Mead, Survivable Network Systems: An Emerging Discipline, Technical Report, Software Engineering Institute, Carnegie Mellon University, 1997
    12 R.J. Ellison, D.A. Fisher, R.C. Linger, H.F. Lipson, T.A. Longstaff, N.R. Mead, Survivability: Protecting Your Critical Systems, IEEE Internet Computing, 1999, 3 (6):55~63
    13 闵应骅, 可信系统与网络, 计算机工程与科学, 2001, 23 (5): 21~23
    14 J.H.Alllen, C.A. Sledge, Information Survivability: Required Shifts inPerspective, The Journal of Defense Software Engineering, 2002, 7:7~9
    15 J. McDermott, A. Kim, J.N. Froscher, Merging Paradigms of Survivability and Security: Stochastic Faults and Designed Faults, Proceedings of the 2003 workshop on New Security Paradigms, 2003:19~25
    16 V. Westmark. A Definition for Information System Survivability, Proceedings of the 37th Annual Hawaii International Conference. 2004:303~ 312
    17 K. John, E. Strunk, K. Sullivan. Towards a Rigorous Definition of Information System Survivability. Proceedings of the DISCEX'03, 2003, (1):78~89
    18 A. Krings, W. Harrison. Scheduling Issues in Survivability Applications Using Hybrid Fault Models. Parallel Processing Letters. 2004, (14)1:5~22
    19 C. Lin, V. Varadharajan, Y. Wang, V. Pruthi, Enhancing Grid Security with Trust Management, Proceedings of the 2004 IEEE International Conference on Services Computing, 2004:303~310
    20 B.C. Neumann,T. Ts’o, Kerberos: An authentication service for computer networks, IEEE Commun. Mag., 1994,32(9):33~38
    21 T. Dierks, E. Rescorla. The TLS Protocol Version 1.1. RFC 2246. http://www.ietf.org/internetdrafts/draft-ietf-tls-rfc2246-bis-06.txt, 2006
    22 A. Frier, P. Karlton, and P. Kocher, The SSL 3.0 Protocol, Netscape Communications Corp., 1996
    23 Information Technology—Open Systems Interconnection—The Directory: Authentication Framework, ITU-T Recommendation X.509 (1997 E), 1997
    24 R. Housley, W. Polk, W. Ford, D. Solo, Internet X.509 Public Key Infrastructure: Certificate and CRL Profile, RFC 3280, 2002:27~31
    25 Y. Fu, J. Chase, B. Chun, S. Schwab, A. Vahdat, SHARP: an architecture for secure resource peering. Proceedings of the nineteenth ACM symposium on Operating systems principles, Bolton Landing, NY, USA, 2003:133~148
    26 S. Song, K. Hwang, Fuzzy trust integration for security enforcement in grid computing, International Symposium on Network and Parallel Computing (NPC2004), Heidelberg: Springer-Verlag GmbH, 2004:9~21
    27 F. Azzedin, M. Maheswaran, Integrating Trust into Grid Resource Management Systems, 2002 International Conference on Parallel Processing (ICPP 2002), 2002:47~54
    28 X. Gui, B. Xie, Y. Li, D. Qian, Study on the behavior-based trust model in grid security system, 2004 IEEE International Conference on Services Computing (SCC 2004), 2004:506~509
    29 B.K. Alunkal, I. Veljkovic, G. Laszewski, K. Amin, Reputation-based Grid Resource Selection, Proceedings of the Workshop on Adaptive Grid Middleware (AGridM 2003), New Orleans LA, USA, 2003
    30 T.Y. Li, H. Zhu, K.Y. Lam, A Novel Two-level Trust Model for Grid, ICICS 2003,LNCS 2836, Berlin: Springer, 2003:214~225
    31 汪进,杨新,刘晓松.一种新型的网格行为信任模型.计算机工程与应用.2003.39(21):62～64
    32 I. Foster, C. Kesselman, Globus: A Meta-computing Infrastructure Toolkit, International Journal of Supercomputer Applications, 1997,11(2):115~128
    33 I. Foster, C. Kesselman, The Globus Project: A Status Report, In Proc. Heterogeneous Computing Workshop, 1998:4~18
    34 I. Foster, C. Kesselman, G. Tsudik, S. Tuecke. A security Architecture for Computational Grids. Proc. 5th ACM Conference on Computer and Communications Security Conference, 1998:83~92
    35 R. Butler, V. Welch, D. Engert, I. Foster, S. Tuecke, J. Volmer, C. Kesselman. A National-Scale Authentication Infrastructure, IEEE Computer, 2000,33(12):60~66
    36 J. Unger, M. Haynos, A visual tour of Open Grid Services Architecture: Examine the component structure of OGSA, http://www-900.ibm.com/developerWords/cn/grid/grvisual/index_ eng.shtml
    37 怀进鹏, 胡春明, 李建欣, 孙海龙, 沃天宇. CROWN:面向服务的网格中间件系统与信任管理, 中国科学 E 辑, 2006,36(10):1127~1155
    38 P. Stelling, I. Foster, C. Kesselman, C. Lee, G. Laszewski, A Fault Detection Service for Wide Area Distributed Computations, Proceedings of the 7th IEEE Symposium on High Performance Distributed Computing, 1998: 268~278
    39 E. Gabriel, G.E Fagg, A. Bukovsky, T. Angskun, J.J. Dongarra, A Fault-Tolerant Communication Library for Grid Environments, 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science, San Francisco, 2003
    40 J. H. Abawajy, Fault-Tolerant Scheduling Policy for Grid Computing Systems, Proceedings of the 18th IEEE International Parallel & Distributed Processing Symposium, Santa Fe, New Mexico, 2004, 50~58
    41 金海, 陈刚, 赵美平. 容错计算网格作业调度模型的研究.计算机研究与发展, 2004,41(8):1382~1388
    42 S. Hwang, C. Kesselman, A flexible framework for Fault Tolerance in the Grid. Jornal of Grid Computing, 2003, 1:251~272
    43 R. Sakellariou, H. Zhao, A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming 2004,12(4): 253~262
    44 R.J. Ellison, R.C. Linger, T. Longstaff, N. Mead, Survivable network system analysis: a case study, IEEE Software, 1999, 16(4):70~77
    45 G. Zhao, H. Wang, J. Wang, A Novel Quantitative Analysis Method for Network Survivability, First International Multi-Symposiums on Computer and Computational Sciences, 2006,(2):30~33
    46 C. Liew, W. Lu. A Framework for Characterizing Disaster-based Network Survivability. IEEE Journal on Selected Areas in Communications, 1994,12 (1):52~58
    47 H.C. Hakki, V.S.S. Nair, Improved Survivability Analysis for SONET SHRs. Computer Networks. 1999, 31(23-24):2505~2528
    48 S. Jha, J.M. Wing, Survivability analysis of networked systems, Proceedings of the 23rd International Conference on Software Engineering, Toronto, Ontario, Canda, 2001:307~317
    49 G. Levitin, A. Lisniaski, Optimizing survivability of vulnerable series–parallel multi-state systems. Reliability Engineering & System Safety, 2003, 79(3):319~331
    50 G. Levitin, A. Lisnianski, Optimal separation of elements in vulnerable multi-state systems, Reliability Engineering & System Safety, 2001,73:55~66
    51 A.W. Kring, A. Azadmanesh, A Graph Based Model for Survivability Applications, European Journal of Operational Research (EJOR), 2005,164(3):680~689
    52 郭渊博,马健峰.分布式系统中服务可生存性的定量分析,同济大学学报,2002, 30(10):1190~1193
    53 包秀国, 胡铭曾, 张宏丽, 张绍瑞, 两种网络安全管理系统的生存性定量分析方法, 2002,25(9):34~41
    54 云晓春, 王树鹏, 分布式应用生存性定量分析方法研究, 中国科协第二届优秀博士生学术年, 中国.苏州, 2005
    55 W. H. Sanders, M. Cukier, F. Webber, P. Pal, R. Watro. Probabilistic Validation of Intrusion Tolerance. Fast Abstract in the Supplemental Volume of the 2002 International Conference on Dependable Systems & Networks (DSN-2002), Washington, DC, 2002:B-78~B-79
    56 B.B. Madan, K. Goseva-Popstojanova, K.S. Trivedi. A Method for modeling and quantifying the security attributes of intrusion tolerant systems. Performance Evaluation, 2004,56(1-4):167~186
    57 O. Kreidl, T. Frazier. Feedback Control Applied to Survivability: a Host-based Autonomic Defense System. IEEE Trans. on Reliability. 2004, 53(1):148~56
    58 M. Pal, P. Webber et al. Adaptive Cyberdefense for Survival and Intrusion Tolerance. Internet Computing, IEEE. 2004, 8(6):25~33
    59 F. Wang, F. Gong, C. Sargor, SITAR: A Scalable Intrusion Tolerance Architecture for Distributed Service. Proceedings of the 2001 IEEE Workshop on Information Assurance and Security, New York, USA, 2001:38~45
    60 B. Dutertre, V. Crettaz, V. Stavridou, Intrusion-Tolerant Enclaves, IEEE International Symposium on Security and Privacy, Oakland, CA, 2002: 216~224
    61 J. Kubiatowicz, D. Bindel, Y. Chen. OceanStore: An Architecture for Global-Scale Persistent Storage, Proceedings of the Ninth international Conference on Architectural Support for Programming Languages and Operating Systems, 2000:190~201
    62 M.A. Hiltunen, R.D. Schlichting, C.A. Ugarte, Enhancing Survivability of Security Services Using Redundancy. Proceedings of the 2001 International Conference on Dependable Systems and Networks, Goteborg, Sweden, 2001:173~182
    63 C. Cachin, J.A. Poritz, Secure intrusion-tolerant replication on the Internet, in Proc. Intl. Conference on Dependable Systems and Networks (DSN-2002), 2002:167~176
    64 L. Zhou, F.B. Schneider, R. Renesse. COCA: A Secure Distributed On-lineCertification Authority. ACM Transactions on Computer Systems, 2002, 20(4):329~368
    65 T. Wu, M. Malkin, D. Boneh. Building Intrusion Tolerant Applications. In proceedings of the 8th USENIX Security Symposium, 1999: 79~91
    66 S. Bryant, F. Wang. Aspects of adaptive reconfiguration in a scalable intrusion tolerant system. Complexity, 2003, 9(2):74~83
    67 T. Courtney, J. Lyons, H. V. Ramasamy, W. H. Sanders, M. Seri, M. Atighetchi, Providing Intrusion Tolerance with ITUA, in Supplemental Volume of the 2002 International Conference on Dependable Systems & Networks (DSN-2002), Washington, DC, 2002,:C-5-1~C-5-3
    68 M. Hiltunen, R. Schlichting, C. Ugarte, and G. Wong, Survivability through Customization and Adaptability: The Cactus Approach, DARPA Information Survivability Conference and Exposition (DISCEX 2000), 2000, 294~307
    69 P. Liu, J. Jing, P. Luenam, Y. Wang, L. Li, S. Ingsriswang, The Design and Implementation of a Self-Healing Database System, Journal of Intelligent Information Systems, 2004,23,(3):247-269
    70 P. Liu, S. Jajodia, C.D. McCollum. Intrusion confinement by isolation in information systems. Journal of Computer Security, 2000,8(4):243~279
    71 P. Liu, P. Ammann, S. Jajodia, Rewriting Histories: Recovering From Malicious Transactions, Distributed and Parallel Databases, 2000,8(1):7~40
    72 M. Yu, P. Liu, W. Zang. Multi-Version Attack Recovery for Workflow Systems. 19th Annual Computer Security Applications Conference. Las Vegas, Nevada, 2003:142~150
    73 L. Hanan, R. Wong. Survivable Telecommunications Network Design Under Different Types of Failures. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans. 2004, 34(4):521~530
    74 S. Debashis, C. Kumar. An Efficient Link Enhancement Strategy for Computer Networks Using Genetic Algorithm. Computer Communications. 1997,20(9):798~803
    75 J. Wylie, M Bigrigg, J. Strunk et al. Survivable Information Storage Systems. Computer. 2000,33(8):61~68
    76 李之棠, 舒承椿. 基于信息冗余分散的两种系统可存活性模型. 计算机研究与发展. 2002,39(7):769~774
    77 M. Pal, P. Webber et al. Adaptive Cyberdefense for Survival and Intrusion Tolerance. IEEE Internet Computing, 2004,8(6):25~33
    78 D. Malkhi, M. Reiter. An Architecture for Survivable Coordination in Large Distributed Systems. IEEE Transactions on Knowledge and Data Engineering. 2000,12(2):187~202
    79 B. Thuraisingham, J. Maurer. Information Survivability for Evolvable and Adaptable Real-Time Command and Control Systems. IEEE Transactions on Knowledge and Data Engineering. 1999,11(1):228~238
    80 H. Thadakamalla, U. Raghavan et al. Survivability of Multiagent-Based Supply Networks: A Topological Perspective. IEEE Intelligent Systems. 2004, 19(5):24~31
    81 M. Brinn, J. Berliner et al. Extending the Limits of DMAS Survivability: The UltraLog Project. IEEE Intelligent Systems, 2004,19(5):53~61
    82 M. Joon, C. Joong. An Approach to Intrusion Tolerance for Mission-critical Services Using Adaptability and Diverse Replication. Future Generation Computer Systems. 2004, 20(2):303~313
    83 M. Merideth, P. Narasimhan. Proactive Containment of Malice in Survivable Distributed Systems. Proceedings of SAM’03, Las vegas, NV, 2003:3~9
    84 D. Fisher, H. Lipson. Emergent Algorithms: A New Method for Enhancing Survivability in Unbounded Systems. Proceedings of the HICSS-32, 1999: 7043~7053
    85 R.C. Ciampa, D. Day, J.R. Franks, C.T. Tsuboi, Global Information Grid Survivability: Four Studies, Technical Report, CMU/SEI-2006-SR-008. 2006
    86 A. Grimshaw1, M. Humphrey, J.C.Knight, The Development of Dependable and survivable Grids, International Conference on Computational Science 2005,(2): 729~737
    87 I. Gupta, T. Chandra, G. Goldszmidt. On scalable and efficient distributed failure detectors. Proceedings of 20th Annual ACM Symposium on Principles of Distributed Computing. 2001:170~179
    88 G. Neiger, S. Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 1990, 11(3): 374~419
    89 J.S. Plank, W. Elwasif, Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems, 28th International Symposium on Fault-Tolerant Computing, Munich, 1998:48~57
    90 W.Q. Meeker, L.A. Escobar. Statistical Methods for Reliability Data. John Wiley & Sons, Inc., 1998:132~136
    91 D. Long, A. Muir, R. Golding. A longitudinal survey of Internet host reliability. 14th Symposium on Reliable Distributed Systems Reliable Distributed Systems, Bad Neuenahr, Germany, 1995:2~9
    92 D. Abramson, J. Giddy, L. Kotler, High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid, IPDPS`2000, Cancun Mexico, USA, 2000:520~528
    93 S. Smallen, W. Cirne, J. Frey et al. Combining Workstations and Supercomputers to Support Grid Applications: The Parallel Tomography Experience, Proceedings of the HCW`2000 – Heterogeneous Computing Workshop, Cancun, Mexico, 2000:241~252
    94 S. Smallen, H. Casanova, F. Berman, Applying Scheduling and Tuning to On-line Parallel Tomography, Proceedings of Supercomputing 2001, Colorado, USA, 2001:46~46
    95 GriPhyN, http://www.griphyn.org
    96 E. Deelman, K. Blackburn, GriPhyN and LIGO, Building a Virtual Data Grid for Gravitational Wave Scientists, the 11th Intl. Symposium on High Performance Distributed Computing, 2002:225~234
    97 J. Annis, Y. Zhao, Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey, ACM/IEEE 2002 Conference on Supercomputing,, 2002:56~56
    98 NPACI, Telescience, https://gridport.npaci.edu/Telescience/
    99 T.D.Braun, H.J.Siegel, N.Beck, L.Boloni, and et.al. A comparison study of static mapping heuristics for a class of metatasks on heterogeneous computing systems. 8th IEEE Heterogeneous Computing Workshop (HCW`99), 1999
    100 M. Maheswaran, S. Ali, H.J. Siegel, D. Hensgen, R.F.Freund, A comparison of Dynamic Strategies for Mapping a class of Independent Tasks onto Heterogeneous Computing Systems, Technical Report, School of Electrical andComputer Engineering, Purdue University, 1999
    101 M. Maheswaran, Dynamic Mapping of a Class of Independent Tasks onto Heterogeneous Computing Systems, Jornal of Parallel and Distributed Computing, 1999, 59(2):107~131
    102 M. Garey, D.Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Company, New York, 1979
    103 M. Humphrey, M. R. Thompson, Security Implications of Typical Grid Computing Usage Scenarios, IEEE Proc. HPDC 2001, 2001:95~103
    104 张伟哲,刘欣然,云晓春,张宏莉. 信任驱动的网格作业调度算法,通信学报,2006,27(2):73~79
    105 S. Song, Kai Hwang, Y.K. Kwok, Risk-Resilient Heuristics and Genetic Algorithms for Security-Assured Grid Job Scheduling. IEEE Trans. Computers 2006, 55(6): 703~719
    106 R. Wolski, N. Spring, J. Hayes. The network weather service: A distributed resource performance forecasting service for meta-computing. Journal of Future Generation Computing System, October, 1999,15(5-6):757~768
    107 L. Gong, X.H. Sun, E. Waston. Performance modeling and prediction of non-dedicated network computing, IEEE Trans. Computer, 2002,51(9):1041~ 1055
    108 高社生,可靠性理论与工程应用,国防工业出版社,2002:101~103
    109 马振华, 刘坤林等,现代应用数学手册—运筹学与最优化理论卷, 清华大学出版社, 2003:309~310
    110 O.H. Ibarra, C.E. Kim, Heuristic algorithms for scheduling independent tasks on nonidentical processors, Journal of the ACM. 1977,24(2):280~289
    111 The SimGrid Project. http://grail.sdsc.edu/projects/simgrid
    112 Shupeng Wang, xiaochun Yun, Xiangzhan Yu. Survivability-based Scheduling Algorithm for Bag-of-Tasks Applications with Deadline Constraints on Grids, International Journal of Computer Science and Network Security, 2006,6(4):13~19
    113 D. Fernadez-Baca, Allocating modules to processors in a distributed system, IEEE Trans. Software Engineering, 1989,15(11):1427~1436
    114 G.C.Sih, E. Lee. A Compile-Time Scheduling Heuristic for Interconnection- Constrained Heterogeneous Processor Architectures. IEEE Transactions onParallel and Distributed Systems, 1993, 4,(2):308~323
    115 T.L. Adam, K.M. Chandy, and J. Dickson, A Comparison of List Scheduling for Parallel Processing Systems, Comm. ACM, 1974,17,(12):685~690
    116 J. Baxter, J.H. Patel, The LAST Algorithm: A Heuristic-Based Static Task Allocation Algorithm, Proc. 1989 Int’l Conf. Parallel Processing, 1989:217~222
    117 J.J. Hwang, Y.C. Chow, F.D. Anger, C.Y. Lee, Scheduling Precedence Graphs in Systems with Interprocessor Communication Times, SIAM J. Computing, 1989,18(2):244~257
    118 K. Coope, New Grid Scheduling and Rescheduling Methods in the GrADS Project, International Journal of Parallel Programming, 2005 ,33(2-3): 209~229.
    119 E. Deelman, Mapping Abstract Complex Workflows onto Grid Environments, Journal of Grid Computing, 2003, 1(1):9~23
    120 R. Prodan, T. Fahringer. Dynamic Scheduling of Scientific Workflow Applications on the Grid: A Case Study. In 20th Annual ACM Symposium on Applied Computing (SAC 2005), New Mexico USA, 2005:687~694
    121 J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, and et.al. Task Scheduling Strategies for Workflow-based Applications in Grids. IEEE International Symposium on Cluster Computing and Grid (CCGrid), 2005:759~767
    122 M. Mika, G. Waligora, J. Weglarz, A Metaheuristic Approach to Scheduling Workflow Jobs on a Grid, Grid Resource Management: State of the Art and Future Trends, Kluwer Academic Publishers, 2004:295~318
    123 张伟哲,计算网格环境下的调度任务研究,博士论文,2006:30~35
    124 J. Yu, R. Buyya, C.K. Tham, QoS-based Scheduling of Workflow Applications on Service Grids, In Proceedings of 1st IEEE International Conference on e-Science and Grid Computing, 2005:140~147
    125 S. Rai, K.K. Aggarwal, An Efficient Method for Reliability Evaluation of a General Network, IEEE Trans. Reliability, 1978, 27:206~211
    126 C.S. Raghavendra, S.V. Makam, Reliability Modeling and Analysis of Computer Networks, IEEE Trans. Reliability, 1986, 35:156~160
    127 P.A. Jensen, M. Bellmore, An Algorithm to Determine the Reliability of a Complex System, IEEE Trans. Reliability, 1969, 18:169~174
    128 Y.G. Chen, M.C. Yuang, A Cut-Based Method for Terminal Pair Reliability,IEEE Trans. Reliability, 1996,45:413~416
    129 G.B. Berriman, J.C. Good, A.C. Laity, A. Bergou, J. Jacob, D.S. Katz, E. Deelman, C. Kesselman, G. Singh, M-H. Su, and R. Williams. Montage: a Grid Enabled Image Mosaic Service for the National Virtual Observatory, Astronomical Data Analysis Software and Systems, 2003:593~596
    130 T.D.Chandra, S. Toueg. Unreliable failure detectors for reliable distributed system. J. ACM. 1996,43(2):225~267
    131 Y. Horita, K. Taura, T. Chikayama, A scalable and efficient self-organizing failure detector for grid applications, Proc. Of the 6th IEEE/ACM Int`l Workshop on Grid Computing, 2005:202~210
    132 A. Jain, R.K. Shyamasundar, Failure detection and membership management in grid environments, Proceedings Of the Fifth IEEE/ACM International Workshop on Grid Computing, 2004:44~52
    133 X.H. Shi, H. Jin, Z.F. Han, W.Z. Qiang, S. Wu, D.Q. Zou, ALTER: Adaptive failure detection services for grid. Proc. Of the IEEE Int`l Conf. on Services Computing, 2005:355~358
    134 F. Cosquer, L. Rodrigues, P. Verissimo. Using tailored failure suspectors to support distributed cooperative applications. In Proc. of the 7th IASTED/ISMM Int’l Conference on Parallel and Distributed Computing and Systems, Washington D.C., USA,1995: 352~356
    135 M.K. Aguilera, C. Delporte-Gallet, H. Fauconnier, S. Toueg, On implementing omega with weak reliability and synchrony assumptions, Proc. Of the 22nd ACM Symp. On Principles of Distributed Computing, 2003:306~314
    136 A. Mostefaoui, E. Mourgaya, M. Raynal. Asynchronous implementation of failure detectors. In Proc. Of the Int’l Conference on Dependable Systems and Networks (DSN2003), San Francisco, USA, 2003:351~360
    137 C. Fetaer, M. Raynal, F. Tronel, An adaptive failure detection protocol, Proc. Of the 2001 Pacific Rim Int`l Symp. On Dependable Computing, 2001:146~153.
    138 董剑, 左德承, 刘宏伟, 杨孝宗,一种基于 QoS 的自适应网格失效检测器,软件学报,2006,17(11):2362~2372
    139 I. Sotoma, E.R.M Maderia, Adaption-Algorithms to adaptive fault monitoring and their implementation on CORBA. Int`l Symp. On Distributed-objects andApplications (DOA2001), 2001:219~228
    140 W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers. 2002.51(5):561~580
    141 J.C. Bolot. End-to-end Packet Delay and Loss behavior in the Internet. In Proc. ACM SIGCOMM93, 1993:289~298
    142 N. Shacham, P. McKenney, Packet recovery in high-speed networks using coding and buffer management, Proc. IEEE Infocom `90, San Francisco, CA, 1990: 124~131
    143 F.B. Schneider. Replication Management using state-machine approach. In S. Mullender, editor, Distributed Systems, 1993:169~198
    144 N. Budhiraja, F.B. Schneider, S. Toueg, and K. Marzullo. The Primary-Backup Approach. In S. Mullender, editor, Distributed Systems, 1993:199~216
    145 H. Jin, D.Q. Zou, H.H. Chen, J.H. Sun, and S. Wu, Fault-Tolerant grid architecture and practice, Journal of Computer Science and Technology, 2003,18(4):423~433
    146 M.A. Hiltunen, R.D. Schlichting, C.A. Ugarte, Building Survivable Services Using Redundancy and Adaptation, IEEE Transactions on Computers, 2003,52(2):181~194
    147 M.K. Reiter, K.P. Birman, How to Securely Replicate Services, ACM Transaction Programming Languages and Systems, 1994,16(3), 986~1009
    148 K. Birman. The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, December, 1993:37~53
    149 X. Zhang, D. Zagorodnov, M. Hiltunen, K. Marzullo, R.D. Schlichting, Fault–tolerant Grid Services Using Primary–Backup: Feasibility and Performance, Proceedings of the 2004 IEEE International Conference on Cluster Computing, 2004:105~114
    150 E. Gabriel, G.E Fagg, A. Bukovsky, T. Angskun, J.J. Dongarra, A Fault-Tolerant Communication Library for Grid Environments, 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science, San Francisco, 2003
    151 L. Alvisi, T. Bressoud, A. El-Khashab, Wrapping server-side TCP to mask connection failures. Proc. IEEE INFOCOM 2001, 2001:329~337
    152 A. Snoeren, D. Andersen, H. Balakrishnan, Fine grained failover using connection migration. Proc. 3rd USENIX Symp. on Internet Technologies and Systems (USITS), 2001:97~108
    153 F. Sultan, K. Srinivasan, L. Iftode, Transport layer support for highly-available network services, Proceedings of the Eighth Workshop Hot Topics in Operating Systems, 2001:182~190
    154 R. Nasika, P. Dasgupta, Transparent migration of distributed communicating processes, Proc. 13th ISCA Intl. Conf. on Parallel and Distributed Computing Systems (PDCS). 2000
    155 M. Orgiyan, C. Fetzer, Tapping TCP streams. Proc. IEEE Intl. Symp. On Network Computing and Applications. 2001:278~289
    156 V. Zandy, B. Miller, Reliable network connections, Proceedings of the 8th annual international conference on Mobile computing and networking, 2002:95~106
    157 N. Aghdaie, Y. Tamir. Implementation and evaluation of transparent fault-tolerant web service with kernel-level support. Proc. IEEE International Conference on Computer Communications and Networks. 2002: 63~68
    158 E. Dekel, G. Goft, ITRA: Inter-Tier Relationship Architecture for End-to-end QoS. The Journal of Supercomputing, 2004,28(1):43~70
    159 R. Gray, W. Wright, R. Stevens, TCP/IP Illustrated, Volume 2: The Implementation. China Machine Press. 2004:138~140
    160 X. Défago, A. Schiper, P. Urbán. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv., 2004,36(4):372~421
    161 M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, April 1985, 32:374~382
    162 D′efago, P. Felber, A. Schiper. Optimization techniques for replicating CORBA objects. In Proc. of the 4th IEEE Int’l Workshop on Object-oriented Real-time Dependable Systems (WORDS’99). Santa Barbara, CA, USA. 1999:2~8
    163 J. Schneider, B. Linnert, L.O. Burchard, Distributed Workflow Management for Large-Scale Grid Environments, Proceedings of the International Symposium on Applications on Internet, 2006:229~235
    164 L.O. Burchard, M. Hovestadt, O. Kao, A. Keller, B. Linnert. The VirtualResource Manager: An Architecture for SLA-aware Resource Management. In 4th Intl. IEEE/ACM Intl. Symposium on Cluster Computing and the Grid, Chicago, USA, 2004:126~133
    165 Foster, I., C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, and A. Roy. A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation. In 7th International Workshop on Quality of Service, London, UK, 1999: 27~36
    166 Workflow Language (xWFL2.0), http://www.gridbus.org/workflow/2.0beta/ docs/xwfl2.pdf
    167 唐振江, 基于生存性的网格工作流调度算法, 硕士论文,2006:42~44

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700