面向分布式关键任务系统的自愈调控技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向分布式关键任务系统的自愈调控技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Self-Healing Regulation Technology for Distributed Mission-Critical Systems
作者：卢旭
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：自律计算 ; 自愈调控 ; 自愈策略生成 ; 失效预测 ; DAG任务迁移
英文关键词：Autonomic Computing ; Self-Healing Regulation ; Self-Healing Strategy Generation ; Failure Prediction ; DAG Task Migration
学位年度：2011
导师：王慧强
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2011-04-01

摘要

分布式关键任务系统的异构性、复杂性和使用环境动态变化不可避免地导致了系统失效、任务偏离甚至中断运行、崩溃死机等现象发生,造成重大经济损失甚至是人员伤亡等严重后果,这也使得人工完成其管理和恢复、不间断地保持任务运行变得愈加困难。在此种背景下,以自我管理能力为核心研究目标的自律计算逐渐得到了广泛重视,并在多个领域有着深入研究与应用。自愈调控技术是自律计算基础性关键技术之一,面向分布式关键任务系统的自愈调控技术实现了关键任务系统的失效监控与预测、自愈调控策略生成以及关键任务调度等系统设计基础功能,对关键任务运行可靠性和可持续性都有着重要的保障作用。本文针对关键任务系统使命连续性需求,对分布式关键任务系统自愈调控关键技术以及应用展开研究。
     从自愈调控总体设计原则讨论入手,首先指出自愈调控总体设计中所需要考虑的基本原则,针对自愈调控设计流程给出综合评价指标体系；以此为基础提出自愈调控整体架构并详细阐述了架构设计理念和关键实现技术；围绕关键任务执行的形式化建模问题,采用状态π演算描述关键任务执行与切换语义,并对关键任务执行逻辑进行验证,为后续自愈调控关键技术研究提供了理论上的可行性和合理性保障。
     自愈调控策略动态生成是分布式关键任务系统自愈调控研究的核心内容。提出了基于策略的自愈调控模式,阐述了自愈调控策略的基本表述形式并给出了自愈调控策略动态管理中策略分类以及化简步骤；针对失效检测机制准确率不高且故障定位难的特点,提出基于部分可观察随机过程(Partially Observable Markov Decision Processes, POMDP)的自愈调控策略更新算法,采用近似迭代方法求解POMDP策略并给出了迭代收敛性的理论分析。仿真实验利用LANL(Los Alamos National Lab)失效数据中恢复策略效果进行统计,然后计算策略求解的迭代与收敛速度并比较了多种类型自愈策略的恢复效果。实验结果表明与固定策略相比,POMDP策略在不准确失效检测下迭代速度更快且恢复时间更短。
     自愈调控数据分析与预测是实现分布式关键任务系统失效自愈的必要条件。针对非线性相关失效数据所具有的高维、稀疏等特征,首先提出了非线性相关失效事件联合聚类算法,以互信息熵损失差作为度量标准并从理论上分析算法有限次迭代收敛性；然后针对数值型失效数据采用有监督局部线性嵌入算法进行数据降维,通过失效模式识别实现失效提前预判。实验首先比较了不同算法在失效数据集上的聚类效果和收敛速度,然后采集了故障态与正常态下系统状态指标数据并进行预测性能分析。实验结果表明,所提出的非线性相关失效数据分析方法能够有效聚类出失效数据对象,基于局部线性嵌入的失效预测结果可为主动恢复操作提供决策依据。
     关键任务自愈调度机制是分布式关键任务系统自愈调控设计与实现的重要保障。针对失效发生随机性以及关键任务运行连续性等特点,采用先调度,后优化的指导思想,提出了基于DAG任务重构迁移的关键任务调度方案。首先重新生成关联任务有向无环图(directed acyclic graph, DAG),提出DAG动态重构算法将关联任务转化为层次化DAG任务,然后计算关键任务迁移路径并给出可迁移任务死锁避免理论分析,将迁移任务提前调度到当前空闲资源运行,达到缩短任务执行时间的目的。仿真实验测试了三种故障注入类型下任务迁移方案与等待恢复方案的加速比执行性能,实验结果表明任务迁移方案在弹性负载与未知故障情况下具有较好的调度质量,为关键任务系统不间断运行提供合理可行的技术方案。
The complexity, heterogeneity and dynamics application environment of Distributed Mission-Critical Systems (DMCS) inevitably lead to system failure, mission suspending, running interrupt even system crash and other phenomena, causing huge economic lives losses and other serious consequences. Meanwhile, DMCS failure also makes manual management and manipulation more difficult. Threrfore, autonomic computing technology with the core goal of self-management has been studied in various fields. Self-Healing Regulation Technology (SHRT) is the one of the critical technologies of autonomic computing. DMCS-oriented self-healing regulation technology can achieve fundamental functions such as failure prediction, self-healing policy generation and critical task scheduling, which have a significant influence on dependability and sustainability of critical mission running. In this paper, aiming at the dependability and sustainability requirements of critical mission running, self-healing regulation technology and its application have been studied systematically.
     In this dissertation, self-healing regulation technology research started from the overall design principle discussion. Firstly, the critical problems of SHRT overall design has been analyzed, then the comprehensive evaluation metrics system has been proposed. Secondly, the SHRT architecture has been proposed and the critical implementation has been analyzed. Aiming at formal validation of critical task execution flow, n-calculus has been applied to describe the semantic task execution and switch. Moreover, the critical task execution logic has been validated, which can provide theoretical feasibility and rationality assurance.
     Self-healing regulation policy dynamic generation method is the critical research topic for the DMCS-oriented self-healing regulation technology. Based on the self-healing regulation architecture, the policy-based self-healing regulation pattern has been proposed. The basic expression format and logic syntax have been discussed. In addition, the policy simplifying and classifying approach has been proposed for the dynamic policy management. In order to solve the problem of inaccurate failure detection and diagnosis, a Partially Observable Markov Decision Processes (POMDP) based self-healing policy re-generation algorithm has been proposed and the policy convergence has been analyzed theoretically. In the experiment we used Los Alamos National Laboratory (LANL) failure data to count the real effect of recovery policy, which showed the necessity of self-healing regulation technology, and then in the simulation experiment we calculated the policy solving iteration and convergence speed and compared different type self-healing policy performance. Our research result can point out the direction of self-healing policy generation and optimizing.
     Self-healing regulation data analysis and prediction is a necessary condition for DMCS self-healing. Aiming at the high dimension and sparsity feature of nonlinear correlated failure failure of high-performance computer system, an information-theoretic based co-clustering algorithm for nonlinearly correlated failure data was proposed. The co-clustering algorithm was measured using mutual information entropy. And the convergence and local optimality of co-clustering algorithm were proved theoretically. Second, the manifold learning algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. In the experiment we first compared the clustering effect of different methods on LANL data, and then we collected system performance metrics under fault injection and normal state. We compared the failure prediction performance and the experimental results on labeled failure data showed that the co-clustering analysis algorithm outperformed other clustering analysis algorithms and has the features of rationality and effectiveness for discovering the nonlinearly correlated failure patterns. The failure analysis and SLLE based prediction results demonstrated that our method can help to predict underlying failures.
     Critical task scheduling for self-healing regulation in DMCS is a significant assurance for SHRT design and implementation. Taking failure randomicity and critical task running continuity into consideration and to achieve the rational scheduling of failed task, a critical task scheduling method based on Directed Acyclic Graph (DAG) task reconstruction and migration is proposed with the principle of scheduling first, optimization after. Firstly, the DAG of correlated task was regenerated according to the proposed DAG dynamic reconstruction algorithm to transform the correlated task to layered DAG task. And then the critical task migration route was computed and migratble task deadlock avoidance analysis is provided. By critical task migration to current idle resources, task execution time can be reduced markedly. Simulation experiment tested the task speedup performance of task migration method and waiting-recovery method with three kind of faults injected. The experiment results showed that task migration method can achieve the better scheduling quality under the flexible load and unknown fault injection.

引文

[1]王慧强,吕宏武,赵倩等.一种关键任务系统自律可信性模型与量化分析.软件学报.2010,21(2)：344-358页
    [2]杨仕平.分布式任务关键实时系统防危性技术研究.电子科技大学博士学位论文,2004：11-29页
    [3]赵国生,任务关键系统可生存性增强的应急技术研究,哈尔滨工程大学博士学位论文,2008：16-22页
    [4]王健,任务关键系统生存性形式化建模与分析,哈尔滨工程大学博士论文学位论文,2008：11-24页
    [5]Laprie J C. Dependable computing and fault tolerance:Concepts and terminology. Proceedings of the 15th IEEE International symposium on Fault-Tolerant Computing (FTCS-15),1985:2-11P.
    [6]Laprie J C. Dependable computing:concepts, limits, challenges. Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing (FTCS'95). IEEE Computer Society, Washington, DC, USA,1995:42-54P
    [7]Laprie J C. Dependability:a unifying concept for reliable, safe, secure Computing. Proceedings of IFIP Congress,1992:585-593P
    [8]Avizienis, Laprie J C, Brian Randell, ed. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing,2004, 1(1):11-33P.
    [9]USA Department of Defense Computer Security Center. Department of Defense Trusted Computer System Evaluation Criteria. http://csrc.nist.gov/publications/history/dod85. pdf,2011.3
    [10]USA National Computer Security Center. Trusted Network Interpretation of the Trusted Computer System Evaluation Criteria, http://csrc.nist.gov/publications/secpubs/rainbow /tg005.txt,2011.3
    [11]USA National Computer Security Center. Trusted Database Management System Interpretation. http://www.fas.org/irp/nsa/rainbow/tg021.htm,2011.3
    [12]张焕国,罗捷,金刚等.可信计算研究进展.武汉大学学报(理学版)2006,52(5)：513-518页
    [13]张焕国,严飞,傅建明等.可信计算平台测评理论与关键技术研究.中国科学：信息科学.2010,40(2)：167-188页
    [14]沈昌祥张焕国王怀民等.可信计算的研究与发展.中国科学：信息科学,2010,40(2)：139-166页
    [15]林闯,王元卓,田立勤.可信网络的发展及其面对的技术挑战.中兴技术通讯2008,41(1)：13-16页
    [16]Patterson D A, Brown A, ed. Recovery-oriented computing(ROC):motivation, definition, techniques and case studies. UC Berkeley TR UCB//CSD-02-1175. Berkeley, CA, March 2002:571-574P.
    [17]Aaron B. Brown, David A. Patterson. Embracing failure:a case for recovery-oriented computing (ROC). Processings of 2001 High Performance Transaction Symposium, Asilomar, CA,2001:1-21P
    [18]Andrew D L. Automatic recovery for request oriented systems:[Dissertation]. University of Illinois at Urbana-Champaign.2010:45-52P
    [19]George Candea, Aaron B. Brown, Armando Fox, David Patterson. Recovery-oriented computing:building multi-dependability computer.2004,37(11):60-67P
    [20]SRS Program. http://www.tolerantsystems.org/SRSProgram,2011.3
    [21]Michael, Dai Yuan-Shun, Rouff C A, James L.Modeling for NASA autonomous nano-technology swarm missions and model-driven autonomic computing. The 21st International Conference on Advanced Networking and Applications,2007:250-257P
    [22]Debanjan Ghosh, Raj Sharman, Raghav Rao, ed. Self-healing systems-survey and synthesis. Decision Support Systems,2007,42(4):2164-2185P
    [23]Kephart, Chess J O. The vision of autonomic computing. IEEE Computer Society: IEEE Press,2003:41-50P
    [24]Hausi A. Miille, Liam O'Brien. Autonomic Computing. CMU/SEI-2006-TN-006. www.sei.cmu.edu/library/abstracts/reports,2011.3
    [25]廖备水,李石坚,姚远,高济.自主计算概念模型与实现方法.软件学报,2008,19(4)：779-802页
    [26]张海俊,史忠植.自主计算软件工程方法.小型微型计算机系统.2006,27(6)： 1077-1082页
    [27]李春江,肖侬,杨学军.具有自主计算特征的计算网格资源备份服务系统.计算机工程与科学.2005,27(12)：59-60,89页
    [28]刘文洁,李战怀.基于自律计算的故障监视机制研究与设计.计算机科学.2010,37(8)：175-177页
    [29]付长冬,舒继武,郑纬民,沈美明.基于自主运算的自适应存储区域网络系统.软件学报.2004,15(07)：1056-1063页
    [30]刘涛,曾国荪,吴长议.网格环境下任务分配的自主计算方法.通信学报.2006,27(11)：139-143,147页
    [31]R. Nagpal, A. Kondacs, C. Chang, Programming methodology for biologically-inspired self-assembling systems. AAAI Symposium,2003.
    [32]S. George, D. Evans, S. Marchette, A biological programming model for self-healing. First ACM Workshop on Survivable and Self-Regenerative Systems,2003:125-131P
    [33]Dashofy E M, Hoek A V D, Taylor R N. Towards architecture based self-healing systems. Proceedings of the First Workshop on Self-Healing Systems,2002:79-84P
    [34]Garlan D, Schmerl B. Model-based adaptation for self-healing systems. Proceedings of the First Workshop on Self-Healing Systems,2002:148-154P
    [35]Cheng S W, Huang A C, Garlan D, Schmerl B, Steenkiste P. Rainbow: architecture-based self-adaptation with reusable infrastructure, IEEE Computer,2004,37 (10):46-54P
    [36]Valetto G, Kaiser G E. Case study in software adaptation. Proceedings of the First Workshop on Self-Healing Systems,2002:15-23P
    [37]Combs N, Vagle J, Adaptive mirroring of system of systems architectures. Proceedings of the First Workshop on Self-Healing Systems,2002:101-108P
    [38]Candea G, Kawamoto S, Fujiki Y, et al. Microreboot-a technique for cheap recovery. Proceeding of the 6th Symposium on Operating Systems Design and Implementation. San Francisco, CA, USA,2004:3-3P
    [39]Candea G, Cutler J, Fox A. Improving availability with recursive microreboots:a soft-state system case study. Performance Evaluation Journal,2004,56(3):213-248P
    [40]Candea G, Delgado M, Michael Chen, ed. Automatic failure-path inference:a generic introspection technique for internet applications. Proceeding of the 3rd IEEE Workshop on Internet Applications. San Jose, CA, USA,2003:132-141P
    [41]董玺坤,王慧强,吕宏武等.一种面向自愈的综合监视模型.通信学报,2010.31,(9A)：155-163页
    [42]Candea G, Cutler J, Fox A, ed. Reducing recovery time in a small recursive restartable system. Proceeding of International Conference on Dependable System and Network. Washington D. C., USA,2002:605-614P
    [43]Garg S, Van Moorsel A, Vaidyanathan K, ed A methodology for detection and estimation of software aging. Proceedings of the 1998 9th International Symposium on Software Reliability Engineering. Los Alamitos, CA, USA:IEEE Computer Society, 1998:283-292P
    [44]Huang Y, Kintala C, Kolettis N, ed. Software rejuvenation:analysis, module and applications. Proceedings of the 25th International Symposium on Fault-Tolerant Computing. Piscataway, NJ, USA:IEEE,1995.381-390P
    [45]Castelli V, Harper R E, Heidelberger P, et al. Proactive management of software aging. IBM Journal of Research and Development,2001,45 (2):311-332P
    [46]Hong Y, Chen D, LiL, et al. Closed loop design for software rejuvenation. Proceedings of the Workshop on Self-Healing, Adaptive and Self-Managed Systems. New York, NY, USA:ACM,2002:157-161P
    [47]Xie W, Hong Y, Trivedi K. Analysis of a two-level software rejuvenation policy. Reliability Engineering and System Safety,2005,87 (1):13-22P
    [48]游静,徐康宁,王洪元等.基于软件再生理论的分布式自适应性能监控系统设计.计算机应用,2010,30(6)：1642-1645页
    [49]Apache Software Foundation, Apache http server project, http://www.apache.org. 2011.3
    [50]IBM server, http://www-03.ibm.com/servers/.2011.3
    [51]Kumar V, Cooper B, Schwan K. Distributed stream management using utility-driven self-adaptive middleware. Proceedings of the 2nd International Conference on Autonomic Computing (ICAC), IEEE, Seattle,2005:3-14P
    [52]Phanse K, DaSilva L, Midkiff S. Design and demonstration of policy-based management in a multi-hop ad hoc network. Ad Hoc Networks,2005:389-401P
    [53]Terzi E, Vakali A, Angelis L. A simulated annealing approach for multimedia data placement, The Journal of Systems and Software.Elsevier,2004:467-480P
    [54]IBM Research, Policy technologies.http://www.research.ibm.com/policytechnologies/, 2011.3
    [55]Liu H, Parashar M. A Framework for rule-based management of parallel scientific Applications. Proceedings of the 2nd International Conference on Autonomic Computing (ICAC), IEEE, Seattle,2005:360-361P
    [56]Maglio P, Campbell C, Kandogan E. On the need for negotiation in policy-based interaction with autonomic computing systems. Proceedings of the 2nd International Conference on Autonomic Computing (ICAC), IEEE, Seattle,2005:356-357P
    [57]Lin T Y, Siewiorek D P. Error log analysis:statistical modeling and heuristic trend analysis. IEEE Transction on Reliability,1990,39(4):419-432P
    [58]Tang D, Iyer R K. Analysis and modeling of correlated failures in multicomputer systems. IEEE Transactions on Computers,1992,41 (5):567-577P
    [59]Vaidyanathan K, Harper R E, Hunter S W, Trivedi K S. Analysis and implementation of software rejuvenation in cluster systems. Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, Cambridge, USA, 2001:62-71P
    [60]Sahoo R K, Sivasubramaniam A, Squillante M S, Zhang Y. Failure data analysis of a large-scale heterogeneous server environment. Proceedings of International Conference on Dependable Systems and Networks (DSN04), Florence, Italy,2004:772-778P
    [61]Schroeder B, Gibson G A. A large-scale study of filures in high-performance computing systems. Proceedings of International Conference on Dependable Systems and Networks (DSN06), Philadelphia,USA,2006:249-258P
    [62]Liang Y, Zhang Y, Jette M, Sivasubramanium A, Sahoo R. BlueGene/L failure analysis and prediction models. Proceedings of International Conference on Dependable Systems and Networks (DSN06), Philadelphia, USA,2006:425-434P
    [63]Liang Y, Zhang Y, Sivasubramaniam A, Sahoo R, Moreira J, Gupta M. Filtering failure logs for a BlueGene/L prototype. Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN05), Yokohama, Japan,2005:476-485P
    [64]钱迎进,肖侬,金士尧.大规模集群中一种自适应可扩展的RPC超时机制.软件学报.2010,21(12)：3199-3210页
    [65]Delporte-Gallet C, Fauconnier H. Tight failure detection bounds on atomic object implementations. The Journal of the ACM,2010,57(4),21:1-32P
    [66]Chandra T D, Toueg S. Unreliable failure detectors for reliable distributed system. Journal of ACM,1996,43(2):225-267P
    [67]Chen W, Toueg S,Aguilera M K. On the quality of service of failure detectors. IEEE Transactions on Computers,2002,51(5):561-580P
    [68]Bertier M, Matin O, Sens P. Implementation and performance evaluation of an adaptable failure detector. Proceedings of the 15th International Conference on Dependable Systems and Networks. Washington D C,USA,2002:354-363P
    [69]Hayashibara N, Defago X, Katayama T. Two-ways adaptive failure detection with the failure detector. Proceedings of the Workshop on Adaptive Distributed Systems(WADIS). Sorrento,2003:22-27P
    [70]Hayashibara N, Defago X,Yared R, ed. The accrual failure detector. Proceedings 23rd IEEE international symposium on Reliable Distributed Systems (SRDS04). Florianopolis. Brazil, IEEE CS Press,2004:66-78P
    [71]Csenki A. Bayes predictive analysis of a fundamental software reliability model. IEEE Transactions on Reliability,1990,39(2):177-183P
    [72]Pfefferman J, Cernuschi-Frias B. A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability,2002,51(4):434-442P
    [73]Williams A W, Pertet S M, Narasimhan P. Tiresias:black-box failure prediction in distributed systems. Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS07). Long Beach, USA,2007:1-8P
    [74]Felix Salfner, Maren Lenk, and Miroslaw Malek. A survey of online failure prediction methods. ACM Computing Surveys.42(3),2010:1-42P
    [75]Kenichi Kourai, Shigeru Chiba, Fast software rejuvenation of virtual machine monitors. IEEE Transactions on Dependable and Secure Computing.2010, pp(99):1-14P
    [76]Salfner F, Malek M. Using hidden semi-markov models for effective online failure prediction. Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS07). Beijing, China 2007:166-174P
    [77]Vilalta R, Apte C V, Hellerstein J L, Ma S, Weiss S M. Predictive algorithms in the management of computer systems. IBM Systems Journal,2002,41(3):461-474P
    [78]Solanoquinde L D, Bode B M. Module prototype for online failure prediction for the IBM BlueGene/L. Proceedings of IEEE International Conference on Elector/Information Technology (EIT2008), Ames, USA,2008:470-474P
    [79]Oliner A, Stearley J. What supercomputers say:a study of five system logs. Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, IEEE Computer Society, Washington, DC, USA,2007:575-584P
    [80]Yalag P, Nath S, Yu H, ed. Beyond Availability:Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems. Proceeding of USENIX Workshop on Real, Large Distributed Systems,2004.
    [81]Gu J X, Zheng Z M, Lan Z, ed. Dynamic meta-learning for failure prediction in large-scale systems:a case study. Proceedings of the 2008 37th International Conference on Parallel Processing (ICPP'08). IEEE Computer Society, Washington, DC, USA, 157-164P
    [82]Sahoo R K, Oliner A J, Rish I, ed. Critical event prediction for proactive management in large-scale computer clusters. In Proceeding of ACM International Conference on Knowledge Discovery and Data Dining (SIGKDD). ACM, New York, NY, USA, 2003:426-435P
    [83]罗威,阳富民,庞丽萍,李俊.基于延迟主动副版本的分布式实时容错调度算法.计算机研究与发展.2007,37(4)：425-429页
    [84]Liestman L, Campbell R H. A fault-tolerant scheduling problem. IEEE Transactions on Software Engineering,1986,12(11):1089-1095P
    [85]Al-Omari R, Gmanimaran, Somani A K. An efficient backup-overloading for fault-tolerant scheduling of real-time tasks. Proceedings of the IEEE workshop on Fault-Tolerant Parallel and Distributed Systems,2000:1291-1295P
    [86]Al-Omari R, Somani A K, Manimaran G. A new fault-tolerant technique for improving the schedulability in multiprocessor real-time systems. Proceedings of the Internatinal Parallel and Processing Symposium, San Francisco, USA,2001.346-353P
    [87]朱萍,阳富民,涂刚.基于被动副版本优先级提高策略的分布式实时容错调度.计算机研究与发展,2010,47(11)：2003-2010页
    [88]伍微,倪少杰,王飞雪.基于截止期错失率可预测的高利用率容错调度.计算机研究与发展.2010,47(2)：370-376页
    [89]Qin X, Han Z F, Jin H, ed. Real-time fault-tolerant scheduling in heterogeneous distributed systems. Proceedings of the International Workshop Cluster Computing-Tech, Environments, and Applications,2000:421-427P
    [90]Ghosh S, Melhem R, Mosse D. Fault-tolerance through scheduling of a periodic tasks in hard real-time multiprocessor systems.IEEE Transcation on Parallel and System, 1997,8(3):272-284P
    [91]孟宪福,刘伟伟.基于选择性复制前驱任务的DAG调度算法.计算机辅助设计与图形学学报.2010,22(6)：1056-1061页
    [92]兰舟,孙世新,基于动态关键任务的多处理器任务分配算法,计算机学报.2007,30(3)：434-461页
    [93]Kuang S R, Chen C Y, Liao R Z. Partitioning and pipelined scheduling of embedded system using integer linear programming. Proceedings of the 11th International Conference on Parallel and Distributed Systems. Los Alamitos, CA:IEEE Computer Society Press,2005:37-41P
    [94]Vida K, Shuvra B S. Efficient techniques for clustering and scheduling onto embedded multiprocessors. IEEE Transcation on Parallel and Distributed Systems,2006,17(7): 667-680P
    [95]Becchi M, Crowley P. Dynamic thread assignment on heterogeneous multiprocessor architectures. Proceeding of the 3rd Conf on Computing Frontiers. New York:ACM Press,2006:29-40P
    [96]Wu A S, Yu H, Jin S, ed. An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Transcation on Parallel and Distributed Systems,2004,15(9): 824-834P
    [97]Jin H, Sun X, Zheng Z M, ed. Performance under Failures of DAG-based Parallel Computing. Proceeding of the 2009 9th IEEE/ACM international symposium on cluster computing and the grid, IEEE computer society, Washington USA,2009:236-243P
    [98]Wu M, Sun X H, Jin H. Performance under failures of high-end computing. Proceedings of the CM/IEEE Super-Computing Conference, ACM, New York, NY, USA,2007:1-11P
    [99]Sun X H, Lan Z, Li Y, ed. Towards a fault-aware computing environment. Proceedings of the High Availability and Performance Computing Workshop (HAPCW'2008), http://www.cs.iit.edu/-scs/psfiles/hapcw_2008.pdf,2011.4
    [100]Kamthe A, Lee S Y. A stochastic approach to estimating earliest start times of nodes for scheduling DAGs on heterogeneous distributed computing systems. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium,2005:121P
    [101]Lan Z L, Li Y W. Adaptive fault management of parallel applications for high-performance computing. IEEE Transactions on Computers,2008, 57(12):1647-1660P
    [102]Jing C, Liu J H. Developing embedded kernel for system-on-a-chip platform of heterogeneous multiprocessor architecture. Proceeding of the 12th IEEE Conf on Embedded and Real2Time Computing Systems and Applications. Los Alamitos, CA IEEE Computer Society Press,2006:246-250P
    [103]Ruggiero Martino, Guerri Alessio, Poletti Francesco, ed. Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip. Proceeding of the Conference on Design, Automation and Test in Europe, Munich, Germany:European Design and Automation Association Press,2006: 3-8P
    [104]Cho Y, Yoo S, Choi Y, ed. Scheduler implementation in MPSoC design. In the proceeding of the 2005 Conference on Asia South Pacific Design Automation. New York:ACM Press,2005:151-156P
    [105]Xue L P, Ozturk O, Li F Hi, ed. Dynamic partitioning of processing and memory resources in embedded MPSoC architecture. Proceeding of the Conference on Design, Automation and Test in Europe, Munich, Germany:European Design and Automation Association Press,2006:690-695P
    [106]Nollet V, Avasare P, Mignolet J Y, ed. Low cost task migration initiation in a heterogeneous MPSoC. The Proceedings of the Conference on Design, Automation and Test in Europe, Los Alamitos, CA:IEEE Computer Society Press,2005:252-253P
    [107]Bertozzi S, Acquaviva A, Bertozzi D, ed. Supporting task migration in multi-processor system-on-chip:a feasibility study. Proceeding of the Conference on Design, Automation and Test in Europe. Munich, Germany:European Design and Automation Association Press,2006:15-20P
    [108]许可.网格服务流的状态π演算形式化验证技术研究与应用,清华大学博士学位毕业论文,2007：39-51页
    [109]Milner R. Communicating and mobile systems:the π-calculus. Cambridge Cambridge University Press,1999:98-106P
    [110]Victor B, Moller F. The mobility workbench:A tool for the π-calculus. Proceedings of the 6th International Conference on Computer-Aided Verification. Berlin, Heidelberg: Springer-Verlag,1994:428-440P
    [111]Joshi K R, Sanders W H, Hiltunen M A, ed. Automatic model-driven recovery in distributed systems. Proceedings of 24th IEEE Symposium on Reliable Distributed Systems, Orlando, USA,2005.25-38P
    [112]Zhu Q J, Chun Y. A reinforcement learning approach to automatic error recovery. Proceedings of International Conference on Dependable Systems and Networks, Edinburgh, UK,2007.729-738P
    [113]Dong X K, Wang H Q, Lai J B, A self-recovery method based on mixed reboot. Proceedings of Fifth International Conference on Autonomic and Autonomous Systems, 2009:56-62P
    [114]Joshi K R, Sanders W H, Hiltunen M A, ed. Automatic recovery using bounded partially observable markov decision processes. Proceedings of the 36th International Conference on Dependable Systems and Networks, Yokohama, Japan,2006.445-456P
    [115]Vihay N, Dennis J, Rajiv G. Self-recovery in server programs. Proceedings of the 2009 International Symposium on Memory Management, Dublin, Ireland,2009:49-58P
    [116]王湛,郭成昊,刘凤玉等.神经网络在计算系统软件抗衰重启技术中的研究.计算机学报,2008,31(7)：1268-1275页
    [117]蔡嘉勇,卿斯汉,刘伟.安全策略模型聚合性评估方法.软件学报,2009,20(7)：1953-1966页
    [118]林植.基于策略的访问控制技术研究.华中科技大学博士学位论文,2006：35-40页
    [119]卢旭.分布式系统自愈调控关键技术研究.哈尔滨工程大学硕士学位论文,2009：20-33页
    [120]Pineau J, Gordon G, Thrun S. Point-based value iteration:An anytime algorithm for POMDPs. Proceedings of International Joint Conference on Artificial Intelligence, Acapulco, Mexico,2003.1025-1032P
    [121]卞爱华,王崇骏,陈世福.基于点的POMDPs算法的预处理方法.软件学报.200819(6)：1309-1316页
    [122]Hauskrecht M. Value-function approximations for partially observable Markov decision processes. Journal of artificial intelligence research.2000,13:33-94P
    [123]Song F, Xu C Z. Exploring event correlation for failure prediction in coalitions of clusters. Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis, Reno:IEEE Computer Society,2007:456-468P
    [124]Dhillon I S, Guan Y. Information theoretic clustering of sparse co-occurrence data. Proceedings of IEEE International Conference on Data Mining 2003, Melbourne:IEEE Computer Society,2003:517-520P
    [125]Roweis S T, Saul L K. Nonlinear dimensionality reduction by local linear embedding, Science,290(5500),2000:2323-2326P
    [126]John G H, Kohavi R, Pfleger K. Irrelevant feature and the subset selection problem. The 11th International Conference on Machine Learning 1994, San Francesco:Morgan Kaufmann.1994:121-129P
    [127]R, Souroujon O. Iterative double clustering for unsupervised and semi-supervised learning. The 12th European Conference on Machine Learning 2001, Freiburg:Springer, 2001:121-132P
    [128]Silva V D, Tenenbaum J B. Global versus local methods in nonlinear dimensionality reduction. Proceedings of the Advances in Neural Information Processing Systems (NIPS2002), Vancouver Canada,2002:705-712P
    [129]Ridder D, Duin R P W. Locally linear embedding for classification. Technical Report PH-2002-01, Delft University of Technology. http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.110.7308&rep=rep1&type=pdf,2011.4
    [130]Kegl B. Intrinsic Dimension estimation using packing numbers. Proceedings of the Advances in Neural Information Processing Systems (NIPS2002), Vancouver Canada, http://books.nips.cc/papers/files/nips15/AA25.pdf,2011.4
    [131]赵明宇,张田文.一种分布式计算环境下并行应用的调度算法.2008,45(4)：695-705页
    [132]Mi N F, Casale G, Smirni E. Scheduling for performance and availability in systems with temporal dependent workloads. IEEE International Conference on Dependable Systems and Networks,2008:336-345P
    [133]武星燕Linux集群的进程迁移技术研究.哈尔滨工程大学硕士学位论文,2008：9-12页
    [134]宋巍,马晓星,吕建.Web服务组合动态演化的实例可迁移性.计算机学报,32(9)：1816-1831页
    [135]陈廷伟,张斌,郝宪文.基于任务-资源分配图优化选取的网格依赖任务调度.计算机研究与发展.2007,44(10)：1741-1750页
    [136]周伟明编著.多核计算与程序设计.武汉：华中科技大学出版社,2009：561-577页

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700