增值业务的概率故障定位
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着IN (Intelligent Network)、3G/IMS (Third Generation/IP Multimedia Subsystem)、NGN (Next Generation Network)等通信技术的演进、成熟和推广,通信网络的业务提供能力不断提高,出现了越来越多的增值业务。这些业务给网络管理和运营维护提出了更高的要求。业务不可用和服务质量下降等故障不仅会造成运营商的经济损失,还会引起用户忠诚度降低甚至客户流失。故障诊断是保障增值业务高可用性、高可靠性和服务质量的关键技术。其中,故障定位在很大程度上决定了故障诊断的效率和效果,是故障诊断的核心技术。深入研究适合于增值业务的故障诊断技术,特别是故障定位技术,具有重要的现实意义和研究价值。
     传统故障诊断侧重于对设备和网络等资源的故障检测和定位,其关注的是各种设备的运行状态和网络的连接状况。这些资源与业务的依赖关系、影响范围、关联强度均没有纳入其研究范围。业务故障与传统网络故障有很大的不同:(1)业务故障的建模困难。相对于传统故障管理中的资源故障建模,业务的多样性、动态性、抽象性、依赖性和多域性使故障建模更加复杂;(2)业务故障的原因复杂。除了网络、平台、软件等原因,还可能有人为的原因,此外还有域间故障;(3)业务故障的范围更大。电信业务的高可靠性、高可用性和可运营性,使得业务故障不仅要包括业务功能类故障,业务性能类故障,还包括业务支撑类故障;(4)业务故障具有非确定性,识别和定位困难。很多时候需要参考上下文环境综合判断才能得出业务运行状态,业务的服务质量缓步下降时尤为如此;(5)业务故障对用户的直接影响要远大于资源故障,用户敏感性强,这就对业务故障定位的效率和效果提出了更高要求。
     本文以近年来迅速发展起来的电信增值业务为研究对象,以降低增值业务故障定位的复杂度,提高故障定位的检测率、降低误检率、减少故障定位时间,提高故障定位效率和效果为具体目标,围绕着增值业务运行时故障定位的关键技术进行研究。本文对研究过程中取得的主要创新成果进行了详细阐述。简要归纳如下:
     (1)在传统故障模型中,很少为资源与业务的关联建模,没有考虑资源与业务的依赖关系、影响范围、关联强度。为此,提出两种故障建模方法:基于统计和数据挖掘的浅知识的建模方法(Statistics and Data Mining, SDM)和借鉴叠加网络思想并结合端到端的业务提供的建模方法(Overlay Network and End-to-End Service Provisioning, ONEE)。ONEE方法包括业务组件间的水平故障建模和业务组件与资源组件间的垂直故障建模。SDM和ONEE弥补了增值业务和故障诊断系统的间隙,可以准确、便捷地为增值业务进行故障建模。
     (2)最优的概率故障定位已经被证明为NP-hard问题,很难应用于大规模、实时的增值业务。针对增值业务的故障定位需求,以概率加权二分图为故障传播模型,借鉴贪婪思想,提出了一种高效的启发式概率故障定位算法BSD (Bayesian Suspect Degree)。不同于现有的以最小集合覆盖为基础的启发式故障定位算法,BSD采用有效增量覆盖的方式,减小了误选故障的可能性。对算法的分析和仿真验证了BSD算法具有较高的效率和较好的定位效果。
     (3)大多数现有的故障定位算法都采用时间窗口的告警观测方式。然而在实践中时间窗口的大小很难准确设定。而不恰当的时间观测窗口,常常会明显降低故障定位算法的性能。针对此问题,提出了一种事件驱动的非确定性的增量故障定位算法IBSD (Incremental Bayesian Suspect Degree)。IBSD能够消除基于告警观测窗口方式故障定位的缺点。仿真实验表明,该算法要优于现有的IHU (Incremental Hypothesis Update)算法。
     (4)尽管BSD和IBSD算法具有一定的健壮性,但是由于其没有针对征兆丢失、征兆虚假等噪音环境提出解决措施,因而在存在大量噪音时,算法的性能下降较多。因此,提出可用于噪音环境下概率故障定位的MICAS (Minimum Interactive Checking with Adaptive Strategy)算法。通过引入增强型的评估函数、最小交互探测机制和适应性门限设置策略等三种机制,MICAS算法在征兆丢失率和虚假率较大的环境下,依然可以获得非常理想的故障定位效果。
     (5)虽然事件驱动方式的故障定位算法可以消除告警观测窗口对于故障定位准确性的影响,但是这种方式的故障定位效率较低,很难处理大量并发征兆。而且,征兆积累到一定程度之前的定位结果也没有实用意义。考虑告警观测窗口的同时,还要兼顾算法效率。因此,提出一种基于带有预处理机制的滑动窗口的增量故障定位算法SWPM (Sliding Window with Preprocessing Mechanism)。仿真实验的结果验证了SWPM算法的有效性。
With the advances of IN (Intelligent Network),3G/IMS (Third Generation/IP Multimedia Subsystem), NGN (Next Generation Network), the capability of service provisioning of communication networks has been greatly improved, emerging more and more value-added services which pose new challenges for network management and OAM (Operation, Administration and Maintenance). Unavailable services and poor QoS (Quality of Service) make not only the loss of revenue but also degradation of customer loyalty and even the loss of customers. Fault diagnosis is a key technology to ensure high availability, high reliability and quality of service. Fault localization, as a central element of fault diagnosis, determines the efficiency and effectiveness of fault diagnosis to a large extent. The study of fault diagnosis techniques for value-added services, especially fault localization techniques, is really important for both industrial application and academic research.
     Traditional fault diagnosis focuses on the detection and localization of the faults in devices and networks, which pays attention to the status of devices operations and network connections and fails to consider the relationships, such as causality, the way of impact, the strength of dependency between resources and services. Service faults have much difference from traditional faults:(1) modeling service faults is more difficult. Compared with resource fault modeling in traditional fault management, service fault modeling is more challenging because of its diversity, dynamics, abstractness, dependences, and multi-domain characteristic; (2) the root causes of service failures are more complicated. There are often user reasons arousing the faults besides network, platform, software, etc.; (3) the scope of service faults has been extended. High availability and operations of services make service failures including not only function faults and performance faults, but also support (assistant function) faults and inter-domain faults; (4) non-deterministic status of service fault is usually difficult to recognize. It is often judged the status of service operation by the context and ambient, especially when service quality degrades gradually; and (5) the impacts of faults on users are greater than those of resource faults. The sensitivity of users imposes more challenges on the efficiency and effectiveness of service fault localization.
     This dissertation takes the emerging value-added service as a research object, aims at reducing the fault localization computational complexity, improving the accuracy of fault detection, shortening the fault localization time, and improving the efficiency and effectiveness of fault localization, and focuses on the key technologies for fault localization of runtime value-added services. This dissertation describes the details of innovations in the research, which are listed as follows:
     (1) Traditional fault models often lack the relationships between resources and services and do not consider the dependencies, the way of impact, the strength of dependency. Therefore, we propose two modeling approaches:fault modeling based on Statistics and Data Mining (SDM) and fault modeling inspired by overlay network and end-to-end service provisioning (ONEE). ONEE consists of two sub-methods: horizontal fault modeling within service components and vertical fault modeling between service components and resource components. They can make up the gap between value-added service and fault diagnosis system and generate the models for value-added service accurately and quickly.
     (2) Optimal probabilistic fault localization has been proven to be NP-hard and can hardly be applied to large scale, real-time value-added services. Considering the requirements of probabilistic fault localization for value-added services, we present a heuristic fault localization algorithm called BSD (Bayesian Suspect Degree) based on probabilistic bipartite graph and greedy idea. Different from existing algorithms based on minimum set cover problem, BSD takes a way of valid incremental coverage, which can mitigate the likelihood of false selections of faults. Analysis and simulations demonstrate the efficiency and effectiveness of BSD.
     (3) Most existing algorithms depend on the symptoms in certain time windows. However, they cannot determine the accurate size of time windows in reality. Usually, improper time windows may decrease the performance of fault localization algorithms obviously. Due to the limit of time windows in OAM practice, we develop an event-driven incremental probabilistic fault diagnosis algorithm called IBSD (Incremental Bayesian Suspect Degree):IBSD can overcome the drawback of inaccurate time windows of fault localization. Simulations show that IBSD outperforms existing IHU (Incremental Hypothesis Update).
     (4) Although BSD and IBSD are effective even in the presence of slight noise, the algorithms become degradable when facing much noise due to no special consideration for robustness. Thus, based on BSD, we present an algorithm called MICAS (Minimum Interactive Checking with Adaptive Strategy). Through enhanced evaluation function, minimum interactive checking, and setting thresholds adaptively, MICAS obtains an excellent performance of fault localization in the presence of a large amount of lost arid spurious symptoms.
     (5) Event-driven fault localization algorithms can eliminate the effect of inaccurate symptom observed windows, but the algorithms are inefficient and hard to deal with large amount of concurrent symptoms. What is more, deficient accumulated symptoms often lead to a wrong judgment, which is useless for network operators. We need to consider not only the observed window but also the efficiency. Therefore, we present a fault localization algorithm based on sliding window with preprocessing mechanism (SWPM). Simulation results demonstrate the validity of SWPM.
引文
[1]廖建新.移动智能网技术的研发现状及未来发展[J].电子学报,2003,31(11):1725-1731.
    [2]徐童.3G移动增值业务网络的演进方法及关键技术研究[D].博士学位论文.北京邮电大学,2005.
    [3]王晶.业务网络智能化及其关键技术研究[D].博士学位论文.北京邮电大学,2008.
    [4]R. Noldus. CAMEL:Intelligent networks for the GSM, GPRS and UMTS network [M]. John Wiley & Sons Ltd,2006.
    [5]T.V. de Velde. Value-added services for next generation networks [M]. Taylor & Francis Group,2008.
    [6]M. Unmehopa, K. Vemuri, A. Bennett. Parlay/OS A:from standards to reality [M]. John Wiley & Sons Ltd,2006.
    [7]J. Zuidweg. Implementing value-added telecom services [M]. Artech House,2006.
    [8]M. Brenner, M. Unmehopa. The Open Mobile Alliance:delivering service enablers for next generation applications [M]. John Wiley & Sons Ltd,2008.
    [9]S. Znaty and J.P. Hubaux, Telecommunications services engineering:principles, architectures and tools, object-oriented technologies [C], Lecture Notes in Computer Science (LNCS),1998,3-10.
    [10]M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks [J]. Science of computer programming, Special Edition on Topics in System Administration,2004,53(2):165-194.
    [11]J. Gray. Why do computers stop and what can be done about it [C]. Symposium on Reliability in Distributed Software and Database Systems,1986.
    [12]D. Oppenheimer, A. Granapathi, and D.A. Patteron, Why do Internet services fail, and what can be done about it [C]. Proceedings of USITS'03:4th USENIX Symposium on Internet Technologies and System, Seattle, WA, USA,2003.
    [13]D. Richard Kuhn. Sources of failures in the public switched telephone network [J]. IEEE Computer,31-36, April 1997.
    [14]B. Parhami. Defect, fault, error,…, or failure [J]. IEEE Transaction on Reliability,1997, 46(4):450-451.
    [15]Brian Randell. On Failures and Faults. FME 2003:formal methods [C], Lecture Notes in Computer Science,18-39,2003.
    [16]ITU-T Recommendation X.700:Management framework for open systems interconnection (OSI) for CCITT applications [S],09.1992.
    [17]ITU-T Recommendation X.733:Information technology-open systems interconnection-systems management:alarm reporting function [S],02.1999.
    [18]I. Katzela, M. Schwartz, Schemes for fault identification in communication networks [J], IEEE/ACM Transactions on Networking 3 (6) (1995) 733-764.
    [19]M. Natu, A.S. Sethi, and E.L. Lloyd, Efficient probe selection algorithms for fault diagnosis [J], Telecommunication Systems,2008,37(1-3):109-125.
    [20]M. Brodie, I. Rish, and S Ma. Optimizing probe selection for fault localization [C], in 12th International Workshop on Distributed Systems:Operations Management, IEEE/IFIP (DSOM),2001.
    [21]G. Candea, S. Kawamoto, Y. Fujiki, G. Frieman, and A. Fox. Microreboot A technique for cheap recovery [C]. Proc. of the 6th Symposium on Operating System Design and Implementation (OSDI), Dec 2004.
    [22]K. Trivedi, G. Ciardo, B. Dasarathy, etc. Achieving and assuring high availability [C]. Lecture Notes in Computer Science, Service Availability,2008,5017:20-25.
    [23]J.F. Huard and A.A. Lazar. Fault isolation based on decision theoretic troubleshooting. Technical Report.442-96-08, Center for Telecommunications Research.
    [24]F.V. Jensen, U. Kjaerullf, B. Kristiansen, H. Langseth, C. Skaanning, J. Vomlel, and M. Vomlelova, The SACSO methodology for troubleshooting complex systems [J]. Artificial Intelligence for Engineering, Design, Analysis and Manufacturing,2001,15(4):321-333.
    [1]M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks [J]. Science of computer programming,2004,53(2):165-194.
    [2]M. Sochorova and J. Vomlel, Troubleshooting:NP-hardness and solution methods [C], in The Proceedings of the Fifth Workshop on Uncertainty Processing, Czech Republic,2000, pp.198-212.
    [3]N.S. V. Rao Computational complexity issues in operative diagnosis of graph-based systems [J]. IEEE Transactions on Computers, April 1993,42(4):447-457.
    [4]G. Jakobson, M.D. Weissman, Alarm correlation [J], IEEE Network,1993,7(6):52-59.
    [5]Jakobson G. Real-time telecommunication network management:extending event correlation with temporal constraints [A].4th IEEE on Integrated Network Management[C], Chapman and Hall, London,1995, pp.290-302.
    [6]Houck K, Calo S, Finkel A. Towards a practical alarm correlation system [C].4th IEEE on Integrated Network Management,1995.
    [7]S. K"atker, M. Paterok, Fault isolation and event correlation for integrated fault management [C], in:A. Lazar, R. Sarauo, R. Stadler (Eds.), Integrated Network Management V, Chapman and Hall, London,1997, pp.583-596.
    [8]S. Russell, P. Norvig, Artificial intelligence:modern approach [M], Prentice Hall, Englewood Cliffs, NJ,1995.
    [9]M. Steinder and A. S. Sethi. Probabilistic event-driven fault diagnosis through incremental hypothesis updating [C]. IFIP/IEEE Int'l Symposium on Integrated Network Management, Colorado Springs, CO, March 2003.
    [10]Dilmar Malheiros Meira. A model for alarm correlation in telecommunications networks [D], Federal University of Minas Gerais,1997.
    [11]A. Aghasaryan, C. Jard, J. Thomas, UML specification of a generic model for fault diagnosis of telecommunication networks [C], in International Communication Conference (ICC), August 2004.
    [12]ITU-T Recommendation M.3100:generic network information model [S], April 2005.
    [13]TMF, Shared Information/Data (SID) Model, Member Evaluation Versionl.0, January 2004
    [14]M. Rose, K. McCloghrie, Structure and identification of management information for TCP/IP-based Internets, IETF Network Working Group, May 1991, RFC1155.
    [15]DMTF CIM:http://www.dmtf.org/standards/cim/cim_spec_v22. Version 2.2. June 14, 1999.
    [16]Lopez J. Ontologies:giving semantics to network management models [J], IEEE Network, 2003.17(3):15-21.
    [17]A. Mayer, S. Kliger, D. Ohsie, and S. Yemini. Event modeling with the MODEL language [C], Proceedings of the Fifth IEEE/IFIP International Symposium on Integrated Network Management (M97), Chapman and Hall, May 1997, pp.625-637.
    [18]Tomas Muth, Modeling Telecom Networks and Systems Architecture:Conceptual Tools and Formal Methods [M], Chapter 1, Springer.
    [19]B A. Fabre, E. Haar, S. Jard C. Diagnosis of asynchronous discrete-event systems:a net unfolding approach [J], IEEE Transactions on Automatic Control, May 2003,48(5): 714-727.
    [20]Rouvellou I. Hart G.W. Automatic alarm correlation for fault identification [C], INFOCOM '95. Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies,1995, pp.553-561.
    [21]A. T. Bouloutas, S. Calo, and A. Finkel. Alarm correlation and fault Identification in communication networks [J]. IEEE Transactions on Communications,2004,42(2/3/4): 523-533.
    [22]David T. Stott. Layer-2 path discovery using spanning tree MIBs [J], Avaya Labs Research, Avaya Inc,2002.
    [23]Bejerano Y, Breitbart Y, Garofalakis M, Rastogi R. Physical topology discovery for large multi-subnet networks [J], In:Proc. of the IEEE INFOCOM 2003. New York:IEEE Press, 2003, pp.342-352.
    [24]Stott D T. SNMP-based layer-3 path discovery [J]. Avaya Labs Research. Avaya Inc,2002.
    [25]P Bahl, P Barham, R Black, R Chandra and M Goldszmidt. Discovering dependencies for network management [C], SIGCOMM,2006
    [26]Siva Sivavakeesar, Oscar F. Gonzalez, and George Pavlou. Service discovery strategies in ubiquitous communication environments [J], IEEE Communication Magazine,2006.
    [27]Deborah Caswell and Srinivas Ramanathan. Using Service Models for Management of Internet Services [J], IEEE Journal on Selected Areas in Communications, Vol.18, No.5, May 2000
    [28]I. Katzela, M. Schwartz, Schemes for fault identification in communication networks [J], IEEE/ACM Transactions on Networking,1995,3 (6):733-764.
    [29]B. Gruschke, Integrated event management:Event correlation using dependency graphs, in: A.S. Sethi (Ed.), Ninth International Workshop on Distributed Systems:Operations and Management, University of Delaware, Newark, DE, October 1998, pp.130-141.
    [30]M. Steinder and A. S. Sethi. End-to-end service failure diagnosis using belief networks [C]. In Proc. Network Operations and Management Symposium (NOMS), Florence, Italy,2002, pp.375-390.
    [31]Yemini S. A. Kliger, S. Mozes, E. Yemini and Y. Ohsie D. High speed and robust event correlation, IEEE Communications Magazine, May 1996,34(5):82-90.
    [32]Y. Dora Cai, David Clutter, Greg Pape, Jiawei Han, Michael Welge, Loretta Auvil, MAIDS mining alarming incidents from data streams [C], In Proc.2004 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD), ACM Press, New York, NY,2004, pp.919-920.
    [33]K. Appleby, G. Goldszmidt, and M. Steinder. Yemanja-a layered event correlation system for multi-domain computing utilities [J]. Journal of Network and Systems Management, 2002,10(2):171-194.
    [34]Yangyang Wu, Shuguang Du, Wei Luo. Mining alarm database of telecommunication network for alarm association rules [C], Proceedings.11th Pacific Rim International Symposium on Dependable Computing,2005.
    [35]C.S. Hood, C. Ji, Proactive network management [C], In Proc. of IEEE INFOCOM, Kobe, Japan,1997, pp.1147-1155.
    [36]I. Katzela, A.T. Bouloutas, S.B. Calo, Centralized vs. distributed fault localization [C], in: A.S. Sethi, F. Faure-Vincent, Y. Raynaud (Eds.), Integrated Network Management IV, Chapman and Hall, London,1995, pp.250-263.
    [37]S. Yemini, S. Kliger, A Coding Approach to Event Correlation, Integrated Network Management [C], Proceedings of the Fourth International Symposium on Integrated Network Management,1995, pp.266-277.
    [38]Kettschau H. J. Bruck S, Schefczik P. LUCAS-an expert system for intelligent fault management and alarm correlation [C], IEEE/IFIP Network Operations and Management Symposium, NOMS,2002, pp.903-905
    [39]Nunez M, Morales R, Triguero F. Automatic discovery of rules for predicting network management events [J]. IEEE Journal on Selected Areas in Communications, May 2002, 20(4):736-745
    [40]Bing Liang, Xiaoqing Luo, Wei Yan. An improvement to model-based diagnosis for network management system [OL],2006. http://www.cn.apan.net/cairns/NRW/35-Liang Bing.pdf.
    [41]Atzmueller M, Puppe F. Inductive learning for case-based diagnosis with multiple faults [C], In:Advances in Case-Based Reasoning, LNAI 2006, pp.173-216.
    [42]F. Fessant and F. Clerot. An efficient SOM-based pre-processing to improve the discovery of frequent patterns in alarm logs [C], International Conference on Data Mining (DMIN), 2006.
    [43]王新苗,晏蒲柳,黄天锡.基于改进遗传神经网络模型的通信网络故障识别和告警相关性分析方法[J],电子科学期刊,2000,22(5):811-816.
    [44]C.V. Damasio, P. Frohlich, W. Nejdl, L. M. Pereira, M. Schroeder. Using extended logic programming for alarm correlation in cellular phone networks. In Proceedings of IEA/AIE'1999,pp.343-352
    [45]Aboelela E, Douligeris C. Switching theory approach to alarm correlation in network management [C]. Local Computer Networks Proceedings of 25th Annual IEEE Conference, 2000.
    [46]苏玉北.1P网络故障仿真建模与事件关联方法研究[D].武汉大学,博士学位论文(第四章、第五章),2001.
    [47]包铁,刘淑芬,基于通信顺序进程的网络故障管理形式化描述,吉林大学学报(工学版),2007,37(1):117-120.
    [48]H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery offrequent episodes in event sequences [J]. Data mining and knowledge discovery,1999,1(3):259-289.
    [49]G.M. Weiss, Predicting telecommunication equipment failures from sequences of network alarms [M], In Handbook of Knowledge Discovery and Data Mining, Oxford University Press.
    [50]Q.G. Zheng, K. Xu, and W.F. Lv, Intelligent search of correlated alarms from database containing noise data [C], In Proc. Network Operations and Management Symposium (NOMS), Florence, Italy,2002.
    [51]T. Mielikainen, Discovery of serial episodes from streams of events [C], Proceedings of 16th International Conference on Scientific and Statistical Database Management,2004.
    [52]M. Chen, A.X. Zheng, J. Lloyd, M.I. Jordan, and E. Brewer, Failure diagnosis using decision trees, In:Proc. of the 1st Int'l Conf. on Autonomic Computing (ICAC 2004). Washington:IEEE Computer Society Press,2004.36-43.
    [53]王云岚等,智能化网络管理中的告警聚类算法研究,中国计算机学会网络与数据通信学术会议,2002.
    [54]S. Kandula, D. Katabi, and J. P. Vasseur, Shrink:a tool for failure diagnosis in IP networks [C], in Proceeding of Applications, Technologies, Architectures, and Protocols for Computer Communication, pp.173-178,2005.
    [55]J Ding, B Kramer, S Xu, H Chen, and Y Bai, Predictive fault management in the dynamic environment of IP networks, IPOM,2004.
    [56]M. Gupta and M. Subramanian, Preprocessor algorithm for network management codebook [C], in Proceeding of Workshop on Intrusion Detection and Network Monitoring,1999, Santa Clara, CA, USA.
    [57]G. Reali and L. Monacelli, Fault localization in data networks [J], IEEE Comm. Letter, 2009,13(3):161-163.
    [58]Q.H Zheng, and Y.T. Qian, An event correlation approach based on the combination of IHU and codebook [C], Computational Intelligence and Security, Lecture Notes in Computer Science,2005,3802:757-763.
    [59]Q.H. Zheng, Y.T. Qian, and M. Yao, A network event correlation algorithm based on fault filtration [C], PRICAI:Trends in Artificial Intelligence, Lecture Notes in Computer Science,2006,4099:864-869.
    [60]R. Reiter, A theory of diagnosis from first principles [J], Artificial Intelligence,1987,32(1): 57-95.
    [61]Ramana R. Kompella, J. Yates, A. Greenberg, and Alex C. Snoeren, IP fault localization via risk modeling [C], in the Proceedings of Second ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, May 2005.
    [62]X.H. Huang, S.H Zou, W.D. Wang, and S.D. Cheng, Fault management for Internet service: modeling and algorithms [C], in Proceedings of IEEE Communication on Conference (ICC),2006.
    [63]黄晓慧,邹仕洪,王文东,程时端,Internet服务故障管理:分层模型和算法[J],软件学报,2007,18(10):2584~2594.
    [64]A.X. Zheng, I. Rish, and A. Beygelzimer, Entropy approximation for active fault diagnosis, Technique Report, IBM,2004.
    [65]Y.N. Tang, and E. Al-Shaer, Sharing end-user negative symptoms for improving overlay network dependability [C], International Conference on Dependable Systems and Networks, DSN,2009.
    [66]F. Hidano and V. Machiraju, GBF (Grouping by frequency) correlator [OL], http: //www.hpl.hp.com/techreports/2001/HPL-2001-197.html.
    [67]R. Sterritt, D.W. Bustard, Fusing hard and soft computing for fault management in telecommunications systems [J], IEEE Transactions on Systems, Man, and Cybernetics, Part C:Applications and Reviews, May 2002,32(2):92-98.
    [68]W.D. Fischer, G.G. Xie, and J.D. Young, Cross-domain fault localization:A case for a graph digest approach [C], In Proceedings of IEEE Internet Network Management Workshop,2008.
    [69]G. Cooper, The computational complexity of probabilistic inference using Bayesian belief networks [J], Artificial Intelligence,1990,42(2-3):393-406.
    [1]J. Zuidweg. Implementing value-added telecom services [M]. Artech House,2006.
    [2]K Schmidt, High availability and disaster recovery:concepts, design, implementation [M], Springer,2006.
    [3]CCITT, Recommendation M.3010, Principles for a Telecommunications Management Network [S], Geneva 1996.
    [4]B. Gruschke, Integrated event management:event correlation using dependency graphs [C], in:A.S. Sethi (Ed.), Ninth International Workshop on Distributed Systems:Operations and Management, University of Delaware, Newark, DE, October 1998, pp.130-141.
    [5]I. Katzela, M. Schwartz, Schemes for fault identification in communication networks [J], IEEE/ACM Transactions on Networking,1995,3 (6):733-764.
    [6]M. Steinder and A. S. Sethi. End-to-end service failure diagnosis using belief networks [C]. In Proc. Network Operations and Management Symposium (NOMS), Florence, Italy,2002, pp.375-390.
    [7]Yemini S. A. Kliger, S. Mozes, E. Yemini and Y. Ohsie D. High speed and robust event correlation, IEEE Communications Magazine, May 1996,34(5):82-90.
    [8]M. Fraiwan, G. Manimaran, Localization of IP links faults using overlay measurements [C], IEEE International Conference on Communications, May 2008, pp.5629-5633.
    [9]Z. Duan, Z.L. Zhang, and Y.T. Hou, Service overlay networks:SLAs, QoS, and bandwidth provisioning [J], IEEE/ACM Transactions on Networking,2003,11(6):870-883.
    [10]C.P. Tang, and P.K. McKinley, On the cost-quality tradeoff in topology-aware overlay path probing [C],11th IEEE International Conference on Network Protocols (ICNP),2003, pp. 268.
    [11]A. Brown, G. Kar, and A. Keller, An active approach to characterizing dynamic dependencies for problem determination in a distributed environment [C], IEEE/IFIP International Symposium on Integrated Network Management (IM),2001, pp.377-390.
    [12]黄晓慧,Internet服务故障管理,博士学位论文[D],第二章,北京邮电大学,2006.
    [13]郑秋华,网络故障智能诊断关键技术研究,博士学位论文[D],第二章,浙江大学,2007.
    [14]Y.N. Tang, and E. Al-Shaer, Towards collaborative user-level overlay fault diagnosis [C]. INFOCOM,2008, pp.2476-2484.
    [1]. C. Mas and P. Thiran, A review on fault location methods and their application to optical networks [J], Optical Networks Magazine,2001,2(4):73-87.
    [2]. C. Mas and P. Thiran, An efficient algorithm for locating soft and hard failures in WDM networks [J], JSAC Special Issue on protocols and architectures for next generation WDM optical networks,2000,18(10):1900-1911.
    [3]. Y. Zhao, Y. Chen, and D. Bindel. Towards unbiased end-to-end network diagnosis [C], in ACM SIGCOMM,2006.
    [4]. C. Wang and M. Schwartz, Fault detection with multiple observers [J], IEEE Transactions on Networks,1993,1(1):48-55.
    [5]. M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks [J]. Science of computer programming,2004,53(2):165-194.
    [6]. M. Steinder and A. S. Sethi. Probabilistic event-driven fault diagnosis through incremental hypothesis updating[C]. IFIP/IEEE Int'l Symposium on Integrated Network Management, Colorado Springs, CO, March 2003.
    [7]. X.H. Huang, S.H. Zou, W.D. Wang, and S.D. Cheng, Fault management for Internet service: modeling and algorithms [C], IEEE ICC 2006.
    [8]. I. Katzela, M. Schwartz, Schemes for fault identification in communication networks [J], IEEE/ACM Transactions on Networking,1995,3 (6):733-764.
    [9]. M. Steinder and A. S. Sethi. End-to-end service failure diagnosis using belief networks [C]. In Proc. Network Operations and Management Symposium (NOMS), Florence, Italy,2002, pp.375-390.
    [10].M. Steinder, A. S. Sethi. Probabilistic fault localization in communication systems using belief networks [J]. IEEE/ACM Transactions on Networking,2004,12(5):809-822.
    [11].M. Sochorova and J. Vomlel, Troubleshooting:NP-hardness and solution methods [C], in The Proceedings of the Fifth Workshop on Uncertainty Processing,2000, pp.198-212.
    [12].N.S. V. Rao Computational complexity issues in operative diagnosis of graph-based systems [J], IEEE Transactions on Computers,1993,42(4):447-457.
    [13].R.R. Kompella, J. Yates, A. Greenberg, and A.C. Snoeren, IP fault localization via risk modeling [C], in the Proceedings of Second ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, May 2005.
    [14].Y.N. Tang, E.S. Al-Shaer, and R. Boutaba, Active integrated fault localization in communication networks [C], in the Proceeding of 9th IFIP/IEEE International Symposium on Integrated Network Management (IM),2005, pp.543-556.
    [15].S. Kandula, D. Katabi, and J. P.Vasseur, Shrink:a tool for failure diagnosis in IP networks [C], in Proceeding of Applications, Technologies, Architectures, and Protocols for Computer Communication,2005, pp.173-178.
    [1]C. Mas and P. Thiran, A review on fault location methods and their application to optical networks [J], Optical Networks Magazine,2001,2(4):73-87.
    [2]C. Mas and P. Thiran, An efficient algorithm for locating soft and hard failures in WDM networks [J], IEEE Journal of Selected Area in Communications, Special Issue on protocols and architectures for next generation WDM optical networks,2000,18(10):1900-1911.
    [3]C. Wang and M. Schwartz, Fault detection with multiple observers [J], IEEE Transactions on Networks,1993,1(1):48-55.
    [4]I. Katzela, and M. Schwartz, Schemes for fault identification in communication networks [J], IEEE/ACM Transactions on Networking,1995,3(6):733-764.
    [5]M. Steinder, and A.S. Sethi, A survey of fault localization techniques in computer networks [J, Science of Computer Programming,2004,53(2):165-194.
    [6]M. Steinder and A. S. Sethi, End-to-end service failure diagnosis using belief networks [C], in Proceedings of Network Operations and Management Symposium (NOMS), Florence, Italy, 2002, pp.375-390.
    [7]M. Steinder, and A. S. Sethi, Probabilistic fault localization in communication systems using belief networks [J], IEEE/ACM Transactions on Networking,2004,12(5):809-822.
    [8]X.H. Huang, S.H Zou, W.D. Wang, and S.D. Cheng, Fault management for Internet service: modeling and algorithms [C], in Proceedings of IEEE Communication on Conference (ICC), 2006.
    [9]M. Steinder and A. S. Sethi, Probabilistic event-driven fault diagnosis through incremental hypothesis updating [C], in Proceedings of IFIP/IEEE Int'l Symposium on Integrated Network Management, Colorado Springs, CO, March 2003.
    [10]M. Natu, A.S. Sethi, and E.L. Lloyd, Efficient probe selection algorithms for fault diagnosis [J], Telecommunication Systems,2008,37(1-3):109-125.
    [11]M. Brodie, I. Rish, and S Ma. Optimizing probe selection for fault localization [C], in 12th International Workshop on Distributed Systems:Operations Management, IEEE/IFIP (DSOM),2001.
    [12]M. Brodie, I. Rish, S. Ma, and N. Odintsova, Active probing strategies for problem diagnosis in distributed systems [C], in Proceeding of International Joint Conferences on Artificial Intelligence, IJCAI,1337-1338,2003.
    [13]Y.N. Tang, E.S. Al-Shaer, and R. Boutaba, Active integrated fault localization in communication networks [C], in the Proceeding of 9th IFIP/IEEE International Symposium on Integrated Network Management (IM), May 2005,543-556.
    [1]Katzela, M. Schwartz, Schemes for fault identification in communication networks [J], IEEE/ACM Transactions on Networking,1995,3 (6):733-764.
    [2]S.Yemini, S. Kliger, A coding approach to event correlation, Integrated Network Management [C], Proceedings of the Fourth International Symposium on Integrated Network Management,1995, pp. 266-277.
    [3]M. Steinder and A. S. Sethi. End-to-end service failure diagnosis using belief networks [C]. In Proc. Network Operations and Management Symposium (NOMS), Florence, Italy,2002, pp.375-390.
    [4]M. Steinder, A. S. Sethi. Probabilistic fault localization in communication systems using belief networks [J]. IEEE/ACM Transactions on Networking,2004,12(5):809-822.
    [5]Xiaohui Huang, Shihong Zou, Wendong Wang, Shiduan Cheng. Fault management for Internet service: modeling and algorithms [C]. IEEE ICC 2006.
    [6]M. Steinder and A. S. Sethi. Probabilistic event-driven fault diagnosis through incremental hypothesis updating [C]. IFIP/IEEE Int'l Symposium on Integrated Network Management, Colorado Springs, CO, March 2003.
    [7]黄晓慧,邹仕洪,王文东,程时端,Internet服务故障管理:分层模型和算法[J],软件学报,2007,18(10):2584-2594.