动态复杂网络中的异常检测问题的研究

英文题名：Research on Anomaly Detection in Dynamic Complex Networks
作者：孟啸
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：异常检测 ; 动态复杂网络 ; 统计特征 ; NP-hard ; 分支限界 ; 线性回归
英文关键词：anomaly detection ; dynamic complex network ; statistical characteristics ; NPHard ; Branch and Bound ; Linear Regression
学位年度：2010
导师：高宏
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2010-06-01

摘要

复杂网络是具有某些共同结构特征的网络的统称。社会关系网络、文献引用网络、学术合作网络、Internet自治系统网络、Web Graph等这些来自不同领域的网络都是典型的复杂网络。复杂网络具有两个重要特性：大规模和动态性。这使得动态复杂网络的统计特征异常检测成为一个非常重要的问题。在网络管理、Web搜索引擎系统监控、复杂网络理论研究等应用中，动态复杂网络的异常检测都有着极其重要的现实意义。本文对动态复杂网络中两个问题进行了研究：1)给定复杂网络的一个时间序列，如何确认在什么时刻，网络的统计特征发生了异常; (2)给定发生异常的时刻，如何选出最能代表其网络统计特征的变化的区域。
     对于第一个问题，复杂网络时间序列的统计特征异常检测问题，我们给出了一个复杂网络统计特征的检测框架。利用在真实数据上观察到的行为（顶点数，边数的线性增长）以及复杂网络的研究成果（稠化幂律以及度数的幂律分布），我们给出了在这些统计特征之下的异常定义。利用线性回归模型，我们给出了全局特征异常检测的基本算法。在此基础上，我们提出了去除检测到的异常点，然后迭代计算线性回归参数，直到异常点的集合达到稳定的迭代式算法：iLRAD算法。真实数据集上的实验结果表明，通过较少的迭代次数，iLRAD算法就能够达到收敛。同时，iLRAD算法就能够显著的改进异常检测的效果。
     对于第二个问题，我们考虑了更一般化的版本，给定复杂网络时间上相继的两个快照，如何选出最能代表其网络统计特征的变化的区域。我们将该问题形式化的定义为一个组合优化问题：k-CR问题。以顶点统计特征为例，我们证明了该问题是一个NP-hard问题，不存在多项式时间的精确算法。我们给出了解决k-CR问题的两种近似算法：基于分数的Top-k Score算法与基于贪心策略的Greedy Remove算法。此外，我们给出求解k-CR问题最优解的分支限界算法，我们给出下界估计的策略并给出了证明。在真实与人工的数据集实验下，基于贪心的Greedy Remove算法表现出较好的实验近似比。
Complex Network is a class of networks with certain common structural features,such as social network, citation network, coauthor network, AS Graph and Web Graph.Two important features of them are dynamics and large-scale, which make statisticalanomaly detection in dynamic complex network an extremely important problem. It canbe widely used in network management, web search monitoring, complex network theoryresearch. We solve two problems in thesis: (1) Given a time series of network, identifywhen occurs anomaly in the time series; (2) Given a time of anomaly, how to selectregions which best represents the change of network;
     For the first problem, we show a framework of anomaly detection in complex network.We definite anomalies under observation on real datasets(linear growth of vertex,edges), and research results (DPL and power law degree distribution), We give ananomaly detection algorithm with linear regression model. We propose an iterative algorithm,iLRAD algorithm further. The basic idea is to remove the detected outliers, fitthe line, get new outliers sets and do it again. The process stopped until the anomaliesset is stable. Experimental results on real datasets show that em iLRAD can achieveconvergence quickly and. significantly improve efficiency of anomaly detection.
     For the second problem, we consider the general version: Given two successivesnapshot of complex network, find regions which can best represent the changes of statisticalcharacteristics. We formalize the problem into an optimization problem, the k-CRproblem. Take the vertex statistical feature for an instance, we show the problem is NPhardwhich implies no algorithm can find the optimal solution in polynomial time. Wegive two approximation algorithms to solve the k-CR problem, the Top-k Score Algorithmand the Greedy Remove algorithm. The two algorithms are both linear time. Experimentalresults on real and synthetic datasets shows the greedy remove algorithm has agood approximation ratio. We also give a branch and bound algorithm to find the optimalsolutions. We show the strategy of estimate lower bound and make a sketch of proof.

引文

1 P. Papadimitriou, A. Dasdan, H. Garcia-Molina. Web Graph Similarity for AnomalyDetection[J]. Journal of Internet Services and Applications, 2010.
    2 V. Chandola, A. Banerjee, V. Kumar. Anomaly Detection: A Survey[J]. ACM Com-puting Surveys, 2009, (09):1–72.
    3 J. M. Hellerstein. Quantitative Data Cleaning for Large Databases[J]. United NationsEconomic Commission for Europe, 2008:1–42.
    4 M. Breunig, H. Kriegel, R. Ng, et al. LOF: Identifying Density-based Local Out-liers[J]. ACM SIGMOD Record, 2000, 29(2):104.
    5 Y. Tao, X. Xiao, S. Zhou. Mining Distance-based Outliers from Large Databases inany Metric Space[C]//Proceedings of the 12th ACM SIGKDD international confer-ence on Knowledge discovery and data mining. 2006:403.
    6 X. Li, J. Han. Mining Approximate Top-k Subspace Anomalies in Multi-dimensionalTime-series Data[C]//VLDB. 2007:447–458.
    7 A. Moore, G. Cooper, R. Tsui, et al. Summary of Biosurveillance-relevant Statisticaland Data Mining Technologies, 2002.
    8 K. Imai, G. King, O. Lau. Arima : Arima Models for Time Series Data, 2007.
    9 X. Li, Z. Li, J. Han, et al. Temporal Outlier Detection in Vehicle TrafficData[C]//Proceedings of the 25th International Conference on Data Engineering(ICDE 2009). IEEE Computer Society, 2009:1319–1322.
    10 D. Chakrabarti, C. Faloutsos. Graph Mining: Laws, Generators, and Algorithms[J].ACM Computing Surveys, 2006, 38(1).
    11 P. Erdo¨s, A. Re′nyi. On the Evolution of Random Graphs[J]. Publ. Math. Inst. Hung.Acad. Sci, 1960, 5:17–61.
    12 A. Baraba′si, R. Albert. Emergence of Scaling in Random Networks[J]. Science,1999, 286(5439):509.
    13 J. Leskovec, J. Kleinberg, C. Faloutsos. Graph Over Time: DensificationLaws, Shrinking Diameters and Possible Explanations[C]//Proceedings of the 11thACM SIGKDD international conference on Knowledge discovery in data mining(KDD’05). Chicago, Illinois, USA.: New York, USA, 2005:177– 187.
    14 J. Leskovec, D. Chakrabarti, J. Kleinberg, et al. Kronecker Graphs: An Approach toModeling Networks, 2008.
    15 C. Bilgin, B. Yener. Dynamic Network Evolution: Models, Clustering, AnomalyDetection, 2008.
    16 J. Leskovec, L. Backstrom, R. Kumar, et al. Microscopic Evolution of SocialNetworks[C]//Proceeding of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining. 2008:462–470.
    17 C. Cortes, D. Pregibon, C. Volinsky. Computational Methods for DynamicGraphs[J]. Journal of Computational and Graphical Statistics, 2003, 12(4):950–970.
    18 M. Berlingerio, F. Bonchi, B. Bringmann, et al. Mining Graph EvolutionRules[C]//ECML/PKDD 2009. Springer Berlin Heidelberg, 2009:115–130.
    19 J. Chan, J. Bailey, C. Leckie. Discovering Correlated Spatio-temporal Changes inEvolving Graphs[J]. Knowledge and Information Systems, 2008, 16(1):53–96.
    20 J. Chan, J. Bailey, C. Leckie. Discovering and Summarising Regions of CorrelatedSpatio-temporal Change in Evolving Graphs[C]//Proceedings of the 6th IEEE Inter-national Conference on Data Mining - Workshops (ICDMW’06). New York, USA:IEEE Computer Society, 2006:361–365.
    21 Z. Liu, J. X. Yu. Discovering Burst Areas in Fast Evolving Graphs[C]//Proceedingsof the 15th International Conference on Database Systems for Advanced Applica-tions (DASFAA’10). Tsukuba, Japan: Springer Berlin Heidelberg, 2010, 5981:171–185.
    22 H. Tong, S. Papadimitriou, P. S. Yu, et al. Proximity Tracking on Time-evolvingBipartite Graphs[C]//SDM 2009. 2008.
    23 H. Tong, Y. Sakurai, T. Eliassi-Rad, et al. Fast Mining of Complex Time-stampedEvents[C]//Proceeding of the 17th ACM conference on Information and knowledgemanagement. 2008:759–768.
    24 F. Chierichetti, R. Kumar, S. Lattanzi, et al. On Compressing Social Net-works[C]//Proceedings of the 15th ACM SIGKDD international conference onKnowledge discovery and data mining - KDD’09. New York,USA: ACM Press,2009:219.
    25 P. Boldi, S. Vigna. The Webgraph Framework I: Compression Tech-niques[C]//Proceedings of the 13th international conference on World Wide Web.2004:595–602.
    26 S. Navlakha, R. Rastogi, N. Shrivastava. Graph Summarization with Bounded Er-ror[C]//Proceedings of the 2008 ACM SIGMOD international conference on Man-agement of data - SIGMOD’08. New York, USA: ACM Press, 2008:419.
    27 Y. Tian, R. A. Hankins, J. M. Patel. Efficient Aggregation for Graph Summariza-tion[C]//Proceedings of the 2008 ACM SIGMOD international conference on Man-agement of data - SIGMOD’08. New York, USA: ACM Press, 2008:567.
    28 J. Leskovec, C. Faloutsos. Sampling from Large Graphs[C]//Proceedings of the 12thACM SIGKDD international conference on Knowledge discovery and data mining -KDD’06. New York, USA: ACM Press, 2006:631.
    29 H. Bunke, P. J. Dickinson, M. Kraetzl, et al. A Graph-theoretic Approach to Enter-prise Network Dynamics[M]. Birkhauser Boston, 2007.
    30 S. Hirose, K. Yamanishi, T. Nakata, et al. Network Anomaly Detection Based onEigen Equation Compression[C]//KDD’09: Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining. New York, USA:ACM Press, 2009:1185–1194.
    31 S. Shekhar, C.-T. Lu, P. Zhang. Graph-based Outlier Detection: Algorithms and Ap-plications (a Summary of Results)[C]//Proceedings of the 7th ACM SIGKDD Int’lConference on Knowledge Discovery and Data Mining (KDD 2001). San Francisco,CA, USA: ACM Press, 2001:371–376.
    32 J. Sun, H. Qu, D. Chakrabarti, et al. Neighborhood Formation and Anomaly De-tection in Bipartite Graphs[C]//Fifth IEEE International Conference on Data Mining(ICDM’05). IEEE, 2005:418–425.
    33 Z. Liu, J. X. Yu, Y. Ke, et al. Spotting Significant Changing Subgraphs in EvolvingGraphs[C]//Proceedings of the 8th IEEE International Conference on Data Mining.IEEE Computer Society, 2008:917–922.
    34 M. Mcglohon, C. Faloutsos. Anomaly Detection in Large Graphs[C]//PAKDD 2010.Hyderabad, India, 2010.
    35 P. Papadimitriou, A. Dasdan, H. Garcia-Molina. Web Graph Similarity for AnomalyDetection (poster)[C]//Proceeding of the 17th international conference on WorldWide Web - WWW’08. New York, USA: ACM Press, 2008:1167.
    36 L. Wasserman. All of Statistics: A Concise Course in Statistical Inference[M].Springer, 2003.
    37 A. Clauset, C. R. Shalizi, M. E. J. Newman. Power-law Distributions in EmpiricalData[J]. Physics, 2007:43.
    38 M. L. Goldstein, S. a. Morris, G. G. Yen. Problems with Fitting to the Power-lawDistribution[J]. The European Physical Journal B, 2004, 41(2):255–258.
    39 S.-h. Cha, S. N. Srihari. On Measuring the Distance between Histograms[J]. PatternRecognition, 2002, vol(35):1355–1370.
    40 C. H. Papadimitriou, K. Steiglitz. Combinatorial Optimization: Algorithms andComplexity[M]. Courier Dover, 1998:496.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700