摘要
异常点检测(outlier detection)领域的大量研究都集中于一类"基于密度的"方法,这类方法能够克服许多传统异常点检测方法的缺陷,但仍大多使用基于几何距离的方式进行数据点局部密度的估计,导致在某些情况下反直观结果的出现.针对该问题,用一种基于邻域链的方法取代传统方法进行局部密度的估计,设计新的异常点检测方法.实验结果表明,对比经典的基于密度的异常点检测方法LOF(Local outlier factor)以及几种基于LOF的改进方法,所提出的方法能够更加准确地区分正常和异常数据点,避免反直观结果的出现.
Many research works in the area of outlier detection are focused on the so called "density-based" methods.Such kind of methods can counter-act many drawbacks of the traditional outlier detection methods. However, most existing density-based methods use geometric-distance-based approaches to estimate the data point's local density, which leads to incorrect results in certain cases. To resolve the problem, the traditional local density estimation method is substituted by a neighborhood-chain-based method, and a new outlier detection method is proposed. Compared to the local outlier factor(LOF) and several of related modifications, the proposed one can find the outliers more accurately.
引文
[1] Bolton R J, Hand D J. Statistical fraud detection:A review(with discussion)[J]. Statistical Science, 2002,17(3):235–255.
[2] Tang B, He H B. A local density based approach for outlier detection[J]. Neurocomputing, 2017, 241(2):171-180.
[3] Jin W, Tung A K, Han J. Mining top-n local outliers in large databases[C]. Proc of the 7th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York:ACM, 2001:293-298.
[4] Barnett V, Lewis T. Outliers in statistical data[M]. Wiley:&Sons, 1994:335-338.
[5] Hwkins D M. Identification of outliers[M]. New York:Springer, 1980:613-615.
[6] Zhang T, Ramakrishnan R, Livny M. BIRCH:A new data clustering algrithm and its applications[J]. Data Mining and Knowledge Discovery, 1997, 1(2):141-182.
[7] Brito M, Chavez E, Quiroz A, et al. Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection[J]. Statistics&Probability Letters, 1997, 35(1):33-42.
[8] Knorr E M, Ng R T. A unified notion of outliers:Properties and computation[C]. Proc of the 3rd ACM Int Conf on Knowledge Discovery and Data Mining. New York:ACM, 1997:219-222.
[9] Knorr E M, Ng R T. Algorithms for mining distance based outliers in large datasets[C]. Proc of the 24th Int Conf on Very Large Data Bases. New York:Morgan Kaufmann Publishers Inc, 1998:392-403.
[10] Breunig M M, Kriegel H P, Ng R T, et al. LOF:Identifying density-based local outliers[C]. Proc of ACM Sigmod Record. Madison:ACM, 2000:93-104.
[11] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets[C]. Proc of the ACM Int Conf on Management of Data(SIGMOD).Dallas:ACM, 2000:427-438.
[12] Angiulli F, Pizzuti C. Outlier mining in large high dimensional data sets[J]. IEEE Trans on Knowledge and Data Engineering, 2005, 17(2):203-215.
[13]刘一民,文俊杰,王岚君.基于空-时近邻与似然比检验的传感器网络异常点检测[J].清华大学学报:自然科学版, 2017, 57(11):1196-1201.(Liu Y M, Wen J J, Wang L J. Outlier detection based on spatio-temporal nearest neighbors and a likelihood ratio test for sensor networks[J]. J of Tsinghua University:Science and Technology, 2017, 57(11):1196-1201.)
[14]杨金伟,王丽珍,陈红梅,等.基于距离的不确定数据异常点检测研究[J].山东大学学报:工学版, 2011,41(4):34-37.(Yang J W, Wang L Z, Chen H M, et al. Distance-based outlier detection over uncertain data[J]. J of Shandong University of Technology:Engineering Science, 2011,41(4):34-37.)
[15] Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered:A generalized view on locality with applications to spatial, video, and network outlier detection[J]. Data Min Knowl Discov, 2014, 28(1):190-237.
[16] Jin W, Tung A K H, Han J, et al. Ranking outliers using symmetric neighborhood relationship[C]. Proc of the10th Pacific-Asia Conf on Knowledge Discovery and Data Mining(PAKDD). Singapore:Springer, 2006:577-593.
[17] Kriegel H P, Kroger P, Schubert E, et ak. LoOP:Local outlier probabilities[C]. Proc of the 18th ACM Conf on Information and Knowledge Management(CIKM). Hong Kong:ACM, 2009:1649-1652.
[18] Zhang K, Hutter M, Jin H. A new local distance based outlier detection approach for scattered real-world data[C]. Proc of the 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining. Berlin:Springer, 2009:813-822.
[19]杨茂林,卢炎生.基于剪枝的海量数据离群点挖掘[J].计算机科学, 2012, 39(10):152-156.(Yang M L, Lu Y S. Outlier mining in mass data based on pruning algorithm[J]. Computer Science, 2012, 39(10):152-156.)
[20] Campos G O, Zimek A, Sander J, et al. On the evaluation of unsupervised outlier detection:Measures, datasets,and an empirical study[J]. Data Min Knowl Disc, 2016,30(4):891-927.
[21] Schubert E, Wojdanowski R, Zimek A. On evaluation of outlier rankings and outlier scores[C]. Proc of the 2012SIAM Int Conf on Data Mining. Anaheim:SIAM, 2012:1047-1058.
[22] Liang S Y, Han D Q, Zhang L, et al. A novel clustering oriented closeness measure based on neighborhood chain[C]. Proc of the 2017 Int Joint Conf on Nerual Networks(IJCNN). Anchorage, IEEE:2017:997-1004.
[23] Benjio Y, Courville A, Vincent P. Representation learning:A review and new perspectives[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2013,35(8):1798-1828.