基于邻域链的数据异常点检测

英文篇名：Outlier detection based on neighborhood chain
作者：梁绍一 ; 韩德强
英文作者：LIANG Shao-yi;HAN De-qiang;College of Electronic and Information Engineering,Xi'an Jiaotong University;CETC Key Laboratory of Aerospace Information Applications,China Electronics Technology Group Corporation;
关键词：数据挖掘 ; 异常点检测 ; 局部密度 ; 局部异常因子 ; 欧氏距离 ; 邻域链
英文关键词：data mining;;outlier detection;;local density;;local outlier factor;;Euclidean distance;;neighborhood chain
中文刊名：KZYC
英文刊名：Control and Decision
机构：西安交通大学电信学院;中国电子科技集团公司航天信息应用技术重点实验室;
出版日期：2018-04-16 09:32
出版单位：控制与决策
年：2019
期：v.34
基金：国家自然科学基金项目(61573275,61671370);; 国家973计划项目(2013CB329405);; 陕西省科技计划项目(2013KJXX-46);; 中央高校基本科研业务费专项资金项目(xjj2016066);; 中国博士后科学基金项目(2016M592790);; 中国电子科技集团公司航天信息应用技术重点实验室高校合作课题项目(KX172600034)
语种：中文;
页：KZYC201907012
页数：8
CN：07
ISSN：21-1124/TP
分类号：92-99

摘要

异常点检测(outlier detection)领域的大量研究都集中于一类"基于密度的"方法,这类方法能够克服许多传统异常点检测方法的缺陷,但仍大多使用基于几何距离的方式进行数据点局部密度的估计,导致在某些情况下反直观结果的出现.针对该问题,用一种基于邻域链的方法取代传统方法进行局部密度的估计,设计新的异常点检测方法.实验结果表明,对比经典的基于密度的异常点检测方法LOF(Local outlier factor)以及几种基于LOF的改进方法,所提出的方法能够更加准确地区分正常和异常数据点,避免反直观结果的出现.
Many research works in the area of outlier detection are focused on the so called "density-based" methods.Such kind of methods can counter-act many drawbacks of the traditional outlier detection methods. However, most existing density-based methods use geometric-distance-based approaches to estimate the data point's local density, which leads to incorrect results in certain cases. To resolve the problem, the traditional local density estimation method is substituted by a neighborhood-chain-based method, and a new outlier detection method is proposed. Compared to the local outlier factor(LOF) and several of related modifications, the proposed one can find the outliers more accurately.

引文

[1] Bolton R J, Hand D J. Statistical fraud detection:A review(with discussion)[J]. Statistical Science, 2002,17(3):235–255.
    [2] Tang B, He H B. A local density based approach for outlier detection[J]. Neurocomputing, 2017, 241(2):171-180.
    [3] Jin W, Tung A K, Han J. Mining top-n local outliers in large databases[C]. Proc of the 7th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York:ACM, 2001:293-298.
    [4] Barnett V, Lewis T. Outliers in statistical data[M]. Wiley:&Sons, 1994:335-338.
    [5] Hwkins D M. Identification of outliers[M]. New York:Springer, 1980:613-615.
    [6] Zhang T, Ramakrishnan R, Livny M. BIRCH:A new data clustering algrithm and its applications[J]. Data Mining and Knowledge Discovery, 1997, 1(2):141-182.
    [7] Brito M, Chavez E, Quiroz A, et al. Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection[J]. Statistics&Probability Letters, 1997, 35(1):33-42.
    [8] Knorr E M, Ng R T. A unified notion of outliers:Properties and computation[C]. Proc of the 3rd ACM Int Conf on Knowledge Discovery and Data Mining. New York:ACM, 1997:219-222.
    [9] Knorr E M, Ng R T. Algorithms for mining distance based outliers in large datasets[C]. Proc of the 24th Int Conf on Very Large Data Bases. New York:Morgan Kaufmann Publishers Inc, 1998:392-403.
    [10] Breunig M M, Kriegel H P, Ng R T, et al. LOF:Identifying density-based local outliers[C]. Proc of ACM Sigmod Record. Madison:ACM, 2000:93-104.
    [11] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets[C]. Proc of the ACM Int Conf on Management of Data(SIGMOD).Dallas:ACM, 2000:427-438.
    [12] Angiulli F, Pizzuti C. Outlier mining in large high dimensional data sets[J]. IEEE Trans on Knowledge and Data Engineering, 2005, 17(2):203-215.
    [13]刘一民,文俊杰,王岚君.基于空-时近邻与似然比检验的传感器网络异常点检测[J].清华大学学报:自然科学版, 2017, 57(11):1196-1201.(Liu Y M, Wen J J, Wang L J. Outlier detection based on spatio-temporal nearest neighbors and a likelihood ratio test for sensor networks[J]. J of Tsinghua University:Science and Technology, 2017, 57(11):1196-1201.)
    [14]杨金伟,王丽珍,陈红梅,等.基于距离的不确定数据异常点检测研究[J].山东大学学报:工学版, 2011,41(4):34-37.(Yang J W, Wang L Z, Chen H M, et al. Distance-based outlier detection over uncertain data[J]. J of Shandong University of Technology:Engineering Science, 2011,41(4):34-37.)
    [15] Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered:A generalized view on locality with applications to spatial, video, and network outlier detection[J]. Data Min Knowl Discov, 2014, 28(1):190-237.
    [16] Jin W, Tung A K H, Han J, et al. Ranking outliers using symmetric neighborhood relationship[C]. Proc of the10th Pacific-Asia Conf on Knowledge Discovery and Data Mining(PAKDD). Singapore:Springer, 2006:577-593.
    [17] Kriegel H P, Kroger P, Schubert E, et ak. LoOP:Local outlier probabilities[C]. Proc of the 18th ACM Conf on Information and Knowledge Management(CIKM). Hong Kong:ACM, 2009:1649-1652.
    [18] Zhang K, Hutter M, Jin H. A new local distance based outlier detection approach for scattered real-world data[C]. Proc of the 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining. Berlin:Springer, 2009:813-822.
    [19]杨茂林,卢炎生.基于剪枝的海量数据离群点挖掘[J].计算机科学, 2012, 39(10):152-156.(Yang M L, Lu Y S. Outlier mining in mass data based on pruning algorithm[J]. Computer Science, 2012, 39(10):152-156.)
    [20] Campos G O, Zimek A, Sander J, et al. On the evaluation of unsupervised outlier detection:Measures, datasets,and an empirical study[J]. Data Min Knowl Disc, 2016,30(4):891-927.
    [21] Schubert E, Wojdanowski R, Zimek A. On evaluation of outlier rankings and outlier scores[C]. Proc of the 2012SIAM Int Conf on Data Mining. Anaheim:SIAM, 2012:1047-1058.
    [22] Liang S Y, Han D Q, Zhang L, et al. A novel clustering oriented closeness measure based on neighborhood chain[C]. Proc of the 2017 Int Joint Conf on Nerual Networks(IJCNN). Anchorage, IEEE:2017:997-1004.
    [23] Benjio Y, Courville A, Vincent P. Representation learning:A review and new perspectives[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2013,35(8):1798-1828.