基于蚁群算法的离群点挖掘算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
离群点挖掘随着数据挖掘的发展引起了广泛关注。通过对国内外离群点挖掘算法的研究情况分析可知,以往的离群点挖掘算法还存在诸多问题,例如用户定义的阈值往往直接影响着挖掘的结果;考查多变量之间的相似性来挖掘时序离群点的算法仍较少,或精确度较低。针对这些问题,本文主要研究了基于蚁群算法的离群点挖掘方法。
     首先,提出了一种在对蚁群构图进行切割的基础上挖掘离群点的算法。该算法在第一阶段对传统的蚁群算法进行改进,将不同属性数据之间的距离和分布情况纳入转移概率的计算之中,从而构建最优的图像。然后在一定的图像切割准则下对图像进行切割,最后通过计算各个簇,即切割图像后形成的各子图之间的差异以及同一簇中数据点之间的差异来找到top n离群点。
     其次,提出了一种基于改进的蚁群k-means聚类算法的多变量时序离群点挖掘算法。该算法把蚁群算法特有的信息素和转移概率引入对数据聚类的过程中,通过计算类内距离和类间距离找到符合聚类标准的最好聚类结果,然后通过查看各数据点在不同簇中的时刻点分布情况,以邻居相似性为标准计算各点的离群系数,从而实现时序离群点的挖掘。
     最后,在真实和合成数据集上对提出的两种算法进行了验证。实验结果表明,提出的算法在对离群点的检测精度上要明显优于其他同类算法,实现了预期的研究目标。
Outlier detection has obtained the widespread concern with the increasing application of data mining. We can know that there are still some problems in the area of outlier detection by analyzing the existed algorithms of domestic and international. The user-defined thresholds often affect the result of outlier detection to a large extent and it is such a difficult task to define them for users who don’t have previous knowledge. Few algorithms mine outliers on the basis of similarity of neighbors when mining outliers on multivariate time series data, but obviously this is a very promising area. There is still another problem: the precision is still low in outlier detection. To solve these problems we mainly study the methods of outlier detection based on ant colony algorithm.
     Firstly, a method of graph-cut based outlier detection using ant colony algorithm is proposed. In the first phase we make some improvements to the traditional ant colony algorithm, for example, to calculate the more accurate transition probability, we take the distance on categorical and numerical attributes as well as the local distribution situation into consideration; in this way a better graph can be constructed in limited time. Then we cut on the graph under a criterion and every subgraph can be treated as a cluster consequently. At the last step we can get top n outliers by comparing the difference of clusters and the difference of points that lie in the same cluster.
     Secondly, a method of outlier detection on multivariate time series data based on improved ant colony algorithm and k-means clustering algorithm is proposed. In the process of classifying the time series data we introduce the concepts of pheromone and transition probability of the ant colony algorithm into the k-means clustering algorithm, and then try to find the best clustering result which is determined by the distance of inner-clusters and inter-clusters. At the last step we calculate the outlier score of every data point by checking the distribution of the time points that lie in clusters under the criterion of neighbor-similarity.
     In the end we test the two algorithms proposed in this paper on both of real-world datasets and synthetic datasets. The experimental results of the proposed algorithms show the superiority on the precision of outlier detection than other existed algorithms. And the expected goal is achieved.
引文
1 J. Han, M. Kamber著,范明,孟小峰等译.数据挖掘概念与技术.机械工业出版社.北京, 2001:1-268
    2赵泽茂,何坤金,陈鹏. Web日志文件的异常数据挖掘算法及其应用.计算机工程, 2003, 29(17):l95-197
    3 R. Gwadera, M. J. Atallah, W. Szpankowski. Reliable Detection of Episodes in Event Sequences. Knowledge and Information Systems, 2005, 7(4):415-437
    4 B. Boucheham. Matching of Quasi-periodic Time Series Patterns by Exchange of Block Sorting Signatures. Pattern Recognition Letters, 2008, 29(4):501-514
    5 S. Shekhar, C. T. Lu, P. Zhang. A Unified Approach to Spatial Outliers Detection. GeoInformatica, 2003, 7(2):139-166
    6 H. P. Kriegel, M. Schubert, A. Zimek. Angle-based Outlier Detection in High-dimensional Data. Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 2008:444-452
    7 Z. He, X. Xu, J. Huang, et al. FP-Outlier: Frequent Pattern-based Outlier Detection. Computer Science and Information Systems ComSIS, 2005:103-118
    8周晓云,孙志挥,张柏礼等.高维类别属性数据流离群点快速检测算法.软件学报, 2007, 18(4):933-942
    9魏藜,宫学庆,钱卫宁等.高维空间中的离群点发现.软件学报, 2002, 13(2):280-291
    10鞠可一,周德群,张玉强.高维离群检测算法及其应用.系统工程, 2008, 26(11):116-122
    11 W. Hung, M. S. Yang. An Omission Approach for Detecting Outliers in Fuzzy Regression Models. Fuzzy Sets and Systems, 2006, 157(23):3109-3122
    12 S.J. Han, S. B. Cho. Evolutionary Neural Networks for Anomaly Detection based on the Behavior of a Program. IEEE Transactions on Systems, Man and Cybernetics,2006, 36(3):559-570
    13 C. C. Aggarwal, P. S. Yu. An Effective and Efficient Algorithm for High-dimensional Outlier Detection. The International Journal on Very Large Data Bases, 2005, 14(2):211-221
    14 C. C. Aggarwal, P. S. Yu. Outlier Detection for High Dimensional Data. Proc. of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, 2001, 30(2):37-46
    15 Y. Mao, L. Xue, M. E. Orlowska. Projected Outlier Detection in High-dimensional Mixed-attributes Data Set. Expert Systems with Applications, 2009, 36(3):7104-7113
    16魏藜,钱卫宁,周傲英. SLOT:基于估计的高效子空间局部离群点发现.第十九届全国数据库学术会议计算机科学, 2002, 29(增刊A)(08):122-125
    17 J. Haslett, R. Brandley, P. Craig, A. Unwin. Dynamic Graphics for Exploring Spatial Data with Application to Locating Global and Local Anomalies. The American Statistician, 1991, 45(3):234-242
    18 S. Shekhar, S. Chawla. Spatial databases: A Tour. Prentice Hall, 2003:1-262
    19 S. Shekhar, C. T. Lu, P. Zhang. Detecting Graph-based Spatial Outlier: Algorithms and Applications(A Summary of Results). Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2001:371-376
    20 S. Shekhar, C. T. Lu, P. Zhang. Detecting Graph-based Spatial Outliers. International Journal of Intelligent Data Analysis(IDA), 2002,6(5):451-468
    21 C. Sanjay, P. Sun. SLOM: A New Measure for Local Spatial Outliers. Knowledge and Information Systems, 2006, 9(4):412-429
    22 J. Takeuchi, K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(4):482-492
    23 A. Zaharim, R. Rajali, R. M. Atok, et al. A Study on the Nature of an Additive Outlier in ARMA(1,1) Models. European Journal of Scientific Research, 2009, 32(3):362-368
    24 A. Grané, H. Veiga. Wavelet-based Detection of Outliers in Financial Time Series.Computational Statistics and Data Analysis, 2010, 54(11):2580-2593
    25 Kokyo Choy. Outlier detection for stationary time series. Journal of Statistical Planning and Inference, 2001, 99(2):111-127
    26翁小清,沈钧毅.基于滑动窗口的多变量时间序列异常数据的挖掘.计算机工程, 2007, 33(12):102-104
    27黄洪宇,林甲祥,陈崇成等.离群数据挖掘综述.计算机应用研究, 2006, 8(13):8-13
    28 P. N. Tan, S. Michael, K. Vipin. Introduction to Data Mining. Addison-Wesley, New York, 2005:1-769
    29 D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980:1-188
    30 E. M. Knorr, R. T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Proc. of the 24th International Conference on Very Large Data Bases, New York, USA, 1998:392-403
    31 M. M. Breunig, H. P. Kriegel, R. T. Ng, et al. LOF: Identifying Density-based Local Outliers. Proc. of ACM SIG MOD International Conference on Management of Data (SIG MOD’00), Dallas, Texas, USA, 2000:93-104
    32徐翔,刘建伟,罗雄麟.离群点挖掘研究.计算机应用研究, 2009, 26(1):34-40
    33 V. Barnett, T. Lewis. Outliers in Statistical Data(3rd edition). John Wiley and Sons, New York, 1994:1-584
    34 Y. Zhang, S. Yang, Y. Wang. LDBOD: A Novel Local Distribution based Outlier Detector. Pattern Recognition Letters, 2008, 29(7):967-976
    35 T. Johnson, I. Kwok, R. Ng. Fast Computation of 2-dimensional Depth Contours. Proc. of KDD’98, New York, USA, 1998:224-228
    36 I. Ruts, P. Rousseeuw. Computing Depth Contours of Bivariate Point Clouds. Journal of Computational Statistics and Data Analysis, 1996, 40(23):153-168
    37 E. M. Knorr, R. T. Ng. A Unified Notion of Outliers: Properties and Computation. Proc. of 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, 1997:219-222
    38 S. Ramaswamy, R. Rastogi, K. Shim. Efficient Algorithms for Mining Outliners fromLarge Datasets. Proc. of the ACM SIGMOD Conference, Dallas, Texas, USA, 2000, 29(2):427-438
    39 D. Ren, I. Rahal, W. Perrizo, et al. A Vertical Distance-based Outlier Detection Method with Local Pruning. Proc. of the 13th ACM International Conference on Information and Knowledge Management(CIKM), Washington DC, USA, 2004:279-284
    40 S. Papadimitriou, H. Kitagawa, P. B. Gibbons, et al. LOCI: Fast Outlier Detection Using the Local Correlation Integral. Proc. of the 19th International Conference on Data Engineering(ICDE’03), Bangalore, India, 2003:315-326
    41 J. Wen, A. K. H. Tung, J. Han, et al. Ranking Outliers Using Symmetric Neighborhood Relationship. PAKDD, Singapore, 2006:577-593
    42赵科平,周水庚,关佶红等.一种新的离群数据对象发现方法.中国人工智能学会第10届全国学术年会论文集,广州,中国, 2003:470-475
    43 B. Yu, M. Song, L. Wang. Local Isolation Coefficient-based Outlier Mining Algorithm. 2009 International Conference on Information Technology and Computer Science, Kiev, Ukraine, 2009, 2:448-451
    44 M. Ester, H. P. Kriegel, J. Sander, et al. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, 1996:226-231
    45 Z. He, X. Xu, S. Deng. Discovering Cluster-based Local 0utliers. Pattern Recognition Letters, 2003, 24(9210):1641-1650
    46 M. F. Jiang, S. S. Tseng, C. M. Su. Two-phase Clustering Process for Outlier Detection. Pattern Recognition Letters, 2001, 22(627):691-700
    47 A. Arning, R. Agrawal, P. Raghavan. A Linear Method for Deviation Detection in Large Databases. Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, 1996:164-169
    48 S. Sarawagi, R. Agrawal, N. Megiddo. Discovery-driven Exploration of OLAP Data Cubes. Proc. of the 6th International Conference on Extending DatabaseTechnology(EDBT’98), Valencia, Spain, 1998, 1377/1998:168-182
    49 Y. Shi. SubCOID: An Attempt to Explore Cluster-Outlier Iterative Detection Approach to Multi-Dimensional Data Analysis in Subspace. Proc. of the 46th Annual Southeast Regional Conference, Auburn, AL, USA, 2008:132-135
    50 C. Zhong, X. Lin, M. Zhang. A Local Outlier Detection Approach Based on Graph-Cut. 2009 International Joint Conference on Computational Sciences and Optimization, Sanya, Hainan, China, 2009, 1:714-718
    51 Z. Hajihashemi, B. Minaei. Improving Noise Clustering Algorithm Using Ant Colony Optimization. Proc. of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 2008:1077-1080
    52 Y. Li, D. Wu, J. Ren, et al. An improvd Outlier Detection Method in High-dimension based on Weighted Hypergraph. 2009 Second International Symposium on Electronic Commerce and Security, Nanchang, Jiangxi, China, 2009:159-163
    53 M. Dorigo, L. M. Gambardella. Ant Colony System: A Cooperative Learning Approach to the Traveling Salesman Problem. IEEE transactions on Evolutionary Computation, 1997, 1(1):53-66
    54 R. A. Weekley, R. K. Goodrich, L. B. Cornman. An Algorithm for Classification and Outlier Detection of Time-Series Data. Journal of Atmospheric and Oceanic Technology, 2010, 27(1):94-107
    55 X. Weng, J. Shen. Detecting Outlier Samples in Multivariate Time Series Dataset. Knowledge-Based Systems, 2008, 21(8):807-812
    56 X. Li, Z. Li, J. Han, et al. Temporal Outlier Detection in Vehicle Traffic Data. Proc. of the 2009 International Conference on Data Engineering (ICDE'09), Shanghai ,China, 2009:1319-1322
    57 K. Yang, C. Shahabi. A PCA-based Similarity Measure for Multivariate Time Series. Proc. of the 2nd ACM International Workshop on Multimedia Databases, Washington DC, USA, 2004:65-74
    58 T. Kanungo, D. M. Mount, N. S. Netanyahu, et al. An Efficient k-means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2002, 24(7):881-892
    59 V. Hautamaki, I. Karkkainen, P. Franti. Outlier Detection using k-nearest Neighbour Graph. Proc. of 17th International Conference on Pattern Recognition(ICPR’04), Cambridge, UK, 2004, 3:430-433

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700