基于NMF和相似度函数离群点检测
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
离群点检测是数据挖掘的重要研究领域,广泛应用在信用卡欺诈检测、网络入侵检测等方面。目前常见的离群点检测算法有基于统计的方法、基于距离的方法、基于密度的方法、基于偏移的离群点检测方法。本文对高维异常点检测算法、以及基于数据流等其他离群点检测算法进行了介绍,并对数据挖掘中常见的聚类算法进行了介绍,如基于层次、基于密度、基于划分、基于模型和基于网格聚类算法。另外对群分类的算法、基于粒子的聚类算法、依据模糊划分的模糊聚类算法等进行了分析比较。
     本文结合层次聚类和相似性的原理,给出高维数据的相似度量函数与类密度概念,基于类密度概念定义高维数据离群点,并进而提出新的异常点检测算法,该算法的算法原理简单,程序容易实现。实验结果表明,该算法有一定优点但也有缺点,如随着数据维数的增加,其运行的时间较长。对此进一步改进,结合NMF和基于相似度量的的离群点检测算法提出基于NMF和相似函数的离群点检测算法,该算法的思想是先对高维数据进行降维处理,然后再对降维后的数据运用基于相似度量的离群点检测算法进行数据处理分析。本算法在时间方面比基于相似度量的离群点检测算法有一定程度改进,在处理高维数据的离群点检测的时间效率较高,从而便于对高维数据特征提取,实验结果表明该算法具有一定的有效性。
Outlier detection is an important field in data mining,which is widely used in credit card fraud detection as well as network intrusion detection and so on. General outlier detection algorithms are introduced in this paper, such as algorithms based on statistical method, distance method, density method, and the offset of the outlier detection method, at the same time outlier detection algorithms of the high dimension data and data stream method are also presented. On the other hand, common clustering algorithms In data mining are presented in the paper, such as partition-based algorithm, hierarchical-based algorithm, density-based algorithm, grid-based algorithm,model-based algorithm, and fuzzy clustering algorithms, etc.
     Similarity function of high dimensional data sets and concept of the classes density are presented in light of hierarchical clustering and similarity principle, so a new outlier detection algorithm is proposed basing on the concept. The algorithm principle is simple and realization of the program is not difficult, at the same time it is valid by test. On the other hand, it owns some defect with which the run time is more long with increasing dimension of data. In order to solve the problem, a new outlier detection algorithm, namely algorithm of NMF and similarity metric which is combined NMF and the similarity metric outlier detection, puts forword.The algorithm has higher time efficiency dealing with the high dimension data through experimental results because it can depress the dimension of data by NMF firstly, namely it is the fusion of NMF and similarity metric outlier detection method.
引文
[1]HAWKINSD. Identification of Outliers[M]. London:Chapmanan d Hall,1980.
    [2]BARNETTV, LEWIST. Outliers in Statistical Data[M]. New York: JohnWiley& Sons,1994.
    [3]史东辉,张春阳,蔡庆生.离群数据的挖掘方法研究[J].小型微型计算机系统,2001,22(10):234-236.
    [4]HAN Jia-wei. KAMBERM.Data Mining:Concepts and Tec-hniques[M].Academic Press,2001.
    [5]KNORREM. NGRT.Algorithms for Mining Distance-based Outliers in Large Datasets[C]. Proc.of Int.Conf.Very Large Data-bases(VLDB98).1998:392-403.
    [6]BAYS,SCHWABACHERM. Mining Distance-based Outliers in Near Linear Time with Randomization and SimplePruning Rule[C].Washington DC:SIGKDD'0320-03.
    [7]ANGIULLIF,PIZZUTIC.Fast Outlier Detectionin High DimensionalSpaces[C]// Proceedings of the 6thEuropean Conference on the Principles of Data Mining and Knowledge Discovery. 2002: 15-16.
    [8]RAMASWAMYS,RASTOGIR,SHIMK. Efficient Algorithms for Mining Outliers from Large Datasets [C]//Proceedings ofthe ACM SIGMOD Conference. 2000: 427-438.
    [9]SHERHARS,LUC,ZHANGP.A unified approach to detecting spatial outliers[J]. Geolnformatica,2003,7(2):139-166.
    [10]BREUNIGMM,KRIEGELHP,NGR,etal. LOF:Identifying Density-Based Local Outliers[C]//Proc. of ACMSIGMOD Conf,2000.
    [11]SPIROSPAPADIMITRIOU,HIROYUKIKITAGAWA,etal.LOCI:Fast Outlier Det-ection Using the Local Correlation In-tegral[C]//The 19th International Confere-ce on Data Engin-eering.2003:315-326.
    [12]YANGFeng-zhao,ZHU Yang-yong,SHI Bai-le.Inc LOF:AnIncremental Algorithm for Mining Local Outliers in Dynamic Environment[J].Journal of Computer Research and Development,2004, 41(3):477-484.
    [13]MALIKAGYEM ANG. Local Sparsity Coefficient-Based Mining of Outliers[D]. University of Windsor,2002.
    [14]CHIUAL,FUAW. Enhancements on local outlier detection[C]//The(IEEE)7th Int. Data-base Engineering and Applications symposium,(IDEAS)HongKong,2003.
    [15]GRAWALR,GEHRKEJ,GUNOPULOSD,etal.Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications [C]//HaasLM,Tiwary A.Proc.of the ACM SIGMOD International Conference on Management of Data.Seattle:ACM Press,1998: 94-105.
    [16]BERCHTOLDC,BOHM,HPKRIEGEL.Improving the Query Performance of High Dimensional Index Structures by Bulk Load Operations[C]//Proc.of EDBT, 1998.
    [17]AGGARWALCC,YUP.Outlier Detection for High Dimensional Data[C]//Aref WG.Proc.of the ACMSIGMOD International Conference on Management of Data.Santa Barbara:ACM Press,2001:37-47.
    [18]BREUNINGMM,KRIEGELHPNGRT,etal.LOF:Identifying Density based Local Outliers[C]//ChenW,NaughtonJF,Bernstein.Proc.of the ACM SIGMOD Internat-ional Conference on Management of Data.Dallas:ACM Press2000:93-104.
    [19]BIALLYT.Space filling Curves:Their Generation and Their Application to Band-width Reduction [J].IEEE Trans.on Info.Theory, 1969, 15(6): 658-664.
    [20]ZHENG Bin- xiang,DU Xiu-hua,XI Yu- geng.Outliers Mining in Time Series Data Sets [J].Journal of Systems Engineering and Electronics,2002,13(1):93-97.
    [21]AGGARWALCC,YUP.Finding Generalized Projected Clusters in High Dimensi-onal Spaces[C]//Chen W, Naughton J F,Bernstein PA.Proc.of the ACM SIGM-OD InternationalConference on Management of Data.Dallas:ACM Press, 2000 :70-81.
    [22]李勇国,田大钢.数据库营销响应建模的一种新的数据挖掘方法[J].桂林电子工业学院学报,2005,25(5):23-26.
    [23]MUTHUKRISHNANS,SHAHR,VITTER J.Mining deviants in time series data streams[C]//Proc.of the 16th Int'l Conf.on Scientific and Statistical Database Management .2004:41-50.
    [24]杨宜东,孙志挥,朱玉金,等.基于动态网格的数据流离群点快速检测算法[J].软件学报,2006,17(8):1798-1803.
    [25]王勇,高亮,杨辉华.网络入侵异常检测的实时方法[J].桂林电子工业学院学报,2005,25(5):1-5.
    [26]林宇等.数据仓库原理与实践[M].北京:人民邮电出版社,2005.
    [27]苏新宁.数据仓库与数据挖掘[M].北京:清华大学出版社,2006.
    [28]夏火松.数据仓库与数据挖掘技术[M].北京:科学出版社,2004
    [29]Wang W,MuntzR,STING.A Statistical Information Grid Approach to Spatial Data Mining[C]. Athens Proceedings of the 23rd Conference on VLDB,1997,186-195.
    [30]张丽娟,李舟军.分类方法的新发展:研究综述[J].计算机科学,2006,33(10):11-15.
    [31]朱强.粒度计算在聚类分析中的应用[D].合肥:安徽大学,2007.
    [32]李明华,等.数据挖掘中聚类算法的新发展[D].苏州苏州大学计算机科学与技术学.
    [33]EKnorr,RNg.Algorithms for mining distance-based outliers in large datasets[A]. In Proc of the24th VLDBConf[C].NewYork:MorganKaufmann,1998.392-403.
    [34]JWHan,MDamber.Data Mining:Concepts and Technologies[M].SanFrancisco: Morgan Kau-fmann 2001.
    [35]PJRousseeuw,AM Leroy.Robust Regression and Outlier Detection[M].New York: John Wiley& Sons, 1987
    [36]Rakesh Agrawal,Johannes Gehrke,Dmiitrios Gunopulos,etal.Automatic Subspace Clustering of High Dimensional Data for Data Mining Application[C]//Proceedi-ngs of the 1998 ACMSI-GMOD Internation a Conference on Management of Data,Seattle,Washington,1998.
    [37]Aggarwal CC,Procopiuc C,Wolf JL,et al. Fast algorithmsf or projected clustering [C]//Proc.of the ACMSIGMOD Conference Philadel Phia, PA,1999:61-72.
    [38]AgrawalR,Gehrke J.Gunopolos D, et al .Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications.In ACM SIGMOD Conference, 1998.
    [39]Zeshui Xu,Meimei Xia.Distance and similarity measures for hesitant fuzzy sets[J].Information Sciences,2011.2128-2138.
    [40]杨风召.高维数据挖掘技术研究[M].南京:东南大学出版社,2007
    [41]贺玲,吴玲达,蔡益朝.高维空间中数据的相似性度量[J].数学的实践与认识,2006,36(9):189-194.
    [42]黄斯达,陈启买.一种基于相似性度量的高维聚类算法的研究[J].计算机应用与软件,2009:102-105.
    [43]武森,高学东.数据仓库与数据挖掘[M]北京:冶金工业版社,2003
    [44]数据挖掘概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2006.
    [45]D.D.Lee,H.S.Seung,Learning the parts of the objects by non-negative matrix factorization,Nature 401(1999)788-791.
    [46]D.D.Lee,H.S.Seung,Algorithms for non-negative matrix facorization,Advances Neural Inform-ation Processing Systems 13 (2001)556-562.
    [47]S.Z.Li,X.W.Hou,H.J.Zhang,Learning spatially localized, parts-based representat-ion, International Conference on Computer Vision and Pattern Recognition (2001) 207-212.
    [48]I.Buciu,I.Pitas,A new sparse image representation algorithm applied to facial expression recongnition,IEEE Workshop On Machine Learning for Signal proce-ssing(2004)539-548.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700