基于核密度估计的空间聚类算法研究以及改进

英文题名：The Research of Spatial Clustering Algorithm Based on Kernel Density Estimation and Improvement
作者：熊祖光
论文级别：硕士
学科专业名称：计算数学
中文关键词：基于密度的聚类分析 ; 核密度估计 ; 簇边缘提取 ; 自适应生成簇
学位年度：2008
导师：马驷良
学科代码：070102
学位授予单位：吉林大学
论文提交日期：2008-04-01

摘要

聚类分析是数据挖掘中的一项重要的研究内容。本文在介绍基于密度聚类的经典方法理论的基础上,分析了基于和函数的密度聚类经典方法(DBSCAN,OPTICS,DENCLUE)的优点和缺点。通过分析已有的改进方法的基本思想。提出了一种通过偏导阈值对簇的边缘部分进行提取的方法,对经典的基于密度聚类的方法对于处理具有变化密度的数据集的不足进行了改进。结果表明,该方法能够有效地处理密度变化较大的数据集。
Data mining is a process of acknowledging useful information and knowl-edge,that come from incomplete, noisy, fuzzy, and the practical application ofstochastic, which is implicit in the extraction, people do not know in advance,but probably is potentially.Data mining is a multi-interdisciplinary research ar-eas,It combines database technology, and artificial intelligence technology,machine learning, statistics, knowledge engineering, object-oriented methods,information retrieval, high-performance computing and data visualization, suchas the results of research on the latest technology.
     Data Mining is an application-oriented subjects. It is not only a specifictype of simple retrieval database queries or call, but also on these data micro, ormacro-statistics, analysis, synthesis and reasoning to guide practical problemsto solve, to discover the links among things, and even make use of the existingdata in the future to predict trends. For enterprises, data mining can help toidentify trends in business development, reveals the fact that the known past, thelocation of forecast results, and help enterprises to accomplish a task analysis ofthe key factors needed to achieve increased revenues, lower costs purposes.
     Clutster can be used to mark. The objects regroup according to the maxi-mize similarity in the same category, and the smallest similarity between differ-ent categories. The cluster of objects can be formed as following: the objects inthe same cluster with high similarity, and the objects in different clusters are verydifferent. Each formed cluster can be seen as an object type, it can export therules. Clustering also facilitates the preparation of classification, similar thingsare organized together. The difference is that with the classification, clusteringoperation of the class is advanced unknown, the formation of entirely cluster isdata-driven, and clustering is a non-guided learning methods.
     Clustering technology can be roughly divided into methods based on the division, based on the hierarchy, based on the grid, based on the models andbased on the density.
     The method based on the density is to check out that whether the dataare linked to the density of domains, and the domain is connected density dataobjects classified as a category. Such method has the advantage of being able tofind clusters of arbitrary shape, and is not sensitive to noise.
     In 1996, Martin.Ester and Hans-Peter.Kriegel proposed the density-basedspatial clustering algorithm with noise,DBSCAN. The core idea of DBSCAN iseach cluster to the point of the radius neighborhood contained points no less thana minimum value,as that the density of the neighborhood to be more than a setthreshold. The algorithm can be effectively found a cluster of arbitrary shape,not sensitive to noise, and has a relatively simple programming, but is inade-quation, because of the parameters for handling sensitive and high-dimensionaldata space in the data rather weak.
     OPTICS algorithm up through the introduction of the concept of reacha-bility distance, re-ordered the point in accordance with the priority order of theadjacent point in DBSCAN algorithm. OPTICS algorithm ordered each pointwith a distance of up to the point ascending sequence memory to be expanded tothe point, in order to better space on the densely populated region in the direc-tion of positioning, and after the treatment process to this direction expansion.OPTICS algorithm reduces DBSCAN algorithm parameters for the selected sen-sitivity, and maintaining the computational complexity of the DBSCAN.
     DENCLUE is a generalization based on the estimate of the kernel densityclustering algorithm.By selecting different effects on the function, the algorithmcan be summarized as DBSCAN algorithm, and based on the combined distancemethods, the algorithm for clustering in the centre of the cluster, can generatesimilar to the K-means clustering. This is a re?ection of the ?exibility algo-rithm(DENCLUE).
     DENCLUE algorithm for the basic idea is: for each data point in the dataspace, there are certain other regions of the impact of this in?uence to use thatin?uence function. The algorithm strike by the overall density function, andthrough the threshold value to determine the density of the cluster core, therebyforming clusters of these points. The algorithm has the advantage lies in its solidmathematical foundation, and the selection of ?exible, capable of producingsimilar to other methods of clustering results for a large number of”noise”datasets can yield better results clustering, and the algorithm in a high-dimensionalspace can form the shape of an arbitrary cluster. However,as the algorithm be-cause of the need for the space density function, the computational complexityis higher than that of DBSCAN algorithm, and because of the density thresholdselection subjectivity, the algorithm changes in the density of data sets largercluster, and the results is of instability.
     Based on the DBSCAN algorithm and improved OPTICS algorithm anal-ysis, for DENCLUE algorithm the inadequacy of the proposed density func-tion,this paper proposes a new algorithm through setting through a partial deriva-tive threshold rather than density threshold, to discover the outline of clusters,thereby reducing the traditional DENCLUE algorithm, and reduce the resultsof the cluster density threshold selection sensitivity. The improved algorithmfor the main idea is to identify the dramatic changes in density function region,and to identify clusters of the border, that is, as long as somewhere in the spacedensity function changes to a certain extent, it can be considered for a regionalcluster of border , regardless of whether it has been in existence in a cluster in-ternal. Therefore the cluster algorithm for handling larger changes in the densityof data sets and data space for the regional density in the uneven distribution ofdata sets give a good results. The method especially for space density in the re-gion span larger clusters and the discovery of the cluster is particularly effective.

引文

[1] Han J,Kamber M.Data Mining:Concept and Techniques [M].MorganKaufmann Publishers,2000.
    [2] Ester M,Kriegel H P,Sander J,A density-based algorithm for discoveringclusters in large spatial databases with noise [A].In Proc. of the 2nd Inter-national Conference on Knowledge Discovery and Data mining [C],AAAIPress,1996,226-231.
    [3] Hinneburg A,Keim D A.An efficient approach to clustering in large mul-timedia databases with noise [A].In Proc. of the 4th International Confer-ence on Knowledge Discovery and Data mining [C],AAAI Press,1998,58-65.
    [4] Ankerst M,Breuing M M,Kriegel H P.OPTICS:ordering points to identifythe clustering structure [A].In Proc. Of the 1999 ACM SIGMOD Interna-tional Conference on Management of Data [C],ACM Press,1999,49-60.
    [5] George K,Han E H,Kumar V,CHAMELEON:a hierarchical clustering al-gorithm using dynamic modeling [J],IEEE computer,1999,27(3):329-341.
    [6] Guha S,Rastogi R,Shim K,CURE:an efficient clustering algorithm forlarge databases [A],In Proc. of the 1998 ACM SIGMOD International Con-ference on Management of Data [C],ACM Press,1998,73-84.
    [7] Ryszard S. Michalski, Ivan Bratko, Miroslav Kubat,Machine learningand data mining methods and applications [A], 北京电子工业出版社2004.
    [8] Ian H.Witten, Eibe Frank,Data mining practical machine learning tools andtechniques, 北京机械工业出版社2006.
    [9] 数据挖掘原理与算法(第二版),毛国君,段丽娟,王实,石云,清华大学出版社,2007,12.
    [10] 模式识别 ( 第二版 ) , 边肇祺,张学工等 , 清华大学出版社,2000,1
    [11] 数据挖掘导论,Tan P N,Steinbach M,Kumar V,Post and TelecomPress,2006.
    [12] 屏蔽了输入参数敏感性的DBSCAN改进算法,蔡颖琨,谢昆青,马修军,北京大学学报(自然科学版),第40卷,第3期,2004.5.
    [13] 周水庚,范晔,周傲英,基于数据区分的DBSCAN算法,计算机研究与发展,2000,37(10):1153-1159.
    [14] 周傲英,周水庚,范晔等,Approaches for Scaling DBSCAN Al-gorithm to Large Spatial Databases.Journal of Computer Science andTechnology,2000,15(06):509-526.
    [15] 淦文燕,李德毅,基于核密度估计的层次聚类算法,系统仿真学报,2004.2,Vol.16 No.2,302-305,309.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700