聚类算法的维度分析

英文题名：Clustering Algorithms Analysis on Data Dimension
作者：张强
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：增量聚类 ; 并行聚类 ; 子空间聚类 ; 数学形态学 ; 尺度空间
英文关键词：Incremental clustering ; Parallel clustering ; Subspace clustering ; Mathematics morphology ; Scale space
学位年度：2007
导师：赵政
学科代码：081203
学位授予单位：天津大学
论文提交日期：2007-04-01

摘要

本文主要研究不同维度条件下聚类分析的特点、需求以及相应的对策和解决方法。针对低维度、高维度全空间和高维度子空间聚类这三个问题分别提出了新的算法。
     在低维度聚类方面,提出了GDMS聚类算法。算法的主要贡献是:1、提出了探针窗口过滤法来检测数据的分布特性,通过选取不同的滤波函数得到不同密度、不同属性的聚类簇,通过选择探针的不同运动方式,实现精度和效率的统一。2、提出了一个新的数学形态学算子来提取聚类簇,算子的精度优于以往使用的开、闭算子。3、将尺度空间理论和形态学相结合,聚类结果是一个多尺度的、层次化的结构。4、算法支持含障碍物的聚类。算法的特点是:计算复杂度与数据量成线性关系;能够发现任意形状的聚类;对噪声不敏感;算法对网格尺寸有一定的适应性;能够区分不同密度的聚类簇;能够区分特定属性聚类簇;层次化的聚类结果有利于用户的理解、解释。
     在高维度全空间聚类方面,提出了MDCLUS、IMDCLUS和PMDCLUS算法以提高聚类速度。1、采用蒙特卡络法获取核心对象,降低了聚类的运算量。定量地给出了抽样率的最小估计值,以避免小聚类簇的丢失和大聚类簇的断裂。提出了标签散列法合并聚类簇,合并的计算量与数据量成线性关系。2、实现了增量聚类。3、实现了分布式并行化处理。算法的特点是:能够发现任意形状的聚类簇;对噪声不敏感;与DBSCAN算法相比速度明显提高;运算量与维度成线性关系;能够在局域网中的多台计算机上以分布式方式同时聚类;支持增量聚类,速度相对于重新聚类有大幅度提升。
     在高维度子空间聚类方面,提出了活跃空间和活跃网格的算法。主要贡献有:1、证明了聚类簇区域的密度、连通性、覆盖度都具有向下封闭性。2、提出了自上而下的搜索方法。3、提出了基于活跃轴数量的噪声过滤法。4、在网格大小固定的基础上扩展为网格大小自适应。5、实现了分布式的并行化聚类。6、提出了以层次化的树形结构组织聚类子空间和聚类簇的方法。算法的主要特点有:既能发现全空间聚类簇也能发现子空间聚类簇;算法的计算量与数据对象个数、数据空间维度数以及聚类簇维度数分别近似成线性关系;算法的抗噪声能力强;能够在多台计算机上分布式地处理聚类;聚类结果有利于用户的理解和解释;算法既能发现相斥型聚类簇,也能发现相容型聚类簇。
In this paper, the characters and requirements of clustering algorithms on different data dimensions are analyzed. We give out our resolutions and algorithms for low dimensional clustering, high dimensional clustering, and subspace clustering.
     We design an algorithm called GDMS to resolve low dimensional clustering. A detecting window is introduced to find out the distributions of data sets. A filter function framework is introduced to distinguish clusters of different density or attributes. Different movement styles and their combinations are designed to speedup algorithm with high accuracy. A new mathematics morphological operator is introduced to discover clusters, which is more accurate than the ordinary operators: open and close. The scale space theory is integrated with mathematics morphology to get multi-scale and hierarchical clusters. In addition, GDMS is extended to support obstacle clustering. The advantages of GDMS are: it is very efficient with a complexity of O(N); it is effective in discovering clusters of arbitrary shape; it can distinguish clusters of different density or attributes; it is not sensitive to noise; it is not sensitive to the grid size; its hierarchical result is easy to understand and interpret.
     We design three algorithms MDCLUS, IMDCLUS, and PMDCLUS to speedup high dimensional clustering. Monte Carlo method is adopted for acquiring core objects to reduce computation complexity. The estimation of minimum sample ratio is analyzed quantitatively to avoid clusters' losing and breaking. A label hash method is introduced to get a linear computation complexity with object's number for clusters emerging. We realized increment clustering and distributed parallel clustering. The advantages of our approach are: it is effective in discovering clusters of arbitrary shape; it is not sensitive to noise; the clustering complexity is linear with data dimensions; it is faster than DBSCAN algorithm.
     We design active spaces and grid algorithms for subspace clustering. We prove the downward closure properties of density, connection, and coverage. A top-down search method is introduced to reduce searching spaces. A new filter method based on active axis numbers is introduced to filter noise objects. The algorithm based on fixed grid is extended to adaptive grid for efficiency. A distributed parallel algorithm is realized for large and high dimensional data sets. We introduce a hierarchical method to arrange clustering result. The advantages of our approach are: it can discover clusters both on entire space and subspace; the computation complexity is proximate linear with object's number, space dimension, and clusters' dimension respectively; it is not sensitive to noise; the clustering result is easy to understand and interpret; it can find both disjoint clusters or overlap clusters.

引文

[1] JiaWei Han, M.K., Data Mining Concepts and Techniques, Beijing, China: Machine Press, 2001.
    [2] Gonzalez R C, W.R.E., Digital Image Processing Second Edition, Beijing China: Publishing House of Electronics Industry, 2002.
    [3] K., P., Digital Image Processing: PIKS Inside, New York: John Wiley & Sons, Inc, 2001.
    [4] C. M. Procopiuc, P.K.A. & Murali, T.M., A monte carlo algorithm for fast projective clustering, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, ACM Press, 2002, 418-427.
    [5] L. Ertoz, M.S. & Kumar, V., Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the 2003 SIAM International Conference on Data Mining, 2003.
    [6] MICHAEL B. EISEN, P.O.B. & BOTSTEIN, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci., USA: 1998, 14863–14868.
    [7] T. R. Golub, e.a., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Sience, 1999 , Vol. 286 : 531–537.
    [8] Strehl, A.; Ghosh, J. & Mooney, R., Impact of similarity measures on web-page clustering, 2000, 58-64..
    [9] J. Pei, M.C.H.W. & Yu, P., Maple: A Fast Algorithm for Maximal Pattern-Based Clustering, Int’l Conf. Data Mining, 2003.
    [10] Richard, B., Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961.
    [11] BEYER K., R.R. & U., S., When is nearest neighbor meaningful?, Proceedings of the 7th ICDT, Jerusalem, Israel: 1999.
    [12] Bohm C., B.S. & D.A., K., Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases, ACM Computing Surveys (CSUR), 2001 , Vol. 33 ( 3 ) : 322-373.
    [13] JAIN, M.M. & P.J., F., Data clustering: a review, ACM Computing Surveys, 1999 , Vol. 31 ( 3 ) : 264-323.
    [14] Xu, R. & Wunsch, D., Survey of clustering algorithms, IEEE Transactions on Neural Networks, 2005 , Vol. 16 ( 3 ) : 645-678.
    [15] Berkhin, P., Survey of clustering data mining techniques, Technical report, Accrue Software, San Jose, CA, 2002.
    [16] KOTSIANTIS, S. & PINTELAS, P., Recent Advances in Clustering: A Brief Survey, WSEAS Transactions on Information Science and Applications, 2004 , Vol. 1 ( 1 ) : 73-81.
    [17] Ester, M.; Kriegel, H.; Sander, J. & Xu, X., Clustering for Mining in Large Spatial Databases, KI, 1998 , Vol. 12 ( 1 ) : 18-24.
    [18] J. Han, M.K. & Tung, A.K.H., Spatial Clustering Methods in Data Mining: A Survey, Geographic Data Mining and Knowledge Discovery, 2001.
    [19] Kolatch, E., Clustering Algorithms for Spatial Databases: A Survey, 2001.
    [20] Fung, G., A Comprehensive Overview of Basic Clustering Algorithms, 2001.
    [21] Grira, N.; Crucianu, M. & Boujemaa, N., Unsupervised and Semi-supervised Clustering: a Brief Survey, A Review of Machine Learning Techniques for Processing Multimedia Content’, Report of the MUSCLE European Network of Excellence (FP6), 2004.
    [22] Jain, A.K. & Dubes, R.C., Algorithms for clustering data, Prentice-Hall, 1998.
    [23] Kaufman, L. & Rousseeum, P.J., Finding Groups in Data: An Introduction to cluster analysis, John Wiley & Sons, 1990.
    [24] Sergios Theodoridis, K.K., PATTERN RECOGNITION, China Machine Press, 2003.
    [25] J., M., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281–297.
    [26] Kanungo, T.; Mount, D.; Netanyahu, N.; Piatko, C.; Silverman, R. & Wu, A., A local search approximation algorithm for k-means clustering, Computational Geometry: Theory and Applications, 2004 , Vol. 28 ( 2-3 ) : 89-112.
    [27] Lauritzen, S.L., The EM algorithm for graphical assoication models with missing data, Computational Statistics and Data Analysis, 1995 , Vol. 19 : 191-201.
    [28] Ng, R.T. & Han, J., Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of the 20th Very Large Databases Conference, Santiago, Chile: 1994, 144-155.
    [29] Judd, D.; McKinley, P. & Jain, A., Large-scale parallel data clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998 , Vol. 20 ( 8 ) : 871-876.
    [30] Waldemark, J., An automated procedure for cluster analysis of multivariate satellite data., Int J Neural Syst, 1997 , Vol. 8 ( 1 ) : 3-15.
    [31] Zhang Tian, R.R. & Livny, M., BIRCH: An Efficient Data Clustering Method for Very Large Databases, roceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada: 1999, 103-114.
    [32] Guha, R.R. & Shim, K., CURE: An Efficient Clustering Algorithm for Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, New York: 1998, 73—84.
    [33] Karypis, E.H. & Kumar, V., CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer, 1999 , Vol. 32:8 : 68-75.
    [34] GUHA S., R.R. & K., S., ROCK: A robust clustering algorithm for categorical attributes, Proceedings of the 15th ICDE, Sydney, Australia: 1999, 512-521.
    [35] Ester, H.K.J.S. & Xu, X., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Oregon, Portland, 1996.
    [36] Sander J., K.H.X.X., Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, Data Mining and Knowledge Discovery, 1998 , Vol. 2 ( 2 ).
    [37] Wang X, H.H., DBRS:A Density-Based spatial Clustering Method with Random Sampling, Proc. of 7th PAKDD, Seoul,Korea: 2003, 563-575.
    [38] Wang, X. & Hamilton, H., A Comparative Study of Two Density-Based Spatial Clustering Algorithms for Very Large Datasets, Lecture notes in computer science, 120-132.
    [39] Xu, M.E.H.K. & Sander, J., A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases, Proceedings of the 14th International Conference on Data Engineering, Orlando, Florida: 1998, 324-331.
    [40] Zaiane, O. & Lee, C., Clustering spatial data in the presence of obstacles: a density-based approach, Database Engineering and Applications Symposium, 2002. Proceedings. International, 2002 , 214-223.
    [41] ANKERST M., K.H. & J., S., OPTICS: Ordering points to identify clustering structure, Proceedings of the ACM SIGMOD Conference, Philadelphia, PA: 1999, 49-60.
    [42] Kollios, G.; Gunopulos, D.; Koudas, N. & Berchtold, S., Efficient biased sampling for approximate clustering and outlier detection in large data sets, IEEE Transactions on Knowledge and Data Engineering, 2003 , Vol. 15 ( 5 ) : 1170-1187.
    [43] Brecheisen, S.; Kriegel, H. & Pfeifle, M., Parallel Density-Based Clustering of Complex Objects.
    [44] MA, E. & CHOW, T., A new shifting grid clustering algorithm, Pattern recognition, 2004 , Vol. 37 ( 3 ) : 503-514.
    [45] Schikuta, E., Grid clustering: An efficient hierarchical method for very large data sets, 13th Conf. on Patter Recognition, IEEE Computer Society Press, 1996, 101-105.
    [46] Wang, J.Y. & Muntz, R., STING: A Statistical Information Grid Approach to Spatial Data Mining, Proceedings of the 23rd Very Large Databases Conference, Athens, Greece: 1997.
    [47] Wang, J.Y. & Muntz, R., STING+: An Approach to Active Spatial Data Mining, Proceedings of the International Conference on Data Engineering, Sydney, Australia: 1999, 116-125.
    [48] Sheikholeslami, G.; Chatterjee, S. & Zhang, A., WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases, New York: 1998, 428-439.
    [49] Hinneburg, A. & Keim, D.A., An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, New York: 1998, 58—65.
    [50] Yanchang, Z. & Junde, S., GDILC: a grid-based density-isoline clustering algorithm, 2001.
    [51] 唐常青, 黄.张., 数学形态学方法及其应用, 科学出版社, 1990.
    [52] Postaire JG., L.C., Cluster Analysis by Binary Morphology, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993 , Vol. 15 ( 2 ) : 170-180.
    [53] 邸凯昌; 李德仁 & 李德毅, 从空间数据库发现聚类, 中国图象图形学报, 1998 , Vol. 3 ( 3 ).
    [54] 骆剑承; 裴韬; 周成虎 & 汪闽, MSCMO: 基于数学形态学算子的尺度空间聚类方法, 遥感学报, 2004 , Vol. 8 ( 1 ).
    [55] 张永梅; 韩焱 & 张建华, 基于数学形态学的一种聚类算法, 兵工学报 (北京), 2006 , Vol. 27 ( 003 ) : 458-462.
    [56] Leung, Y.; Zhang, J. & Xu, Z., Clustering by scale-space filtering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000 , Vol. 22 ( 12 ) : 1396-1410.
    [57] Roberts, S., A Parametric and Nonparametric Unsupervised Clustering Analysis, Pattern Recognition, 1997 , Vol. 30 ( 2 ) : 261-272.
    [58] Wilson, R. & Spann, M., A new approach to clustering, Pattern Recognition, 1990 , Vol. 23 ( 12 ) : 1413-1425.
    [59] Chakravarthy, S. & Ghosh, J., Scale-based clustering using the radial basis function network, Neural Networks, IEEE Transactions on, 1996 , Vol. 7 ( 5 ) : 1250-1261.
    [60] Linderberg, T., Scale-Space Theory in Computer Vision, 1994.
    [61] Tung A.K.H. Hou, H.J., Spatial clustering in the presence of obstacles, 17th International Conference on Data Engineering, Heidelberg, Germany: 2001, 359-367.
    [62] V.Estivill-Castro, I.Lee, AUTOCLUST+: Automatic Clustering of point-data sets in the presence of obstacles. Intl.Conf.On workshop on Temporal, Spatial and Spatio-Temporal Data Mining.2000, 133~146.
    [63] Osmar R.Zaiane, Chi-Hoon Lee, Clustering Spatial Data When Facing Physical Constraints, IEEE International Conf on Data Mining, 2002, 737~741.
    [64] XinWang, Cailo Rostoker and Howard J.Hamilton, Density-Based Spatial Clustering in the Presence of Obstacles and Facilitators, PAKDD, 2004, 1~15.
    [65] Zhiwei sun, H.w., A fast clustering algorithm based on grid and density, 2005.
    [66] Arya, S., Accounting for Boundary Effects in Nearest-Neighbor Searching, Discrete and Computational Geometry, 1996 , Vol. 16 ( 2 ) : 155-176.
    [67] Berchtold S., K.D.K.H., A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space, ACM PODS Symp. on Principles of Database Systems, Tucson, AZ: 1997, 78-86.
    [68] Johnson, W. & Lindenstrauss, J., Extensions of Lipshitz mapping into Hilbert space, Contemporary Mathematics, 1984 , Vol. 26 : 189-206.
    [69] Frankl, P. & Maehara, H., The Johnson-Lindenstrauss Lemma and the Sphericity of Some Graphs, Journal of Combinatorial Theory B, 1988 , Vol. 44 : 355-362.
    [70] Blum, A. & Langley, P., Selection of Relevant Features and Examples inMachine Learning, Artificial Intelligence, 1997 , Vol. 97 ( 1-2 ) : 245-271.
    [71] Dash, M.; Liu, H. & Yao, J., Dimensionality reduction of unsupervised data, Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE International Conference on, 1997 , : 532-539.
    [72] Kanth, K.; Agrawal, D. & Singh, A., Dimensionality reduction for similarity searching in dynamic databases, ACM Press New York, NY, USA, 1998, 166-176.
    [73] Achlioptas, D., Database-friendly random projections, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press, 2001, 274-281.
    [74] Bingham, E. & Mannila, H., Random projection in dimensionality reduction: applications to image and text data, ACM Press New York, NY, USA, 2001, 245-250.
    [75] Dash, M.; Choi, K.; Scheuermann, P. & Liu, H., Feature Selection for Clustering-A Filter Solution, 02, 115-122.
    [76] Ding, C.; He, X.; Zha, H. & Simon, H., Adaptive dimension reduction for clustering high dimensional data, 2002.
    [77] Chakrabarti, K. & Mehrotra, S., Local dimensionality reduction: A new approach to indexing high dimensional spaces, 2000, 89-100.
    [78] Tadesse, M.; Sha, N. & Vannucci, M., Bayesian Variable Selection in Clustering High-Dimensional Data., Journal of the American Statistical Association, 2005 , Vol. 100 ( 470 ) : 602-618.
    [79] SELLIS, R.N. & FALOUTSOS, C., The R+ -tree: A dynamic index for multidimensional objects, 1987 , : 507–518.
    [80] N. Beckmann., R.S.B.S., The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, ACM SIGMOD Int. Conf. on Management of Data, 1990, 322-331.
    [81] Berchthold, K.D. & HP, K., The X-Tree: An Index Structure for High-Dimensional Data, 1996, 29-39.
    [82] Lin, K.; Jagadish, H. & Faloutsos, C., The TV-tree: An index structure for high-dimensional data, The VLDB Journal The International Journal on Very Large Data Bases, 1994 , Vol. 3 ( 4 ) : 517-542.
    [83] White, D. & Jain, R., Similarity indexing with the SS-tree, 1996, 516-523.
    [84] Kurniawati, R.; Jin, J. & Shepard, J., SS+ tree: an improved index structure for similarity searches in a high-dimensional feature space, SPIE, 1997, 110.
    [85] Katayama, N., The SR-tree: an index structure for high-dimensional nearest neighbor queries, Proc. SIGMOD 1997, ACM Press New York, NY, USA, 1997, 369-380.
    [86] Berchtold S., J.H.K.H. & Sander, J., Independent quantization: An index compression technique for high-dimensional data spaces, 2000.
    [87] Ciaccia P., P.M. & P., Z., M-tree: An efficient access method for similarity search in metric spaces, 1997, 426-435.
    [88] Rob811, J., The KDB Tree: A Search Structure for Large Multidimensional Dynamic Indexes, 1981, 10-18.
    [89] LOMET, D. & SALZBERG, B., The hb-tree: A robust multiattribute search structure, 1989 , : 296–304.
    [90] Berchtold S., B.C. & Kriegal, H., The pyramid-technique: towards breaking the curse of dimensionality, ACM Press New York, NY, USA, 1998, 142-153.
    [91] Chang, J. & Jin, D., A new cell-based clustering method for large, high-dimensional data in data mining applications, Proceedings of the 2002 ACM symposium on Applied computing, ACM Press, 2002, 503-507.
    [92] Hinneburg, A. & Keim., D.A., Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering, Proceedings of the 25th Very Large Databases Conference, Edinburgh, Scotland: 1999, 506-517.
    [93] Pilevar, A. & Sukumar, M., GCHL: a grid-clustering algorithm for high-dimensional very large spatial data bases, Pattern Recognition Letters, 2005 , Vol. 26 ( 7 ) : 999-1010.
    [94] Milenova, B.L. & Campos, M.M., O-cluster: scalable clustering of large high dimensional data sets, Proceedings from the IEEE International Conference on Data Mining, 2002, 290-297.
    [95] Cheng ZP, W.C.G.J.W.W.D.B.S.B., CLINCH: Clustering incomplete high-dimensional data for data mining application, WEB TECHNOLOGIES RESEARCH AND DEVELOPMENT - APWEB 2005 LECTURE NOTES IN COMPUTER SCIENCE, 2005 , Vol. 3399 : 88-99.
    [96] Fern, X.Z. & Brodley, C.E., Random projection for high dimensional data clustering: A cluster ensemble approach, Proceedings of the International Conference on Machine Learning, 2003.
    [97] Hinneburg, A.; Keim, D. & Wawryniuk, M., HD-Eye: visual mining of high-dimensional data, Computer Graphics and Applications, IEEE, 1999 , Vol. 19 ( 5 ) : 22-31.
    [98] Wegman, E. & Luo, Q., High dimensional clustering using parallel coordinates and the grand tour, Computing Science and Statistics, 1997 , Vol. 28 : 361-8.
    [99] Yanchang, Z. & Junde, S., A general framework for clustering high-dimensional datasets, CCECE2003, Montreal: 2003.
    [100] OTOO, E.J. & SHOSHANI, A., Clustering High Dimensional Massive Scientific Datasets, Journal of Intelligent Information Systems, 2001 , Vol. 17 ( 2/3 ) : 147–168.
    [101] Wang, T., Generalized Projected Clustering in High-Dimensional Data Streams, LECTURE NOTES IN COMPUTER SCIENCE, 2006 , Vol. 3841 : 772.
    [102] Macerata, A.; Del Corso, F.; Michelassi, C.; Varanini, M.; Taddei, A. & Marchesi, C., An interactive clustering procedure for selection of high dimensionpatterns, Computers in Cardiology 1996, 1996 , 369-372.
    [103] Karypis, H., Hypergraph based clustering in high-dimensional data sets: a summary of results, IEEE Bulletin of Technical Committee on Data Engineering, 1998 , Vol. 21 ( 1 ) : 105-140.
    [104] Shen, Y.; Shen, Z.; Zhang, S. & Yang, Q., Cluster cores-based clustering for high dimensional data, Data Mining, 2004. ICDM 2004. Proceedings. Fourth IEEE International Conference on, 2004 , 519-522.
    [105] McCallum, A.; Nigam, K. & Ungar, L., Efficient clustering of high-dimensional data sets with application to reference matching, ACM Press New York, NY, USA, 2000, 169-178.
    [106] Pan, F.; Wang, B.; Zhang, Y.; Ren, D.; Hu, X. & Perrizo, W., Efficient Density Clustering Method for Large Spatial Data Using HOBBit Rings.
    [107] Pan, F.; Hu, X.; Wang, B. & Perrizo, W., Rapid and accurate density clustering analysis for high dimensional data, 13th IASSE, 2004 , 269-274.
    [108] Borodin, A.; Ostrovsky, R. & Rabani, Y., Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces, Machine Learning, 2004 , Vol. 56 ( 1 ) : 153-167.
    [109] Fionn. Murtagh, J.S. & Berry, M.W., Overcoming the curse of dimensionality in clustering by means of the wavelet transform, The Computer Journal, 2000 , Vol. 43 : 107-120.
    [110] Zhu, S.; Li, T. & Ogihara, M., CoFD: An algorithm for non-distance based clustering in high dimensional spaces, 2002, 52-62.
    [111] Strehl, A. & Ghosh, J., Relationship-Based Clustering and Visualization for High-Dimensional Data Mining, INFORMS Journal on Computing, 2003 , Vol.15 ( 2 ) : 208-230.
    [112] Banerjee, A. & Ghosh, J., Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres, 2002.
    [113] Steinbach, M.; Ertoz, L. & Kumar, V., The Challenges of Clustering High Dimensional Data, Applications in Econophysics, Bioinformatics, and Pattern Recognition, 2003.
    [114] 周水庚范晔周傲英, 基于数据取样的DBSCAN算法, 小型微型计算机系统, 2000 , Vol. 21 ( 12 ) : 1270-1274.
    [115] Parsons, L.; Haque, E. & Liu, H., Subspace clustering for high dimensional data: a review, ACM SIGKDD Explorations Newsletter, 2004 , Vol. 6 ( 1 ) : 90-105.
    [116] C. Baumgartner, H.K.P.K. & Plant, C., Subspace Selection for Clustering High-Dimensional Data, Proc. ICDM 2004, 04.
    [117] Kailing, K.; Kriegel, H.; Kroger, P. & Wanka, S., Ranking Interesting Subspaces for Clustering High Dimensional Data, Springer, 2003.
    [118] Kim, Y.; Street, W. & Menczer, F., Feature selection in unsupervised learning via evolutionary search, ACM Press New York, NY, USA, 2000, 365-369.
    [119] C. C. Aggarwal, P.S.Y.C.P. & Park, J.S., Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, ACM Press, 1999, 61-72.
    [120] Charu C. Aggarwal, P.S.Y., Finding Generalized Projected Clusters In High Dimensional Spaces, SIGMOD Conference 2000, 2000, 70-81.
    [121] C. Bohm, H.K. & Kr¨oger, P., Density Connected Clustering with Local Subspace Preferences, Proc. ICDM, 2004, 2004.
    [122] K. Kailing, H.K. & Kroger, P., Density-Connected Subspace Clustering for High-Dimensional Data, In Proc. SIAM Data Mining, 2004, 2004
    [123] Yiu, M. & Mamoulis, N., Frequent-Pattern Based Iterative Projected Clustering, Third IEEE Int’l Conf. Data Mining, 2003.
    [124] Dhillon, I.; Guan, Y. & Kogan, J., Iterative clustering of high dimensional text data augmented by local search, 2002, 131-138.
    [125] Yiu, M.L. & Mamoulis, N., Iterative Projected Clustering by Subspace Mining, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005 , Vol. 17 ( 2 ).
    [126] J. Han, J.P. & Yin, Y., Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD, 2000, 2000.
    [127] Friedman, J. & Meulman, J., Clustering objects on subsets of attributes, Journalof the Royal Statistical Society Series B(Statistical Methodology), 2004 , Vol. 66 ( 4 ) : 815-849.
    [128] Har-Peled, S. & Varadarajan, K., Projective clustering in high dimensions using core-sets, ACM Press New York, NY, USA, 2002, 312-318
    [129] R. Agrawal, D.G. & Ragha-van, P., Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Man-agement of data, .ACM Press, 1998, 94-105.
    [130] C.-H. Cheng, A.W.F. & Zhang, Y., Entropy-based subspace clustering for mining numerical data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, 1999, 84-93.
    [131] Goil, H.N. & Choudhary, A., MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets, Technical Report Number CPDC-TR-9906-019, Center for Parallel and Distributed Computing, Northwestern University, 1999.
    [132] Kriegel, H.; Kroger, P.; Renz, M. & Wurst, S., A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data, IEEE Computer Society Washington, DC, USA, 2005, 250-257.
    [133] B. Liu, Y.X. & Yu, P.S., Clustering through decision tree construction, Proceedings of the ninth international conference on Information and knowledge management, ACM Press, 2000, 20-29.
    [134] Agarwal, P. & Procopiuc, C., Approximation algorithms for projective clustering, Journal of Algorithms, 2003 , Vol. 46 ( 2 ) : 115-139.
    [135] Agrawal, R. & Srikant, R., Fast algorithms for mining association rules, 1994 Int. Conf. Very Large Data Bases (VLDB'94), Santiago, Chile: 1997, 487-499.
    [136] W. S.D., Multivariate density estimation, Wiley & Sons, 1992.
    [137] David W. Scott, S.R.S., Multi-dimensional Density Estimation, 2004.
    [138] Silverman, B., Density Estimation for Statistics and Data Analysis, London: Chapman and Hall, 1986.
    [139] Donoho, D.L., High Dimensional Data Analysis: The Curses and Blessings of Dimensionality, American Math. Society Conference: Mathematical Challenges of the 21st Century, Los Angeles, CA: 2000, 6-11.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700