分布式环境下聚类分析新方法的研究

英文题名：New Methods for Cluster Analysis in Distributed Environments
作者：李成安
论文级别：博士
学科专业名称：控制科学与工程
中文关键词：数据挖掘 ; 聚类分析 ; 分布式计算 ; 分布式聚类 ; 集成学习 ; 移动代理 ; 层次优化 ; 协同 ; 时间序列
英文关键词：Data mining ; distributed computing ; distributed clustering ; ensemble learning ; mobile agent ; hierarchial optimization ; collaboration ; time-series
学位年度：2006
导师：吴铁军
学科代码：081101
学位授予单位：浙江大学
论文提交日期：2006-07-01

摘要

随着计算机和存储技术的快速发展，人们已经积累了大量的历史数据，迫切需要将这些历史数据转化为知识。聚类分析，基于“物以类聚”的朴素思想，将物理或抽象对象集合划分为由相似对象组成的多个类，在数据挖掘领域得到了广泛的研究，并成功应用于各个领域。
     近年来，数据库规模持续增长，分布范围日益广泛，而大多数现有聚类分析方法需要一次性将所有数据载入内存，耗费大量计算时间，无法满足海量、分布式数据环境下的知识提取需要，因此分布式环境下聚类分析方法的研究是当今聚类分析领域富有挑战性的前沿课题。本论文致力于这一研究课题，以大规模、分布存储的数据集为研究对象，采用机器学习、人工智能和层次优化等技术和分布式计算相结合的方法，探索分布式环境下新的聚类技术，为高效、合理利用分布的、大规模数据提供理论和技术基础。
     本文的主要研究内容和创新点包括以下几个方面：
     1．对分布式环境下的聚类分析，从产生背景、算法研究、应用研究等方面进行了较为全面系统的分析和总结。
     2．针对分布式聚类的易实现性问题，利用弱聚类算法的易实现性，提出了一种基于Boosting技术的分布式聚类算法DBCA。DBCA算法在每次迭代中，将不同子数据库基于弱聚类算法建立的局部模型组装生成全局模型，各子数据库基于全局模型对其数据进行划分，再根据划分的质量确定下一次迭代的采样概率，通过加权投票集成前些次迭代的划分，并将最后一次集成得到的划分作为最后的聚类结果。分析表明DBCA算法具有可并行计算、良好的伸缩性和通讯代价小等特点，不仅有助于科学家对聚类分析的深入研究。还有助于普通工程技术人员利用分布式聚类技术来解决真实世界中的问题。实验表明DBCA算法可得到与集中数据库相似的结果。
     3．针对分布式聚类的集成伸缩性问题，根据数据库的网络分布、网络带宽等特点，利用层次设计思想，对OIKI DDM模型进行扩展，提出了基于移动代理的层次优化集成挖掘模型—HOIKI DDM模型，并相应提出一种分布式聚类算法HOIKIDC。实验和分析表明，HOIKIDC对于分布式环境具有更好的伸缩性，实现更加灵活，效率更高，并可有效降低通讯代价，特别适合于大规模异构分布式数据聚类问题。
     4．对分布式聚类的集成有效性问题进行研究。首先提出了集成有效性概念和局部结果不一致性概念，分析了局部结果不一致性的产生原因，提出了协同算法来降低这种不一致性，并相应地提出了一种分布式聚类算法CDCA，通过局部站点之间的信息交互和协同使全局聚类质量得到改善。实验结果表明，CDCA算法使结果集成更为有效。
     5．针对应用领域中的时间序列存在数据规模大且分布存储的特点，提出了一种分布式模糊短时间序列聚类算法DFSTS来分析这些时间序列的形状相似性从而更好的揭示序列的结构，并分析了该算法的收敛性。仿真结果表明DFSTS算法具有良好的伸缩性，具有与集中数据集同样的聚类质量，计算效率更高。
     6．以国家863计划项目为背景，以冶金生产过程质量预测与操作优化为研究对象，对分布式聚类技术在冶金工业中的应用进行了研究。首先设计了一个分布式数据挖掘系统原型。针对大规模、分布存储的连续退火生产过程数据，应用本文提出的分布式聚类算法完成了两个挖掘任务：1)带钢断带建模与预报；2)离群检测。实验结果表明，该方法对于连续退火过程数据的分析是有效的，对大规模冶金工业生产过程数据分析具有十分广阔的应用前景。
With the rapid development of computer and memory technologies, there is growing interest in clustering theories and applications in data mining due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Cluster analysis is, based on the naive idea-things of one kind come together, a division of data into groups of similar objects and widely applied to many fields.
    In recent years, databases are persistently growing and distributed physically or geographally in more and more locations connected with computer networks. However, it is difficult for most of existed clustering algorithms to extract knowledge from huge amounts of distributed data because they need to load all data into the main memory and huge computational overhead. Thus new methods of discovering knowledge are necessary to be developed in large-scale, distributed environments and distributed clustering method is just one. Distributed clustering is the applications of cluster analysis in distributed computing environments and a challenge topic in data mining fields. This dissertation explores new clustering techniques in distributed environments so as to provide theoretic and technical foundations for utilizing efficiently and suitely large-scale, distributed data. And several novel distributed clustering menthods are proposed to cluster large-scale, distributed datasets in distributed environments using many techniques such as machine learning, artificial intelligence, distributed computing techniques, etc. The main work and results of the paper are showed in the following:
    1. Clustering methods in centralized and distributed environments are surveyed in three aspects, which are backgrounds, algorithms and applications of clustering methods.
    2. For easy implementation of distributed clustering algorithm, a novel distributed clustering algorithm (DBCA) is proposed using some simple and easily-implemented algorithms such as K-means algorithm and boosting techniques. At each iteration of DBCA algorithm, a set of clustering models are first generated from sub-databases at those sites using a weaker clustering algorithm and combined into a global model which is transmitted to the sites and used to partition the sub-database at each site. Then, in terms of partitioning qualities, sampling probabilities of the next iteration are updated at the sites. Finally, the partitions are integrated into an aggregated partition by a weighted voting. The final clustering result is the aggregated partition at the last iteration. DBCA algorithm is parallelly computable, scalable and has a low communication overhead. It is not only helpful for scientists to investigate cluster analysis but also helpful for common engineers to solve real-world problems using distributed clustering techniques. Experimental
    results show that DBCA algorithm is effective and can achieve results comparable to the algorithms in which boosting techniques are applied to the centralized databases.
    3. Integration scalability in large amount of sites which contain large-scale, distributed data sets is studied. First, a new hierarchical optimization mining model (HOIKI DDM model) based on mobile agent is proposed. Based on hierarchical idea and divid-and-conquer strategy, the proposed model extends OIKI DDM model according to network topology and bandwidth, and integrates multiple local results among the sites using mobile agent and incremental optimization. Then, a novel distributed clustering algorithm (HOIKIDC) with the proposed model is presented to cluster large-scale, distributed heterogeneous data sets. The experimental results demonstrate that HOIKIDC algorithm is scalable, flexible and efficient and particularly suited to large-scale distributed environments. In addition, HOIKIDC algorithm can reduce dramatically communication cost based on network characteristics.
    4. Validity of knowledge integration in distribute clustering is studied. First, integation validity and inconsistency amongst local results from different sites are defined. Then, analysis of inconsistency amongst local results and a coordination algorithm to reduce the inconsistency are proposed. Forethermore, based on the coordination algorithm, a novel distributed clustering algorithm (CDCA) in which information is exchanged amongst the sites is presented to improve clustering quality and integation validity. Experimental results show that CDCA algorithm outperforms the algorithms without cooridination in integation validity.
    5. For large-scale, distributed short time-series data sets in many fields sach as industries and DNA databases, a distributed clustering algorithm (DFSTS) is proposed to cluster short time series in distributed environments for analyzing the shape similarity hiding amongst the data so as to find its structure. Based on fuzzy clustering, the proposed algorithm is performed in multiple sites without transferring all data to a single dataset. The simulated results demonstrate that the proposed algorithm is effective, efficient and scalable and provides the same clustering quality as the single centralized data set.
    6. The distributed algorithms proposed in the dissertation are applied to steel plant in a real-world project (National "863" Project) to sovle the real industrial problems. First, a prototype system of distributed data mining is designed to apply distributed algorithms to metallurgy industries. Then, for large-scale, distributed data from continuous-anneal processes, two distributed mining tasks which employ distributed clustering algorithms: 1) modeling and prediction of strip-rupture after data-preprocessing; 2) detection of outliers, are performed. The performed results indicate that the distributed approaches are effective and only need to transfer models and knowledge rather than original data. According to the results, great application prospect of distributed clutering approaches proposed in this dissertation
    can be expected to analyze large-scale, distributed data from metallurgy process industries.

引文

[ABKS99] M. Ankest, M. Breunig, H. P. Kriegel and J. Sander. OPTICS: Ordering points to identify the clustering structure. In: Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), 49-60, 1999.
    [ACM03] D. Alur, J. Crupi, D. Malks. Core J2EETM Patterns: Best Practices and Design Strategies, Second Edition, Prentice Hall PTR, 2003.
    [AFS93] R. Agrawal, C. Faloutsos and A. Swami. Efficient similarity search in sequence databases. In Fourth Int. Conf. on Foundations of Data Organization and Algorithms (FODO), 69-84, Evanston, IL, Oct. 1993.
    [AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-3, 1998, Seattle, Washington, USA, 94-105, 1998.
    [AKPB96] J. Aronis, V. Kolluri, F. Provost and B. Buchanan. The WORLD: Knowledge discovery from multiple distributed databases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, 1996.
    [AS95] R. Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In: Proc. 1995 Int. Conf. Data Engineering (ICDE'95), 3-14, 1995.
    [ASG03] E. Ariwa, M. Senousy and M. M. Gaber. Facilities management and e-business mode application for distributed data mining using mobile agents. The International Journal of Applied Marketing, 2(1), 2003.
    [AY00] Charu C. Agrawal and Philip S. Yu. Finding generalized projected clusters in high-dimensional spaces. SIGMOD Conference, 70-81, 2000.
    [BBBT02] C. P. Bottura, G. Barreto, M. J. Bordon and A. D. R. Tamariz. Parallel and distributed computational multivariate time Series modeling in the state space, In: Proceedings of the American Control Conference Anchorage, AK May 8-10, 1466-1471, 2002.
    [Bez81] J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York, 1981.
    [BC00] D. Barbara and P. Chen. Using the fractal dimension to cluster datasets, in Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 260-264, 2000.
    [BG01] Bertone Paul and Gerstein Mark. Integrative data mining: The new direction in bioinformatics. IEEE Engineering in Machine and Biology, 33-40, 2001.
    [BG02] B. Boutsinas and T. Gnardellis. On distributing the clustering process, Pattern Recognition Letters, 23: 999-1008, 2002.
    [BKKS01] M. M. Breunig, H-P Krisgel, P. Kroeger and Jorg Sander. Data bubbles: Quality preserving performace boosting for hierarchical clustering. In Proc. ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD01), 1-11, 2001.
    [BKM98] C. Blake, E. Keogh and C. J. Merz. UCI repository of machine learning databases. [http://www.ics.uci.edu/～mlearn/MLRepository.htm], Department of Information and Computer Science, University of California, Irvine, CA, 1998.
    [Bor01] Alex Bordetsky. Agent-based support for collaborative data mining in systems management, Proceedings of the 34th Hawaii International Conference on System Sciences, 1-9, 2001.
    [Bre98] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3): 801-824, 1998.
    [CDG00] J. Chattratichat, J. Darlington and Y. Guo, S. Hedvall, M. Kohler, and J. Syed. An architecture for distributed enterprise data ining, 2000.
    [CGSK04] R. Chen, C. Giannella, K. Sivakumar and H. Kargupta. Distributed data mining for earth and space science applications, In: Proceedings of the NASA Earth Science Technology Conference, 2004.
    [Cha05] Dimitrios Charalampidis. A modified K-means algorithm for circular invariant clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12): 1856-1865, 2005.
    [CHH05] Deng Cai, Xiaofei He and Jiawei Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12): 1624-1637, 2005.
    [CHM00] Cadez I, Hecherman David and Meek Christopher. Visualization of navigation patterns on a web site using model-based clustering. In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.
    [CFPS99] P. Chan, W. Fan, A. Prodromidis and S. Stolfo. Distributed Data Mining in Credit Card Fraud Detection. IEEE Intelligent Systems, 67-74, Nov/Dec 1999.
    [CS96] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, 61-83, 1996.
    [CSH03] D. Caragea, A. Silvescu and V. Honavar. Learning decision tree Induction from distributed heterogeneous autonomous data sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA03), Tulsa, Oklahoma, 2003.
    [DE95a] W. Davies and P. Edwards. Agent-based knowledge discovery. In AAAI Spring Symposium on Information Gathering, 1995.
    [DE95b] W. Davies and P. Edwards. Distributed learning: an agent-based approach to data mining. In Diana Gordon, editor, Proceedings of Machine Learning-95 Workshop on Agents That Learn From Each Other, Tahoe City, CA, AAAI Press. 1995.
    [Die97] T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18(4):97-136, 1997.
    [DK96] D. Dave and R. Krishnapuran. A possibilistic c-means algorithm: Insight and recommendations. IEEE Transactions on Fuzzy Systems, 4(3):385-393, 1996.
    [DM99] I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, 245-260, 1999.
    [Dun73] J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cyber. 3 (3), 32-57, 1973.
    [Dun74] J. Dunn. Well separated clusters and optimal fuzzy partitions, J. Cybernet. 4: 95-104, 1974.
    [EKSX96] M. Ester, H. P. Kriegel, J. sander and X. Xu. A density-based algorithm for discovery clusters in large spatial databases with noise. In: Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD'96), 226-231, 1996.
    [EKSX97] M. Ester, H. P. Kriegel, J. sander and X. Xu. Density-connected sets and their application for trend detection in spatial databases. In: Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97), 10-15, 1997.
    [EKSX98] M. Ester, H. P. Kriegel, J. sander and X. Xu. Clustering for mining in large spatial databases. AI(Artificial Intelligence), Special Issue on Data Mining, ScienTec Publishing, 18-24, March 1998.
    [EMH03] M Eisenhardt, W Muller andA Henrich. Clustering documents by distributed P2P clustering. In: Proc. of Informatik 2003, GI Lecture Notes in Informatics, Frankfort, Germany, 2003.
    [Fay96] U M Fayyed. Data mining and knowledge discovery: Making sense out of data. IEEE Expert, 20-25, 1996.
    [FB03] X. Fern and C. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In The Twentieth International Conference on Machine Learning (ICML2003), Washington, DC, August 2003.
    [Fis87] D. Fisher. Improving inference through conceptual clustering. In: Proc. 1987 AAAI Conf., 461-465, 1987.
    [FJ05] Ana L. N. Fred and Anil K. Jain. Combining multiple clusterings using evidence accumulation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6): 835-850, 2005.
    [FLS04] D. Frossyniotis, A. Likas and A. Stafylopatis. A clustering method based on boosting, Pattern Recognition Letters, 25:641-654, 2004.
    [Fre01] Y. Freund. An adaptive cersion of the boost by majority algorithm. Machine Learning, 43(3): 293-318, 2001.
    [FPSU96] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy editors. Advances in knowledge discovery and data mining. AAAI/MIT Press. 1996.
    [FS97] Y Frund and R E Schapire. A decision-theoretic generaization of on-line learning and an application to Boosting. Journal of Computer and System Sciences, 55(1): 119-139, 1997.
    [FZH00] G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. SIGKDD Explorations, 2(2): 34-38, 2000.
    [GLF89] J. Gennari, P. Langley and D. Fisher. Models of incremental concept formation. Artificial Intelligence. 40: 11-61, 1989.
    [GGP01] V Guralnik, N Garg and G Karypis. Parallel tree projection algorithm for sequence mining. LNCS 2150, 310-320, 2001.
    [GMT05] A. Gionis, H, Mannila and P. Tsaparas. Clustering aggregation. In the 21st International Conference on Data Engineering (ICDE'05), 341-352, 2005.
    [GRS98] S. Gupa, R. Rastogi and K. Shim. CURE - An efficient clustering algorithm for large databases. In Proceedings of ACM- SIGMOD International Conference on Management of Data, 73-84, 1998.
    [GRS99] S. Guhu, R. Rastogi and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. ICDE'99, 512-521, 1999.
    [GSM02] J. Ghosh, A. Strehl and S. Merugu. A consensus framework for integrating distributed clusterings under limited knowledge sharing. In Proceedings of NSF Workshop on Next Generation Data Mining, 99-108, Baltimore, MD, November 2002.
    [Gus97] D. Gusfield. Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge, U.K.: Cambridge Univ. Press, 1997.
    [Har75] J. N. Hartigan. Clustering algorithm. New York, NY: John Wiley & Verleg, 1975.
    [HK98] A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In: Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), 58-65, 1998.
    [HK99] A. Hinnburg and D. Keim. Clustering techniques for large data sets: form the past to the future. Tutorial Notes for A CMSIGKDD Int. Conf. on Knowledge and Data Mining, 1999.
    [HK01a] J. Han and M Kamber. Data mining: Concepts and techniques. USA: Morgan Kaufmann. 2001.
    [HK01b] D. E. Hershberger and H. Kargupta. Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining. Journal of Parallel and Distributed Computing, 61 (3): 372-400, 2001.
    [HLM03] Matthias Hlusch, Stefano Lodi and Gianluca Moro. The role of agent in distributed data mining: issues and benefits[C], In Proceedings of the IEEE/WIC International Conference on Intelligent Agent Technology (IAT'03), 2003.
    [HMW98] V. Honavar, L. Miller and J. Wong. Distributed knowledge networks. In IEEE Information Technology Conference, Syracuse, NY, 1998.
    [HS92] L. Hunter and D. J. States. Bayesian classification on protein structure. Expert, 7(4): 67-75, 1992.
    [Hua98] Zhexue Huang. Extensions to the K-means algorithms for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2: 283-304, 1998.
    [HYK99] J W Han, Q Yang and E Kim. Plan mining by divide-and-conquer. In: Proc. 1999 SIGMOD'99 Workshop on Research lssues on Data Mining and Knowledge Discovery (DMKD'99), Philadephia, PA, 81-86, 1999.
    [JD88] A. K. Jain, R. C. Dubes. Algorithm for clustering data. Prentice Hall, 1988.
    [JK99] E. Johnson and H. Kargupta. Collective, Hierarchical clustering from distributed, heterogeneous data. In M. Zaki and C. Ho, editors, Large-Scale Parallel KDD Systems. Lecture Notes in Computer Science, 1759: 221-244. Springer-Verlag, 1999.
    [JKP03] Eshref Januzaj, Hans-Peter Kriegel and Martin Pfeifle. Towards effective and efficient distributed clustering, Workshop on Clustering Large Data Sets (ICDM2003), Melbourne, FL, 2003.
    [JKP04a] E. Januzaj, H. P. Kriegel and M. Pfeifle. DBDC: Density-based distributed clustering, Proc. 9th Int. Conf. on Extending Database Technology (EDBT), 88-105, 2004.
    [JKP04b] E. Januzaj, H. P. Kriegel and M. Pfeifle. Scalable density-based distributed clustering, Proc. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Pisa, Italy, 2004.
    [Jin02] Jin Huidong. Scalable model-based clustering algorithms for large databases and their applications. Ph. D Derstation of The Chinese University of Hong Kong, August, 2002.
    [JMJ96] D. Judd, P. McKinley and A. Jain. Large-scale parallel data clustering. In Int'l Conf. Pattern Recognition, August 1996.
    [JN03] P Jouve, N Nicoloyannis. A new method for combining partitions, applications for distributed clustering. In Proc. of Workshop on Parallel and Distributed Computing for Machine Learning as part of the 14th European Conf. on Machine Learning, 2003.
    [Jos03] Manasi N. Joshi. Parallel K-means algorithm on distributed memory multiprocessors, www.cs.umn.edu/～mnjoshi/PKMeans.pdf, 2003.
    [JPM01] M Josffe. Pena and E. Menasalvas. Towards flexibility in a distributed data mining framework. In Workshop on Research lssues in Data Mining and Knowledge Discovery (DMKD 2001), 2001.
    [Kar99] H. Kargupta. An introduction to distributed data mining http://www.eecseecs.wsu wsu.edu edu/～/～hillol hillol,1999.
    [KC99a] H. Kargupta and P. Chan. Distributed data mining. AI Magazine, 20(1):126, 1999.
    [KC99b] S. Kantabutra and A. L. Couch. Parallel k-means clustering algorithm on Nows. NECTEC Technical Journal, 1(1): 243-247, 1999.
    [KHK99] G. Karypis, E. H. Han and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. Computer, 32: 68-75, 1999.
    [KHS97] H.Kargupta, I.Hamzaodlu and B.Stafford, Sclable, distributed data mining using an agent based architecture", D. Heckerman, et al (Eds.) Proc. of 3rd Int. Conf. on knowledge discovery and Data Mining, AAAI press, 211-214, 1997.
    [KHSJ01] H. Kargupta, W. Y. Huang, K. Sivakumar and E. Johnson. Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 3(4): 422-448, 2001.
    [KI91] R. Krishnamurthy and T. Imielinski. Research directions in knowledge discovery., SIPMOD record, 20(3): 76-78, 1991.
    [KKPR05] Hans-Peter Kriegel, Peter Kunath, Martin Pfeifile and Matthias Renz, Approximated clustering of distributed high-dimensional data, PAKDD, 432-441, 2005.
    [KLM03] M. Klusch, S. Lodi and G. L. Moro. Distributed clustering based on sampling local density estimates. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2003), 485-490, Mexico, August 2003.
    [KM93] R. Krishnapuran and J. M. Kei ler. A possibilistic c-means algorithm. IEEE Transactions on Fuzzy Systems, 2: 100-12, 1993.
    [KM99] K. Krishma and M. N. Murty. Genetic k-means algorithm. IEEE Transactions on System, Man, and Cybernetics, Part B, 29(3): 433-439, 1999.
    [Koh82] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59-69, 1982.
    [KP04] S. B. Kotsiantis, P. E. Plntelas. Recent advances in clustering: A brief survey. WSEAS Transactions on Information Science and Applications, 1(1): 73-81, 2004.
    [KPHJ99] H Kargupta, B. Park, D. Hershberger, E Johnson. Collective data mining: A new perspective toward distributed data mining. In: Advances in Distributed Data Mining. AAAI Press, 1999.
    [KR90] K. L. Kaufman, P. J. Rousseeuw. Finding groups in data: An introduction to cluster analysis. John Wiley & Sons, 1990.
    [KZL02] S. Krishnaswamy, A. Zaslavsky and W. S. Loke. Techniques for estimating the computation and communication costs of distributed data mining. In Proceedings of International Conference on Computational Science (ICCS2002)- Part Ⅰ, volume 2331 of Lecture Notes in Computer Science (LNCS), 603-612. Springer Verlag, 2002.
    [Lau95] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19: 191-201, 1995.
    [LF89] X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11: 270-290, 1989.
    [Lia05] T. Warren Liao, Clustering of time series data-a survey. Pattern Recognition, 38: 1857-1874, 2005.
    [LKK01] Mark Last, Yaron Klein and Abraham Kandel. Knowledge kiscovery in time series databases. IEEE Transactions on Systems, Man, and Cybernetics—Part b: Cybernetics, 31(1): 160-169, 2001.
    [LO01] A. Lazarevic and Z. Obradovic. The distributed boosting algorithm. In Knowledge Discovery and Data Mining, 311-316, 2001.
    [LO02] A. Lazarevic and Z. Obradovic. Boosting algorithms for parallel and distributed learning, Distributed and Parallel Databases, 11 (2): 203-229, 2002.
    [LPO00] A. Lazarevic, D. Pokrajac and Z. Obradovic. Distributed clustering and local regression for knowledge discovery in multiple spatial databases. In Proceedings of 8th European Symposium on Arti_cial Neural Networks, 129-134, Bruges, Belgium, April 2000.
    [LSO03] T. Li, S. Zhu and M. Ogihara. A new distributed data mining model based on similarity. ACM SAC Data Mining Track, March 2003.
    [LYL05] Alan Wee-Chung Liew, Hong Yah and N. F. Law. Image segmentation based on adaptive cluster prototype estimation, IEEE Transactions on Fuzzy Systems, 13(4): 444-453, 2005.
    [Mac67] J. MacQueen. Some methods for classification and analysis of multivariate observations. In: Proc. 5th Barkeley Symp. Math. Statist., Prob., 1: 281-297, 1967.
    [MB00] U. Maulik, and S. Bandyopadhyay. Genetic algorithm-based clustering techniques. Pattern Recognition, 33(9): 1455-1465, 2000.
    [MBP02] G. J. McLachlan, R. W. Bean and D. Peel. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18: 413-422, 2002.
    [Med01] Mohamed Medhat. Distributed classification using OIKI DDM model.2001
    [Med02] Mohamed M. Medhat, A framework for a scalable distributed data mining model, The Thesis of University of Louisville, August 2002.
    [MG03] S. Merugu, and J. Ghosh. Privacy-preserving distributed clustering using generative models[C], In ICDM, 2003.
    [MG05] S. Merugu and J. Ghosh, A privacy-sensitive approach to distributed clustering, Pattern Recognition Letters. 26: 399-410, 2005.
    [MH04] P. More and L. O. Hall. Scalable clustering: a distributed approach. In: Proc. 2004 IEEE Int. Conf. Fuzzy Systems, 1: 143-148, 2004.
    [Mic83] R. S. Michalski. A theory and methodology of inductive learning. In: Machine Learning: An Artificial Intelligence Approach, Vol.1 Michalski et al., editors, CA: Morgan Kaufmann, 83-134, 1983.
    [Mic97] Pierre Michaud. Clustering techniques. Future Generation Computer Systems, 13: 135-147, 1997.
    [Mit97] T. M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
    [MK97] G. J. Mclachlan, and T. Krishnan. The EM algorithm and extensions. John Wiley & Sons, Inc., 1997.
    [MKC+05] C.S. Moller-Levet, F. Klawonn, K.-H. Cho, H.Yin and O.Wolkenhauer, Clustering of unevenly sampled gene expression time-series data, Fuzzy Sets and Systems, 152: 49—66, 2005.
    [MS83] R. S. Michalski and R. E. Step. Learning from observation: conceptual clustering. In: Machine Learning: An Artificial Intelligence Approach, Vol. 1, R. S. Michalski, J. G. Carbonell, T. M. Mitchell, editors, CA: Morgan Kaufmann, 1983.
    [MS03] D S Madha and W S Spangler. Feature weighting in k-means clustering. Machine Learning, 52(3): 217-237, 2003.
    [MSG00] S. McClean, B. Scotney and K. Greer. Conceptual clustering heterogeneous distributed databases. In Workshop on Distributed and Parallel Knowledge Discovery, Boston, MA, 2000.
    [MSMG05] Sally McClean, Bryan Scotney, Philip Morrow and Kieran Greer. Knowledge discovery by probabilistic clustering of distributed databases, Data & Knowledge Engineering, 54: 189-210, 2005.
    [Nwa96] H S. Nwana. Software agents: an overview. The Knowledge Engineering Review, 11(3): 205-244, 1996.
    [OAB97] L. Owsley, L. Atlas and G. Bernard. Self-organizing feature maps and hidden Markov models for machine-tool monitoring. IEEE Transactions on Signal Processing, 45(11): 2787-2798, Nov. 1997.
    [OFC00] T. Oates, L. Firoiu and P. Cohen. Using dynamic time warping to bootstrap HMM-based clustering of time series. In Sequence Learning. ser. LNAI 1828, R. Sun and C. Giles, Eds. Berlin, Germany: Springer-Verlag, 35-52, 2000.
    [Ols95] C. F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21: 1313-1325, 1995.
    [Ped02] Witold Pedrycz. Collaborative fuzzy clustering. Pattern Recognition Letters, 23: 1675-1686, 2002.
    [PG00] S. Policker and A.B. Geva. Nonstationary time series analysis by temporal clustering, IEEE Trans. Syst. Man Cybernet. -B: Cybernet. 30 (2): 339-343, 2000.
    [PK02] B. Park, and H. Kargupta. Distributed Data Mining: Algorithms, Systems and Applications. In Nong Ye, editor, Data Mining Handbook, 341-358. IEA, 2002.
    [PO00a] S. Parthasarathy and M. Ogihara. Clustering distributed homogeneous datasets. In Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery, volume 1910 of Springer-Verlag Lecture Notes in Computer Science, 566-574, 2000.
    [PO00b] S. Parthasarathy and M. Ogihara. Exploiting dataset similarity for distributed mining. In 3rd Workshop on High Performance Data Mining. In conjunction with International Parallel and Distributed Processing Symposium 2000 (IPDPS'00), Cancun, Mexico, May 2000.
    [Pou03] F. Poulet. Multi-way distributed SVM algorithms. In Parallel and Distributed Computing for Machine Learning. In conjunction with the 14th European Conference on Machine Learning (ECML'03) and 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD '03), Cavtat-Dubrovnik, Croatia, September 2003.
    [Pro00] F. Provost. Distributed data mining: Scaling up and beyond. In Hillol Kargupta and Philip Chan, editors, Advances in Distributed Data Mining. MIT/AAAI Press, 2000.
    [PT03] C. Pizzuti and D. Talia. P-Autoclass: Scalable parallel clustering for mining large data sets. IEEE Trans. Knowledge and Data Engineering, 15(6): 629-641, 2003.
    [Qui93] J. R Quinlan. C4.5: Programs for machine learning. CA: Morgan Kaufmann, 1993.
    [RZ85] D. Z. Rumelhart and D. Zipser. Feature discovery by competitive learning. Cognitive Science, 9:75-112, 1985.
    [Sch01] R. E. Schapire. The boosting approach to machine learning: An overview. In: Proc of the Mathematical Sciences Research Institute(MSRI) Workshop on Nonlinear Estimation and Classification. Berkeley, California, 149-172, 2001.
    [SCZ98] G. Sheikholeslami, S. Chatterjee and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In: Proc. 1998 Int. Conf. Very Large Data Bases (VLDB'98), 428-439, 1998.
    [SG00] R. Sun and C. Giles, "Sequence learning: Paradigms, algorithms, and applications," in LNAI 1828. Berlin, Germany, 2000.
    [SG02] A. Strehl and J. Ghosh. Cluster ensembles-a knowledge reuse framework for combining partitionings. JMLR, 3(3): 583-617, 2002.
    [SM01] M. Senousy and M. Medhat. A proposed model for distributed data mining using mobile agents, BIT 2001 "Constructing IS Future", Manchester. UK. 2001.
    [Smy97] P. Smyth, Clustering sequences with hidden markov models, in Advances in Neural Information Processing, M. Mozer, M. Jordan, and T. Petsche, Eds. Cambridge, MA: MIT Press, 9: 648-654, 1997.
    [SOGM02] N. F. Samatova, G. Ostrouchov, A Geist and A. V. Melechko. RACHET: An efficient cover-based merging of clustering hierarchies from distributed datasets, Distributed and Parallel Databases, 11 (2): 157-180, 2002.
    [SPT+97] S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, D. Fan, and P. Chan. JAM: Java agents for meta-learning over distributed databases, KDD'97, Newport Beach, California, USA: 74-81, 1997.
    [SS99] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3): 297-336, 1999.
    [SSZ05] Yong Shi, Yuqing Song and Aidong Zhang. A shrinking-based clustering approach for multidimensional data. IEEE Transactions on Knowledge and Data Engineering, 17(10): 1389-1403, 2005.
    [SYW05] Shen Hong-Bin, Yang Jie, Wang Shi-Tong and Dong Yi-Fei, Study on new information theory based cooperative clustering algorithm, Chinese Journal of Computers, 28(8): 1287-1294, 2005.
    [TJP03] A. Topchy, A. Jain and W. Punch. Combining multiple weak clusterings. In The Third IEEE International Conference on Data Mining (1CDM'03), Melbourne, FL, November 2003.
    [TJP05] A. Topchy, A. Jain and W. Punch. Clustering ensembles: models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12): 1866-1881, 2005.
    [TV04] D. K. Tasoulis and M. N. Vrahatis. Unsupervised distributed clustering. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, Innsbruck, Austria, 2004.
    [Val84] L.G. Valiant. A theory of the learnable. Commnucations of the ACM, 11(27): 11234-1142, 1984.
    [WWYY02] H.Wang, W.Wang, J. Yang and P. Yu, Clustering by pattern similarity in large data sets, in Proc. ACM SIGMOD Int. Conf. Management of Data, 394-405, 2002
    [WYM97] W. Wang, J. Yang and R. Muntz. STING: A statistical information grid approach to spatial data mining. In: Proc. 1997 Int. Conf. Very Large Data Bases (VLDB'97), 186-195, 1997.
    [XJK99] X. Xu and J. Jgerand, H. P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 3: 263-290, 1999.
    [XW05] Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645-678, MAY 2005.
    [YF04] Ossama Younis and Sonia Fahmy. Distributed clustering in Ad-hoc networks: a hybrid, energy- efficient approach, IEEE INFCOM2004.
    [YHMW98] J. Yang, V. Honavar, L. Miller and J.Wong. Intelligent mobile agents for information retrieval and knowledge discovery from distributed data and knowledge sources. In IEEE Information Technology Conference, Syracuse, NY, 1998.
    [YM05] Man Lung Yiu and Nikos Mamoulis. Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering, 17(2): 176-189, 2005.
    [Zak01] M Zaki. Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing, 61: 401-42, 2001.
    [ZHD00] B., Zhang, M. Hsu, and U. Dayal. K-harmonic means: A spatial clustering algorithm with boosting." In Proc. International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, TSDM2000, Lyon, France, Lecture Notes in Artificial Intelligence, 2007. Roddick, J. F. and Hornsby, K., Eds., Springer, 2000.
    [ZRL96] T. Zhang, R. Ramakrishnan and M. Livny. BIRCH: An efficient data clustering method for very large databases. In: Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'96), 103-114, 1996.
    [ZT06] Z. H. Zhou and W. Tang. Clusterer ensemble. Knowledge-Based Systems, 19(1): 77-83, 2006.
    [ZZL+05] Xiang Zou, Wei Zhang, Yang Liu and Qing Sheng Cai, Study on distributed sequential pattern discovery algorithm, Journal of Software (in Chinese), 16(7): 1262-1269, 2005.
    [陈01] 陈宁．数据挖掘中聚类算法的研究．中国科学院数学与系统科学研究院博士学位论文．2001．
    [都02] 都志辉，陈渝，刘鹏编著．网格计算．清华大学出版社，2002．
    [葛96] 葛卢生，温郑铨，陈雷，王东意．带钢连续退火机组微机检测系统．冶金工业自动化，23(4)：17-19，1996．
    [何04] 何建锋．冷轧板连续退火技术及其应用．上海金属．26(4)：50～53，2004．
    [黄02] 黄逸民．基于多Agent的智能管理信息系统理论与应用研究．浙江大学博士论文，2002．
    [刘99] 刘金琨，王树青．复杂系统多智能体不一致问题的研究．控制与决策，14(3)：249-252，1999．
    [刘01] 刘海龙．动态环境下分布式智能系统的任务协作研究，浙江大学博士学位论文，2001。
    [邢04] 邢进生．数据挖掘在冶金产品质量控制中的应用．国防工业出版社，2004．
    [于04] 于玲，吴铁军．集成学习：Boosting算法综述．模式识别与人工智能，7(1)：52-59，2004．
    [岳04] 岳士弘，李平，宋执环，谷应鲲．自适应模糊聚类．浙江大学学报(工学版)，38(10)：1280-1281，2004．
    [郑05] 郑少仁，王海涛，赵志峰，米志超，黎宁著．Ad Hoc网络技术．人民邮电出版社，2005．
    [周02] 周志华，陈世福．神经网络集成．计算机学报，25(1)：1-8，2002．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700