聚类CLIQUE算法及其并行化研究

英文题名：CLIQUE Algorithm and Parallelize
作者：祖峰
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：数据挖掘 ; 聚类 ; 并行算法 ; 工作站机群
英文关键词：Data mining ; clustering ; parallel algorithm ; NOW
学位年度：2003
导师：熊忠阳
学科代码：081201
学位授予单位：重庆大学
论文提交日期：2003-05-08
答辩委员会主席：廖晓峰

摘要

数据挖掘是帮助人们在海量数据中发现信息和知识的工具。近年来数据挖掘技术成了商业智能的核心技术，被广泛应用到了诸多领域，引起了学术界极大的关注。聚类分析是数据挖掘中的一个重要研究领域，它从数据库中寻找数据间的相似性，从而优化大规模数据库的查询和发现数据中隐含的有用信息或知识。如何进行快速聚类以及如何取得更好的聚类结果成了聚类数据挖掘算法研究的重点和难点。
    CLIQUE算法综合了基于密度和基于网格的聚类方法，它有着速度快的优点。但是由于方法太简化，可能会降低聚类结果的精确性。通过深入的研究和分析，发现由于CLIQUE算法没有考虑到如何利用当前挖掘数据的特性，而是进行一种硬性的网格划分，因此增加了计算复杂程度，而为了降低计算的复杂程度就只能降低聚类结果的精确性。针对上述问题论文引入了自适应的网格划分方法，通过在一维的情况下预先分割区间，然后找出密集分割区间并对分界进行调整来得到密集区间，最后把这些密集区间作为划分网格的依据。这种划分网格的方法很好地利用了当前要挖掘的数据的特性，同时减少了网格的数量以及密集单元候选集的数目，大幅度减少了计算的复杂程度，从而使得在每个子空间进行计算成为了现实，也大大提高了聚类结果的精确性，但算法的时间复杂度仍是指数级的。只是这个指数是维数，使得算法的时间复杂度比起很多聚类算法的仍然简单很多。
    为了进一步提高算法的执行效率，论文还对并行CLIQUE算法进行了研究。选用通过商用网络连接起来的PC机，以及并行虚拟机PVM和分布式操作系统LINUX，共同构成了一个机群系统作为并行计算平台。在并行程序的模型上选用了Master/Slave模型。该并行算法将数据集分配到各个节点机上实现了数据并行，在数据并行的基础上，当生成密集单元候选集以及验证密集单元的时候又采取了任务并行的方法。由于主体是数据并行，因此达到了接近线性的加速比。每个节点计算任务的时间复杂度由两部分构成，一部分是指数级的验证密集单元的时间复杂度，另一部分是线性的通信时间复杂度。
    最后，通过实验验证了并行CLIQUE算法的可行性，从实验中得到的并行算法的加速比与理论分析结果一致。实验表明，并行CLIQUE算法在提高了聚类挖掘结果精确度的同时达到了较高的效率，同时由于算法是基于PVM的机群系统开发的，因此算法的通用性较强。
Data mining technology is used to help people finding the information and knowledge in the data. It has become the core technology of the intelligence commerce. It has been widely used in many areas and drawn the attention of the whole academe. Clustering is one of the most important areas in data mining Clustering finds the similarity among the data and use it to optimal the query of the large scale databases and find the hidden useful information and knowledge. How to make the clustering faster and the result of the clustering more accurate is of the most importance and hardness.
    CLIQUE is integrated density-based and grid-based method. It has the advantage of faster speed. But due to simplify the procedure, the accuracy of the clustering may be degraded. After deeply investigate and analysis, we found the drawback of CLIQUE lies in its inconsideration of the characteristic of the data being processed. It grid the data into a predefined grid and this adds up to the complexity of the computation. Then it has to degrade the accuracy of the result to degrade the complexity of computation,. We introduce adaptive-grid method to settle this problem. We divide each dimension into a fix interval and join the dense interval to dense part. At the boundary of the each dense part, boundary is adjusted by dividing a smaller interval. Finally the adaptive-grid is produced according to the dense part. This method makes full use of the characteristic of the data being processed. The number of dense unit and candidate dense unit is great reduced. At the same time the complexity of the computation is greatly decreased. So, computation in each dimension is feasible. This make the accuracy of cluster upgraded. But the computation complexity of the algorithm is still exponential. Due to the fact the exponent is dimension, the complexity of algorithm is still less than other clustering algorithms.
    To make the algorithms more efficient, it was parallelized. The hardware platform is PC connected with LAN. The software platform is PVM and LINUX. They construct the whole PC-cluster system. The parallel program model is master/slave model. The algorithm assign data set to each node realizes the data-parallel. When produce dense unit, task-parallel is used. Due to the fact the algorithm is complete data-parallel; the speedup of the algorithm is nearly liner. The time complexity of the each node is composes of exponential computation time and liner communication time.
    At last, the experiment proves the feasibility of the algorithm and the speedup gets from the experiment is in accord with theoretical one. The experiment also proves the parallel algorithm upgrade the accuracy of the clustering result combined with more efficient. Because the algorithm is based on PVM cluster, it is more popular.

引文

[1] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan ·Automatic subspace clustering of high dimensional data for data mining application·In Proc. 1998 ACM-SIGMOD Int.Conf.Management of Data(SIGMOD'98)
    [2] Martin Ester, Hans-Peter Kriegel el ·A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise·Published in Proceedings of 2nd international conference on knowledge discovery and data mining
    [3] Jiawei Han , Micheline Kamber 著· 范明，孟小峰译·数据挖掘概念与技术·机械工业出版社
    [4] 薛一波，王建中·并行处理中加速比的研究·计算机工程与设计·Vol 16·NO1
    [5] Cosimo Anglano, Macro Botta ·NOW G-Net:Learning Classification Programs on Networks of Workstations·IEEE Transaction on evolutionary computation·Vol 6·NO 13·Oct 2002
    [6] Harsha S. Nagesh , Sanjay Goil , Alok N. Choudhary·A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets·International Conference on Parallel Processing·2000
    [7] Pavel Berkhin·Survey Of Clustering Data Mining Techniques · http://citeseer.nj.nec.com/berkhin02survey.html
    [8] 王实，高文·数据挖掘中的聚类方法·计算机科学·2000Vol.27,No.4
    [9] Han J. ·Conference tutorial notes: data mining techniques. ·In: Proceedings of ACM SIGMOD International Conference'96 on Management of Data (SIGMOD'96). Montreal, Canada·June 1996
    [10] Ruspini HE·A new approach to clustering ·Inf Cont. 1969
    [11] Bezdek JC· Pattern recognition with fuzzy objective function algorithms. · New York:Plenum Press,1981
    [12] 李金宗编著·模式识别导论·高等教育出版社
    [13] Barry Wilkinson, Michael Allen 著·陆鑫达等译·并行程序设计·机械工业出版社·2002
    [14] Sanpawat Kantabutra, Alva L. Couch·Parallel K-means Clustering Algorithm on NOWs·Nectac Technical Journal·Vol 1,NO 6 Jan. 2000
    [15] Alexander Hinneburg, Daniel A. Keim·Optimal Grid-Clustering:Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering·Proceedings of 25th VLDB Conference,1999
    [16] Alexander Hinneburg, Daniel A. Keim·An Efficient Approach to Clustering in Large Multimedia Databases with Noise·Knowledge Discovery and Data Mining·1998
    [17] Cheng-Fa Tsai,Han-Chang Wu,Chun-Wei Tsai·A new Data Clustering Approach for Data Mining in Large Database·Proceeding of the International Symposium on Parallel Architecture, Algorithm and Networks·2002
    [18] R.Agrawal, H.Mannila, R.Srikant, H.Toivonen, A.L.Verkarno·Fast Discovery of Association Rules·Advance in Knowledge Discovery and Data Mining·AAAI/MIT Press,1996
    [19] C.Aggrarwal,C.Procoplue,J.Wolf,P.Yu, J.Park·A Framework for Finding Projected Clusters in High Dimensional Spaces·Proc. ACM SIGMOD int. Conf.on Management of Data·1999
    [20] I.Dhilln,D.Modha ·A data-Clustering algorithm on distributed memory multiprocessors· Large-Scale parallel KDD Systems· ACM SIGKDD International Conference on Knowledge Discovery and Data Mining·1999
    [21] Laks V.S. Lakshmanan, Raymond T.Ng, Christine Xing Wang, Xiaodong Zhou, Theodore J.Johnson ·The Generalized MDL Approach for Summarization·Proceeding of the 28th VLDB Conference,2002
    [22] D. Foti , D. Lipari,C. Pizzuti , D. Talia·Scalable Parallel Clustering for Data Mining on Multicomputers·Lecture Notes in Computer Science·2000
    [23] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park·Fast algorithms for projected clustering·In Proceedings of ACM SIGMOD Conference on Management of Data· pages 61-72·1999
    [24] C. Aggarwal, and P. S. Yu·Finding generalized projected clusters in high dimensional spaces·SIGMOD-00·2000
    [25] Felicity George Arno Knobbe·A Parallel Data Mining Architecture for Massive Data Sets · http://monetdb.cwi.nl/acoi/DMW/publications/JofKDDM99.pdf
    [26] Alex A.Freitas·A Survey of Parallel Data Mining·Dep.de Informatica（DAINF）AV.sete　de　setembro，3165
    [27] D.Foti,D.Lipari,C.Pizztui,D.Talia ·Scalable Parallel Clustering for Data Mining on Multicomputers
    [28] ZheXue Huang·A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
    [29] XiaoWei Xu, Martin Ester, Hans-Peter Kriegel, J. Sander·A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases· Proc. Of 14th International Conf. On Data Engineering·1998
    [30] Michael Steinbach, Levent Ertoz, Vipin Kumar·The Challenges of Clustering High Dimensional Data

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700