基于数据流双层结构聚类挖掘的研究

英文题名：Research on Two-tier Structure Clustering Mining Based on Data Stream
作者：楚红涛
论文级别：硕士
学科专业名称：计算机应用研究
中文关键词：流数据 ; 聚类挖掘 ; 异常点检测 ; 双层结构
英文关键词：data stream ; clustermining ; outlier detection ; two-tier structure
学位年度：2008
导师：寒枫
学科代码：081203
学位授予单位：华北电力大学（河北）
论文提交日期：2007-12-18

摘要

随着计算机技术的发展,越来越多的应用产生流数据,流数据不同于传统的静态数据,它是连续的、有序的、快速变化的、海量的数据。本文的主要工作是设计和实现了双层结构流数据聚类算法TWDSCluster,它包括两部分:在线层聚类和离线层聚类。为了有效地存储保留数据流中数据点的摘要信息,本文在框架中引入了微簇和金字塔时间框架。数据点的摘要信息以微簇的形式保留,并按照金字塔时间框架存储。该算法可以有效的检测数据流中的异常点。通过相关的仿真实验和其它的算法对比,显示了TWDSCluster算法的高效性和先进性。最后对本文的内容进行了总结,并对以后的工作进行了展望。
With the high development of computer technology,there are more and more applications that facing the environment of stream data.Stream data is a kind of continuous,ordered,changing fast and huge amout data.It is quite a new object that is different from conventional static data stored on the disk.The main achievement in this paper is to design and realize the two-tier framework TWDSCluster which includes two parts the online cluster and the offline cluster.We introduce two concepts microcluster and pyramidal time framework.The statistical information in data points is retained as the form of microcluster,and stored in terms of the pyramidal time framework. It can also detect outliers in data stream efficiently.Experiments show that our algorithm can get higher accuracy of clustering within limited memory.Finally,we summarize the content of the paper and point out the research emphases for future work.

引文

[1] 单世民.基于网格和密度的数据流聚类方法研究:[博士学位论文].大连:大连理工大学.2006.
    [2] 王彬.双层数据流框架的设计与实现:[硕士学位论文].长春:吉林大学.2004.
    [3] 武森,高学东,M.巴斯蒂安.数据仓库与数据挖掘.北京:冶金工业出版社.2003.183～184.
    [4] Mut hukrishnan S. Data streams algorithms and applications.In Proc of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia:Society for Industrial and Applied Mathematics,2003.413～413.
    [5] Guha S, Mishra N,Motwani R,et al. Clustering Data Streams.In Proceedings of the 41st Annual Symposium on Foundations of of Computer Science. Washington,DC:IEEE Computer Society,2000.359～366.
    [6] O’Callaghan L, Mishra N,Meyerson A ,et al. Motwani.Streaming-Data Algorithms for High-Quality Clustering. In: Proceeding of the 18th International Conference on Data Engineering.Washington,DC: IEEE Computer Society,2002. 685～704.
    [7] Aggarwal C , Han J , Wang J ,et al. A Framework for Clustering Evolving Data Streams. In:Proceedings of the 29th International Conference on Very Large Data Bases. San Francisco:Morgan Kaufmann Publishers Inc,2003. 81～92.
    [8] 孙焕良,赵法信,鲍玉斌,等. CD-Stream — 一种基于空间划分的流数据密度聚类算法. 计算机研究与发展,2004,41:289～294.
    [9] Aggarwal C, Han J, Wang J, et al. A Framework for Projected Clustering of High Dimensional Data St reams. In : Proceedings of the 30th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers Inc,2004. 852～863.
    [10] C.C.Aggarwal,J.Han,J.Wang,and P.S.Yu. A framework for clustering evolving data streams In Proc. of VLDB,2003.
    [11] J.Yang.Dynamic clustering of evolving streams with a single pass.In Proc. of ICDE,2003.
    [12] O.Nasraoui,C.Cardona,C.Rojas,and F.Gonzlez.Tecno-streams:tracking evolving clusters in noisy data streams with a scalable immune system learning model.In Proc. Of ICDM,pp.235～242,2003.
    [13] S.Guha,A.Meyerson,N.Mishra,R.Motwani and L.O’Callaghan. Clustering data streams:theory and practice.In IEEE Transcations on Knowledge and Engineering.pp.515～528,2003.
    [14] S.Muthukrishnan,R.Shah,J.Vitter,”Mining Deviants in Time Series Data Streams”,Proceedings of the 16th International Conference on Scientific and Statistical Database Management(SSDM’04),Santorini Island,Greece,pp.41～50,2004.
    [15] Ester M et al.A density-based algorithm for discovering clusters in large spatial database with noise.The 2nd International journal Conference on knowledge discovery and Data Mining,Poland,(1996).
    [16] L.O ’ Challaghan et al.Streaming-Data Algorithms For High-Quality Clustering.ICDE Conference,(2002).
    [17] J.Ma,S.Perking,”Time-series Novelty Detection Using One-class Support Vector Machines”,Proceedings of the International Joint Conference on Neural Networks,Portland,OR,United States,pp.1741～1745,July,2003.
    [18] A.Ghoting,M.Otey,S.Parthasarathy,”LOADED:Link-based Outlier and Anomaly Detection in Evolving Data Sets”,Proceedings of the Fourth IEEE International Conference on Data Mining(ICDM’ 04),Brighton,UK,pp.387～390,January,2004.
    [19] Fang Chu,Yizhou Wang,Carlo Zaniolo,”An Adaptive Learning Approach for Noisy Data Streams”,Proceedings of the fourth IEEE International Conference on Data Mining(ICDM’04),Brighton,UK,pp.351～354,January,2004.
    [20] S.Shekhar,Chang-Tien Lu,Pusheng Zhang,”Detecting Graph-Based Spatial Outliers:Algorithms and Applications(A Summary of Results)”,Proceeding of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining,San Francisco USA,pp.371～376,August,2001.
    [21] T.Zhang,R.Ramakrishnan,M.Livny.BIRCH:An Efficient Data Clustering Method for Very Large Databases.ACM SIGMOD Conference,(1996).
    [22] S.Guha,N.Mishra,R.Motwani and L.O’Callaghan.Clustering data stream.In Proc FOCS,Pages 359～366,(2000).
    [23] Zhe Wang,Bin Wang,Chunguang Zhou,Xiujuan Xu.Clustering Data Streams On the Two-tier Structure.the 6th Asia Pacific Web Conference.(2003).
    [24] K.Jain and V.Vazirani.Primal-dual Approximation algorithms for metric facility location and k-median problems.Proc.FOCS,(1999).
    [25] S.Guha,R.Rastogi and K.Shim.CURE:An Efficient Clustering Algorithm of Large Databases.Proc.ACMSIGMOD Int’1 Conf.Managemnet of Data,ACM Press.New York,pp73～84.(1998)
    [26] W.Wang,J.Yang,and R.R.Muntz.Sting:A Statistical information grid approach to spatial data mining.In Proc.of VLDB,pp.186～195,1997.
    [27] M.Ester,H.-P.Kriegel,J.Sander,and X.Xu.A density-based algorithm for discovering clusters in large spatial databases with noise.In Proc. of KDD,1996.
    [28] Barbara D.Requirements for clustering data streams.ACM SIGKDD Explorations Newsletter,2003,(3)2 pp.23～27.
    [29] He Zeng-you,Xu Xiao-fei,Deng Sheng-chun.Squeezer:An efficient algorithm for clustering categorical data.Journal of Computer Science and Technology,2002,17(5):611～624.
    [30] Portnoy L,Eskin L,Stolfo S J.Intrusion detection with unlabeled data using clustering.In:Proceedings of ACM CSS Workshop on Data Mining Applied to Security(DMSA-2001),Philadelphia,2001 pp.109～115.
    [31] 朱蔚恒 , 印鉴 , 谢益煌 . 基于数据流的任意形状聚类算法 . 软件学报.2006,17(3) :379～387.
    [32] Nam Hun Park, Won Suk Lee. Statistical o-Partition Clustering over Data Streams. Proceedings of European conference on principles and practice of knowledge discovery in databases.2003:387～398.
    [33] Nam Hun Park, Won Suk Lee. Statistical Grid-based Clustering over Data Streams. ACM SIGMOD Record. 2004, 33(1): 32～37.
    [34] Yansheng Lu, Yufen Sun, Guiping Xu, and Gang Liu. A Grid-Based Clustering Algorithm for High-Dimensional Data Streams. In: Proceedings of first International Conference, ADMA 2005, Wuhan, China. 2005: 824 一 831.
    [35] Aggarwal C, Han J, Wang JY, philip S. YU. On High Dimensional Projected Clustering of Data Streams. Data Mining and Knowledge Discovery. 2005,10:251-273.
    [36] Kok-Leong Ong, Wenyuan Li, Wee-Keong Ng, et al. SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes. In: Proceedings of 6th International Conference, DaWaK 2004, Zaragoza, Snain. 2004: 209-218.
    [37] Poh Hean Yap, Kok-Leong Ong. o -SCLOPE: Clustering Categorical Streams Using Attribute Selection. In: Proceedings of Knowledge-Based Intelligent Information and Engineering Systems: 9th International Conference, KES 2005, Melbourne, Australia. 2005 (partIl): 929-935.
    [38] 蔡伟鸿,刘震.基于密度聚类算法的入侵检测研究.计算机工程与应用.2005, 21: 149-151.
    [39] 伊胜伟,刘肠,魏红芳.基于数据挖掘的入侵检测系统智能结构模型.计算机工程与设计. 2005,26(9):2464-2466.
    [40] 陈新,熊家军.基于概念聚类算法的入侵检测警报研究.空军雷达学院学报.2004,18(2): 28-30.
    [41] 杨宜东,孙志辉.基于动态网格的数据流离群点快速检测算法[J].软件学报,2006,17(8):1796-1803.
    [42] 李存华,孙志辉.GridOF:面向大规模数据集的高效离群点检测算法[J].计算机研究与发展,2003,40(11):1586-1592.
    [43]金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181.
    [44] A. Ghoting, M. Otey, S. Parthasarathy.LOADED:Link-based Outlier and Anomaly Detection in Evolving Data Sets[C]. Proceedings of the Fourth IEEE International Conference on DataMining,Brighton,UK, pp. 387-390.
    [45] Markus M. Breunig. LOF: Identifying Density-Based Local Outliers[C]. In:Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000.
    [46]Dragoljub Pokrajac. Incremental Local Outlier Detection for DataStreams[C].In:Proc IEEE Symposium on Computational Intelligence and Data Mining(CIDM), Hawaii, April,2007.
    [47] N. Roussopoulos, S. Kelley and F. Vincent,Nearest neighbor queries.71-79, Proceeding of the ACM SIGMOD Conference,San Jose,CA,1995.
    [48] E.Achtert E., C.Bohm, P.Kroger, P. Kunath, A. Pryakhin, M. Renz. Efficient Reverse k-Nearest Neighbor Search in Arbitrary Metric Spaces.In Proceedings ACM SIGMODInt Conf. On Management of Data(SIGMOD’06), Chicago (IL), U.S.A, 2006, pp. 515-526.
    [49] N. Beckmann, H.-P. Kriegel, R. Schneider, and B.Seeger. The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec, 19(2):322–331.
    [50]Yang Yidong,Sun Zhihui,Zhang jing,”Finding outliers in distributed data streams based on kernel density estimation”,Computer Research and Development,Vol.42,No.9,pp.1498-1504,September,2005.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700