多时间序列数据流聚类算法研究

英文题名：Research on Mulitple Time Series Streams Clustering Algorithms
作者：金燕
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：多时间序列数据流 ; 聚类 ; 离散傅立叶变换 ; 多层次相关度
英文关键词：multiple time series streams ; clustering ; DFT ; multilayer correlation
学位年度：2007
导师：刘青宝
学科代码：1201
学位授予单位：国防科学技术大学
论文提交日期：2007-11-01

摘要

随着数据流应用的需求不断增加,数据流上的挖掘算法越来越受到国内外研究学者的重视。本文针对多时间序列数据流环境,提出计算数据流对间多层次相关度的方法,并且在多层次相关度计算的基础上提出针对多数据流环境的挖掘算法。
     多层次相关度计算的优点:(1)多数据流的多层次相关度的计算方法对传统时间序列统计相关度的方法进行了改进,适应数据流多变性和不可重复性的特点;(2)运用离散傅立叶变换来对数据流数据进行压缩和处理,降低了系统内存存储需求,加快了计算机处理的时间;(3)运用多层次时间窗口模型对数据流进行多层次相关度的统计,不仅知道当前时刻的数据流对间相关性,而且可以通过多粒度聚集树(DSMGA-Tree)结构查询数据流对间历史时刻的相关性,并返回多层次的相关度信息,在一定的误差范围内满足了详细查询请求。
     在多层次相关度计算的基础上,本文提出针对多数据流环境的动态挖掘算法。(1)算法将传统数据库领域的数据挖掘技术运用到多数据流环境,动态地实现了多数据流聚类;(2)提出基于相关度的多数据流动态聚类算法(CBDMSClustering),运用DBSCAN基于密度的聚类思想,可以聚集出任意形状的类,同时,时间复杂度并没有增加;(3)在CBDMSClustering算法的基础上进一步改进,提出基于相对相关度的多数据流动态聚类算法(RCBDMSClustering),除了可以聚集出任意形状的结果,还可以区分出不同相关度的类,将受低相关类掩盖的高相关类识别出来。实验证明了多数据流动态聚类算法的改进可以得出更好的聚类结果。
Along with the requirements of data streams' applications increasing, clustering algorithms on data streams attract more and more researchers' interests. The paper aims at multiple time series streams, brings forward the method to compute multilayer correlation, and brings forward mining algorithms for multi-streams on the base of multilayer correlation.
     Multi-streams' multilayer correlation computing method have three advantages: (1)Multi-streams' multilayer correlation computing method improves the classic time series correlation methods , adapting to streams variety and the character of can not repeat;(2)We use DFT to compress and deal with data streams' data, it's reducing the requirements of system's memory storage, quickening computer's deal time;(3)We use multilayer time windows model to statistic multilayer correlations. We not only know correlations between current data streams, but also can query history correlations between multi-streams by DSMGA-Tree, returning multilayer correlations, meeting particular query request under a certain error range.
     On the base of multilayer correlation, we bring forward dynamic mining algorithms for multi-streams. (1)Algorithms use classic mining techniques in databases to multi-streams, achieve multi-streams clustering dynamical; (2) We propose CBDMSClustering on the base of correlation , use DBSCAN's idea on the base of density, can gain arbitrary clusters, meanwhile without add time complexity;(3)On the base of CBDMSClustering, we improve further, propose RCBDMS Clustering. It can distinguish different correlation density clusters, other than gain arbitrary clusters. It can identify high correlation clusters which covered by low correlation clusters. Experiments prove that the multi-streams dynamic clustering algorithms advances result obviously.

引文

[1]Nauman Chaudhry,K.Shaw,M.Abdelguerfi.Stream Data Management[J].June 1,2004.
    [2]NOAA.National Oceanic and Atmospheric Administration[J],U.S.National Weather Service.http://www.nws.noaa.gov/;
    [3]Unidata.Atmospheric data repository[J]http://www.unidata.ucar.edu/;
    [4]Babu,Shivnath;Subramanian,Lakshminarayan,Widom,Jennifer.A data stream management system for network traffic[C].In Proceedings of Workshop on Network-Related Data Management(NRDM 2001).31st May 2001.
    [5]A.C.Gilbert,Y.Kotidis,S.Muthukrishnan,M.J.Strauss.QuickSAND:Quick summary and analysis of network data[J].DIMACS Technical Report 2001-43,November 2001.
    [6]Samuel Madden,Michael J.Franklin,Berkeley madden.Fjording the stream:An architecture for queries over streaming sensor data[C].In:Proc of the 2002 Intl Conf on Data Engineering,2002-02.California PATH Smart-AHS.
    [7]More details at http://path.berkeley.edu/SMARTAHS/index.html
    [8]Lukasz Golab,M.Tamer Ozsu,Issues in Data stream Management[C].SIGMOD Record,32(2):5-14,2003.
    [9]Daniel J.Abadi,Don Carney,Ugur Cetintemel,Mitch Cherniack.Aurora:A new model and architecture for data stream management[C].The VLDB Journal-The International Journal on Very Large Data Bases.Volume 12,Issue 2,August 2003.120-139.
    [10]Rajeev Motwani,Jennifer Widom,Arvind Arasu.Query processing,resource management,and approximation in a data stream management system[C].Proceedings of the 2003 CIDR Conference.2003.190-202.
    [11]Alfredo cuzzocrea,Wei Wang.Approximate range-sum query answering on data cubes with probabilistic guarantees[J].Springer Netherlands,Volume 28,Number 2.2007.161-197.
    [12]S.Guha,N.Mishra,R.Motwani,and L.O'Callaghan.Clustering data streams[C].In IEEE Symposium on Foundations of Computer Science (FOCS'00),Redondo Beach,CA,2000.
    [13]C.Aggarwal,J.Han,J.Wang,and P.Yu.A framework for clustering evolving data streams[C].Proc.VLDB conference,2003.
    [14]Feng Cao,Martin Ester,Weining Qian,and Aoying Zhou.Density-based Clustering over an Evolving Data Stream with Noise[C].In Proceedings of the 2006 SIAM Conference on Data Mining(SDM'2006).
    [15]Agrawal R,Faloutsos C,Swami A.Efficient Similarity Search in Sequence Databases[C].Proc.of the FODO Conf.,Springer,1993
    [16]Min Wang,Xiaoyang Scan Wang:Efficient Evaluation of Composite Correlations for Streaming Time Series[J].WAIM 2003:369-380
    [17]Pedro Pereira Rodrigues,Joao Gama,Joao Pedro Pedroso.ODAC:Hierarchical clustering of time series data streams[C].In Proceedings of the Sixth SIAM International Conference on Data Mining,pages 499-503,Bethesda,Maryland,USA,April 2006.SIAM.
    [18]Beringer J,Hullermeier E.Online Clustering of Parallel Data Streams.Data &KnowledgeEngineering.2005.http://wwwiti.cs.uni-magdeburg.de/iti_dke/p ublications/DKE_draft.pdf
    [19]史金成.基于相关性的数据流聚类及其相关性研究[D].合肥:合肥工业大学硕士学位论文.2007.5.
    [20]陈安龙.多数据流处理的关键技术研究[D].成都:四川大学博士学位论文.2006.4.
    [21]Babcock B,Babu S,Datar M,Motwani R,Widom J.Models and issues in data stream systems[C].In:Popa L,ed.Proc.of the 21st ACM Symp.on Principles of Databases Systems(PODS).Madison,2002.1-16.
    [22]赵基.基于数据挖掘的银行客户分析管理关键技术研究[D].杭州:浙江大学博士学位论文,2005,5.
    [23]Ankur Jain.Statistical Mining in Data Streams[D].A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science.December 2006.6.
    [24]金澈清,钱卫宁,周傲英。流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181。
    [25]STREAM:The stanford stream data manager[J].IEEE Data Engineering Bulletin,Stanford Stream Data Management(STREAM) Project.http://www-db.stanford.edu/stream.
    [26]Daniel J.Abadi,Don Camey,Ugur Cetintemel,Mitch Cherniack.Aurora:A new model and architecture for data stream management[J].The VLDB Journal—The International Journal on Very Large Data Bases.Volume 12,Issue 2,August 2003.120-139.
    [27]Jianjun Chen,David J.DeWitt,Feng Tian,Yuan Wang.NiagraCQ:A scalable continuous query system for internet databases[C].SIGMOD Conference 2000.
    [28]KRISHNARNURTHY,Cooper O,Cooper O.TelegraphCQ:Continuous dataflow processing,Continuous queries over data streams[J].2003.11-18.
    [29]李建中,郭龙江,张冬冬,王伟平.数据流上的预测聚集查询处理算法.软件学报,2005,16(7):1252-1261.http://www.jos.org.cn/1000-9825/16/1252.htm
    [30]张冬冬,李建中,王伟平,郭龙江.数据流历史数据的存储与聚集查询处理算法[J].软件学报 2005,23.
    [31]李建中.无线传感器网络专刊前言[]J.软件学报,2007,18(5):1077-1079.
    [32]Kleinberg J.Bursty and Hierarchical Structure in streams[C].In:Proc.2002ACM SIGKDD Intl.Conf.on Knowledge Discovery and Data Mining,Aug,2002.
    [33]Aggarwal CC,Han J,Wang J,Yu PS.A framework for projected clustering of high dimensional data streams[C].In:Nascimento MA,-zsu MT,Kossmann D,Miller RJ,Blakeley JA,Schiefer KB,eds.Proc.of the VLDB.Toronto:Morgan Kaufmann Publishers,2004.852-863.
    [34]Domingos P,Hulten C.Mining high-speed data streams[C].In:Proc.of the KDD.2000.http://citeseer.ist.psu.edu/domingos00mining.html.
    [35]Hulten G,Spencer L,Domingos P.Mining time-changing data streams[C].Proc of ACM SIGKDD Int Conf Knowledge Diseovery in Databases (KDD'01),2001.
    [36]Zhou A,Cai Z,Wei L,Qian W.M-Kernel merging:Towards density estimation over data streams[C].In:Cha SK,Yoshikawa M,eds.The 8th Int'l Conf.on Database Systems for Advanced Applications(DASFAA 2003).Kyoto:IEEE Computer Society,2003.285-292.
    [37]Pedro Pereira Rodrigues,Joao Gama.Online prediction of streaming sensor data[J],http://www.cs.cmu.edu/～jroure/iwkdds/IWKDDS06PS.pdf.
    [38]Garofalakis M,Gehrke J,Rastogi R.Querying and mining data stream:you only get one look A tutorial[J].In:Franklin MJ,Moon B,Ailamaki A,eds.Proc.of the 2002 ACM SIGMOD Int'l Conf.on Management of Data.Madison:ACM Press,2002.635.
    [39]Yunyue Zhu,Dennis Shasha Statstream:Statistical Monitoring of Thousands of Data Streams in Real Time[J].TR2002-827August 15,2002.
    [40]Muthukrishnan,S.:Data streams:algorithms and applications[C].In:Proc.of the 4th annual ACM-SIAM symposium on discrete algorithms.(2003).
    [41]S.Chandrasekaran,S.Krishnamurthy,S.Madden,A.Deshpande,M.J.Franklin,J.M.Hellerstein,M.Shah.Windows Explained,Windows Expre ssed[J].2003.www.cs.berkeley.edu/～sirish/research/streaquel.pdf.
    [42]Ukarsh Srivastava,Jennifer Widom.Flexible time management in data stream systems[C].Proceedings of the twenty-third ACM SIGMOD -SIGACT-SIGART symposium on Principles of database systems,2004, 263-274.
    [43]ARASU Arvind,BABUShivnath,WIDOM.Jennifer,Lausen.Georg,Suciu.Dan A Arasu,S Babu,J Widsom.CQL:A language for continuous queries over streams and relations[J].DBPL:database programming languages:Potsdam,6-8 September 2003.
    [44]武珊珊,宋宝燕,袁锋,于亚新,于戈.数据流模型研究[J]。计算机研究与发展增刊.2004,10.430-435.
    [45]Widmer G,Kubat M.Leaming in the Presence of Concept Drift and Hidden Contexts[J].Machine Learning,1996,23(1):69.
    [46]J.Han,M.Kambr著,范明,孟小峰译.数据挖掘概念与技术[M].北京:机械工业出版社,2001.
    [47]梁循。数据挖掘算法与应用[M].北京:北京大学出版社,2006.
    [48]T.Zhang,R.Ramakrishnan,and M.Livny.BRICH:An efficient data clustering method for very large databases[C].In proc.1996 ACM SIGMOD,Montreal,Canada,June 1996:103-114.
    [49]MITCHELL,T.Machine Learning[M].McGraw-Hill,New York,Y.1997.
    [50]S.Guha,R.Rastogi,and K.Shim.CURE:An efficient clustering algorithm for large databases[C].In Proc.SIGMOD,Seattle,WA,1998:73-84.
    [51]A.Hinneburg and D.Keim.An efficient pproach to clustering large multime dia databases with noise[C].In Proc.of the ACM SIGKDD,1998:58-65.
    [52]M.Ester,H.P.Kriegel,J.Sander and X.Xu.Adensity-based algorithm for discovering clusters in large spatial databased with noise[C].Proc.2nd Int.Conf.on Knowledge Discovery and Data Mining Portland OR 1996:226-231.
    [53]李雄飞,李军.数据挖掘与知识发现[M].第一版.高等教育出版社.2003.11:92-111.
    [54]M.Ankerst,M.Breunig,H.p.Kriegel,and J.Sander.OPTICS:Ordering points to identify the clustering structure[C].In:Proc,1999 ACM_SIGMOD Int.Conf.Management of Data(SIGMOD'99),Philadephia,PA,1999:49-60.
    [55]J.Wang,R.Yang,Muntz.STING:A Statistical Information Grid Approach to Spatial Data Mining[C].In:Proc.1997Int.Conf.Very Large Data Bases (VLDB'97),1997:186-195.
    [56]D.Fisher.Improving inference through conceptual clustering[C].In:Proc.of 1987AAAI Conf,Seattle,WA,1987:461-465.
    [57]J.Gennar,P.Langley,D.Fisher.Models of incremental concept formation[J].Artificial Intelligence,40(1):11-61.
    [58]P.Cheeseman,J.Stutz.Bayesian classification(AutoClass):Theory and Result[J].Advances in Knowledge Discovery and Data Ming.1996,18(2): 153-180.
    [59] James Kennedy and Rui Mendes.Neighborhood topologies in fully informed and best-of-neighborhood particle swarms[C].In:Proceedings of the 2003 IEEE International Workshop on Soft Computing in Industrial Applications,2003:45-50.
    [60] P.B.Gibbons and Y.Matias. Synopsis data structures[C]. In Proc. of SODA,pages 909-910,1999.
    [61] Y.Ioannidis and S. Christodoulakis. Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results[C]. ACM Transactions on Database Systems, Vol. 18, No.4, pages 709-748, December 1993.
    [62] Y.Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation[C]. Proceedings of ACM SIGMOD, 1995.
    [63] V. Poosala and Y.Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption[C]. Proceedings of VLDB,1997.
    [64] H. Jagadish,N. Koudas, S. Muthukrishnan,V. Poosala, K. Sevcik, and T.Suel. Optimal histograms with quality guarantees[C]. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 275-286,1998.
    [65] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries[C]. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 58-66,2001.
    [66] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman.Computing iceberg queries efficiently[C]. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 299-310,1998.
    [67] L. Qiao, D. Agrawal, A. El Abbadi, Rhist: adaptive summarization over continuous data streams[C].in: Proceedings of the Eleventh International Conference on Information and Knowledge Management, ACM Press, New York, 2002, pp. 469 - 476.
    [68] Francesco Buccafurri, Gianluca Lax, Fast range query estimation by N-level tree histograms[J]. Data & Knowledge Engineering, v.51 n.2, p.257-275,November 2004.
    [69] JS Viter. Random sampling with a reservoir[C].ACM Transactions on Mathemetical Software(TOMS), 1985.37-57.
    [70] PB Gibbons,Y Matias.New sampling-based summary statistics for improving approximate query answers[C].Proceedings of the 1988 ACM SIGMOD international conference on Management of data.Seattle,Washington,United States.1988.331-342.
    [71]Liu Q B,Deng Su,Lu C H,et al.Relative Density Based K-nearest Neighbors Clustering Algorithm[A].In:Prec.2003 Int.Conf.on Machine leaming and Cybemetics[C],Xi' flu,China,2003,133-137.
    [72]Matias Y,Vitter JS,Wang M.Wavelet-Based histograms for selectivity estimation[C].In:Haas LM,Tiwary A,eds.SIGMOD 1998,Proc.of the ACM SIGMOD Int'l Conf.on Management of Data.Seattle:ACM Press,1998.448-459
    [73]Matias Y,Vitter JS,Wang M.Dynamic maintenance of wavelet-based histograms[C].In:Abbadi AE,Brodie ML,Chakravarthy S,Dayal U,Kamel N,Schlageter G,Whang KY,eds.VLDB 2000,Proc.of the 26th Int'l Conf.on Very Large Data Bases.Cairo:Morgan Kaufmann,2000.101-110
    [74]Gilbert AC,Kotidis Y,Muthukrishnan S,Strauss MJ.Surfing wavelets on streams:One-Pass summaries for approximate aggregate queries[C].In:Apers PMG,Atzeni P,Ceri S,Paraboschi S,Ramamohanarao K,Snodgrass RT,eds.VLDB 2001,Proc.of the 27th Int'l Conf.on Very Large Data Bases.Roma:Morgan Kaufmann,2001.79-88
    [75]Garofalakis M,Gibbons PB.Wavelet synopses with error guarantees[C].In:Franklin MJ,Moon B,Ailamaki A,eds.Proc.of the 2002 ACM SIGMOD Int'l Conf.on Management of Data.Madison:ACM Press,2002.476-487.
    [76]Bulut,A.and Singh,A.K.SWAT:Hierarchical stream summarization in large networks[C].In Proceedings of the ICDE.2003.
    [77]Gilbert AC,Kotidis Y,Muthukfishnan S,Strauss MJ.Surfing wavelets on streams:One-Pass summaries for approximate aggregate queries[C].In:Apers PMG,Atzeni P,Ceri S,Paraboschi S,Ramamohanarao K,Snodgrass RT,eds.VLDB 2001,Proc.of the 27th Int'l Conf.on Very Large Data Bases.Roma:Morgan Kaufmann,2001.79-88
    [78]Patrick Pantel,Dekang Lin.2001.A Statistical Corpus-Based Term Extractor[C].Canadian Conference.on.AI2001.36-46.
    [79]A.Broder and M.Mitzenmacher.Network applications of bloom filters:A survey[C].Allerton Conference,2002.
    [80]http://www.eecs.harvard.edu/michaelm/NEWWORK/papers.html.
    [81]K.L.Chung,A Course on Probability Theory[M].New York:Academic,1974.
    [82]Yu J X,Chong Z,Lu H,Zhou A.False Positive or False Negative:Mining Frequent Itemsets from High Speed Transactional Data Streams[C].In:Proceedings of the 30th International Conference on Very Large Data Bases.Toronto,Canada:Morgan Kaufmann,2004.204-215.
    [83]Barbara D.Requirements for clustering data streams[C].ACM SIGKDD Explorations Newsletter,2003,3(2):23-27.
    [84]Ester M,Kriegel H P,Sander J,et al.Incremental Clustering for Mining in A Data Warehousing Environment[C].In:Proceedings of the 24th International Conference on Very Large Data Bases.New York,Morgan Kaufmann Publishers Inc.,1998-06:323-333
    [85]Chalaghan LO,Mishra N,Meyerson A,Guha S.Streaming data algorithms for high-quality clustering[C].In:Proc.of the 18th Int'l Conf.on Data Engineering.San Jose,2002.685-694.
    [86]Babcock B,Datar M,Motwani R,Callaghan LO.Maintaining variance and k-medians over data stream windows[C].In:Proc.of the 22nd ACM SIGACT-SIGMOD-SIGART Symp.Principles of Database Systems.San Diego:ACM Press,2003.234-243.
    [87]Donghui Zhang,Dimitrios Gunopulos,Vassilis J.Tsotras,Bernhard Seeger Temporal aggregation over data streams using multiple granularities[C].Advances in Database Technology-EDBT 2002:8th International Conference on Extending Database Technology,Prague,Czech Republic,March 25-27,2002.Proceedings
    [88]Yang J,Widom J.Incremental Computation and Maintenance of Temporal Aggregates[J].Proc.of ICDE,2001.
    [89]丁玉美,高西全.数字信号处理[M].西安电子科技大学出版社.2001.
    [90]C.Faloutsos,M.Ranganathan,and Y.Manolopoulos.Fast Subsequence Matching in Time-Series Databases[C].Proceedings of ACM SIGMOD,Minneapolis,MN,pages 419-429,May,1994.
    [91]Y.Chen,G.Dong,J.Han,B.W.Wah,J.Wang.Multi-Dimensional Regression Analysis of Time-Series Data Streams[C].In Proc.Int.Conf.on VeryLarge Data Bases,2002,pp.323-334.
    [92]A Deshpande,C Guestfin,S Madden Using probabilistic models for data management in acquisitional environments[C]。Proc.Biennial Conf.on Innovative Data Sys.Res.(CIDR),2005。
    [93]Guha S,Gunopulos D,Koudas N.Correlating Synchronous and Asynchronous Data Streams[C].In:Proc of The 9th ACM SIGKDD Intl Conf on Knowledge Discovery and Data Mining,2003.529-534
    [94]Aggarwal C C.A Framework for Diagnosing Changes in Evolving Data Streams[C].in:Proc of the 2003 ACM SIGMOD Intl Conf on Management of Data,2003.575-586
    [95]刘青宝,何勇,邓苏,张维明.基于相对密度的多分辨聚类算法[J].小型微型计算机系统,2007,7.
    [96]Cormode G,Muthukrishnan S.What's hot and what's not:Tracking most frequent items dynamically[C].In:Proc.of the 22nd ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.San Diego:ACM Press,2003.296-306.
    [97]Arasu A,Manku G S.Approximate counts and quantiles over sliding windows[C].In:Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.Paris,France:ACM Press,2004,286-296
    [98]Katerina T,Frantzi.Incorporating Context Information for the extraction of terms[J].Proceedings of the 35th annual meeting on Association for Computational Linguistics.1997,501-503.
    [99]朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法[J].软件学报,2006,17(3):379-387.http://www.jos.org.cn/1000-9825/17/379.htm
    [100]刘青宝,戴超凡,邓苏,张维明.基于网格的数据流聚类算法[J].计算机科学,2007,3.
    [101]Hoeffding,W.Probability inequalities for sums of bounded random Variables[J],JASA,1963,58:13-30..
    [102]O.Maron and A.Moore.Hoeffding races:Accelerating model selection search for classification and function approximating[J].In Advances in Neural Information Processing Systems,pages 59-66,1994.
    [103]Breunig MM,Kriegel H,Ng RT,Sander J.LOF:Identifying density-based local outliers[C].In:Dunham M,Naughton J,Chen W,eds.Proc.of the 2000 ACM SIGMOD Int'l Conf.on Management of Data.Dallas:ACM Press,2000.93-104..

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700