数据流聚类分析算法

英文题名：Algorithms for Clustering over Data Streams
作者：曹锋
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：数据流 ; 聚类分析 ; 进化 ; 窗口 ; 图形处理器
英文关键词：data stream ; clustering ; evolving ; window ; graphic processing unit
学位年度：2006
导师：周傲英
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2006-04-12

摘要

近年来，许多应用中的数据是以流的形式产生的，例如网络流，传感器数据，以及网页点击流等。分析和挖掘这类数据日益成为一个热点问题。作为一种基础的数据挖掘手段，聚类分析在数据流环境下得到了学术界和工业界的广泛关注。与传统数据库不同，数据流具有如下特点：(1)数据总量的无限性；(2)数据到达的快速性；(3)数据到达次序的无约束性；(4)除非可以保存，每个元素均只能被处理一次。
     数据流的上述特点对数据流上的聚类挖掘提出了如下要求：首先，算法必须能够进行实时在线挖掘，快速处理每一个元组，并实时输出挖掘处理结果。其次，相对于无限规模的数据流内存通常是有限的，算法的空间复杂度要低，往往需要在数据量的对数范围内。再次，由于算法实时在线挖掘以及对空间复杂度的限制，算法往往只能得到近似解，且需要具有一定的精确度保证。最后，算法要具有较强的适应性，包括对数据流不断进化的底层模型的适应性，处理离群点的能力，以及挖掘任意形状簇的能力等。
     学术界已经对数据流上的聚类分析问题进行了不少研究工作，但仍存在许多问题尚待研究和解决。本文研究了滑动窗口内的数据流聚类分析问题，数据流中具有任意形状簇的挖掘问题，利用图形处理器加速数据流聚类问题以及分布式数据流的数据聚类问题，旨在为现有的数据流系统提供更为多样的聚类分析功能。本文的主要贡献有如下四个方面：
     1．本文提出了一种新算法CluWin来解决滑动窗口内数据流聚类分析问题。我们设计了一种新的概要结构—聚类特征指数直方图—来保持滑动窗口中簇的统计信息。CluWin算法仪需要维护O(k/∈log(∈[N/k]))个时间聚类特征结构，就能够估算长度为N的滑动窗口中所有记录的聚类结果，且窗口最大相对差不超过∈。此外，它还被扩展用于解决N-n窗口(滑动窗口扩展模型)数据聚类问题。
     2．本文提出了一种新算法DenStream用于挖掘进化数据流中具有任意形状的簇。我们引入一种“密”微簇称为核心微簇(core-micro-cluster)用于描述数据流中任意形状的簇，并提出潜在核心微簇(potential core-microcluster)和离群微簇(outlier micro-cluster)结构分别用于维护并区分数据流中潜在的簇和离群点。DenStream基于这些概念包含了一种新颖的淘汰策略，该策略可利用次线性空间的内存维护并保证各微簇权值的精度。
     3．本文利用性能强大、日趋廉价且在数据流领域尚未引起足够重视的图形处理器(GPU)处理数据流聚类挖掘问题。我们提出一类基于GPU的快速聚类方法，包括基于k-means的基本聚类方法，基于GPU的数据流聚类以及数据流簇进化分析方法。这些方法的共同特点就是充分利用GPU强大的处理能力和流水线特性。与以往具有独立框架的数据流聚类算法不同，基于GPU的聚类算法具有同一框架和多种聚类分析功能，为数据流聚类分析提供了统一平台。
     4．本文提出了一个分布式聚类处理框架CluDistream。该框架可高效地实时处理分布式数据流中海量数据，有噪声、有损或不完整数据记录，以及有交叠的数据集。在CluDistream基于期望最大化(Expectation Maximization)的算法中，每个数据记录可以以不同的隶属度属于不同的簇。这种软聚类方式能较好地反映簇的交叠性。对有噪声、损坏的或不完整的数据记录，算法可通过最大化数据簇的似然度来学习数据流的底层分布。此外，CluDistream算法中测试后聚类的策略可有效地减少算法的平均处理代价，这对分布式数据流的在线实时聚类挖掘非常有效。
     总之，本文研究了数据流聚类分析的四个基本问题并分别提出了新的解决方案。滑动窗口是处理数据流的基本模型之一，如何在滑动窗口内对数据流进行聚类分析是一个基本问题；具有任意形状簇相对于球形簇是更为一般的数据簇模型，如何挖掘任意形状的簇也是一个基本问题；如何提高数据流聚类算法的处理速度是一个基本问题，这是由数据流聚类算法实时在线挖掘的特点所决定的；分布式数据流的数据聚类问题，其基础性在于现实应用中数据流往往是在分布式环境中产生的。本文算法是对现有数据流上的聚类分析技术的有益补充和改进。理论分析和实验结果表明本文算法能够高效地解决相应问题，与现有数据流聚类方法相比，本文算法在存储空间开销、挖掘处理速度以及结果准确性上具有优势。
Recently, various applications generate a large number of streaming data, such as network flow, sensor data, and web page clicks. Mining and analyzing such kind of data has been becoming a hot topic. As a basic method of data mining, clustering in data stream settings has been wildly concerned from academia and industry. Different from traditional data bases, data streams have the following distinguished characteristics: (1)unbounded volume of data; (2)rapid arriving rate of data; (3)uncontrollability of tuples' arriving order; (4) being processed only once for each tuple, except being reserved.
    Above characteristics propose the following requirements on clustering over data streams: Firstly, algorithms should be online mining, be able to fast process each tuple, and output the mining result on time; Secondly, because of the limited memory compared to unlimited data streams, algorithms should have low space complexity, the space complexity should be in the range of log bound of data volume. Thirdly, by the constraint of online mining and low space complexity, algorithms can only get approximate results, however, that results should be guaranteed on precision. At last, algorithms should be adaptive, including the applicability on data streams with evolving underlying model, the abilities on handling outliers and arbitrary-shaped clusters.
    A lot of methods on stream clustering have been proposed, however there are many problems need to be researched and resolved. In this paper, we study the the problem of clustering data streams over sliding windows, the problem of discovery arbitrary-shaped clusters in data streams, the problem of clustering data streams via graphic processors and the clustering problem in distributed data stream settings, in order to provide the existing data stream systems more functions on clustering analysis. The main contributions of this thesis include the following aspects:
    1. We propose a novel method, CluWin, to cluster data streams over sliding windows, and a new synopsis named Exponential Histogram of Cluster Features
    to maintain the statistic information of clusters in sliding windows. CluWin algorithm can provide the clustering results in sliding windows ( window size = N) with relative error no more than e by maintaining O(k/ε log(ε[N/K]) Time Clustering Feature synopses. In addition, it has been extended to deal with clustering problems over the n-of-N sliding window (an extended model of sliding window).
    2. We propose a new approach, DenStream, to discover clusters with arbitrary shapes in evolving data streams, and introduce a dense micro-cluster(called core-micro-cluster) to describe the arbitrary-shaped clusters in data streams. In addition, we propose potential core-micro-cluster and outlier micro-cluster synopses to maintain and distinguish the potential clusters and outliers in data streams. Based on these concepts, DenStream includes a novel pruning strategy, which can maintain and ensure the precision of the weight of micro-clusters by memory space with log complexity.
    3. We utilize the powerful, cheaper and cheaper and not enough considered graphic processing units (GPUs) to cluster data streams. We propose a group of fast clustering methods via GPUs, including the basic k-means method, the stream clustering algorithm, and the evolving analysis method over data streams. The common characteristics of these methods are making use of the strong computational and pipeline power of GPUs. Different from pervious clustering methods with individual framework, our methods share the same framework with multi-function, which provides a uniform platform for stream clustering.
    4. We propose a new framework, CluDistream, for clustering distributed data streams, which can effectively handle a huge volume of data in real-time, the noisy, corrupted or incomplete data records, the overlapping properties of data set in distributed data streams. In CluDistream, the EM-based (Expectation Maximization) algorithms, each data record is assigned to a cluster with certain degree of membership. This soft clustering reflects more naturally the overlapping property of the clusters. In the presence of noisy, corrupted or incomplete data records, our algorithms learn the parameters governing the distribution of underlying data streams by maximizing the likelihood of the data clusters. Moreover, a test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams.
    All in all, we study four principal problems of clustering data streams and propose novel solutions for them in this thesis. Since sliding window is a basic model for processing data streams, how to cluster data streams over sliding windows becomes a basic problem; arbitrary-shaped clusters model is more general than spherical clusters model, therefore discovery arbitrary-shaped clusters in data streams is a basic problem; the characteristic of online mining makes how to improve the processing speed of algorithm a basic issue in clustering data streams; clustering distributed data streams is a basic problem, because in real life applications, the data streams are often generated in a distributed environment. Our methods are great complementarity and improvement to existing stream clustering methods. Theoretic analysis and experimental results show that our methods can solve their corresponding problems efficiently and outperform previous clustering methods in space complexity, processing rate and result quality.

引文

[1] Stream project, http://www-db.stanford.edu/stream.
    [2] P. Agarwal, S. Krishnan, N. Mustafa, and S. Venkatasubramanian. Streaming geometric optimization using graphics hardware. In Proc. of 11th European Symposium on Algorithms, pages 544-555, 2003.
    [3] C. C. Aggarwal. A framework for diagnosing changes in evolving data streams. In Proc. of SIGMOD, pages 575-586, 2003.
    [4] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In Proc. of VLDB, pages 81-92, 2003.
    [5] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for projected clustering of high dimensional data streams. In Proc. of VLDB, pages 852-863, 2004.
    [6] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On demand classification of data streams. In Proc. of KDD, pages 503-508, 2004.
    [7] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proc. of PODS, pages 10-20, 1999.
    [8] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. of STOC, pages 20-29, 1996.
    [9] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In Proc. of SIGMOD, pages 49-60, 1999.
    [10] A.P.Dempster, N.M.Laird, and D.B.Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal statistical Society, Series B, 39(1):1-38, 1977.
    [11] A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In Proc. of PODS, pages 286-296, 2004.
    [12] I. T. Archive. trace dec-pkt-4. http://www.acm.org/sigcomm/ITA/.
    [13] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, pages 1-16, 2002.
    [14] B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining variance and k-medians over data stream windows. In Proc. of PODS, pages 234-243, 2003.
    [15] B. Babcock and C. Olston. Distributed top-k monitoring. In Proc. of SIGMOD, pages 28-39, 2003.
    [16] G. Baciu, W. S.-K. Wong, and H. Sun. Recode: An image-based collision detection algorithm. Visualization and Computer Animation, 10(4):181-192, 1999.
    [17] D. Barbard. Requirements for clustering data streams. 2(3):23-27, 2002.
    [18] S. Bhattacharya and S. Moon. Network performance monitoring and measurement: Techniques and experience. In ACM SIGMETRICS tutorial, 2002.
    [19] P. Bradley, C. Reina, and U. Fayyad. Clustering very large databases using em mixture models. 2:2076-2080, 2000.
    [20] A. Bulut and A. K. Singh. A unified framework for monitoring data streams in real time. In Proc.of ICDE, 2005.
    [21] M. Burl, C. Fowlkes, J. Roden, A. Stechert, and S. Mukhtar. Diamond eye: A distributed architecture for image data mining. In SPIE DMKD, 1999.
    [22] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams-a new class of data management applications. In Proc. of VLDB, pages 215-226, 2002.
    [23] C.Bishop. Neural networks for pattern recognition. Oxford Univerity Press, 1995.
    [24] L. O. Chalaghan, N. Mishra, A. Meyerson, and S. Guha. Streaming data algorithms for high-quality clustering. In Proc. of ICDE, pages 685-694, 2002.
    [25] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. of ICALP, pages 693-703, 2002.
    [26] M. Charikar, L. O'Callaghan, and R. Panigrahy. Better streaming algorithms for clustering problems. In Proc. of STOC, pages 30-39, 2003.
    [27] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining, pages 153-180, 1996.
    [28] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. of SIGMOD, pages 379-390, 2000.
    [29] G. Cormode, M. Datar, P. Indyk, and S.Muthukrishnan. Comparing data streams using hamming norms (how to zero in). In Proc. of VLDB, pages 335-345, 2002.
    [30] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proc. of SIGMOD, pages 25-36, 2005.
    [31] G. Cormode and S. Muthukrishnan. Radial histogram for spatial streams. In Technical Report 2003-11, DIMACS, 2003.
    [32] G. Cormode and S. Muthukrishnan. What's new: Finding significant differences in network data streams. In Proc. of IEEE INFOCOM, pages 1534-1545, 2004.
    [33] B. Dai, J. Huang, M. Yeh, and M. Chen. Clustering on demand for multiple data streams. In Proc. of ICDM, pages 367-370, 2004.
    [34] A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi. Distributed set-expression cardinality estimation. In Proc. of VLDB, pages 312-323, 2004.
    [35] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. of SODA, pages 635-644, 2002.
    [36] E. Demaine, A. Lopez-Ortiz, and J. Munro. Frequency estimation of internet packet streams with limited space. In Proc. of 10th Annual European Symposium on Algorithms, pages 348-360, 2002.
    [37] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of KDD, pages 71-80, 2000.
    [38] P. Domingos and G. Hulton. A general method for scaling up machine learning algorithms and its application to clustering. In Proc. of ICML, pages 106-113, 2001.
    [39] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of KDD, pages 226-231, 1996.
    [40] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. Incremental clustering for mining in a data warehousing environment. In Proc. of VLDB, pages 323-333, 1998.
    [41] W. Fan. Systematic data selection to mine concept-drifting data streams. In Proc. of KDD, pages 128-137, 2004.
    [42] J. Feigenbaum, R. Kannan, M. Strauss, and M. Viswanathan. An approximate l~l-difference algorithm for massive data streams. In Proc. of FOCS, pages 501-511, 1999.
    [43] G. Fung, J. Yu, P. Yu, and H. Lu. Parameter free bursty events detection in text streams. In Proc. of VLDB, pages 181 192, 2005.
    [44] G. H. G, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proc. of KDD, pages 97-106, 2001.
    [45] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Record, 34(2): 18-26, 2005.
    [46] V. Ganti, J. Gehrke, and R. Ramakrishnan. Demon: Mining and monitoring evolving data. In Proc. of ICDE, pages 439-448, 2000.
    [47] M. Garofalakis, J. Gehrke, and R. Rastogi. Querying and mining data streams: You only get one look. In Proc. of SIGMOD, page 635, 2002.
    [48] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining frequent patterns in data streams at multiple time granularities. In Proc. of NSF Workshop on Next Generation Data Mining, 2002.
    [49] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. of SIGMOD, pages 331-342, 1998.
    [50] P. B. Gibbons and Y. Matias. Synopsis data structures. In Proc. of SODA, pages 909-910, 1999.
    [51] A. Gilbert, S. Guha, Y. Kotidis, P. Indyk, S.Muthukrishnan, and M. Strauss. Fast, small space algorithm for approximate histogram maintenance. In Proc. of STOC, pages 389-398, 2002.
    [52] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One pass summaries for approximate aggregate queries. VLDB Journal, 79(8):79-88, 2001.
    [53] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proc. of VLDB, pages 454-465, 2002.
    [54] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery, 1(1), 1997.
    [55] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proc. of SIGMOD, pages 215-226, 2004.
    [56] N. K. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proc. of SIGMOD, pages 611-622, 2005.
    [57] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. of SIGMOD, pages 58 66, 2001.
    [58] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. In IEEE Transactions on Knowledge and Data Engineering, pages 515-528, 2003.
    [59] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data stream. In Proc. of FOCS, pages 359-366, 2000.
    [60] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proc. of SIGMOD, pages 73-84, 1998.
    [61] J. Han and M. Kamber. Data mining: Concepts and techniques. Morgan Kaufmann, 2001.
    [62] M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Note 1998-011. Digital systems research center, Palo Alto, May 1998.
    [63] J. Hershberger and S. Suri. Adaptive sampling for geometric problems over data streams. In Proc. of PODS, pages 252-262, 2004.
    [64] A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. of KDD, pages 58-65, 1998.
    [65] K. E. Hoff III, J. Keyser, M. Lin, D. Manocha, and T. Culver. Fast computation of generalized voronoi diagrams using graphics hardware. In Proc. of SIGGRAPH, pages 277-286, 1999.
    [66] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. of FOCS, pages 189-197, 2000.
    [67] P. Indyk. Stream-based geometric algorithms. In Proc. of ACM/DIMACS Workshop on Management and Processing of Data Streams, 2003.
    [68] Y. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In Proc. of SIGMOD, pages 233-244, 1995.
    [69] A. Jain and R. Dubes. Algorithms for clustering data. New Jersey, 1998.
    [70] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. A CM Computing Surveys, 31(3):264-323, 1999.
    [71] E. Januzaj, H.-P. Kriegel, and M. Pfeifle. Towards effective and efficient distributed clustering. In Workshop on Clustering Large Data Sets ICDM2003, 2003.
    [72] C. Jin, W. Qian, C. Sha, J. Yu, and A. Zhou. Dynamically maintaining frequent items over a data stream. In Proc. of CIKM, pages 287-294, 2003.
    [73] R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proc. of KDD, pages 571-576, 2003.
    [74] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, and et. al. Vedas: A mobile and distributed data stream mining system for real-time vehicle monitoring. In Proc. of SIAM International Conference on Data Mining, 2004.
    [75] H. Kargupta, B. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. Mobimine: Monitoring the stock market from a pda. 3(2):37-46, 2002.
    [76] R. Karp, S. Shenker, and C. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. A CM Trans. on Database Systems, 28(1):51-55, 2003.
    [77] G. Karypis, E.-H. S. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8):68-75, 1999.
    [78] M. Khan, Q. Ding, and W. Perrizo. k-nearest neighbor classification on spatial data streams using p-trees. In Proc. of PAKDD, pages 517-518, 2002.
    [79] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proc. of VLDB, pages 180-191, 2004.
    [80] J. Kleinberg. Bursty and hierarchical structure in streams. In Proc. of KDD, pages 91-101, 2002.
    [81] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. In Proc. of VLDB, pages 275-286, 2002.
    [82] S. Krishnan, N. H. Mustafa, and S. Venkatasubramanian. Hardware-assisted computation of depth contours. In Proc. of SODA, pages 558-567, 2002.
    [83] D. Lambert and J. Pinheiro. Mining a stream of transactions for customer patterns. In Proc. of KDD, pages 305-310, 2001.
    [84] E. S. Larsen and D. K. McAllister. Fast matrix multiplies using graphics hardware. In Proc. of IEEE Supercomputing, pages 55-55, 2001.
    [85] X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously maintaining quantile summaries of the most recent n elements over a data stream. In Proc. of ICDE, pages 362-373, 2004.
    [86] X. Lin, Y. Yuan, W. Wang, and H. Lu. Stabbing the sky: Etticient skyline computation over sliding windows. In Proc. of ICDE, pages 502-513, 2005.
    [87] L. Liu, C. Pu, and W. Tang. Continual queries for internet scale eventdriven information delivery. IEEE Trans. on Knowledge and Data Engineering, 11(4):583-590, 1999.
    [88] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In Proc. of ICDE, pages 767-778, 2005.
    [89] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, pages 346-357, 2002.
    [90] L. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12:315-323, 1980.
    [91] S. Muthukrishnan. Data streams: Algorithms and applications. In Proc. of SODA (short paper), 2002.
    [92] S. Muthukrishnan, R. Shah, and J. S. Vitter. Mining deviants in time series data streams. In Proc. of the 16th International Conference on Scientific and Statistical Database Management, pages 41-50, 2004.
    [93] O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez. Tecno-streams: Tracking evolving clusters in noisy data streams with a scalable immune system learning model. In Proc. of ICDM, pages 235-242, 2003.
    [94] R. Neal and G. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, 1999.
    [95] J. A. Nelder and R. Mead. A simplex method for function minimiztion. The Computer Jouranl, 7(4):308-313, 1965.
    [96] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queries over distributed data streams. In Proc. of SIGMOD, pages 563-574, 2003.
    [97] C. Ordonez. Clustering binary data streams with k-means. In The 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery(DMKD), pages 12-19, 2003.
    [98] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. of SIGMOD, pages 294-305, 1996.
    [99] P. Rodrigues, J. Gama, and J. P. Pedroso. Hierarchical time-series clustering for data streams. In Proc. of the First International Workshop on Knowledge Discovery in Data Streams, pages 22-31, 2004.
    [100] A. Srivastava and J. Stroeve. Onboard detection of snow, ice, clouds and other geophysical processes using kernel methods. In Proc. of the ICML workshop on Machine Learning Technologies for Autonomous Space Applications, 2003.
    [101] C. J. Stone. A course in probability and statistics. Duxbury Press, 1995.
    [102] M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The sequoia 2000 storage benchmark. In Proc. of SIGMOD, pages 2-11, 1993.
    [103] C. Sun, D. Agrawal, and A. E. Abbadi. Hardware acceleration for spatial selections and joins. In Proc. of SIGMOD, pages 455-466, 2003.
    [104] H. Sun, G. Yu, Y. Bao, F. Zhao, and D. Wang. Cds-tree: An effective index for clustering arbitrary shapes in data streams. In RIDE-SDMA, pages 81-88, 2005.
    [105] S.Venkatasubramanian. The graphics card as a stream computer. In Proc. of SIGMOD DIMACS, 2003.
    [106] S. Tanner, M. Alshayeb, E. Criswell, M. Iyer, and et. al. Eve: On-board process planning and execution. In Earth science technology conference, pages 11-14, 2002.
    [107] W. Teng, M. Chen, and P. S. Yu. A regression-based temporal pattern mining scheme for data streams. In Proc. of VLDB, pages 93-104, 2003.
    [108] C. J. Thompson, S. Hahn, and M. Oskin. Using modern graphics architectures for general-purpose computing: A framework and analysis. In Proc. of IEEE/ACM International Symposium on Microarchitectures, pages 306-317, 2002.
    [109] N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. Smem algorithm for mixture models. Neural Computation, 12(9):2109-2128, 2000.
    [110] V.V. Vazirani. Approximation algorithms. Berlin, New York, Springer, 2001.
    [111] J.S. Vitter. Random sampling with a reservoir. 11(1):37-57, 1985.
    [112] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept drifting data streams using ensemble classifiers. In Proc. of KDD, pages 226-235, 2003.
    [113] W. Wang, J. Yang, and R. Muntz. STING+: An approach to active spatial data mining. In Proc. of ICDE, pages 116-125, 1999.
    [114] W. Wang, J. Yang, and R. R. Muntz. Sting: A statistical information grid approach to spatial data mining. In Proc. of VLDB, pages 186-195, 1997.
    [115] J. Yang. Dynamic clustering of evolving streams with a single pass. In Proc. of ICDE, pages 695-697, 2003.
    [116] Y. Yao and J. Gehrke. Query processing for sensor networks. In Proc. of CIDR, 2003.
    [117] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In Proc. of VLDB, pages 204-215, 2004.
    [118] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In Proc. of SIGMOD, pages 103-114, 1996.
    [119] A. Zhou, Z. Cai, L. Wei, and W. Qian. M-kernel merging: Towards density estimation over data streams. In Proc. of DASFAA, pages 285-292, 2003.
    [120] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. of VLDB, pages 358-369, 2002.
    [121] Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In Proc. of KDD, pages 336-345, 2003.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700