基于遗忘特性的数据流概要结构及其应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机网络和各类电子设备应用的越来越广泛,越来越多的数据以连续的流的形式出现,如网络路由信息,传感器网络采集的实时信号,证券交易、信用卡交易、商场购物交易等的实时记录,因特网网站点击流,电信网络的电话呼叫业务记录,聊天室、短信等的实时文本流等,均产生连续不断的各类数据。这些数据被称为流数据或数据流。因为数据流和传统数据库等系统中处理的数据的巨大差别,迫使研究人员对数据流模型和处理方法进行深入研究。
     数据流处理的关键是应用单趟数据扫描算法,建立流数据的概要结构,以便随时能根据该结构提供数据流的近似处理结果。数据遗忘是数据流的一种重要特性,在数据流概要结构构造中应充分考虑这种遗忘特性。本文工作利用这种遗忘特性,提出了一种基于数据流遗忘特性的概要结构的框架,称为分层遗忘概要(Hierarchical AmnesicSynopses,简称HAS)。应用HAS结构,可将原来不考虑遗忘特性的概要结构构造方法改造为结合了数据流遗忘特性的方法。本文工作将HAS结构应用于直方图、抽样、小波、sketch、随机投影等主要的数据流概要结构中,并给出了几个典型应用。
     本文主要贡献包括:
     (1)提出了一种数据流概要结构的通用框架,HAS结构。该框架嵌入了数据流的遗忘特性,并且具有遗忘速度和重构误差控制的能力。利用该框架,可将现有的多种典型数据流概要结构改造成具有数据流遗忘特性处理能力。
     (2)实现了基于小波数据压缩的HAS结构(W-HAS),提出了小波概要的归并方法,并讨论了在基于误差平方和(sse)和基于最大绝对误差(max_abs)两种误差度量标准下的W-HAS,以及如何进行W-HAS中的重构误差控制的方法。
     (3)讨论了基于加权随机抽样的HAS结构(WS-HAS),分别对有放回和无放回加权随机抽样设计了WS-HAS概要结构的维护算法。
     (4)提出了结合HAS结构和直方图数据压缩方法的H-HAS结构,讨论了等宽直方图下的H-HAS结构的实现,用动态规划方法实现了最优直方图下的H-HAS结构。
     (5)基于数据流的W-HAS结构,讨论了数据流之间的近似距离和聚类中心的计算,并进而提出了适合并行多数据流的K-means聚类方法:W-HAS-clustering。同时,利用数据流的遗忘特性,应用随机投影,构造了基于随机投影的数据流分层概要结构RP-HAS,并设计了规范化后数据流的RP-HAS结构维护的方法。提出了基于RP-HAS结构的适合并行多数据流的聚类方法RP-HAS-clustering。
     (6)讨论了高维数据流中HAS结构的实现,并将其应用到数据流的分类和聚类中。
     (7)提出了一种基于sketch的数据流概要结构EFM sketch,并用EFM sketch来估算集合的相似度。在HAS结构的基础上,应用EFM sketch分析数据流上数据的相似度和演化。
With the increasingly widespread use of computer networks and electronic equipments, many real-life applications data appeared as dynamic evolving data streams which are continuous and unbounded in nature. Such applications include stock markets, network traffic monitoring, sensor networks, internet, network security, data mining, financial monitoring, and many more. Traditional data base techniques can hardly be applied to process such high-speed and unbounded data stream directly. So researches need to work out novel processing techniques over data streams.
     Maintaining a synopsis data structure dynamically from data stream is vital for a variety of streaming data applications, such as approximate query or data mining. In many cases, the significance of data item in streams decay with age: this item perhaps conveys critical information first, but, as time goes by, it gets less and less important until it eventually becomes useless. This feature is termed amnesic. The dissertation proposed a Hierarchical Amnesic Synopses (HAS) which includes the amnesic feature of data stream in the generation of synopses. HAS can provide a better approximate representation for data streams with amnesic feature than conventional data stream synopses.
     Our major contributions in the dissertation are as follows.
     1. The dissertation proposed HAS structure to utilize the amnesic feature of data stream. HAS structure has the ability to control the amnesic speed and the reconstruction error.
     2. Discrete Wavelet Transform is often used in construction of synopses for streaming data. We proposed a Wavelet-based Hierarchical Amnesic Synopses (W-HAS). To maintain W-HAS online for evolving data streams, the paper first explored the merger process of two wavelet decompositions, and then implemented the addition of data nodes in W-HAS structure based on the merger process. Using the addition of data nodes, W-HAS grows dynamically and hierarchically. The construction methods of W-HAS under sum of squared error (sse) and maximum absolute error metrics are discussed. Further, W-HAS with error control is also explored.
     3. We proposed Weighted Sampling based Hierarchical Amnesic Synopses (WS-HAS). The construction method of WS-HAS for weighted random sampling with and without replacement is discussed.
     4. We proposed Histograms Based Hierarchical Amnesic Synopses (H-HAS). The construction methods of H-HAS using Equi-width and V-optimal histograms are discussed.
     5. Clustering is useful in analyzing the paralleled data streams. Using the W-HAS and the RP-HAS (Random Projections based HAS), respectively, a fast computation of approximate distances between streams and the cluster center can be implemented, and an efficient online version of the classical K-means clustering algorithm is developed.
     6. We proposed HD-HAS (High Dimensional HAS).
     7. We introduced a novel sketch, EFM sketch. EFM sketch can be used to estimate the similarity of two sets. Based on HAS structure, we discussed the analyzing method of evolvement in data stream.
引文
[1]Babcock B,Babu S,Datar M,Motwani R,Widom J.Models and issues in data stream systems,In:Proc.of the 21st ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems(PODS),New York:ACM Press,2002:1-16.
    [2]Muthukrishnan,S.Data streams:algorithms and applications.Foundations and Trends in Theoretical Computer Science,2005,1(2):117-236.
    [3]Garofalakis M,Gehrke J,Rastogi R.Querying and mining data streams:You only get one look.In:Proc.ACM SIGMOD Int.Conf.on Management of Data.2002:635.
    [4]Gilbert AC,Kotidis Y,Muthukrishnan S,Strauss M.One-pass wavelet decompositions of data streams.IEEE Transactions on Knowledge and Data Engineering,2003,15(3):541-554.
    [5]http://infolab.stanford.edu/stream/index.html
    [6]Abadil DJ,Carney D,Cetintemel D,Cherniack M.Aurora:a new model and architecture for data stream management.The VLDB Journal,2003,12(2):120-139.
    [7]Cranor C,Johnson T,Spataschek O,Shkapenyuk V.Gigascope:a stream database for network applications.In:Proceedings of the 2003 ACM SIGMOD international Conference on Management of Data,New York,NY:ACM Press,2003:647-651.
    [8]Guha S,Koudas N,Shim K.Data-Streams and histograms.In:Proc.of the 33rd Annual ACM Syrup.on Theory of Computing.New York:ACM Press,2001:471-475.
    [9]Gilbert A,Guha S,Indyk P,Kotidis Y,Muthukrishnan S,Strauss M.Fast,small-space algorithms for approximate histogram maintenance.In:Proc.of the 34th Annual ACM Symp.on Theory of Computing.New York:ACM Press,2002:389-398.
    [10]Gibbons PB,Matias Y.New sampling-based summary statistics for improving approximate query answers.In:Proc.of the ACM SIGMOD Int'l Conf.on Management of Data.Seattle:ACM Press,1998:331-342.
    [ll]Garofalakis M,Gibbons PB.Wavelet synopses with error guarantees.In:Proc.of the 2002 ACM SIGMOD Int'l Conf.on Management of Data.New York:ACM Press,2002:476-487.
    [12]陈安龙,唐常杰,元昌安,朱明放,段磊.基于小波和偶合特征的多数据流压缩算法.软件学报,2007,18(2):177-184.
    [13]刘兵,汪卫,施伯乐.基于小波变换的序列间距离严格估算.计算机研究与发展,2006,43(10):1732-1737.
    [14]Indyk P.Stable distributions,pseudorandom generators,embeddings,and data stream computation.J.ACM,2006,53(3):307-323.
    [15]Cormode G,Muthukrishnan S.An improved data stream summary:the count-min sketch and its applications.J.Algorithms,2005,55(1):58-75.
    [16]Alon N,Matias Y,Szegedy M.The space complexity of approximating the frequency moments.In Proc.ACM Symp.on Theory of Computing(STOC),1996:20-26.
    [17]Flajolet P,Martin GN.Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences. 1985, 31(2): 182-209.
    [18]Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 2005, 55(1): 58-75.
    [19]Domingos P, Hulten G Mining High-Speed Data Streams. In: Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining. New York,ACM Press 2000:71-80.
    [20]Aggarwal CC, Han J, Wang J, Yu PS. A Framework for Clustering Evolving Data Streams. In:Proc of the 29th Int'l Conf on Very Large Data Base. San Francisco: Morgan Kaufmann Publishers,2003:81-92.
    [21]Aggarwal CC, Han J, Wang J, Yu PS. A Framework for On-Demand Classification of Evolving Data Streams. IEEE Transactions on Knowledge and Data Engineering,2006,18(5): 577-589.
    [22]Guha S, Mishra N, Motwani R, O'Callaghan L. Clustering data streams. In: Proceedings of the Annual Symposium on Foundations of Computer Science, 2000.
    [23]Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L. Clustering Data Streams:Theory and Practice. IEEE Transactions on Knowledge and Data Engineering,2003,15(3):515-528.
    [24]Charikar M, O'Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In: Proc. of 35th ACM Symposium on Theory of Computing. New York:ACM Press, 2003: 30-39.
    [25]Babcock B, Datar D, Motwani R, O'Callaghan L. Maintaining Variance and k-Medians over Data Stream Windows. In:Proceedings of the 22nd Symposium on Principles of Database Systems. San Diego: ACM Press, 2003:234-243.
    [26] Domingos P, Hulten G A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In:Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann,2001:106-113.
    [27]Hulten G, Spencer L, Domingos P. Mining Time-Changing Data Streams. In:Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining,2001:97-106.
    [28]Ordonez C. Clustering Binary Data Streams with K-means. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, New York, NY:ACM Press, 2003:12-19.
    [29]Aggarwal CC, Han J, Wang J, Yu PS. A framework for projected clustering of high dimensional data streams. In Proc. Int. Conf. on Very Large Data Bases (VLDB), 2004:852-863.
    [30]Beringer J, Hullermeier E. Online Clustering of Parallel Data Streams. Data &Knowledge Engineering,2006,58(2): 180-204.
    
    [31]常建龙,曹锋,周傲英.基于滑动窗口的进化数据流聚类.软件学报,2007,18(4):905-918.
    [32]周晓云,孙志挥,张柏礼,杨宜东.高维数据流聚类及其演化分析研究.计算机研究与发展,2006,43(11):2005-2011.
    [33]倪巍伟,陆介平,陈耿,孙志挥.基于k均值分区的流数据高效密度聚类算法.小型微型计算机系统,2007,28(1):83-87.
    [34]Last M.Online Classification of Nonstationary Data Streams.Intelligent Data Analysis.2002,6(2):129-147.
    [35]Ding Q,Perrizo W.Decision Tree Classification of Spatial Data Streams Using Peano Count Trees.In:Proceedings of the ACM Symposium on Applied Computing.New York,NY:ACMPress,2002:413-417.
    [36]Giannella C,Han J,Pei J,Yah X,Yu PS.Mining Frequent Patterns in Data Streams at Multiple Time Granularities.In:H.Kargupta,A.Joshi,K.Sivakumar,and Y.Yesha (eds.),Next Generation Data Mining,AAAI/MIT,2003.
    [37]Manku G,Motwani R.Approximate frequency counts over data streams.In:Bernstein P,Ioannidis Y,Ramakrishnan R,eds.Proc.of the 28th Int'l conf.on Very large data bases.San Francisco:Morgan Kaufmann Publishers,2002:346-357.
    [38]Cormode G,Muthukrishnan S.What's hot and what's not:tracking most frequent items dynamically.In:Proc.PODS 2003,2003:296-306.
    [39]Tantono FI,Manerikar N,Palpanas T.Efficiently Discovering Recent Frequent Items in Data Streams.In:Scientific and Statistical Database Management,Springer Berlin,2008:222-239.
    [40]王伟平,李建中,张冬冬,郭龙江.一种有效的挖掘数据流近似频繁项算法.软件学报,18(4):884-892.
    [41]周晓云,孙志挥,张柏礼,杨宜东.高维类别属性数据流离群点快速检测算法.软件学报,18(4):933-942.
    [42]Indyk P,Koudas N,Muthukrishnan S.Identifying representative trends in massive time series data sets using sketches.In:Proc.of the 26th Int.Conf.on Very Large Data Bases.San Francisco:Morgan Kaufmann,2000:363-372.
    [43]Zhu Y,Shasha D.StatStream:Statistical monitoring of thousands of data streams in real time.In:Proceedings of the 28th international Conference on Very Large Data Bases,2002:358-369.
    [44]Lin J,Keogh E,Lonardi S,Chiu B.A Symbolic Representation of Time Series,with Implications for Streaming Algorithms.In:proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.2003:2-11.
    [45]Cohen E,Strauss M.Maintaining time-decaying stream aggregates.In:Proc.of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.New York:ACM Press,2003:223-233.
    [46]Cohen E,Kaplan H.Spatially-decaying aggregation over a network:Model and algorithms.In:Proc.ACM SIGMOD Int.Conf.on Management of Data,2004:707-718.
    [47]Palpanas T,Vlachos M,Keogh E,Gunopulos D,Truppel W.Online Amnesic Approximation of Streaming Time Series. In: Proc. of the 20th Int'l conf. on Data Engineering. Los Alamitos: IEEE Computer Society, 2004: 339-349.
    [48]Palpanas T, Vlachos M, Keogh E, Gunopulos D. Streaming Time Series Summarization Using User-Defined Amnesic Functions. IEEE Transactions on Knowledge and Data Engineering, 2008,20(7):992-1006.
    [49]Chen Y, Dong G, Han J, Wah BW, Wang J. Multi-Dimensional Regression Analysis of Time-Series Data Streams. In: Proc. VLDB International Conference,2002:323-334.
    [50]Bulut A, Singh AK. SWAT: Hierarchical Stream Summarization in Large Networks. In:Proc. of the 19th Int'l conf. on Data Engineering. Los Alamitos: IEEE Computer Society,2003:303-314.
    [51]Kopelowitz T,Porat E. Improved Algorithms for Polynomial-Time Decay and Time-Decay with Additive Error. Theory of Computing Systems. 2008,42(3):349-365.
    [52]Cormode G, Korn F, Tirthapura S. Exponentially Decayed Aggregates on Data Streams. In:Proc. IEEE 24th International Conference on Data Engineering(ICDE 2008),2008:1379-1381.
    [53]Cormode G, Korn F, Tirthapura S.Time-decaying aggregates in out-of-order streams.In:Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York: ACM Press, 2008:89-98.
    [54]Zhao YC, Zhang SC. Generalized Dimension-Reduction Framework for Recent-Biased Time Series Analysis. IEEE Transactions on Knowledge and Data Engineering, 2006,18(2): 231-244.
    [55]Potamias M, Patroumpas K, Sellis T. Amnesic Online Synopses for Moving Objects. In:Proc. of the 15th ACM int'l conf. on Information and knowledge management. New York: ACM Press, 2006:784-785.
    [56]Guha S, Kim C, Shim K. XWAVE: Approximate extended wavelets for streaming data. In: Proc. of the 30th Int. Conf. on Very Large Data Bases. Toronto, Canada: Morgan Kaufmann, 2004: 288-299.
    [57]Garofalakis M, Kumar A. Deterministic Wavelet Thresholding for Maximum-Error Metrics. In: Proc. of the 23rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York: ACM Press, 2004: 166-176.
    [58]Guha S, Harb B. Wavelet synopsis for data streams: minimizing non-euclidean error. In:Proc. of the 7th ACM SIGKDD Int'l conf. on Knowledge discovery in data mining. New York: ACM Press, 2005: 88-97.
    [59]Karras P, Mamoulis N. One-pass wavelet synopses for maximum-error metrics. In:Proc. of the 31st Int'l conf. on Very large data bases. Trondheim: VLDB Endowment,2005: 421-432.
    [60]Matias Y, Vitter JS, Wang M. Wavelet-based histograms for selectivity estimation. In:Proc. of the 1998 ACM SIGMOD int'l conf. on Management of data. New York: ACM Press, 1998: 448-459.
    [61]Boggess A, Narcowich FJ. A First Course in Wavelets with Fourier Analysis. Prentice Hall,2001.(Boggess A,Narcowich FJ著,芮国胜,康健等译.小波与傅里叶分析基础.北京:电子工业出版社,2004.)
    [62]Cormode G,Garofalakis M,Sacharidis D.Fast Approximate Wavelet Tracking on Streams.In:Proc.EDBT 2006,Springer-Verlag,2006:4-22.
    [63]SFR-USD tickwise stock data set.http://www-psych.stanford.edu/~andreas/Time-Series/Data/.
    [64]Olken F,Rotem D.Random Sampling from Databases—A Survey.Statistics &Computing,1995,5(1):25-42.
    [65]Govindarajulu Z.Elements of Sampling Theory and Methods.NJ:Prentice-Hall,1999.(Govindarajulu Z.抽样理论与方法(英文版).北京:机械工业出版社,2005.)
    [66]Vitter JS.Random Sampling with a Reservoir.ACM Transactions on Mathematical Software,1985,11(1):37-57.
    [67]Vitter JS.An Efficient Algorithm for Sequential Random Sampling.ACM Transactions on Mathematical Software,1987,13(1):58-67.
    [68]Chaudhuri S,Motwani R,Narasayya V.On random sampling over joins.In:Proc.of the 1999 ACM SIGMOD Intl.Conf.on Management of Data.New York:ACM Press,1999.263-274.
    [69]Jermaine C,Pol A,Arumugam S.Online Maintenance of Very Large Random Samples.In:Proc.of the 2004 ACM SIGMOD Intl.Conf.on Management of Data.New York:ACM Press,2004:299-310.
    [70]Kolonko M,Wasch D.Sequential Reservoir Sampling with a Non-Uniform Distribution.ACM Transactions on Mathematical Software,2006,32(2):257-273.
    [71]Efraimidis PS,Spirakis PG.Weighted random sampling with a reservoir.Information Processing Letters,2006,97(5):181-185.
    [72]Babcock B,Datar M,Motwani R.Sampling from a Moving Window over Streaming Data.In:Proc.of the 2002 Annual ACM-SIAM Symp.on Discrete Algorithms.Philadelphia:SIAM,2002:633-634.
    [73]Braverman V,Ostrovsky R,Zaniolo C.Succinct Sampling on Streams.http://eprintweb.org/S/authors/cs/za/Zaniolo/1.2007.
    [74]Zhang DD,Li JZ,Wang WP,Guo LJ.Algorithms for storing and aggregating historical streaming data.Journal of Software,2005,16(12):2089-2098.
    [75]Aggarwal CC.On Biased Reservoir Sampling in the Presence of Stream Evolution.In:Proc of VLDB '06,2006:607-618.
    [76]Park BH,Ostrouchov G,Samatova NF,Geist A.Reservoir-Based Random Sampling with Replacement from Data Stream.In:Proc.of the 4th SIAM International Conference on Data Mining.SIAM,2004.
    [77]Greenwald M,Khanna S.Space-Efficient Online Computation of Quantile Summaries.In:Proc.of the 2002 ACM SIGMOD Int'l Conf.on Management of Data.New York:ACM Press,2001:58-66.
    [78]Gibbons P. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In: Proc. of the 27th Int'l conf. on Very large data bases. San Francisco: Morgan Kaufmann Publishers,2001:541-550.
    [79]Duffield NG, Lund C, Thorup M. Learn more, sample less: control of volume and variance in network measurement. IEEE Transactions in Information Theory, 2005,51(5): 1756-1775.
    [80]Datar M, Muthukrishnan S. Estmating rarity and similarity on data stream windows. In:Proc. of the 10th Annual European Symposium on Algorithms. London: Springer-Verlag,2002: 323-334.
    [81]Jagadish HV, Koudas N, Muthukrishnan S,Poosala V, Sevcik K, Suel T. Optimal histograms with quality guarantees. In: Proceedings of the International Conference on Very Large Databases, 1998:275-286.
    [82]Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowledge and Information Systems,2000, 3(3): 263-286.
    [83]Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Transactions on Database Systems, 2002, 27(2): 188-228.
    [84]Keogh E, Kasetty S. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery,2003,7(4):349-371.
    [85]Gavrilov M, Anguelov D, Indyk P, Motwani R. Mining the stock market: Which measure is best? In: Proc. of the 6th ACM SIGKDD Int'l conf. on Knowledge discovery in data mining. New York: ACM Press, 2000: 487-496.
    [86]Keogh E, Xi X, Wei L, Ratanamahatana CA. The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/timeseriesdata/. 2006.
    [87]Johnson WB, Lindenstrauss J. Extensions of Lipschitz mappings into a Hilbert space. In:Conference in modern analysis and probability. Amer. Math. Soc., 1984: 189-206.
    [88]Achlioptas D. Database-friendly random projections. In: Proc. of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York:ACM Press, 2001:274-281.
    [89]Linial N, London E, Rabinovich Y. The geometry of graphs and some of its algorithmic applications. In: Proc. of 35th Annual Symposium on Foundations of Computer Science.1994:577-591.
    [90]Dasgupta S, Gupta A. An elementary proof of a theorem of Johnson and Lindenstrauss.Random Structures & Algorithms, 2003, 22(1): 60-65.
    [91]Cormode G, Datar M, Indyk P, Muthukrishnan S. Comparing data streams using hamming norms(how to zero in). IEEE Transactions on Knowledge and Data Engineering, 2003,15(3): 529-540.
    [92]Thaper N, Guha S, Indyk P, Koudas N. Dynamic Multidimensional Histograms. In:Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data. New York: ACM Press, 2002:428-439.
    
    [93]Aggarwal CC, Yu PS. Data Streams Models and Algorithms. Springer US, 2007.
    [94]Vapnik VN. Statistical Learning Theory. John Wiley and Sons, 1998.
    [95]Burges CJC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998,2(2):121-167.
    [96]Joachims T. Making large-scale support vector machine learning practical. In: A. S. B.Scholkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines.Cambridge, MA: MIT Press, 1998:169-184.
    [97]Chang CC, Lin CJ. Training nu-support vector classifiers: Thoery and algorithms.Neural Computation, 2001,13(9):2119-2147.
    [98]Williams CKI, Seeger M. Using the Nystrom method to speed up kernel machines. In:T.Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cambridge, MA: MIT Press, 2001.
    [99]Smola A, Scholkopf B. Sparse greedy matrix approximation for machine learning. In:Proceedings of the Seventeenth International Conference onMachine Learning, Stanford,CA,USA, 2000:911-918.
    [100]Platt JC. Fast training of support vector machines using sequential minimal optimization. In:B. Sch'olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. Cambridge, MA: MIT Press, 1999:185-208.
    [101]Schohn G, Conn D. Less is more: Active learning with support vector machines. In:Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann,2000:839-846.
    [102]Achlioptas D, McSherry F, Scholkopf B. Sampling techniques for kernel methods. In:T. G Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA: MIT Press, 2002: 335-342.
    [103]Pavlov D, Chudova D, Smyth P. Towards scalable support vector machines using squashing. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000:295-299.
    [104]Fung G, Mangasarian OL. Finite Newton method for Lagrangian support vector machine classification. Neurocomputing. 2003,55(1):39-55.
    [105]Syed N, Liu H, Sung K. Incremental learning with support vector machines. In: Proc.the Workshop on Support Vector Machines at the International Joint Conference on Articial Intelligence, Stockholm, Sweden, 1999.
    [106]Kivinen J,Smola AJ, Williamson RC. Online learning with kernels. In: Proc. Advances in Neural Information Processing Systems, Cambridge, MA, 2002.
    [107]Boley D, Cao D. Training support vector machine using adaptive clustering. In:Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista,FL,USA,2004:126-137.
    [108]Yu H, Yang J, Han J, Li X. Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing. Data Min. Knowl. Discov. 2005,11(3): 295-321.
    [109]T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, 1996.
    [110] Simard P, LeCun Y, Denker J. efficient pattern recognition using a new transformation distance, in Hanson, S. and Cowan, J. and Giles, L. (Eds), Advances in Neural Information Processing Systems, 5, Morgan Kaufmann, 1993
    [111]Hastie T, Simard P, Saeckinger E. Learning Prototype Models for Tangent Distance.ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 1995.
    [112] THE MNIST DATABASE of handwritten digits.http://yann.lecun.com/exdb/mnist/index.html.
    [113]http://archive.ics.uci.edu/ml/databases/kddcup99/kddcup99.html.
    [114]Chang CC, Lin CJ. LIBSVM: a Library for Support Vector Machines, 2004. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [115]LeCun Y, Bottou L, Bengio Y, Haffher P. Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, 1998,86(11):2278-2324.
    [116]Patrice Y. Simard, Dave Steinkraus, John Platt, Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos,2003:958-962.
    [117]DeCoste D and Scholkopf B. Training invariant support vector machines. Machine Learning, 2002,46(1): 161-190.
    [118]Ranzato M, Huang FJ, Boureau YL,LeCun Y. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In: Proc. Computer Vision and Pattern Recognition Conference (CVPR'07), IEEE Press, 2007.
    [119]Tsang IW, Kwok JT, Cheung P. Core Vector Machines: Fast SVM Training on Very Large Data Sets. The Journal of Machine Learning Research,2005,6(2005): 363-392.
    [120]Broder AZ. On the resemblance and containment of documents. In:Proc. Compression and Complexity of SEQUENCES, IEEE Computer Society, 1997:21-29.
    [121]Cohen E. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997, 55(3): 441-453.
    [122]Broder AZ, Charikar M, Frieze AM, Mitzenmacher M. Min-Wise Independent Permutations. Journal of Computer and System Sciences, 2000,60(3):630-659
    [123]Durand M, Flajolet P. Loglog counting of large cardinalities. In: ESA '03: Proceedings of the 11th Annual European Symposium on Algorithms. 2003:605-617.
    [124]Flajolet P, Fusy E, Gandouet O, Meunier F. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In: AofA '07: Proceedings of the 2007 International Conference on Analysis of Algorithms, 2007: 127-146.
    
    [125]Ganguly S, Garofalakis M, Rastogi R. Tracking set-expression cardinalities over continuous update streams. The VLDB Journal. 2004,13(4):354-369.
    [126]Flajolet P. Counting by coin tossings. In: Proceedings of the 9th Asian Computing Science Conference. 2004:1-12.
    
    [127]Broder AZ. Filtering near-duplicate documents. In:Proc. FUN 98, 1998.
    [128]Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C.Finding Interesting Associations without Support Pruning. In:Proc. 16th ICDE,2000:489-499.
    [129]Chen Z, Korn F, Koudas N, Muthukrishnan S. Selectivity estimation for boolean queries. In: Proc. PODS 2000,2000:216-225.
    [130]Haveliwala TH, Gionis A, Indyk P. Scalable techniques for clustering the web. In:Proc. 3rd Intl. Workshop on the Web and Databases(WebDB 2000), 2000: 129-134.
    [131]Gibson D, Kumar R, Tomkins A. Discovering large dense subgraphs in massive graphs.In:Proc. VLDB '05. VLDB Endowment, 2005:721-732.
    [132]Fogaras D, Racz B. Scaling link-based similarity search. In: Proceedings of the 14th international Conference on World Wide Web .New York, NY: ACM Press,2005:641-650.
    [133]Fogaras D, Racz B. Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs. IEEE Transactions on Knowledge and Data Engineering. 2007,19(5):585-598.
    [134]Parreira, JX, Donato, D, Michel, S, Weikum, G Efficient and decentralized PageRank approximation in a peer-to-peer web search network. In: Proceedings of the 32nd international Conference on Very Large Data Bases.2006:415-426.
    [135]Bhagwat D, Eshghi K, Mehra P. Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In: Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining. New York, NY:ACM Press, 2007:105-112.
    [136]Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. In: Eppstein D, ed. Proc. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms. San Francisco: ACM/SIAM, 2002: 635-644.
    [137]Fern XZ, Brodly CE. Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proc. of the 20th Int. Conf. on Machine Learning. 2003:186-193.
    
    [138]Gibbons PB, Matias Y. Synopsis data structures. In: Porc of SODA, 1999.
    [139]Lerman I, Costa J, Silva H. Validation of very large data sets clustering by means of a nonparametric linear criterion. In: Classification, Clustering and Data Analysis: Recent advances and application, Proc. of IFCS-2002. Springer, 2002. 147-157.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700