基于小波变换的流数据压缩算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近几年来,随着网络通信技术的快速发展滋生了大量的流数据。许多实时的应用系统面对的都是在线的、持续的数据流。流数据海量无限的特性决定了我们无法用传统的存储方式将其完全保存,此外不经处理完全传输这些数据会占用大量有限的网络带宽,造成网络阻塞。因此,对流数据进行压缩处理显得尤为重要,具有现实意义。
     本文围绕数据流时间序列错位相似性、聚类压缩、多小波变换三个方面进行了深入研究。主要成果包括:
     (1)基于动态时间弯曲技术的数据流处理方法。将一段时间内采集到的流数据作为一个时间序列来进行处理。由于同一时间段内数据流变化的影响因素基本相同,导致一些数据流变化存在错位相似,具体表现为数据流形状大致相同,但在时间上有所超前或延迟。对于这种错位相似的数据流采用常用的欧几里得测度法是无法识别的,而采用动态时间弯曲技术却可以很好地判断数据流的这种相似性。本文在采用动态时间弯曲路径法得到两个时间序列对应点的基础上提出了用预测法估计两个时间序列的关系,从而确定时间序列最佳匹配点的算法。
     (2)基于多元时间序列相似性聚类压缩算法。首先采用动态时间弯曲距离分析数据流之间的相似关系,根据相似程度进行模糊聚类,接着选取各聚类中心作为特征流时间序列,最后保存每个聚类的数据流编号、特征数据流序列的小波系数和其它数据流序列与特征流序列的匹配点对和关系系数作为压缩数据。之后结合上一章的最佳匹配点算法给出了数据还原的算法。从仿真实验结果可以看出,该算法能有效压缩数据流,较采用欧式距离测度能更好地提高数据压缩的精度。
     (3)基于多小波变换的流数据压缩算法。将多属性数据流进行多小波变换后原数据流被分解为四个不同空间方向和不同分辨率的子数据矩阵,每个子矩阵又可以进一步进行多小波变换分解,流数据能量绝大部分汇聚于低频矩阵。根据这一特点对变换后的小波系数进行编码压缩从而达到压缩数据流的目的。从实验结果看,该算法压缩率高,并且能够很好地保存数据特征,还原后的数据能基本再现原数据流。
Many streaming data are generated as the rapid development of network communication technologies in recent years. Lots of real-time application systems need to deal with online and continuous streaming data. Streaming data can hardly be saved completely in conventional way because of their characteristic of infinite. And transmitting streaming data takes much in the limited network bandwidth, so it is very important to compress streaming data.
     This paper focuses on time series data streams dislocation similarity, clustering compression and multi-wavelet transform, including:
     First, data stream process method based on dynamic time warping technology. The streaming data collected in a certain period is processed as a time series. Since in the same period the factors causing data streams changes are approximately the same, so there exists dislocation similarity among data streams waves. It behaves as, the, data streams waves are similar to each other in shape, but forward or backward in time. Trandition Euclidean distance measure method can not identify the similarity of dislocation data streams, but dynamic time warping technology can do well. Based on the dynamic time warping path method to calculate corresponding points of two time series, an algorithm using prediction to estimate relationship of two time series and then find out their best match point is proposed in this paper.
     Second, clustering compression algorithm based on similarity of multivariate time series. Firstly, analyze the similarity among data streams with dynamic time warping distance; then do fuzzy clustering according to the similarity degree and select cluster center as the characteristic time series; lastly, save the data stream numbers for each cluster, the wavelet coefficients of characteristic time series and the match point and coefficient between every other data stream series and characteristic series as compression data. The algorithm to decompress data streams based on the best match point algorithm is also given. Simulation shows, this compression algorithm is effective, and gains higher precision than Euclidean distance measure method.
     Third, streaming data compression algorithm based on multi-wavelet transform. A multiattribute data stream can be broken into four sub-data matrixes in different dimension and resolution by multi-wavelet transform, and every sub-data matrix can also be decomposition. The energy of data stream mostly concentrated in low-frequency matrix. This feature is used to encode and compress the wavelet coefficients in order to compress streaming data. Experiment results show, this compression algorithm has a high compression rate, and stores characteristics of the data perfectly which ensures the original data stream can be recovered approximately after decompression.
引文
[1]Henzinger M R, Raghavan P, Rajagopalan S. Computing on data streams. In: SRC Technical Note 1998-011. Palo Alto, California:Digital systems research center,1998
    [2]孙玉芬,卢炎生.流数据挖掘综述.计算机科学,2007,34(1):1-5
    [3]Papadiiitriou S, Sun J, Faloutsos C. Streaming Pattern Discovery in Multiple Time-Series. In:Proc of the 31st VLDB Conference.2005:697-708
    [4]Babcock B, Babu S, Datar M, et al. Models and issues in data stream systems. In: Popa L.ed. Proc. of 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Madison:ACM Press,2002:1-16
    [5]Chandrasekaran S, Cooper O, Deshpande A, et al. TelegraphCQ, Continuous Dataflow Processing for an Uncertain World. In:Proc of the 2003 CIDR Conference,2003
    [6]Abadi D J, Carney D, Cetintemel U, et al. Aurora:A New Model and Architecture for Data Stream Manageent. The Intl Journal on Very Large Data Bases,2003,12(2):120-139
    [7]Cai Y D, Clutter D, Pape G, et al. MAIDS:Mining Alarming Incidents from Data Streams. In:Proc of the 2004 ACM SIGMOD Intl Conf on Management of data,2004:919-920
    [8]Muthukrishnan S. Data streams:Algorithms and applications. In:Proc of the fourteenth annual ACM-SIAM symposium on discrete algorithms,2003:413
    [9]金澈清,钱卫宁,周傲英.流数据分析与管理综述.软件学报,2004,15(8):1172-1181
    [10]Terry D, Goldberg D, Nichols D, et al. Continuous queries over append-only databases. In:Proc ACM SIGMOD International Conference on Management of Data. San Diego, CA,1992,21(2):321-330
    [11]Avnur R, Hellerstein J. Eddies:Continuously adaptive query processing. In: Proc. of the 2000 ACM SIGMOD Intl. Conference on Management of Data. Dallas:ACM Press,2000:261-272
    [12]Hellerstein J, Franklin M, Chandrasekaran S, et al. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin,2000,23(2):7-18
    [13]Carney D, Cetinternel U, Cherniack M, et al. Monitoring streams-A new class of DBMS applications. In:Brown Computer Science Technical Report, TR-CS-02-04. Providence:Department of Computer Science, Brown University, 2002
    [14]HE Zeng-you, XU Xiao-fei, DENG Sheng-chun. Squeezer:An efficient algorithm for clustering categorical data. Journal of Computer Science and Technology,2002,17(5):611-624
    [15]Portnoy L, Eskin L, Stolfo S J. Intrusion detection with unlabeled data using clustering. In:Proc of ACM CSS Workshop on Data Mining Applied to Security. Philadelphia, PA,2001
    [16]O'Callaghan L, Mishra N, Meyerson A, et al. Streaming-data algorithms for high-quality clustering. In:Proc of IEEE International Conferrnce on Data Engineering.2002
    [17]Guha S, Mishra N, Motwani R, et al. Clustering data streams. In:Proc of IEEE Symposium on Foundations of Computer Science (FOCS'00).2000:71-80
    [18]Guha S, Meyerson A, Mishra N, et al. Clustering data streams:Theory and practice. Knowledge and Data Engineering, IEEE Transactions,2003,15(3): 515-528
    [19]M.Khan, Q. Ding, W. Perrizo, et al. K-nearest Neighbor Classification on Spatial Data Stream Using P-trees.In:Proc of the PAKDD. Taipei, Taiwan,2002: 517-528
    [20]Zhang T, Ramakrishnan R. BIRCH:A efficient data clustering method for very large database. In:Proc of ACM SIGMOD Conference on Management of Data. Montreal, Canada:1996
    [21]Aggarwal C, Han J, Wang J, et al. A framework for clustering evolving data stream. In:Proc of Int Conf on Very Large Data Base (VLDB'03). Berlin, Germany,2003:81-92
    [22]朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法.软件学报,2006,17(3):379-387
    [23]Domingos P, Hulten G. Mining high-speed data streams. In:Proc of ACM SIGKDD Int Conf Knowledge Discovery in Databases (KDD'00).2000:71-80
    [24]Hulten G, Spencer L, Domingos P. Mining time-changing data streams.In:Proc of ACM SIGKDD Int Conf Knowledge Discovery in Databases (KDD'01).2001
    [25]Giannella C, HAN Jia-wei, JIAN Pei, et al. Mining frequent patterns in data streams at multiple time granularities.In:Proc of the NSF Workshop on Next Generation Data Mining,2002
    [26]M. Charikar, K. Chen, M. Farach-Colton. Finding Frequent Items in Data Streams. International Colloquium on Automata, Languages and Programming, 2002:508-515
    [27]Charikar, Chen K, Farach-Colton M. Finding frequent items in data streams. Theoretical Computer Science,2004:3-15
    [28]JACOB ZIV, ABRAHAMLEMPEL. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory,1977, IT-23(3): 337-343
    [29]高腾蛟,高军,杨冬青,等.面向XPath执行的XML数据流压缩方法.软件学报,2005,16(5):869-877
    [30]刘向宇,王雅哲,杨晓春等.面向无线传感器网络的流数据压缩技术.计算机科学,2007,34(2):141-143
    [31]Nassar S, Sander J. Effective Summarization of Multi-Dimensional Data Streams for Historical Stream Mining. In:19th International Conference on Scientific and Statistical Database Management. Banff, AB, Canada,2007: 30-39
    [32]Dula S, Kim C, Shim K. XWAVE:Optimal and approximate extended wavelets for streaming data. In:Nascimento MA, Kossmann D, eds. Proc. of the 30th Int1. Conference on Very Large Data Bases. Toronto:Morgan Kaufmann Publishers, 2004:288-299
    [33]Gilbert AC, Kotidis Y, Muthukrishnan S, et al. One-Pass wavelet decompositions of data streams. IEEE Transactions on Knowledge and Data Engineering,2003, 15(3):541-554
    [34]Chen AL, Tang CJ, Yuan CA, et al. Mining correlations between multi-streams based on haar wavelet. In:Sui GL. Vianu V eds. Advances in Computer Science: ASIAN 2005 the 10th Asian Computing Science Conf. Kunming: Springer-Verlag.2005:270-271
    [35]陈安龙,唐场杰,元昌安等.基于小波和偶合特征的多数据流压缩算法.软件学报.2007,18(2):177-184
    [36]冉启文,谭立英.小波分析与分数傅立叶变换及其应用.北京:国防工业出版社,2002:1-5
    [37]S.Mallat. Multi-frequency channel decomposition of images and wavelet model. IEEE Transactions,1989
    [38]I.Daubechies. Ten lectures on wavelets. SLAM, Philadelphia, PA,1992
    [39]A.Cohen, I.Daubechies, J. C. Feauveau. Biorthogonal bases of compactly supported wavelets.Commun on Pure and Appl,1992,4:1-10
    [40]C. Chui, J. Z. Wang. Acandinal spline approach to wavelets. Proc. Amer,1991, 113:785-793
    [41]梁晓鸣.基于小波变换的图像压缩技术研究:哈尔滨理工大学.2008,21-23
    [42]沈兰荪,卓力.小波编码与网络视频传输.北京:科学出版社,2005:46-47
    [43]Berndt D, Clifford J. Using dynamic time warping to find patterns in time series. Working Notes of the Knowledge Discovery in Databases Workshop,1994: 359-370
    [44]温颖均,朱仲英.基于动态时间弯曲的时序数据聚类算法的研究.计算机仿真,2004,21(3):37-40
    [45]Ian H.Witten, Eibe Frank.数据挖掘实用机器学习技术.董琳等译.第二版.机械工业出版社,2006,169-180
    [46]杨纶标,高英仪.模糊数学原理及应用.第四版.华南理工大学出版社,2006,68-77
    [47]Goodman T, Lee S. Wavelets of multiplicity. Trans on Amer Math Soc,1994, 342:307-324
    [48]Geronimo J S, Hardin D P, Massopust P R. Fractal functions and wavelet expansions based on several scaling functions. J of Approx Theory,1994,78: 373-401
    [49]Strang G, Strela V. Short wavelets and matrix dilation equations. IEEE Trans on SP,1995,43:108-115
    [50]Strela V. Multiwavelets:theory and application:[Ph.D].Massachusetts Institute of Technology,1996
    [51]Rieder P, Cotze J, Nossek J A. Multiwavelet transforms based on several scaling functions. In:Proceedings of IEEE Int Symp on Time-Fre quency and Time-Scale Analysis. Philadelphia.1994
    [52]Lebrun J, Vetterli M. Balanced multiwavelets theory and design. IEEE Trans Signal Processing,1998,46(4):1119-1125
    [53]Jiang Q T. Orthogonal multiwavelets with optimum time-frequency resolution. IEEE Trans Signal Processing,1998,46(4):830-844
    [54]Jia, Rongqing Riemenschneider, Zhou Dingxuan. Vector subdivision schemes and multiple wavelets. Math, of Coputation,1998,67(224):1533-1563
    [55]Hwee Huat Tan. Shen Lixin. Tham J Y. New Biorthogonal multiwavelets for image compression. Signal Processing,1999:45-65
    [56]Shen Lixin, Hwee Huai Tan. On a family of orthonormal scalar wavelets and related balanced multiwavelets. IEEE Trans on Signal Processing,2001,49(7): 1447-1453
    [57]黄卓君,马争鸣.CL多小波图像编码.中国图像图形学报,2001,6(7):662-668
    [58]Xia X G. A new prefilter design for discrete multiwavelet transforms. IEEE Trans Signal Processing,1998,46(6):1558-1570
    [59]Miller J T, Li C C. Adaptive multiwavelet initialization. IEEE Trans Signal Processing,1998,46(12):3282-3292
    [60]G.Strang and V. and V. Strela Short wavelets and matrix dilation equations. IEEE Trans Signal Processing,1995,43:108-115
    [61]Jerome M. Shapiro. Embedded Image Coding Using Zerotrees of Wavelet Coefficients. IEEE Trans on Signal Processing.1993,41(12):3445-3462