数据流连续查询的自适应降裁策略研究

英文题名：Research of Auto-adapted Load Shedding Algorithm on Data Stream Inquires Continuously
作者：吴亚娟
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据流 ; 数据流预处理 ; 连续查询 ; 降载 ; 算子 ; 阈值
英文关键词：data streams ; data stream pre-processing ; continuous query ; load shedding ; operator ; threshold
学位年度：2010
导师：马瑞民
学科代码：081202
学位授予单位：大庆石油学院
论文提交日期：2010-03-27

摘要

有关数据流连续查询的自适应降载方法研究是近几年数据流研究领域的核心内容之一。虽然,国内外的一些学者在此方面已经做了大量研究和实验工作,也取得了一些研究成果,但也出现了一些新的问题。例如,目前的数据流查询处理技术主要考虑滑动窗口的数据,但这些研究工作都基于滑动窗口的大小固定不变,然而在很多实际应用中却存在内存资源浪费、查询结果达不到精确要求等问题。现有的基于窗口运算的降载方法大多是针对主存的限制展开研究的,需要研究CPU计算能力的限制等问题。
     本文在对以上问题进行了比较深入地分析、研究的基础上,首先提出了一种新的数据流连续查询方法,即一种基于变窗口的数据流连续查询方法。通过在缓存中加入窗口控制器实现变窗口技术。当内存资源不足或者数据流流速过大时,根据用户提出查询具体问题、具体数据情况,窗口分配算子对流入的数据流按照时间序列进行大小不等的划分,匹配算子根据查询计划与不同大小的滑动窗口进行相似模式匹配,使其与查询对象特征相对应。从而解决了内存资源浪费与查询结果的精确性问题。同时,针对CPU过载问题,本文提出了一种新颖的局部降载方法,即通过滑动窗口与运算对数据流上相邻的滑动窗口中输出速率较大的基本窗口进行合并处理,有效地解决了对数据流进行降载处理所面临的降载时机和降载方法等基本问题。最后,将所提出的降载方法进行了初步应用。
     由于数据预处理技术在数据流挖掘中具有十分重要的作用,对数据流进行预处理可以改进数据流的质量,也可以有效地预防数据流过载情况,有助于提高挖掘过程的精度和性能。本文针对数据流具有的特征,借鉴和引用已有静态数据集预处理的方法,提出了采用抽样、滑动窗口模型等方法构造概要数据结构时,将得到的数据流数据的样本进行再处理的思想,以进一步减小挖掘算法的时间和空间复杂度。
The method of auto-adapted load shedding on data stream inquires continuously is becoming one of the core contents in the data stream research area in recent years. In this aspect,some scholars of the domestic and foreign have already done the massive research and experimental work,as well as obtained some research results,but also had some new problems.For example,at present,data stream inquiry processing technology main consideration sliding window data,but these research work all based on sliding window which size fixed invariable,however,it will lead to the memory resources wasted, and the inquiry result can’t match the accuracyin many practical applications.these methods of load shedding which based on the window operation were mainly researched on the limitation of main memory. Need to study the issues of the computing power in CPU.
     Firstly,in this essay ,based on the analysis and research of inquires continuously and load shedding to the data stream combined with related technology,a method of data stream inquires continuously which based on the variable size windows was proposed.That is putting a window controller in the cache,to achieve the variable window technology,When the memory usage is not full.According to the user’s questions and the speed of data stream flow, The assignment operators will change the data streams’size in term of time-lists. at one time,the Matching operator carry through similar pattern matching,to make it correspond with the query characteristics The method solves the issues of memory resources wasted and the accuracy of query result.Secondly,for the sake of solved the CPU overload situation.In this essay,an original method by combine some basic windows with more larger output rate which are included in the sliding windows on the adjacent over data stream can easily solved the problem.Finally,application has been conducted.
     Pre-processing technology play a very important role in data stream mining,it can improve the quality of data stream and effectively prevent data stream overloading,it also help to improve the precision and performance in the process of mining.In this essay, based on the characteristics of the data streams,use the experience of the preprocessing methods of static data sets.A method was proposed to reprocessing the data stream which have gotten by sampling method and sliding window model method.This method will reduce the spatial complexity and temporal complexity.

引文

[1] N.Tatbul,U.Cetintemel,S.Zdonik,et al. Load shedding in a data stream manager.Johann Christoph Freytag,Peter C.Lockemann,Serge Abiteboul,Michael J.Carey,PatriciaG. Selinger Andreas Heuer(Eds).VLDB’03,Berlin Germany,2003:309-320.
    [2] A.Das,J.Gehrke,M.Riedewald.Approximate Join Processing over Data Streams.In ACM SIGMOD Conference,San Diego,CA,June 2003:40-51.
    [3] J.Kang,J.Naughton,S.Viglas.Evaluating Window Joins over Unbounded Streams.In IEEE ICDE Conference,Bangalore,India,March 2003:341-352.
    [4] Motwani,J.Widom,A.Arasu,B.Babcock,S.Babu,M.Datar,G.Manku,C.Olston,J.Rosenstein, R.Varma.Query Processing,Approximation,and Resource Management in a Data Stream Management System.In CIDR Conference,Asilomar,CA,January 2003:245-256.
    [5] M.A.Hammad,M.J.Franklin,WG.Aref and,A.K.Elmagarmid.Schedulingfor shared window joins overdatastreams.JohannChristoph Freytag,Peter C.Lockemann,Serge Abiteboul,Michael J.Carey,PatriciaG..Selinger,And reas Heuer(Eds).VLDB’03,Berlin Germany,2003:297~308.
    [6] B.Babcock,M.Datar, R.Motwani.Load shedding for aggregation query over data streams. The 20th International Conference on Data Engineering (ICDE’04),Boston,MA, USA,30 March-2 April 2004. Boston:IEEE Computer Society 2004:350~361.
    [7] U.Srivastava,J.Widom.Memory-limited Execution of Window Stream Joins.Processsing of the international Conference on Very Large Databases (VLDB),2004:324-335.
    [8]金澈清,钱伟宁,周傲英.流数据分析与管理综述[J].软件学报,2004(8):1172~1181.
    [9] Xuan,HongDong,Wee Keong Ng,Kok Leong Ong.Adaptive Load Shedding for Mining Frequent Patterns from Data Streams.www3.it.deakin.edu.au/~leong/papers/dawak06-1.pdf. 2006.
    [10] Yun Chi,Philip S.Y,Haixun Wang,Richard R.Muntz.Loadstar:A Load Shedding Scheme for Classifying Data Streams.In SIAM Conference,2005:346-357.
    [11]闫莺,金澈清,曹锋,汪恒杰,周傲英.多数据流上共享窗口连接查询的降载策略[J].计算机研究与发展,2004,41(10):1836-1841.
    [12]王栩,李建中,王伟平.基于滑动窗口的数据流压缩技术及连续查询处理方法[J].计算机研究与发展,2004,41(10):1639-1644.
    [13] Arvind Arasu,Brian Babcock,Shivnath Babu,Mayur Datar,Keith It,Itaru Nishizawa,Justin Rosenstein,Jenifer Widow.STREAM:the Stanford Stream Data Manager.In SIGMOD’03,San Diego,California,USA.2003.
    [14] M.Garofalakis,J.Gehrke,R.Rastogi.Querying and Mining Data Streams:You Only Get One Look.In Proc.of the 28th VLDB Conference,HongKong,China.2002.
    [15] D.Abadi,D.Carney,U.Cetintemel,and et al.Aurora:A New Model and Architecture forData Stream Management. In VLDB’03 Journal(12)2:120-139,August 2003.
    [16] Yan-Nei Law,Haixun Wang,Zaniolo.Query Language and Data Models for Database Sequences and Data Streams.In Proc.ofVLDB 2004.
    [17] The STREAM Group.Stanford Data Stream Management System (lastest overview paper).To appear in a book on data stream management edited by Garofalakis,Gehrke, and Rastogi.2005.
    [18] Brian Babcock,Shivnath Babu,Mayur Datar,Rajeev Motwani,Jennifer Widom.Models and issues in data stream system.In PODS’02,Madison,WI,USA.2002.
    [19] Datar M,Gionis A,Indyk P,Motwani R.Maintaining stream statistics over sliding windows.ACM Special Interest Group on Algorithms and Computation Theory and SIAMActivity Group on Discrete Mathematics(Eds).The 13th ACM SIAM Symposium on Discrete Algorithms,San Francisco,USA,2002.San Francisco:Society for Industrial and Applied Mathematics,2002:635~644.
    [20] Datar M,Gionis A,Indyk P,Motwani R.Maintaining stream statistics over sliding windows.ACM Special Interest Group on Algorithms and Computation Theory and SIAM Activity Group on Discrete Mathematics(Eds).The 13th ACM SIAM Symposium on Discrete Algorithms,San Francisco,USA,2002.San Francisco:Society for Industrial and Applied Mathematics,2002:635~644.
    [21] Babcock B,Datar M,Motwani R.Sampling from a moving window over streaming data. ACM Special Interest Group on Algorithms and Computation Theory and SIAM Activity Group on Discrete Mathematics(Eds).The 13th ACM SIAM Symposium on Discrete Algorithms,San Francisco,USA,2002.San Francisco:Society for Industrial and Applied Mathematics,2002:633~634.
    [22] Chris Giannella,Jiawei Han,Jian Pei,Xifeng Yan,Philip S.Yu.Mining frequent patterns in data streams at multiple time granularities.Hillol Kargupta and Anupam Joshi(Eds).National Science Foundation Workshop on Next Generation Data Mining(NGDM’02),Marriott,Inner Harbor,Baltimore,USA,2002.Baltimore:NSF,November2002.http://www.cs.umbc.edu/NGDM02/schedule.html.
    [23]李建中,张冬冬.滑动窗口规模的动态调整算法[J].软件学报,2004,15(12):1800~1814.
    [24] Vitter JS.Random Sampling with a Reservoir.ACM Trans.on Mathematical Software, 1985,11(1):37~57.
    [25] GibbonsPB,MatiasY.New sampling-based summary statistics for improving approximate Query answer.Laura M.Haas,Ashutosh Tiwary(Eds)ACM SIGMOD International Conference on Management of Data (SIGMOD’98),Seattle,Washington,USA,June2-4,1998:331~342.
    [26] GibbonsPB,Matias,Y,PoosalaV.Fastin cremental maintenance of approximate his tograms.MatthiasJarke,MichaelJ.Carey,KlausR.Dittrich,FrederickH.Lochovsky,PericlesLoucopoulos,ManfredA.Jeus feld(Eds).The 23rd International Conferenceon Very Large Data Bases(VLDB'97),Athens,Greece.August25-29,1997.Athens:MorganKaufmann1997:466~475.
    [27] Poosala V,Ioannidis Y,Haas P,Shekita E.Improved histograms for selectivity estimation of range predicates.SIGMOD Record,1996,25(2):294~305.
    [28] Gilbert A,Guha S,Indyk P,Kotidis Y,Muthukrishnan S,Strauss M.Fast,small-space algorithms for approximate histogram maintenance.The ACM special interest group on algorithms and computation theory(Ed.).The 34th ACM Symposium on Theory of Computing(STOC’02),Montreal,Quebec,Canada,May 2002.Montreal:SIGACT(ACM Special Interest Group on Algorithms and Computation Theory),2002.http://www.can kao wen xian.com.umontreal.ca/STOC02/STOC02.pdf,page10.
    [29] Charikar M,Chen K,Farach-Colton M.Finding frequent items in data streams.Theoretical Computer Science,2004,312(1):3~15.
    [30] Vitter,JS,Random Sampling with a Reservoir.ACM Trans.on Mathematical Software, 1985,11(1):37~57.
    [31] Cohen S,Matias Y.Spectral Bloom filters.Alon Y.Halevy,Zachary G.Ives,AnHai Doan (Eds.)ACM,SIGMOD,Conference,2003,SanDiego,California,USA,2003.SanDiego:ACM,2003:241~252.
    [32] Ganguly S,Garofalakis M,Rastogi R.Processing set expressions over continuous update streams.AlonY.Halevy,Zachary,G.Ives,AnHai,Doan(Eds)ACM,SIGMOD,Conference2003,San Diego,California,USA,2003.San Diego:ACM,2003:265~276.
    [33] Garofalakis M,Gibbons PB.Wavelet synopses with error guarantees.ACM SIGMOD(Ed) The 2002 ACM SIGMOD international conference on Management of data(SIGMOD’02), Madison,Wisconsin,USA,2002.Madison:ACMPress 2002,USA.476~487.
    [34] Chi Yun ,Yu P S ,Wang Haixun ,etal.Loadstar:A load shedding scheme for classifying data streams.In:SIAM International Con2ference on Data Mining (SDM),2005.
    [35] Jiang Qing-chun,Chakravarthy,S.Load shedding in a data stream management system [EB/OL].[2007-02].http://berlin.uta.edu/~qingchun/Papers/Load Shedding.pdf.
    [36]郭庆平,欧阳琳提出了一种分布式数据流连接查询算法[J].武汉理工大学学报2009:31(3).
    [37] Cheqing.J,Weining.Q,Chaofeng.S,Jeffrey.X,Dynamically Maintaining Frequent Items overData Stream.CIKM,New Orleans,Louisiana,USA.2003.
    [38] GomesJS,Heyong-AH choi.Finding Optimal Join Tree for Multi-Join Stream Queries in a Production System[A].Distributed Computing Systems Workshops.2006.ICDCS Workshops 2006.26th IEEE International Conference on,2006:27-27.
    [39]郭景峰,贺春亮.数据流滑动窗口聚集查询降载策略研究[J].计算机应用研究.2009.26(7).
    [40] Sakurai.Y,Faloutsos.C,Yamam M.Stream monitoring of multi-variate time series[C]//IC DE,2007.
    [41] TATBULN,ZDONIKS.Window-aware load shedding for aggregation queries over data streams[C]//Proc of the 32nd InternationalConference on Very Large Data Bases.Seoul:[s.n.], 2006:799-810.
    [42] REISS F, HELLERSTIN J.Data triage: an adaptive architecture for load shedding in telegraph CQ[C] //Proc of IEEE ICDE Conference.Tokyo:[s.n],2005:155-156.
    [43]苏亮邹鹏等.海量数据流上快速Top-K子序列匹配算法研究[J].计算机工程与科学.2009:31(6).
    [44] Golab L,Bijay K.On Concurrency Control in Sliding Window Queries over Data Streams [C]//In Proceeding10th International Conference on Extending Database Technology.Munich ,Germany:[s.n.],2006:608-626.
    [45] Arasu.A.Approximate Count and Quantile over Sliding Windows.In PODS 2004.
    [46]王金栋,周良,张磊.基于分支路径分析的连续查询降载算法[J].应用科学学报.(1)2007.
    [47]张素智,魏艳.数据流技术及查询算法研究[J].微型电脑应用.23(12)2007.
    [48] Mehmed Kantardzic著.闪四清,陈茵,程雁等译.数据挖掘概念、模型、方法和算法[M].北京:清华大学出版社,2003.
    [49] Jiawei.Han,Michseline Kamber著.范明,孟小峰等译.数据挖掘概念与技术[M].北京:机械工业出版社,2007:70-91.
    [50] Lukasz Golab.Sliding Window Query Processing over Data Streams.A thesis presented to the University of Waterloo.2006.www.cs.uwaterloo.ca/research/tr/2006/CS-2006-27.
    [51] A.Arasu,S.Babu,J.Widom. The CQL Continuous Query Language:Semantic Foundations and Query Execution[J] Execution[J].International Journal on Very Large Data Bases(VLDB Journal).2006,15(2):121-142.
    [52] JAESEOK,P.HAENGRAEC.Semantic load shedding for prioritized continuous queries over data streams [C]//Proc of International Symposium on Computer and Information Sciences.Berlin:Springer,2005:813-812.
    [53] S,Babu,K.Munagala,J.Widom,and R.Motwani.Adaptive caching for continuous queries.In Proc.Int.Conf.on Data Eng.(ICDE),pages 118(129,2005).
    [54]冯卫兵,李战怀.数据流的连续查询优化技术[J].计算机应用研究.2008,1(25).
    [55]曲吉林,李敏强等.基于滑动窗口的数据流反向查询方法[J].计算机工程与应用. 2006,(30):171~173.
    [56] Tatbul N,Cetintemel U,Zdonik S.Load shedding in a data stream manager.In:Freytag JC,Lockemann PC,Abiteboul S,eds.Proc.of the 29th Int’l Conf.on Very Large Data Bases,Berlin:MorganKaufmann,Publishers 2003:309~320.
    [57] Babcock,Brian and Datar,Mayur and Motwani.Load shedding for aggregati on queries over data streams In:Rundensteiner E.Proc of the 20th In Conference on Data Engineering. Boston:IEEE Computer Society.2004:350~361.
    [58]陈为满,苏亮,高春鸣.数据流上快速子序列匹配[J].计算机工程与应用.2008,44(36).
    [59] Lukasz Golab.Sliding Window Query Processing over Data Streams.A thesis presented to the University of Waterloo.2006.www.cs.uwaterloo.ca/research/tr/2006/CS-2006-27.
    [60]林锦贤,林钦仙.数据流滑动窗口连接的自适应降载策略[J].福州大学学报.2007,35(3):381-386.
    [61]翁颖钧,朱仲英.基于动态时间弯曲的时序数据聚类算法的研究[J].计算机仿真,2004,21 (3):37241.
    [62] Chen S2C,Kashyap R L.A Spatio Temporal Semantic Model for Multimedia a Presentati ons and Multimedia Database Systems[J].IEEE Trans on Knowledge and Data Engineering, 2001,13 (4)6072622.
    [63]安镇宙,杨鉴.一种新的基于并行分段裁剪的DTW算法[J].计算机工程与应用,2007,43(15):35238.
    [64]陈当阳,贾素玲.时态数据的趋势序列分析及其子序列匹配算法研究[J].计算机研究与发展,2007,44(3):5162520.
    [65]傅鹂,鲁先志,蔡斌.一种基于数据流驱动的数据流连续查询模型[J].重庆工学院学报(自然科学),2008,22(10):107-111.
    [66]马瑞民,王小龙.一种挖掘和监测数据流上的变化的方法[J].计算机科学,2005(32)7:496-499.
    [67]王小龙,马瑞民.一种挖掘数值型数据流上的分类方法[J].计算机应用,2006,26: 165-168.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700