数据流上Skyline查询处理算法研究

英文题名：Research on Algoritms to Process Skyline Query over Data Stream
作者：孙圣力
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：数据流 ; 数据挖掘 ; Skyline查询 ; 分布式数据流 ; 子空间 ; 概率数据流
英文关键词：Data Stream ; Data Mining ; Skyline Query ; Distributed Data Steam ; Subspace ; Probabilistic Data Stream
学位年度：2008
导师：朱扬勇
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2008-04-12

摘要

数据流是连续、实时、有序的数据项序列。数据流广泛存在于因特网与传感器网络、交通与环境监控、工业控制、金融股市与电子商务交易等应用中。流数据挖掘与管理是近年来学术界和工业界所共同关注的问题。作为一种重要的数据挖掘技术,Skvline查询是近年来的研究热点。Skvline是定义在一个多维对象集上的集合,它由所有不被其它对象所支配的对象组成。Skyline对于数据挖掘可视化、用户偏好查询及多约束决策等问题具有重要意义。自从Skyline查询在学术界被提出以来,对该课题的研究迄今为止都非常活跃。
     数据流的特点是数据实时到达、规模宏大、次序独立以及数据往往只能一次读取。这就要求Skvline查询处理算法必需高效地处理数据流中到来的每一个对象,并且要求算法具有较低的时间复杂度。学术界已经出现了一些有关该课题的研究成果,但这些成果仅涉及数据流上全空间Skyline的查询处理,并且数据形式也仅限于集中式、确定性数据流。此外,现存算法没有很好地解决求影响时间与淘汰被新来对象所支配的对象这两个关键问题,导致算法性能较低。本文在对现有技术之不足进行了彻底改进的基础上,进一步解决了一些新的重要实际应用与需求。用户对数据对象各属性的关注往往是有差异的;实际应用中的数据流往往以分布式的形式出现,例如:金融股票交易、环境监测以及网络通讯等应用;最近两年以来,一种全新的被称为概率数据流的数据形态逐步引起了研究者们的关注,针对概率数据流的挖掘与分析方兴未艾。这些新的应用需求为数据流上的Skyline查询处理带了新的挑战。本文对数据流上的Skyline查询处理算法进行了系统地研究,主要成果概括为以下几个方面:
     (1).提出了一个高效地处理滑动窗口上Skyline持续查询的算法CridSky。对于数据流上的Skvline查询处理,决定其性能的主要因素是计算新到达对象的影响时间以及淘汰被新到达对象所支配的对象这两个关键过程的效率。对这两个关键过程,现有工作中只是简单地依靠R树上的窗口查询机制来实施,直接导致了算法性低下。GridSky算法采用更适合数据流这种频繁更新环境的基本网格作为索引结构,并在此基础上开发了基于时间戳的剪枝策略、基于网格的的剪枝策略以及分层遍历策略等优化措施,大幅度地提高了算法的性能。大量的对比实验表明,在空间复杂度略低的情况下,GridSky与其竞争对手相比时间性能优势在1个数量级以上。此外,GridSky算法对不同的数据分布、数据规模与维度具有很强的可扩展性。
     (2).研究了分布式数据流上的Skyline查询问题,提出了一个通信最优的分布式算法BOCS。近年来随着大规模分布式应用的涌现,分布式数据流上的查询处理与数据挖掘越来越引起了研究者们的关注。此前相关工作局限于集中式数据流上的Skyline计算,没有人考虑吏具挑战性的分布式数据流上的Skyline查询问题。本文围绕着降低系统反应延迟与最小化通信负荷的目标,在对GridSky进行重大适应性改造的基础上,提出了一个两阶段渐进求解的分布式算法BOCS。并对BOCS的关键实现环节,如:远程站点与协调站点间的通信协议、Skyline增量的计算等进行了优化,使算法达到了通信效率与反应延迟的合理均衡。理论分析证明在所有基于非共享策略的算法中,BOCS算法通信最优。大量的对比实验也表明,BOCS算法高效、稳定且具有良好的可扩展性。
     (3).提出了一个高效地计算滑动窗口中任意子空间上Skyline的算法StreamSubsky。此前相关工作仅限于维护滑动窗口全空间上的Skyline,未涉及到子空间上Skyline的计算问题。而用户的偏好与倾向性天然不同,这就催生了滑动窗口中子空间上的Skyline查询问题的研究。本文首次提出并较好地解决了该问题,提出的StreamSubsky算法巧妙地利用了全空间与各子空间上Skyline之间的关系,采用自顶向下的方式通过两个阶段增量地返回目标子空间上的计算结果。此外,算法还采用了多个优化策略显著地提高了计算效率。理论分析和实验结果表明,与同类算法相比StreamSubsky以极少的时间开销就能使查询得到响应,算法具有良好的时间与空间性能。
     (4).对概率数据流上的Skyline查询问题进行了深入研究,提出了一个高效的解决方案。数据的内在不确定性存信息集成、RFID以及传感器网络等普适计算环境中普遍存在。对概率数据流进行管理与分析近年来引起了研究者们的广泛关注,而此前尚无解决概率数据流上Skyline持续查询的算法出现。本文基于“可能世界”语义对该问题首次进行了建模,并提出了一个高效的查询处理算法SOPDS。与传统确定性数据流上的Skyline查询处理不同,SOPDS算法主要考虑以下两个基本问题:一是如何高效地确定对象的身份(判断其是否为Skyline对象),即减少身份判断过程中支配测试的次数以降低CPU开销;二是在保证算法正确性的前提下尽可能早地淘汰那些不再有机会加入Skyline的对象,以减少内存开销。围绕着以上两个基本问题,本文先后提出了概率定界、逐步求精、提前淘汰与选择补偿等优化措施对算法从时间与空间上进行了系统地优化。理论分析与详细的对比实验表明,SOPDS算法在时间与空间上具有较高的整体性能,算法高效、稳定。
     本文研究了数据流上Skyline查询的四个重要问题,并分别提出了有效的解决方案。本文提出GridSky算法对现有技术进行了彻底地改进,而提出的BOCS、StreamSubsky和SOPDS算法则进一步填补了一些重要新兴应用的空白。理论分析证明这些算法高效地解决了相应的问题;大量的对于比实验也表明,与现有技术相比本文提出的算法在存储空间、查询处理速度等方面具有明显的优势。
Data stream is a sequence of digitally encoded signals used to represent information in transmission, which is generated from applications such as network routing, traffic and environmental monitoring, stock trading and electronic business activities etc. The study on streaming data is one of the hot topics among the database circle all over the world recently. As an important data mining technique, skyline query has been receiving considerable attention among database community. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Skyline query is of great significance for urban navigation, data mining visualization, multi-criteria decision making and preference query. As skyline query is proposed, techniques to process skyline query mushroom.
     Item within data stream can only be treated once because streaming data is of the nature of rapid arriving rate, large size and independence. Novel algorithm to process skyline query over data stream efficiently in a limited memory is still on demand. As to the topic of skyline query processing over sliding window, there already exist some methods. But some important issues about this topic are still open questions. Skyline query in full-space and only over centric, certain data stream is addressed in existing work. Besides, there exist critical defects about them. Existing approaches are improved fundamentally in this thesis at first. Furthermore, we set forward to solve some emerging issues about this topic. Attributes of object tend to be regarded differently; data stream in reality is always generated by a distributed manner; more and more attention has been attached to probabilistic data stream analyzing recently, but skyline computation over probabilistic data stream is still at large. These emerging requirements entail great challenges to skyline query processing over data stream. We are committed to subdue these challenges effectively and efficiently in this thesis, and our endeavors and contributions can be concluded as following:
     (1).We proposed a novel algorithm GridSky to process skyline query over sliding window efficiently. As to the algorithm processing skyline query over sliding window, its performance is decided by efficiency of the critical procedure to compute influence time of new arriving object and to eliminate the dominated objects. This problem is not well addressed in existing work, hence leads to low performance of existing approaches. More adaptive index basic grid, which is more applicable to data stream environment, is adopted by GridSky. Based on this index structure, pruning rule based on timestamp, pruning rule based grid and optimizing rule based on layered visit are developed to speed up GridSky progressively. Massive comparing experiments demonstrate that, GridSky outperforms its competitors by at least one order of magnitude as to the point of time complexity while consuming less memory space. Besides, GridSky is more scalable over data distribution, dimensionality and cardinality than its competitor.
     (2).We studied the issue of skyline query over distributed data streams, and proposed a communication-optimal algorithm BOCS. Query processing and data mining over distributed data streams have received considerable attention within database community recently. We are the first to address skyline query processing over distributed data streams where streams derive from multiple horizontally-spit data sources. Based on strategy of progressive refinement, skyline is computed incrementally in BOCS through two phases. In the first phase, local skylines on remote sites are maintained efficiently by GridSky and only skyline increments on remote sites are transmitted to coordinator at each time. Some skillful measures are developed to compute delta skyline efficiently in this phase. In the second phase, global skyline is obtained by integrating remote increments with the latest global skyline. Theoretical analysis is provided and shows that BOCS is communication optimal among all algorithms which are out of share-nothing strategy. Extensive experiments demonstrate that our proposal is efficient, scalable and stable.
     (3). We proposed an efficient algorithm StreamSubsky to handle subspace skyline computation over sliding window. Streaming data mining and skyline computation within subspaces have recently received much attention in data mining community. Previous work only sought to maintain full space skyline, we are the first to handle the problem of subspace skyline computation over sliding window. Some useful rules between full-space and subspace skyline are found by us in StreamSubsky, we propose an efficient top-town method to incrementally output the skyline objects within specified subspaces henceforth. Moreover, we present a set of optimizing heuristics to speed up critical procedures. Theoretical analysis and extensive experiments demonstrate that our algorithm is able to output the first result only taking a very short time compared with previous approaches, and our method are both efficient and scalable.
     (4).Inherent uncertainty and unreliability of data exist widely in emerging applications like information integration, RFID and sensor network based ubiquitous computation etc. The management of uncertain, probabilistic data stream has recently attracted considerable attention among researchers. Although previous work has well addressed skyline computation over static data or traditional certain data stream, skyline query over probabilistic data stream is still at large. To the best of our knowledge, we are the first to model this issue formally based on "possible worlds" semantics, furthermore an effective and efficient algorithm SOPDS is proposed to handle this issue. Compared with skyline query over traditional data stream, there are two basic different problems within this topic: how to efficiently confirm the status of new arrival and how to knock out unpromising objects as early as possible. Based on more adaptable index structure, a set of heuristic rules like probability bounding, progressive refinement, pre-elimination and selective compensation are developed to improve the comprehensive performance of SOPDS from the point of reducing both CPU overhead and memory cost. Massive back to back experiments demonstrate that our algorithm is of high overall performance, SOPDS is efficient, stable and scalable.
     In one word, four important issues about skyline query over data stream are well studied in this thesis and efficient algorithms to handle these issues are proposed respectively. Our approaches are great complement and improvement to existing skyline algorithm. Theoretical analysis shows our approaches are effective and efficient, extensive comparing experiments validate the theoretical observation.

引文

[AHW03]Aggarwal C.,Han J.,Wang J.,and Yu P.S.A Framework for Clustering Evolveing Data Streams[A].In:Proc.of VLDB[C].2003,pages 81-92.
    [AM04]Arasu A.,Manku G.S.Approximate Counts and Quantiles over Sliding Windows[A].In:Proc.of the ACM PODS[C].2004,pages 286-296.
    [AMS99]Alon N.,Matias Y.,and Szeedy M.The Space Complexity of Approximating the Frequency Moments[J].Journal of Computer and System Sciences,1999,58(1):137-147.
    AY[08]Aggarwal C.,Yu P.A Framework for Clustering Uncertain Data Streams[A].In:Proc.of ICDE[C].2008.
    [Blo70]Bloom B.H.Space/Time Trade-offs in Hash Coding with Allowable Errors[J].Communications of the ACM,1970,13(7):422-426.
    [Buc89]Buchta C.On the Average Number of Maxima in a Set of Vectors[J].Informaton Processing Letters,1989,32(2):63-65.
    [BBD+02]Babcock B.,Babu S.,Datar M.,Motwani R.and Widom J.Models and Issues in Data Stream Systems[A].In:Proc.of the ACM PODS[C].2002,pages 1-16.
    [BDJ+05]Burdick D.,Deshpande P.M.,Jayram T.S.,Ramakrishnan R.,and Vaithyanathan S.OLAP Over Uncertain and Imprecise Data[A].In:Proc.of VLDB[C].2005,pages 970-981.
    [BGZ04]Balke W.T.,Guntzer U.,and Zheng J.X.Efficient Distributed Skylining for Web Information Systems[A].In:Proc.of the EDBT[C],2004,pages 256-273.
    [BKS01]Borzsonyi S,Kossmann D.and Stocker K.The Skyline Operator[A].In:Proc.of ICDE[C].2001,pages 421-430.
    [BKS+78]Bentley J.L.,Kung H.Y.,Schkolnick M.and Thompson C.D.On the Average Number of Maxima in a Set of Vectors and Applications[J].Journal of ACM,1978,25(4):536-543.
    [BO03]Babcock B.,and Olston C.Distributed Top-k Monitoring[A].In Proc.of the ACM SIGMOD[C],2003,pages 28-39.
    [Ca006]曹锋.数据流聚类分析算法[D].上海:复旦大学,2006.
    [CCC04]Charikar M.,Chen K.,and Colton M.F.Finding Frequent Items in Data Streams[A].In:Proc.of the 29~(th) International Colloquium on Automata,Languages and Programming[C].2002,pages 639-703.
    [CCC+02]Carney D.,Cetintemel U.,Chemiack M.et al.Monitoring Streams:A New Class of Data Management Applications[A].In:Proc.of VLDB[C],2002,pages 215-225.
    [CCD+03]Chandrasekaran S.,Cooper O.,and Deshpande A.et al.YelegraphCQ:Continuous Dataflow Processing for an Uncertain World[A].In:Proc.of the Conferenc on Innovative Data System[C].2003,pages 269-280.
    [CET05]Chan C.Y.,Eng P.K.,Tan K.L.Stratified Computation of Skylines with Partially-ordered Domains[A].In:Proc.of the ACM SIGMOD [C].2005,203-214.
    [CG05]Cormode G.Garofalakis M.Sketching Streams Through the Net:Distributed Approximate Query Tracking[A].In:Proc.of VLDB[C].2005,pages 13-24.
    [CG07]Cormode G.and Garofalakis M.Sketching Probabilistic Data Streams[A].In:Proc.of the ACM SIGMOD[C].2007,281- 292
    [CGG+03]Chomicki J.,Godfrey E,Gryz J.and Liang D.Skyline With Presorting[A].In:Proc.of ICDE[C].2003,Pages 717-719.
    [CGM+05]Cormode G.,Garofalakis M.,Muthukrishnan S.and Rastogi R.Holistic Aggregates in a Networked World:Distributed Tracking of Approximate Quantiles[A].In:Proc.of the ACM SIGMOD[C].2005,pages 25-36.
    [CJT06]Chan C.Y.,Jagadish H.V.,Tan K.L.et al.Finding k-Dominant Skyline in High Dimensional Space[A].In:Proc.of the ACM SIGMOD[C],2006,pages 503-514.
    [CKP04]Cheng R.,Kalashnikov D.,and Prabhakar S..Querying Imprecise Data in Moving Object Environments[J].IEEE Trans.on Knowledge and Data Engineering,2004,16(9):1112-1127.
    [CM08]Cormode G.,and McGregor A.Approximation Algorithms for Clustering Uncertain Data[A].In:Proc.of the ACM PODS[C].2008.
    [CMR05]Cormode G.,Muthukrishnan S.and Rozenbaum I.Summarizing and Mining Inverse Distribution on Data Streams via Dynamic Inverse Sampling[A].In:Proc.of VLDB[C].2005,pages 25-36.
    [COP03]Charikar M.,O'Callaghan L.,and Panigrahy R.Better Streaming Algorithms for Clustering Problems[A].In:Proc.of STOC[C].2003,pages 30-39.
    [DGG04]Das A.,Ganguly S.,Garofalakis M.and Rastogi R.Distributed Set-expression Cardinality Estimation[A].In:Proc.of VLDB[C].2004,pages 312-323.
    [DH00]Domingos P.,and Hulten G.Mining High-speed Data Streams[A].In:Proc.of KDD[C],2000,pages 71-80.
    [DS04]Dalvi N.and Suciu D.Efficient Query Evaluation on Probabilistic Databases[A].In:Proc.of VLDB[C].2004,pages 864-875.
    [DS07] Dellis E., and Seeger B. Efficient Computation of Reverse Skyline Queries[A]. In: Proc. of VLDB[C], 2007, pages 291-302.

    [FCA+98] Fan L., Cao P., Almeida J. and Broder A. Z.. Summary Cache: a Scalable Wide-area Web Cache Sharing Protocol [R]. Madison: Department of Computer Science, University of Wisconsin-Madison, 1998.
    [Gib01] Gibbons P. Distinct Sampling for Highly-accurate Answers to Distinct Values Queries and Event Reports [A]. In: Proc. of VLDB[C], 2001, pages 541-550.

    [Gut84] Guttman A. R-tree a Dynamic Index Structure for Spatial Searching[A]. In: Proc. of the ACM SIGMOD[C]. 1984,47-57.

    [GGR02] Garofalakis M., Gehrke J., and Rastogi R. Querying and Mining Data Steams: You Only Get One Look[A]. In: Proc. of the ACM SIGMOD[C]. 2002, pages 635-646.
    [GHP+02] Giannella C., Han J.W., Pei J. et al. Mining Frequent Patterns in Data Streams at Multiple Time Granularities [A]. In: Proc. of the NSF Workshop on Next Generation Data Mining [C], 2002.
    [GK01] Greenwald M., and Khanna S. Space-efficient Online Computation of Quantile Summaries [A]. In: Proc. of the ACM SIGMOD[C]. 2001, pages 58-66.
    [GKM+01] Gilbert A., Kotidis Y., Muthukrishnan S., and Strauss M. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries [J]. VLDB Journal, 2001, 79(8):79-88.
    [GL01] Graefe G. and Larson P.A. B-tree Indexes and CPU Caches [A]. In: Proc. of ICDE[C].2001, pages 349-358.

    [GM98] Gibbons P.B., Matias Y. New Sampling-based Summary Statistics for Improving Approximate Query Answer [A]. In Proc. of the ACM SIGMOD[C].1998, pages 331-342.
    [GMM00] Guha S., Mishra N., Motwanis R. and O'Callaghan L. Clustering Data Streams[A]. In: Proc. of FOCS[C]. 2000, pages 359-366.
    [GMM03] Guha S., Meyerson A., and Mishra N. et al. Clustering Data Streams: Theory and Practice [J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15(3):515-528.
    [GSG05] Godfrey P., Shipley R., and Gryz J. Maximal Vector Computation in Large Data Sets [A]. In: Proc. of VLDB[C],2005, pages 229-240.
    [GT01] Gibbons P. and Tirthapura S. Estimating Simple Functions on the Union of Data Streams [A]. In: Proc. of the ACM Symposium on Parallel Algorithms and Architectures[C]. 2001, pages 281-291.
    [HKP01] Hristidis V, Koudas N., Papakonstantinou Y. PREFER: A System for the Efficient Execution of Multi-Parametric Ranked Queries [A]. In: Proc. of the ACM SIGMOD[C], 2001,pages 259-270.
    [HRR98]Henzinger M.,Raghavan R,and Rajagopalan S.Computing on Data Streams[R].Calofornia:Digital systems research center,Compaq,May 1998.
    [HSD01]Hulten G.,Spencer L.,and Domingos P.Mining Time-changing Data Streams[A].In:Proc.of KDD[C].2001,pages 97-106.
    [HWX+08]Haas P.,Wu M.,Xu F.,Jampani R.,Jermaine C.,and Perez L.MCDB:A Monte Carlo Approach to Managing Uncertain Data[A].In:Proc.of the ACM SIGMOD[C].2008.
    [HXD02]He Z.Y.,Xu X.F.,Deng S.C.,Squeezer:An Efficient Algorithm for Clustering Categorical Data[J].Journal of Computer Science and Technology,2002,17(5):611-624.
    [IP95]Ioannidis Y.,and Poosala V.Balancing Histogram Optimality and Practicality for Query Result Size Estimation[A].In:Proc.of the ACM SIGMOD[C].1995,pages 233-244.
    [JHE04]Jin W.,Han J.,Ester M.Mining Thick Skylines over Large Database [A].In:Proc.of KDD[C].2004,pages 255-266.
    [JKM+98]Jagadish H.V.,Koudas N.,Muthukrishnan S.,Poosala V.,Sevcik K.C.,and Suel T.Optimal Histogram with Quality Gurantees[A].In:Proc.of VLDB[C].1998,pages 275-286.
    [JKR+06]Jayram T.S.,Krshnamurthy R.,Raghavan S.,Vaithyanathan S,and Zhu H.Avatar Information Extraction System[J].IEEE Data Engineering Bulletin,2006,29(1):40-48.
    [JKV07]Jayram T.S.,Kale S,Vee E.Efficient Aggregation Algorithms for Probabilistic Data[A].In:Proc.of the ACM-SIAM SODA[C].2007,pages 346-355.
    [JLL05]Jiang S.Y.,Li Q.H.,and Li X.Survey on Data Stream Mining[J].Computer Engineering and Design.2005,26(5):1130-1133.
    [JMM+07]Jayram T.S.,McGregor A,Muthukrishan S.,Vee E.Estimating Statistical Aggregates on Probabilistic Data Streams[A].In:Proc.of the ACM PODS[C].2007,pages 243-252.
    [JQZ04]金澈清,钱卫宁等.流数据分析与管理综述,[J].软件学报,2004,15(10):1172-1181.
    [JS94]Jawerth B.,Sweldens W.An Overview of Wavelet Based Multiresolution Analysis[J].SIAM Review,1994,36(3):377-412.
    [Koc08]Christoph Koch,Linking Uncertainty and Unreliability and Approximating Expressive Queries on Probabilistic Databases[A].In:Proc.of the ACM PODS[C].2008.
    [KBS06]Khoussainova N.,Balazinska M.,and Suciu D.Towards Correcting Input Data Errors Probabilistically Using Integrity Constraints[A].In:Proc.of the 5~(th) ACM international workshop on Data engineering for wireless and mobile access[C].2006,pages 43-50.
    [KGR+06]Keralapura R.,Cormode G.,and Ramamirtham J.Communication-efficient Distributed Monitoring of Threshold Counts[A].In:Proc.of the ACM SIGMOD[C].2006,pages 289-300.
    [KLP75]Kung H.T.,Luccio E Preparata F.P.On Finding the Maxima of a Set of Vectors[J].JACM,1975,22(4):469-476.
    [KMS02]Kom F.,Muthukrishnann S.and Srivastava D.Reverse Nearest Neighbour Aggregates over Data Streams[A].In:Proc.of VLDB[C].2002,pages 814-825.
    [KRR02]Kossman D.,Ramsak F.,and Rost S.Shooting Stars in the Sky:An Online Algorithm for Skyline Queries[A].In:Proc.of VLDB[C].2002,pages 275-286.
    [LC08]Lian X.,and Chen L.Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases[A].In:Proc.of the ACM SIGMOD[C].2008.
    [LCT+07]刘莉,蔡军卫,田中彬等.一种基于移动Agent的分布式Skyline 查询算法[J].微电子学与计算机.2007,24(10):46-49.
    [LYW+05]Lin X,Yuan Y,Wang W and Lu H.Stabbing the Sky:Efficient Skyline Computation Over Sliding Windows[A].In:Proc.of ICDE[C].2005,pages 502-513.
    [MBP05]Mouratidis K.,Bakiras S.,and Papadias D.Continuous Monitoring of Top-k Queries Over Sliding Windows[A].In:Proc.of the ACM SIGMOD[C].2005,pages 635-646.
    [MFH03]Madden S.,Franklin M.J.,Hellerstein J.M/,and Hong W.The Design of an Acquisitional Query Processor for Sensor Networks[A].In Proc.of the ACM SIGMOD[C].2003,pages 491-502.
    [MJW+08]Hua M.,Pei J.,Zhang W.,and Lin X.Ranking Queries on Uncertain Data:A Probabilistic Threshold Approach[A].In:Proc.of the ACM SIGMOD[C].2008.
    [MM02]Manku G..S.,and Motwani R.Approximate Frequency Counts over Data Streams[A].In:Proc.of VLDB[C].2002,pages 346-357.
    [MP80]Munro L.and Paterson.Selection and Sorting with Limited Storage[J].Theoretical Computer Science,1980,12:315-323.
    [MPJ07]Morse M.,Patel J.M.,and Jagadish H V.Efficient Skyline Computation over Low-cardinality Domains[A].In:Proc.of VLDB[C].2007,pages 267-278.
    [MSD05]Manjhi A,Shkapenyuk V,Dhamdhere K and Olston C.Finding (recently) Frequent Items in Distributed Data Streams[A].In:Proc.of ICDE[C].2005,pages 767-778.
    [MVW98]Matias Y.,Vitter J.S.,Wang M.Wavelet-Based Histogram for Selectivity Estimation[A].In:Proc.of the ACM SIGMOD[C].1998,pages 448-459.
    [MWA+03]Motwani R.,Widom J.Arasu A.et al.Query Processing,Resource Management,and Approximation in a Data Stream Management System[A].In:Proc.of CIDR[C].2003,pages 245-256.
    [NKC+06]Ngai W.K.,Kao B.,Chui C.K.,Cheng R,Chau M,and Yip K.Y.Efficient Clustering of Uncertain Data[A].In:Proc.of ICDM[C].2006,pages 436-445.
    [PIH+96]Poosala V.,Ioannidis Y.,Haas P.and Shekita E.Improved Histograms for Selectivity Estimation of Range Predicates[A].In:Proc.of the ACM SIGMOD[C].1996,pages 294-305.
    [PJE05]Pei J.,Jin W.,Ester M.,and Tao Y..Catching the Best Views of Skyline:A Semantic Approach Based on Decisive Subspaces[A].In:Proc.of VLDB[C].2005,pages 253-264.
    [PJL+07]Pei J,Jiang B,Lin X and Yuan Y.Probabilistic Skylines on Uncertain Data[A].In:Proc.of the VLDB[C].2007,pages 15-16.
    [PTF+05]Papadias D.,Tao Y.,Fu G.,and Seeger B.Progressive skyline Computation in Database Systems[J].ACM Transations on Database Systems,2005,30(1):41-82.
    [PYL+06]Pei J.,Yuan Y.,Lin X.,Jin W.et al.Towards Multidimensional Subspace Skyline Analysis[J].ACM Transations on Database Systems,2006,31(4):1335-1381.
    [RLB+08]Re C.,Letchner J.,Balazinska M.,and Suciu D.Event Queries on Correlated Probabilistic Streams[A].In:Proc.of the ACM SIGMOD [C].2008.
    [SBH+06]Sarma A.D.,Benjelloum O.,Halevy A.,and Widom J.Working Models for Uncertain Data[A].In:Proc.of ICDE[C].2006.
    [SS06]Sharifzadeh M.,and Shahabi C.The Spatial Skyline Queries[A].In:Proc.of the VLDB[C],2006,pages 751-762.
    [TEO01]Tan K.L.,Eng P.K.,and Ooi B.C.Efficient Progressive Skyline Computation[A].In:Proc.of VLDB[C].2001,pages 301-310.
    [TP06]Tao Y,Papadias D.Maintaining Sliding Window Skylines on Data Streams[J].IEEE Transactions on Knowledge and Data Engineering(IEEE TKDE),2006,18(3):377-391.
    [TS96]Theodoridis Y.and Sellis T.A Model for the Prediction of R-tree Performance[A].In:Proc.of the ACM PODS[C].1996,161-171.
    [TXP06]Tao Y.,Xiao X.and Pei J.SUBSKY:Efficient Computation of Skylines in Subspaces[A].In:Proc.of ICDE[C].2006,pages 65-74.
    [Vit85]Vitter J.S.Random Sampling with a Reservoir[J].ACM Transactions on Mathematical Software(TOMS).1985,11(1):37-57.
    [VDK+07]Vlachou A.,Doulkeridis C.,Kotidis Y.and Vazirgiannis M.SKYPEER:Efficient Subspace Skyline Computation Over Distributed Data[A].In:Proc.of ICDE[C].2007,416-425.
    [WFY+03]Wang H.,Fan W.,Yu ES.and Han J.Mining Concept Drifting Data Streams Using Ensemble Classifiers[A].In:Proc.of KDD[C].2003,pages 226-235.
    [WYM97]Wang W.,Yang J.and Muntz R.STING:a Statistical Information Grid Approach to Spatial Data Mining[A].In:Proc.of VLDB[C].1997,pages 186-195.
    [WZF+06]Wu P.,Zhang C.,Feng Y.et al.Parallelizing Skyline Queries for Scalable Distribution[A].In:Proc.of the EDBT[C],2006,pages 112-130.
    [XWC+07]Xin J,Wang G,Chen L,Zhang X and Wang Z.Continuously Maintaining Sliding Window Skylines in a Sensor Network[A].In:Proc.of the 12th International Conference on Database Systems for Advanced Applications[C].2007,pages 509-521.
    [XZ06]Xia T.and Zhang D.Refreshing the Sky:the Compressed Skycube with Efficient Support for Frequent Updates[A].In:Proc.of the ACM SIGMOD[C].2006,pages 491-502.
    [YCL+04]Yu X.J.,Chong Z.H.,Lu H.G.,and Zhou A.False Positive or False Negative:Mining Frequent Itemsets From High Speed Transactional Data Streams[A].In:Proc.of VLDB[C].2004,pages 204-215.
    [YLQ+05]Yuan Y.,Lin X.,Q.Liu,Wang W.,Yu J.X.and Zhang Q..Efficient Computation of the Skyline Cube[A].In:Proc.of VLDB[C].2005,pages 241-252.
    [Zhou07]周红福.基于索引的Skyline算法研究[D].上海:复旦大学,2007.
    [ZCY+07]Zhou A.,Cao E,Yan Y.,Sha C.,and He X.Distributed Data Stream Clustering:A Fast EM-based Approach.In:Proc.of ICDE.2007,pages 736-745.
    [ZLY+08]Zhang Q.,Li F.,and Yi K.Finding Frequent Items in Probabilistic Data[A].In:Proc.of the ACM SIGMOD[C].2008.
    [ZMC+05]Zhang L.,Mao Y.,Cao C.,and Song W.Research and Development in Data Stream Management System[J].Computer Application Research.2005.12-15.
    [ZR96]Zhang T.,Ramakrishnann R.BIRCH:An Efficient Data Clustering Method for Very Large Databases[A].In:Proc.of the ACM SIGMOD[C].1996,pages 103-114.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700