不确定数据流上Skyline查询处理技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机网络技术的快速发展,在金融信息、气象信息、无线传感器网络等领域产生了大量的数据流。同时,网络环境的复杂化使得数据流中的数据具有不确定的特征,研究不确定数据流处理技术是一个热点问题。Skyline查询处理技术常用于多目标决策,不确定数据流上Skyline查询具有重要的应用价值。不确定数据流的数据不确定、实时响应、单遍处理等特点对Skyline查询带来巨大挑战。本文针对不确定数据流上Skyline查询中的对象建模、结构索引、多数据流来源及多用户查询等问题,深入研究了不确定数据流上Skyline查询方法、分布式Skyline查询方法以及分布式子空间Skyline查询方法。取得的主要研究进展如下:
     不确定数据流上Skyline查询用于解决不确定数据流上的多目标决策问题。本文针对连续概率密度函数模型的不确定数据流上Skyline查询处理技术中Skyline概率计算、不确定对象索引结构等问题,提出了一种高效的基于高斯模型的不确定数据流上Skyline查询方法SGMU,该方法包含两个算法:动态高斯建模算法DGM和基于高斯树的Skyline查询算法GTS。DGM算法对不确定数据流滑动窗口中的数据采样并建立高斯模型,将数据流转化为不确定对象概率密度函数的参数流;GTS算法针对高斯模型的参数流建立R树索引结构,通过对R树进行剪枝以减少计算量。理论分析和模拟测试表明,与无索引结构的不确定数据流Skyline查询方法BNL(Block-Nested-Loop,简称BNL)相比,SGMU方法不仅能够对连续型不确定对象进行有效建模以辅助Skyline查询,而且能够有效地剪枝不确定数据对象,提高Skyline查询效率。
     分布式Skyline查询方法能够应对分布式的数据流上的Skyline查询任务。本文深入研究了已有的数据流上分布式Skyline查询中的数据集分割方案及对应数据集分割方案下的Skyline查询方法。在SGMU方法的基础上,提出了基于水平分割的连续概率密度函数建模的不确定数据流分布式Skyline查询方法SHUCpdf和基于垂直分割的连续概率密度函数建模的不确定数据流分布式Skyline查询方法SVUCpdf。两种方法均在分布式节点先获取局部Skyline结果集合,并在局部Skyline结果并集上再次进行Skyline查询,得出全局Skyline结果集合。两种方法的区别在于采取不同的数据集合分割方式和不同的Skyline概率计算方式。理论分析和模拟测试表明,与集中式的不确定数据流Skyline查询相比,SHUCpdf方法和SVUCpdf方法结构简单,能够快速返回部分数据集合的查询结果,精确连续的获取全局Skyline查询结果。
     子空间Skyline查询用于解决多用户Skyline查询问题。不同用户关注的数据对象属性可能不同,由此产生多个Skyline查询子空间。针对不确定数据流上的多用户Skyline查询问题,提出了基于垂直分割的连续概率密度函数建模的不确定数据流分布式子空间Skyline查询方法SSVUCpdf。SSVUCpdf方法以SVUCpdf方法为基础,通过分布式节点计算单个维度的Skyline查询结果,便于不同维度组合形成子空间;设置用户偏好空间队列保存用户偏好子空间查询结果,快速返回部分用户请求,减少高维度空间的子空间数目。理论分析与模拟测试表明,SSVUCpdf方法能够有效减少查询开销,避免子空间Skyline查询中“维度爆炸”的问题。
With the rapid development of computer network technology, a large number of data streams are generated in the field of financial information, weather information and wireless sensor networks. The complexity of the network environment brings uncertain characteristics of data streams, which makes the technology of processing uncertain data stream become increasingly important. Skyline query processing techniques are applied to make multi-objective decisions. Skyline queries on uncertain data streams have great value in many applications. The characteristics of uncertain data stream such as uncertainty of data, real-time response and single-pass, make a huge challenge for skyline query processing technology. According to the problem of object modeling method, object index structure, multi-source of data stream and multi-user queries, a skyline query method on uncertain data stream, a distributed skyline query method on uncertain data stream and a distributed sub-skyline query method on uncertain data stream are proposed in this paper.
     Skyline query methods on uncertain data streams are used to solve multi-objective decision making on uncertain data stream. To solve the problem of continuous probability density function modeling method for uncertain data, skyline probability computation and uncertain object index structure, an effective skyline query method on Gaussian model uncertain data streams (SGMU) is proposed in this paper. The method SGMU contains two algorithms: dynamic Gaussian modeling algorithm (DGM) and skyline query algorithm based Gaussian tree (GTS). The DGM algorithm samples data object and builds Gaussian model for uncertain objects in the sliding window of uncertain data stream. The data stream is transferred into probability density function parameters stream of uncertain objects by DGM. GTS algorithm establishes an R-tree index structure by the Gaussian model parameters. The R-tree structure is used to prune data object and reduce the load of computation. Theoretical analysis and simulation tests show that, compared to BNL (Block-Nested-Loop, short for BNL) which is a kind of skyline query method on uncertain data stream with no index structure, SGMU method can not only effective in modeling to support skyline query, but also effective in pruning uncertain data objects to improve the efficiency of skyline queries.
     Distributed skyline query methods can deal with the problem of skyline queries on distributed data stream. In the existing distributed skyline queries on data stream, the methods of dataset dividing and the skyline computation are researched in this paper. According to the type of dataset dividing, two skyline query methods based on SGMU are proposed. One is on horizontal divide uncertain data stream modeling by continuous probability density function (SHUCpdf), the other is on vertical divide uncertain data stream modeling by continuous probability density function (SVUCpdf). Both methods compute the skyline results of partial dataset in the distributed nodes, then query from the local skyline results. The schemes of dataset partitioning and skyline probability computation are the main differences between SHUCpdf and SVUCpdf. Theoretical analysis and simulation results show that SHUCpdf and SVUCpdf can get skyline results quickly with simple data structures comparing to SGMU. Meanwhile, both methods can get the global skyline results accurately and continuously.
     Subspace skyline query is proposed to solve queries with multi-user. Users may concern about different attributes of data objects, which brings multiple subspace for skyline queries. A subspace skyline query method on vertical divide uncertain data stream modeling by continuous probability density function (SSVUCpdf) is proposed to solve the problem of multi-user skyline query. Based on SVUCpdf, SSVUCpdf computes the skyline results of dataset on the single dimension in distributed nodes. Then the different subspaces can be constructed from various single dimension datasets easily. At the same time, SSUCpdf saves the user preferred subspaces by the user sub-skyline query queue, which can reduce the number of subspaces in high dimensional skyline queries, and returns the requests of most users rapidly. Theoretical analysis and simulation results show that, SSVUCpdf method can effectively reduce the query cost, and avoid the "dimension explosion" problem in the subspace skyline queries.
引文
[1] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom, Models and Issues in Data Stream Systems. Department of Computer Science, Stanford University,2002.
    [2] Chen J, DeWitt D J, Tian F, et al. NiagraCQ: scalable continuous query system for internet databases[C].Proc of the ACM SIGMOD Intl Conf on Management of Data. Dallas,Texas,USA:ACM Press.2000:379-390.
    [3] Gibbons PB, Matias Y. Synopsis data structures for massive data sets. In: Tarjan RE, Warnow T, eds. Proc. of the 10th AnnualACM-SIAM Symp. on Discrete Algorithms. Baltimore: ACM/SIAM, 1999: 909-910.
    [4] Greenwald M, Khanna S. Space-Efficient online computation of quantile summaries. In: Proc. of the 2001 ACM SIGMOD Int’l Conf. on Management of Data. Santa Barbara: ACM Press, 2001: 58-66.
    [5] Jeffery S R, Garofalakis M, Franklin M J. Adaptive Cleaning for RFID Data Streams[C]. In: VLDB. Seoul: 2006. 163-174.
    [6] Sarma A D, Benjelloum O, Halevy A, et al. Working models for uncertain data[C]. In: ICDE.Washington: IEEE Computer Society, 2006.
    [7] Klir G J, Folger T A. Fuzzy sets, uncertainty and information[M]. New Jersey: Pemtice-Hall, 1988.
    [8] Cormode G, Garofalakis M. Sketching probabilistic data streams[C]. In: the 2007 ACM SIGMOD international conference on Management of data.ACM, 2007. 281-292.
    [9] Pei J, Jiang B, Lin X, et al. Probabilistic skylines on uncertain data[C]. In: the 33rd international conference on Very large data bases.VLDB Endowment, 2007. 15-26.
    [10] S J T, Kale S, Vee E. Efficient aggregation algorithms for probabilistic Data[C]. In: the ACM-SIAM SODA.Louisiana: 2007. 346-355.
    [11] Cormode G, Mcgregor A. Approximation Algorithms for Clustering Uncertain Data[C]. In: 27th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.Vancouver: 2008. 191-200.
    [12] Kung H T, Luccio F, Preparata F P. On finding the maxima of a set of vectors[J]. Journal of the ACM (JACM). 1975, 22(4): 469-476.
    [13] Bentley J L, Kung H T, Schkolnick M, et al. On the average number of maxima in a set of vectors and applications[J]. Journal of the ACM (JACM). 1978, 25(4): 536-543.
    [14] Bentley J L, Clarkson K L, Levine D B. Fast linear expected-time algorithms for computing maxima and convex hulls[C]. In: SODA.1991.
    [15] Borzsonyi S, Kossmann D, Stocker K. The skyline operator[C]. In: the 17th Int'l Conf. on Data Engineering.Heidelberg: IEEE Computer Society Press, 2001. 421-430.
    [16] Balke W T, Guntzer U, Zheng J X. Efficient distributed skyline for Web information systems[C]. In: EDBT.Heraklion, Greece,: 2004. 256-273.
    [17] Chen L, Cui B, Lu H, et al. iSky: Efficient and progressive skyline computing in a structured P2P network[C]. In: Int. Conf. on Distributed Computing Systems.2008. 160-167.
    [18] Zhu L, Tao Y, Zhou S G. Distributed skyline retrieval with low bandwidth consumption[J]. IEEE Transactions on Knowledge and Data Engineering. 2009, 21(3): 384-400.
    [19] Huang Z, Jensen C S, Lu H, et al. Skyline queries against mobile lightweight devices in manets[C]. In: ICDE.2006.
    [20] Wenjie Zhang, Xuemin Lin, Ying Zhang et al. Probabilistic Skyline Operator over Sliding Windows.
    [21] Christian B?hm, Frank Fiedler, Annahita Oswald. et al. Probabilistic Skyline Queries. CIKM 2009: 651-660
    [22] Zhu L, Guan J H, Zhou S G. Skyline computation: Survey[J]. Computer Engineering and Applications. 2008, 44(6): 160-165.
    [23] Papadias D, Tao Y, Fu G, et al. An optimal and progressive algorithm for skylinequeries[C]. In: SIGMOD.2003.
    [24] Kossmann D, Ramsak F, Rost S. Shooting stars in the sky: An online algorithm for skyline queries[C]. In: the 28th international conference on Very Large Data Bases.VLDB Endowment, 2002. 286.
    [25] Sullivan M, Heybey A. Tribeca: A system for managing large databases of network traffic[C]. In: the annual conference on USENIX Annual Technical Conference.USENIX Association, 1998. 2.
    [26] Antonin Guttman ,R-trees: a dynamic index structure for spatial searching. In: ACM SIGMOD international conference on Management of data.1984.
    [27] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider. The R*-tree: an efficient and robust access method for points and rectangles. In:ACM SIGMOD international conference on Management of data,1990.
    [28] Huang Z, Lu H, Ooi B, et al. Continuous skyline queries for moving objects[J]. IEEE Transactions on Knowledge and Data Engineering. 2006, 18: 1645-1658.
    [29] Lin X, Yuan Y, Wang W, et al. Stabbing the sky: Efficient skyline computation over sliding windows[C]. In: ICDE.Tokyo, Japan: 2005. 502-513.
    [30] Tao Y, Papadias D. Maintaining sliding window skylines on data streams[J]. IEEE Trans on KDE. 2006, 18(2): 377-391.
    [31] Wang B, Yang X C, Wang G R. Outlier Detection over Sliding Window for Probabilistic Data Stream[J]. Journal of Computer Science and Technology. 2010, 25(3): 389-400.
    [32] Pei J, Fu A W, Lin X, et al. Computing compressed multidimensional skyline cubes efficiently[C]. In: the 23rd Int’l Conf. Data Eng.(ICDE’07).Citeseer, 2007. 96-105.
    [33] Atallah M J, Qi Y. Computing all skyline probabilities for uncertain data[C]. In: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.ACM, 2009. 279-287.
    [34] Zhang W J, Lin X M, Zhang Y, et al. Probabilistic Skyline Operator over Sliding Windows[C]. In: ICDE.2009.
    [35] Christian B, Frank F, Annahita O. Probabilistic Skyline Queries[C]. In: CIKM'09.Hong Kong, China: ACM, 2009.
    [36] Zhu L, Tao Y, Zhou S G. Distributed skyline retrieval with low bandwidth consumption[J]. IEEE Transactions on Knowledge and Data Engineering. 2009, 21(3): 384-400.
    [37] Huang Z, Jensen C S, Lu H, et al. Skyline queries against mobile lightweight devices in manets[C]. In: ICDE.2006.
    [38] Wang S, Ooi B C, Tung A K H, et al. Efficient skyline query processing on peer-to-peer networks[C]. In: ICDE.2007. 1126-1135.
    [39] Wu P, Zhang C, Feng Y, et al. Parallelizing skyline queries for scalable distribution[C]. In: EDBT.Munich, Germany: 2006. 112-130.
    [40] Hose K, Lemke C, Satter K U, et al. Processing relaxed skyline in PDMS using distributed data summaries[C]. In: CIKM.Arlingtion, USA: 2006. 425-434.
    [41] Vlachou A, Doulkeridis C, Kotidis Y, et al. Skypeer: Efficient subspace skyline computation over distributed data[C]. In: ICDE.Citeseer, 2007.
    [42] Xiao Y, Chen Y. Efficient distributed skyline queries for Mobile applications[J]. Journal of computer science and technology. 2010, 25(3): 523-536.
    [43] Balke W T, Guntzer U, Zheng J X. Efficient distributed skyline for Web information systems[C]. In: EDBT.Heraklion, Greece,: 2004. 256-273.
    [44] Eric L, Kevin Y Y, Lin K, et al. Progressive skylining over Web-accessible databases[J]. Data & Knowledge Engineering. 2006, 57(2): 122-147.
    [45] Xin J, Wang G, Chen L, et al. Continuously maintaining sliding window skylines in a sensor network[C]. In: the 12th International Conference on Database Systems for Advanced Applications.Berlin: Springer-Verlag, 2007. 509-521.
    [46] Sun S L, Li J J, Zhu Y Y. Efficient Processing of Continuous Skyline Query over Distributed Data Streams[J]. Journal of Software. 2009, 20(7): 1839-1853.
    [47] Yuan Y, Lin X, Liu Q, et al. Efficient computation of the skyline cube[C]. In: the 31st international conference on Very large data bases.VLDB Endowment, 2005. 252.
    [48] Pei J, Jin W, Ester M, et al. Catching the best views of skyline: A semantic approach basedon decisive subspaces[C]. In: the 31st international conference on Very large data bases.VLDB Endowment, 2005. 264.
    [49] Tao Y, Xiao X, Pei J. Subsky: Efficient computation of skylines in subspaces[C]. In: the 22nd International Conference on Data Engineering.Citeseer, 2006. 65.
    [50] Xia T, Zhang D. Refreshing the sky: the compressed skycube with efficient support for frequent updates[C]. In: the 2006 ACM SIGMOD international conference on Management of data.ACM, 2006. 502.
    [51] Chan C Y, Jagadish H V, Tan K L, et al. Finding k-dominant skylines in high dimensional space[C]. In: the 2006 ACM SIGMOD international conference on Management of data.ACM, 2006. 514.
    [52] Evangelos Dellis, Akrivi Vlachou, Ilya Vladimirskiy, et.al. Progressive Computation of Constrained Subspace Skyline Queries.
    [53] Chan C Y, Jagadish H V, Tan K L, et al. On high dimensional skylines[J]. EDBT2006,LECTURE NOTES IN COMPUTER SCIENCE. 2006, 3896: 478-495.
    [54] Lin X, Yuan Y, Zhang Q, et al. Selecting stars: The k most representative skyline operator[C]. In: ICDE.Citeseer, 2007. 86-95.
    [55] Pei J, Yuan Y, Lin X, et al. Towards multidimensional subspace skyline analysis[J]. ACM Transactions on Database Systems (TODS). 2006, 31(4): 1335-1381.
    [56] Christian B?hm, Alexey Pryakhin, Matthias Schubert, The Gauss-Tree: Efficient Object Identification in Database of Probabilistic Feature Vectors, ICDE, 2006.
    [57] Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis et.al. SKYPEER: Ef?cient Subspace Skyline Computation over Distributed Data.
    [58] Arasu A, Babcock B, Babu S, et al. STREAM: the stanford stream data manager (demonstration description)[C]. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data.ACM New York, NY, USA, 2003. 665-665.
    [59] Chandrasekaran S, Cooper O, Deshpande A, et al. TelegraphCQ: continuous dataflow processing[C]. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data.ACM New York, NY, USA, 2003. 668-668.
    [60] Carney D, Etintemel U, Cherniack M, et al. Monitoring streams: A new class of data management applications[C]. In: Proceedings of the 28th international conference on Very Large Data Bases.VLDB Endowment, 2002. 226.
    [61] Sarvjeet Singh, Chris Mayfield, Sagar Mittal, et.al. Orion 2.0: Native Support for Uncertain Data. In Proc. of the ACM Special Interest Group on Management of Data (SIGMOD 2008), Vancouver, Canada, June 2008.
    [62] ME Khalefa, MF Mokbel, Levandoski, J.J. Skyline Query Processing for Incomplete Data. Data Engineering, 2008. ICDE, IEEE 24th International Conference on Issue Date, April 2008.
    [63] [63] Hua M, Pei J, Zhang W, et al. Ranking queries on uncertain data: A probabilistic threshold approach[C]. In: ACM SIGMOD Int. Conf. on Management of Data.2008.
    [64] Lian X, Chen L. Top-k dominating queries in uncertain databases[C]. In: EDBT.Saint Petersburg, Russia: ACM, 2009. 660-671.
    [65] Beskales G, Soliman M A, Iiyas I F. Efficient search for the top-k probable nearest neighbors in uncertain databases[J]. the VLDB Endowment archive. 2008, 1(1): 326-339.
    [66] Soliman M A, Ilyas I F, Chang K C. Probabilistic top-k and ranking-aggregate queries[J]. ACM Transactions on Database Systems (TODS). 2008, 33(3): 13.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700