基于维层次数据立方体存储技术的研究

英文题名：Research on the Storage Technique of Data Cube Based-on Dimension Hierarchy
作者：彭志鹏
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：OLAP ; 数据立方体 ; 维层次 ; 维层次编码 ; 基本单元组
英文关键词：OLAP ; Date Cube ; dimension hierarchy ; Dimension hierarchy encoding ; Base single tuple
学位年度：2008
导师：蒋外文
学科代码：081203
学位授予单位：中南大学

摘要

数据立方体是数据仓库和联机分析处理的核心概念。为了加速响应联机分析处理系统中的复杂多维查询,通常需要预先计算并保存数据立方体,然而数据立方体的巨大尺寸却给其计算和存储带来诸多难题。因此,降低磁盘空间成本和提高查询性能成为数据立方体研究两个重要却又相互制约的目标。为了从根本上解决这些问题,需要探索有效的数据立方体组织方法。
     本文首先改进实现了计算维层次数据立方体的ICODH方法,该算法在给定的维顺序下,自底向上逐层递归计算;当具体到某一个维,是从维的粗粒度层到细粒度层方向循环计算聚集;通过共享排序来减少磁盘的读写操作,以减小维层次数据立方体的计算时间。另一方面研究了维层次编码技术,提出了一种对维表能有效进行层次编码的方法,保存了原有数据立方体的语义信息。通过这两方面来加快数据立方体的计算速度,提高其查询性能。
     浓缩数据立方是一种有效缩小数据立方尺寸的机制,但仍然存在大量的前缀冗余,如小方内的前缀冗余和小方间的前缀冗余。对此,本文扩展实现了一种基于维层次的数据立方组织结构IDHC,它结合基本单元组的浓缩和小方内的前缀共享技术,利用维层次的特点,将具有相同聚集维集(或单值维集)的立方元组聚簇,同一簇内的元组以共享前缀的形式组织来进一步减小立方体的压缩尺寸。同时在物理存储这些元组时为了减小因共享前缀而进行大量元组之间的比较,又提出了批处理生成元组的算法。该算法消除了仅包含单个聚集维的数据小方内元组间的比较,并以批处理模式计算IDHC。
Data cube is the kernel conception of data warehouse and on-line analytical processing (OLAP). It usually needs to pre-compute and save the data cube in disk in order to promptly answer complex multidimensional queries in the OLAP applications. But the large size brings about a lot of trouble when they are computed and saved. To decrease disk storage cost and improve queries performance are very important but contradictive goals of data cube research. For the sake of resolving these problems, it needs to explore the effective structures of data cube.
     A new approach named ICODH is improved here, which computes tuples in recursion from bottom to top. When one dimension is computed, it computes from the coarse granularity to the fine granularity in circle. By sharing the sorting costs, it decreases the reading and writing operations of the disk in order to reduce the dimension hierarchy cube's computation time. On the other side, basing on the research of dimension hierarchy encoding technique, this paper also proposes an effective method to encode the dimension table. This approach preserves the semantic relations by virtue of the compressing mechanism. Through two sides, the data cube speeds up the computation and improves the performance of query.
     Condensed Data Cube has been proposed as an effective approach for reducing data cube's size, but there are still lots of prefix-redundancy in the data cube, such as intra-cuboid prefix redundancies and among-cuboid prefix redundancies. For this, a data cube structure named IDHC is extended here. It combines two techniques—BST condensing and intra-cuboid prefix-sharing. According to the character of dimension hierarchy, it clusters the cube tuples which has the same grouping dimension set (or the single dimension set), reduces the size of cube because the tuples in same cluster can share the prefix. Meantime, when these tuples are preserved in disk, the algorithm which tuples generated inthe same batch is proposed in order to eliminate tuple comparisons. Thisapproach eliminates comparisons among tuples in cuboid which containsonly one grouping dimension, and computes IDHC in batch mode wasproposed.

引文

[1]W.H.Inmon.Building the data warehouse.Canada:Johnwiley & Sons Inc,1992
    [2]Codd E F,et al.Providing OLAP(on-line analytical processing)to user-analysts:an IT mandate.IBM Research Lab,Technical Report,1993
    [3]王珊等编著.数据仓库技术与联机分析处理.科学出版社,1999
    [4]岳丽华,杨冬青编著.数据仓库系统全书.机械工业出版社,2003
    [5]Gray J,Bosworth A,Layman A,et al.Data cube:A relational aggregation operator generalizing group-by,cross-tab,and sub-totals.In:Su SYW,ed.Proceedings of the 12th International Conference on Data Engineering.New Orleans:IEEE Computer Society,1996.152-159
    [6]Chan C Y,Ioannidis Y E.Hierarchical cubes for range-sum queries.In:Atkinson M P,Orlowska M E,Valduriez P,eds.Proceedings of 25th International Conference on Very Large Databases.Edinburgh:Morgan Kaufmann Publishers,1999.675-686
    [7]Geffner S,Agrawal D,Abbadi A El.The dynamic datacubes.In:Zaniolo C,ed.Proceedings of the 7th International Conference on Extending Database Technology.Konstanz,2000.237-253
    [8]Mistry H,Roy P,Sudarshan S,et al.Materialized view selection and maintenance using multi-query optimization.In:Aref W G,ed.Proceedings of the ACM SIGMOD 2001.Santa Barbara:ACM Press,2001.307-318
    [9]Ross K A and Srivastava D..Fast approximate answers to aggregate queries on a data cube.In:Proceedings of the 24th Conference on VLDB,1997,116-185
    [10]Beyer K,Ramakrishnan R..Bottom-up computation of sparse and iceberg cubes.In:Delis A,Faloutsos C,Ghandeharizadeh S,eds.Proceedings of the ACM SIGMOD International Conference on Management of Data.Philadelphia:ACM Press,1999.359-370
    [11]彭木根编著.数据仓库技术与实现.电子工业出版社,2002
    [12]T.Tsuji,A.Isshiki,T.Hoehi,K.Higuchi.An Implementation Scheme of Multidimesional Arrays for MOLAP.In:Makoto Takizawa,eds.Dexa'02,13~(th)International Workshop on Database and Expert Systems Application Aix-en-Province,France:IEEE Computer Society,September 2002,2-6
    [13]冯建华,蒋旭东,周立柱.用于数据仓储的一种改进的多维存储结构.软件学报,2002,13(8):1423-1429
    [14]林杰斌,刘民德.数据挖掘与OLAP理论与实践.北京:清华大学出版社,2003
    [15]Sin Yeung Lee,Tok Wang Ling,Hua Gang Li.Hierarchical Compact Cube for Range-Max Queries.School of computing National University of Singapore,2003:59-62
    [16]D.Barbara,W.DuMouchel,C.Faloutsos,P.J.Haas,J.M.Hellerstein,Y.Ioannidis,H.V.Jagadish,T.Johnson,R.Ng,V.Poosala,K.A.Ross,K.C.Sevcik.The New Jersey Data Reduction Report.In IEEE Data Engineering bulletin 20,1997.3-45
    [17]I.Lazaridis,S.Mehrotra.Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure.In:Proc.ACM-SIGMOD.2001
    [18]D.Margaritis,C.Faloutsos,S.Thrun.NetCube:A Scalable Tool for Fast Data Mining and Compression.In:Proceedings of the 28~(th)Conference on VLDB.2001
    [19]Daniel BARBARA,Xintao WU.Loglinear-Based quasi cubes.Journal of Intelligent Information Systems,2001,16,255-276
    [20]李红松,黄厚宽.无须附加空间的数据立方体联机聚集.软件学报,2006,17(04):806-813
    [21]Nick Roussopoul,Yannis Kotidis et al.Cubetree:Organization of and bulk incremental updates on the data cube.In:Proceedings of the ACM SIGMOD International Conference on Management of Data,Tucson,Arizona,1997.89-99
    [22]T.Johnson and D.Shasha.Some Approaches to Index Design for Cube Forests.Data Engineering Bulletin,1997.20(1):27-35
    [23]Y.Sismanis et al.Dwarf:Shrinking the petacube.In:Franklin M J,Moon B,Ailamaki A,eds.Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data.Madison:ACM Press,2002.464-475
    [24]Y.Sismanis,N.Roussopoulos.The Polynomial Complexity of Fully Materialized Coalesced Cubes.in:M.A.Nascimento,M.T.Ozsu,D.Kossmann,R.J.Miller,J.A.Blakeley,K.B.Schiefer eds..Proceedings of the 30~(th)International Conference on Very Large Data Bases.Toronto,Canada.2004.San Francisco:Morgan Kaufmann,2004.540-511
    [25]Y.Sismanis et al.Hierarchical dwarfs for the rollup cube.In:Proceedings of the 6th ACM international workshop on Data warehousing and OLAP.New York,NY,USA:ACM Press,2003.17-24
    [26]胡孔法,董逸生,徐立臻.基于维层次的压缩Cube.东南大学学报,2004, 34(05):599-602
    [27]杨科华,胡孔法.基于维层次性的Data Cube存储优化方法.东南大学学报,2005,35(04):524-527
    [28]叶德谦,王桂兰,张杰.基于层次聚簇的CUBE存储结构的研究.计算机工程与应用.2005,9,174-176
    [29]梁作鹏,胡孔法,董逸生等.数据仓库系统中一种改进的维层次聚集Cube 存储结构.计算机研究发展,2005,42(08):1362-1368
    [30]Harinarayan V,et al.Implementing data cubes efficiently.In:Jagadish HV,Mumick IS,eds.Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.Montreal:ACM Press,1996.205-216
    [31]A Shukla,PM Deshpande,JF Naughton.Materialized view selection for multidimensional datasets.In:Gupta A,Shmueli O,Widom J,eds.Proceedings of the 24th International Conference on Very Large DataBase.New York:Morgan Kaufmann,1998.488-499
    [32]K.A.Ross,D.Srivastava,S.Sudarshan.Materialized view maintenance and integrity constraint checking:trading space for time.In:Proc.of the 1996ACM-SIGMOD Conf.,1996,.447-458.
    [33]J.Yang,K.K.Kadapalem,Q.Li.Fast computation of sparse databases algorithms for materialized view design in data warehousing.ln:Proc.of the 23rd Int'l Conf.VLDB,1997,136-145.
    [34]H.Gupta,I.S.Mumick.Selection of views to materialize under a maintenance cost constraint,in:Proceedings of the International Conference on Extended Database Theory,1999,453-470.
    [35]Y.Kotidis,N.Roussopoulos.DynaMat:a dynamic view management system for data warehouses.In:Proc.of the 1999 ACM SIGMOD Conf.,1999,pp.371-382.
    [36]W.F.Liang,H.Wang,M.E.Odowska.Materialized view selection under the maintenance time constraint.Data&Knowledge Engineering.2001(37):203-216.
    [37]李盛恩,王珊.Star Cube—一种高效的数据立方体实现方法.计算机研究与发展,2004,41(04):587-593
    [38]张柏礼,孙志挥,孙翔.物化视图选择的预处理算法.计算机研究与发展,2004,41(10):1645-1651
    [39]谭红星,周龙骧.多维数据实视图的动态选择.软件学报,2002,13(6):1090-1096
    [40]W.Wang et al.Condensed cube:An effective approach to reducing data cube size.In:Proceedings of the 18th International Conference on Data Engineering.San Jose:IEEE Computer Society,2002.155-165
    [41]J.Feng,Q.Fang,H.Ding.PrefixCube:prefix-sharing condensed data cube.In:Proceedings of ACM International Workshop on Data Warehousing and OLAP.Washington,D.C.,USA.2004.New York:ACM Press,2004.38-47
    [42]聂晶,冯剑琳,王元珍.一种新的前缀立方索引机制.小型微型计算机系统,2007,28(3):451-455
    [43]L.V.S Lakshmanan,J.Pei,J.Han.Quotient cube:How to summarize the semantics of a data cube.In:Bressan S,Chaudhri AB,Lee ML,Yu JX,Lacroix Z,eds.Proceedings of the 23rd International Conference on Very Large Data Bases.Hong Kong:Morgan Kaufmann,2002.778-789.
    [44]L.V.S Lakshmanan,J.Pei,Y Zhao.QC-Trees:An efficient summary structure for semantic OLAP.In:Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data.New York:ACM Press,2003.64-75
    [45]李盛恩,王珊.封闭数据立方体技术研究.软件学报,2004,15(08):1165-1171
    [46]Y.Feng,D.Agrawal,A.E.Abbadi,et al.Range CUBE:Efficient Cube Computation by Exploiting Data Correlation.in:Proceedings of the 20th International Conference on Data Engineering.Boston,MA,USA.2004.Washington,D.C.:IEEE Computer Society,2004.658-670
    [47]吴杰.数据立方体优化技术的研究:[硕士学位论文].长沙:中南大学图书馆,2007
    [48]武小荣,武庆华.数据仓库技术的研究现状和未来方向.现代电子技术,2002(6):6-8
    [49]周丽娟.数据仓库中的实视图的选取.计算机工程与应用,2003,39(34):94-96
    [50]Lijuan Zhou.Selecting Materialized Views in a Data Warehouse.IS&T/SPIE's 15th Annual Symposium,Storage and Retrieval for Media Databass.Jan.California,USA,2003.316-318
    [51]李晓君.基于数据仓库的多维数据模型设计和实现:[硕士学位论文].中国科学院计算机技术研究所,2000
    [52]Y.Zhao,P.Deshpande,J.F.Naughton:An Array-Based Algorithm for Simultaneous Multidimensional Aggregates.In:Proc.ofthe 1997 ACM-SIGMOD Conf.,1997,159-170
    [53]胡孔法,蒋蜂,宋爱波等.OLAP中聚集函数的更新.第十八届全国数据库会议论文集,2001.54-58
    [54]Frcderic Gingras,L.V.S.Lakshmanan.nD-SQL:A multi-dimensional language for interoperability and OLAP.In:Proceedings of the 25th Conference on VLDB.N,York,1998.134-145
    [55]孙惠琴,熊璋.立方体实体化的遗传算法设计与实现.北京航空航天大学学报,2004,30(7):610-613
    [56]冯玉才,向隆刚.维上带层次数据立方的自底向上计算.小型微型计算机系统,2004,25(08):1477-1481
    [57]骆吉洲,李建中,赵锴.大型压缩数据仓库上的Iceberg Cube算法.软件学报,2006,17(08):1743-1752
    [58]冯玉才,方琼,李曲,冯剑琳.PefixCube计算的优化.计算机科学,2004,31(12):81-85
    [59]Gray J,Barldaatov A.DBGen Synthetic Data Generator for SQL Tables and Text files on Windows Platforms.Gray J,Sundaresan P,Englert S.Quickly Generating Billion-Record Synthetic Databases.Proceedings of the SIGMOD,Minneapolis:ACM Press,1994.243-252http://research.microsoft.com/～Gray/DBGen/
    [60]C Hahn,S Warren,J London.Edited Synoptic Cloud Reports from Ships and Land Stations over the Globe.http://cdiae..esd.ornl.gov/ediac/ndps/ndp026b.html http://cdiac.esd.ornl.gov/ftp/ndp026b/SEP85L.DAT.Z.1996

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700