数据仓库中数据志跟踪的理论与方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在数据仓库系统中,一个仓库数据项的精确的历史沿革,即该数据项从获取、转换、集成到现状这一完整过程的相关描述和信息,称为数据志(Data Lineage)。数据志包含两个部分:起始数据集和作用在该数据集上的数据处理过程。获取数据志的过程称为数据志跟踪(Data Lineage Tracing)。数据志跟踪技术是数据仓库研究中一个最新的前沿性课题,不仅可以支持更全面、更深入的数据分析,还可以帮助技术人员验证源数据、清洗规则和转换处理的正确性,从而提高数据仓库的质量。
     作者从定义起源集入手,找出了起源集的一般规律,证明了有关起源集的定理,提出了一种“基于属性映射的弱逆与验证”的起源集跟踪方法,给出了一系列有关起源集跟踪的算法,并设计了数据志跟踪的基本过程,从而形成了一套系统的数据志跟踪理论与方法。本文的主要工作与创新有以下几个方面:
     作者首先对与数据志相关的概念进行了完善和细化,给出了起源集的形式化定义,并提出了补集无关和补集相关的概念。这些定义和概念是跟踪起源集的基础,也是检验跟踪结果的依据。在此基础上,作者证明了有关起源集的5个定理,这些定理证明了转换与属性映射、起源集与属性映射、起源集与作用集之间的关系,并证明了几类转换的补集无关性。这些定理为作者根据属性映射的可逆性构造和验证弱起源集提供了基本依据和指导思想,丰富了数据志跟踪的基本理论。
     作者根据可逆与弱可逆的思想,提出了一种“基于属性映射的弱逆与验证的方法(Wivem,Weak Inversion and VErification of attRibute mapping)”求解属性映射的(属性级)起源集。在此基础上,作者分析了转换的可逆性,给出了弱可逆转换的形式化定义,并通过对弱可逆转换中弱逆映射求解的弱起源集进行单维合并、多维合并来求解转换的(元组级)弱起源集。
     作者证明了基本运算的起源集的唯一性定理和求解定理。基本运算的起源集唯一性定理保证了求解的基本运算的起源集的正确性,基本运算的起源集求解定理给出了求解公式,通过这些求解公式可以直接求解这些基本运算的精准的起源集,而不需要进行验证,并且一般不需要访问输入数据集,因此求解性能很好。
     作者基于导出关系给出了转换图的起源集的形式化定义,证明了起源集的传递性定理。在此基础上,设计了跟踪转换图的数据志的基本过程。在构造弱起源集阶段,提出了可延续跟踪性的概念,给出了可延续跟踪性判别算法和可延续跟踪的弱逆映射的筛选算法;在验证弱起源集阶段,针对不同类型的转换和属性映射,给出了相应的验证算法。
     为了验证本文提出的理论和方法,作者对TPC-H测试标准中具有代表性的关系查询Q2和Q12进行了数据志跟踪实验,验证了起源集理论和方法的有效性,并与Cui博士提出的“基于转换性质的跟踪查询过程的方法”进行了详细的比较。实验结果表明,从跟踪响应时间、存储需求和结果的精度等主要指标来分析和评价,作者提出的Wivem方法的跟踪性能在总体上优于Cui博士方法的跟踪性能。
The exact history of a given warehouse data item, including the complete description of its acquisition , transformation and integration is termed the data lineage. Data lineage includes two parts: (1) the set of source data items which exactly produces the warehouse data item; (2) the processes which contribute to the set of source data items. Identifying the data lineage of a given warehouse data item is termed data lineage tracing . As one of the most advanced research problems in data warehouse system, data lineage tracing may play an important role in the area of in-depth data analysis, and help us to validate the source data , cleaning rules and transformation rules, and thus improving the quality of data warehouse.
    Beginning with the formal definition of derivation set, this thesis finds the general laws of derivation set, proves the theorems about derivation set, proposes an approach for weak inversion and verification based on attribute mapping to trace data lineage, gives a series of arithmetic for data lineage tracing, describes the basic processes of data lineage, and then forms systematic theories and approach. Following is the primary work and contributions of this thesis.
    First, the concepts about data lineage tracing are completed and refined, and the formal definition of derivation set and supplementary set are provided. These definitions form the basis for derivation set tracing. At the same time, they are the criterion for verifying the result of tracing. Then this thesis proves five theorems about derivation set, which defined the relationship between transformation and attribute mapping, derivation set and attribute mapping, derivation set and contribution set, and the correlation of supplementary set of transformation. These theorems is the basis and guideline for constructing and verifying the weak derivation set according to the invertibilrty of attribute mapping, thus improves the basic theories of data lineage tracing.
    Next, this thesis presents a data lineage tracing approach, Wivem ( Weak Inversion and VErification of attRibute mappiNg ), which can calculate ( attribute-level ) derivation set of attribute mapping. Then, this thesis analyzes the invertibilrty of transformation, and presents the formal definition of weak invertibte transformation, and calculates ( tuple-level ) derivation set of transformation by one-dimension merging and multi-dimension merging of the weak derivation set resolved by weak inverse attribute mapping. Also, this thesis proves the uniqueness and solution theorems of derivation set of basic relation operators.
    Then , This thesis presents the formal definition of derivation set of transformation diagram, proves the derivation set transitivity theorem, and shows the basic processes for tracing transformation diagram. Upon the construction of weak derivation set, this thesis presents the concept of continuing traceability , and provides decision algorithm
    
    
    
    for the continuing traceability of a transformation sequence and tittering algorithm for the continuing traceable weak inverse attribute mapping. Upon verifying weak derivation set, this thesis gives a series of verification algorithms based on the best property of attribute mapping or transformation.
    Finally, in order to validate our theories and approach, this thesis conducts data lineage tracing experiment with relational query Q2 and Q12 of TPC Benchmark?H, and compares the tracing performance with the approach of tracing query process presented by Doctor Cui. The result shows that the Wivem approach is much better than the approach presented by Cui according to tracing time, storage cost and the precision of tracing result.
引文
[1] 陈文伟.决策支持系统及其开发(第二版)[M].北京:清华大学出版社,2000.
    [2] 邓苏.数据仓库原理与应用[M].北京:电子工业出版社,2002.
    [3] Efraim Tueban, Jay E.Aronson. Decision Support Systems and Intelligent Systems(Fifth Edition), Prentice Hall. Inc. 1998.
    [4] 戴超凡,邓苏,黄宏斌.DSS中的数据管理新技术研究[J].计算机工程与应用,2000,36(12):21-24.
    [5] 王志海.数据仓库Building the Data Warehouse(第二版)[M].北京:机械工业出版社,2000.
    [6] Business Objects[EB/OL], http://www.businessobiects.com, 1999.
    [7] Joachim Hammer, Hector Garcia-Molina, Jennifer Widom, Wilburt Labio, Yue Zuuge. WHIPS: The Stanford Data Warehousing Project [EB/OL], http://www-db.stanford.edu/warehousing/warehouse.html, 2000.
    [8] 王珊.数据仓库技术与联机分析处理[M].北京:科学出版社,1998.
    [9] 李子木.数据仓库联机维护技术的研究与实现[D].长沙:国防科技大学,1999.
    [10] Hao Fan, Alexandra Poulovassilis. Tracing data lineage using Automed schema transformation pathways[EB/OL]. http://www.dcs.bbk.ac.uk/~hao/Publications/bbkcs0207.pdf
    [11] Hao Fan. Incremental View Maintenance and Data Lineage Tracing in Heterogeneous Database Environments[EB/OL]. http://www.dcs.bbk.ac.uk/~hao/Publications/phdSchool.pdf
    [12] S Chaudhuri, U Dayal. An overview of data warehousing and OLAP technology[J]. SIGMOD Record, 1997, 26 (1): 65-74.
    [13] Ganesh Variar. The Origin of Data[EB/OL]. http://www.intelligententerpnse.com/020201/503feat3_1.shtml
    [14] H Galhardas, D Florescu, D Shasha, E Simon, C Saita. Improving data cleaning quality using a data lineage facility[A]. Proceedings of the Third International Workshop on Design and Management of Data Warehouses[C], Interlaken, Switzerland, 2001
    [15] W J Labio. Efficient Maintenance and Recovery of Data Warehouses[D]. Stanford University. 1999
    [16] W J Labio, H Garcia-Molina, J L Weiner. Efficient resumption of interrupted warehouse loads[A]. Proceedings of the ACM SIGMOD International Conference on Mlanagement of Data[C]. Dallas, Texas, 2000:46-57
    [17] N I Hachem, K Qiu, M Gennert, M Ward. Managing derived data in the Gaea scientific DBMS[A]. Proceedings of the Ninteenth International Conference on Very Large Data Bases[C], Dublin, Ireland, 1993:1-12
    
    
    [18] Roy Williams, Julian Bunn, Reagan Moore, James Pool. Interfaces to scientific data archives[R]. Center for Advanced Computing Research, California Institute of Technology, 1998.
    [19] Common Warehouse Metamodel (CWM)Specification[S], OMG Document ad/99-09-01, Initial Submission edition, 1999. http://www.omg.org
    [20] Meta Data Coalition. Open Information Model Version 1. 1 (Proposal)[S]. 1999, http://mdcinfo.com
    [21] Tom Sullivan, Meta Data Coalition merge[EB/OL], September 25, 2000, http://www.infoworld.com/articles/hn/xml/00/09/25/000925hnomg.xml
    [22] http://www.ascentialsoftware.com/
    [23] http://www.informatica.com
    [24] http://www.hummingbird.com
    [25] J Gray, A Bosworth, A Layman, H Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals[A]. Proceedings of the Twelfth International Conference on Data Engineering[C], New Orleans, Louisiana, 1996: 152-159
    [26] http://www.ibm.com/db2
    [27] Cognos: PowerPlay OLAP Analysis Tool[EB/OL]. http://cognos.com/powerplay
    [28] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dtssql
    [29] P Bernstein, T Bergstraesser. Meta-data support for data transformations using Microsoft Repository. IEEE Data Engineering Bulletin[EB/OL]. 1999. 22(1) : 9-14
    [30] Ralph Kimball. Adding An Audit Dimension To Track Lineage And Confidence[EB/OL]. http://rkimball.com/html/designtips/2001/designtip26. html
    [31] T Lee, S Bressan, S Madnick. Source attibution for querying against semi-structured documents[A]. Proceedings of the Workshop on Web Information and Data Management[C], Washington, DC, 1998: 33-39
    [32] R Wang. S Madnick. A polygen model for heterogeneous database systems: The source-tagging perspective[A]. Proceedings of the Intl. Conf. on Very Large Databases[C], 1990
    [33] H Galhardas, D Florescu, D Shasha, E. Simon. AJAX: An Extensible Data Cleaning Tool. SIGMOD 2000 . 2000
    [34] AJAX: An Extensible Data Cleaning Tool[EB/OL]. http://caravel.inria.fr/-galhardayajax.html
    [35] P Buneman, A Deutsch. Wang-Chiew Tan. A deterministic model for semistructured data[A]. Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats[C], 1998
    [36] S Abiteboul, P Buneman. and D Suciu. Data on the Web[A]. From Relations to
    
    Semistructured Data and XML[C]. Morgan Kaufman, 2000
    [37] P Buneman, S Davidson, G Hillebrand, D Suciu. A Query Language and Optimization Techniques for Unstructured Data[J]. SIGMOD 1996: 505-516
    [38] SAbiteboul, D Quass, J McHugh, J Widom, J Wiener. The lorel query language for semistructured data[J]. Journal on Digital Libraries, 1(1) , 1996
    [39] P Buneman, S Khanna, Wang-Chiew Tan. Data Provenance: Some Basic Issues[A]. Foundations of Software Technology and Theoretical Computer Science (FSTTCS)2000[C]: 87-93
    [40] P Buneman, S Khanna, Wang-Chiew Tan. Why and Where: A Characterization of Data Provenance[A]. The 8th International Conference on Database Theory (ICDT 2001) [C]: 316-330
    [41] C Faloutsos, H V Jagadish, N D Sidiropoulos. Recovering information from summary data[A]. Proceedings of the Twenty-Third International Conference on Very Large Data Bases[C], Athens, Greece, 1997: 36-45
    [42] L Yan, R J Miller, L M Hass, R Fagin. Data-driven understanding and refinement of schema mappings[A]. Proceedings of the ACM SIGMOD International Conference on Management of Data[C], Santa Barbra, California, 2001: 485-496
    [43] A Woodruff, M Stonebraker. Supporting fine-grained data lineage in a database visualization environment[A]. Proceedings of the Thirteenth International Conference on Data Engineering[C], Birmingham, UK, 1997: 91-102
    [44] A Woodruff. Data lineage and information density in database visualization [D]. Department of EECS University of California. 1998
    [45] Y W Cui. Lineage Tracing in Data Warehouses[D]. Stanford University. December 2001
    [46] Y W Cui, J Widom. Lineage Tracing in a Data Warehousing System[A]. Demo proposal. Proceedings of the 16th International Conference on Data Engineering (ICDE'00) [C], San Diego, California, February 2000
    [47] Y W Cui, J Widom. Practical Lineage Tracing in Data Warehouses[A]. Proceedings of the 16th International Conference on Data Engineering (ICDE'OO) [C], San Diego, Califomia, February 2000
    [48] Y W Cui, J Widom, J L Wiener. Tracing the Lineage of View Data in a Data Warehousing Environment[A]. ACM Transaction on Database Systems (TODS) [C], 2000
    [49] Y W Cui, J Widom. Storing Auxiliary Data for Efficient Maintenance and Lineage Tracing of Complex Views[A], the 2nd International Workshop on Design and Management of Data Warehouses (DMDW'00) [C], Stockholm, Sweden, 2000
    [50] Y W Cui, J Widom. Run-Time Translation of View Tuple Deletions Using Data Lineage[R]. 2001
    
    
    [51] Y W Cui, J Widom. Lineage Tracing for General Data Warehouse Transformations[A]. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01) [C], Rome, Italy, 2001
    [52] 尼葛洛庞帝著。胡泳,范海燕译。数字化生存(第三版)。海南出版社,1997
    [53] 戴超凡,邓苏,杨强.基于GMM的数据仓库管理与维护.国防科技大学学报. 2002(6)
    [54] Martin Staudt, Anca Vaduva, Thomas Vetterli. The Role of Metadata for Data Warehouse[R]. 1999, http://www.ifi.unizh.ch
    [55] Martin Staudt, Anca Vaduva, Thomas Vetterli. Metadata Management and Data Warehouse[R]. 1999, http://www.ifi.unizh.ch
    [56] Anca Vaduva, K R Dittrich. Metadata Management for Data Warehousing: Between Vision and Reality[A]. Proceedings of IDEAS'01[C], France, 2001
    [57] Anca Vaduva, Thomas Vetterli. Metadata Management for Data Warehousing: An Overview[A]. Intl. Journal of Cooperative Information Systems (IJCIS), Special Issue on Design and Management of Data Warehouses[C], 2001, 10(3) : 273-298
    [58] W H Inmon. Enterprise Metadata[EB/OL]. 1998. http://www.dmreview.com/master.cfm?NavID=55&EdID=298
    [59] W H Inmon. Metadata in the Data Warehouse: A Statement of Vision[EB/OL]. 1997, http://www.billinmon.com//library/whiteprs/techtopic/tt10. pdf
    [60] W H Inmon. Metadata in the Data Warehouse[EB/OL], 2000, http://www.billinmon.com//library/whiteprs/earlywp/ttmeta.pdf
    [61 ] W H Inmon. Toward A Unified Theory Of Metadata[EB/OL]. 2002, http://www.billinmon.com//library/whiteprs/earlywp/metuni.pdf
    [62] David Macro. Metadata Moves Mainstream[EB/OL]. Microsoft White Paper, 1999. http://www.rnicrosoft.com/SQL/bizsol/metadata.htm.
    [63] David Macro. Building and Managing the Metadata Repository-A Full Lifecycle Guide[M]. John Wiley & Sons Inc. 2000
    [64] 戴超凡,陈文伟,邓苏.数据仓库中的元数据技术研究.计算机工程与应用, 2001,37(14) :85-87
    [65] 戴超凡,邓苏.开放信息模型研究.北京:计算机工程与应用.2001,37(1) :14-16
    [66] Hummingbird: Hummingbird Metadata Management[EB/OL]. White Paper, 1999. http://www.hummingbird.com/whites/index.html.
    [67] IBM: Metadata Management for Business Intelligence Solutions[EB/OL]. White Paper, 1998. http://www-4. ibm.com/software/data/pubs/papers
    [68] Lehmann, P Jaszewski: Business Terms as a Critical Success Factor for Data Warehousing[A]. Proceedings Workshop Design and Management of Data Warehouses (DMDW) 1999 [C]
    [69] Thomas Vetterliy, Anca Vaduvaz, Martin Staudty. Metadata Standards for
    
    Data Warehousing: Open Information Model vs. Common Warehouse Metamodel [A]. SIGMOD Record[C], 2000, 29(3) : 68-75
    [70] Thomas Vetterli. A Comparison of OIM with CWM[EB/OL]. http://www-ai.cs.uni-dortmund.de
    [71] Thomas Stohr, Robert Muller, Erhard Rahm. An Integrative and Uniform Model for Metadata Management in Data Warehousing Environments[A]. DMDW'99[C], 1999: 1-16
    [72] The OLAP Council. The MDAPI Specification V2. 0[S], 1998, http://www.olapcouncil.org/research/apily.htm
    [73] John Zachman. a framework for information systems architecture[J]. IBM Systems Journal, 1987, 26 (3)
    [74] W H Inmon, J A Zachman, J G Geiger. Data Stores, Data Warehousing and the Zachman Framework[M]. McGraw-Hill, 1997
    [75] Hong Hai Do, Erhard Rahm. On Metadata Interoperability in Data Warehouses[R]. Technical Report. Dept. of Computer Science, University of Leipzig, 2000
    [76] Philip A Bernstein, Erhard Rahm. Data Warehousing Scenarios for Model Management[A]. Proceedings 19. Intl. Conf. on Conceptual Modeling (ER) [C], LNCS, Springer-Verlag, 2000
    [77] Philip A Bernstein. Is Generic Metadata Management Feasible? Panel discussion[A], Proceedings VLDB 2000[C], http://www.research.microsoft.corn/-philbe/VLDB00panel.pdf
    [78] Philip A Bernstein. On Matching Schemas Automatically[R]. Technical Report, 2001, http://dol.uni-leipzig.de/pub/2001-5
    [79] Philip A Bernstein, T Bergstraesser. Meta-Data Support for Data Transformations Using Microsoft Repository[J]. IEEE Data Engineering Technical Bulletin, 1999, 22(1) : 9-14
    [80] Philip A Bernstein, T Bergstraesser, J Carlson. Microsoft Repository Version 2 and the Open Information Model. Information Systems, 1999, 24(2) : 71-98
    [81] Martin Staudt, Anca Vaduva, Thomas Vetterli. The SMART Project[R]. Technical Report, 1999. http://www.ifi.unizh.ch
    [82] 与Anca Vaduva博士(SMART项目的主要研究人员)的私人通信[Z]
    [83] B Devlin. Data Warehouse-From Architecture to Implementation[M]. Addison-Wesley Longman, 1997
    [84] Satya P Sachdeva, Meta Data Architecture for Data Warehousing[EB/OL]. 1998, http://www.dmreview.com/master_sponsor.cfm?NavID=55&EdID=664
    [85] Athanasios Vavouras, Stella Gatziu, Klaus R Dittrich. SIRIUS: An Approach for Data Warehouse Refreshment[R]. Technical Report, 1998, http://www.ifi.unizh.ch
    
    
    [86] Athanasios Vavouras, Stella Gatziu, Klaus R Dittrich. Modeling and Executing the Data Warehouse Refreshment Process[R]. Technical Report, 2000, http://www.ifi.unizh.ch
    [87] 罗昌隆,黄梓龙.数据仓库的元数据模型的探讨.南京邮电学院学报(自然科学版),2000,20(2):80-82
    [88] Gartner Group. OMG's Common Warehouse Metamodel Specification [EB/OL]. Research Note 2000.07.28, E-11-4175
    [89] J Wieken. Metadata for Data Marts and Data Warehouses[A]. The Data Warehouse Concept[C], Gabler, Wiesbaden, 1998:275-315
    [90] R Jung, S Schwarz. Planning success for Data Warehouse processes: An extended business case approach[R]. Technical Report, University of St. Gallen, Inst. Of Information Management, 1999
    [91] OMG Unified Modeling Language Specification Version 1.4 [S]. http://www.omg.org/cgi-bin/doc?formal/01-O9-67.pdf
    [92] James Rumbaugh, Ivar Jacobson, Grady Booch. The Unified Modeling Language Reference Manual[M]. Addison Wesley, MA, USA, 1998
    [93] James Rumbaugh, Ivar Jacobson, Grady Booch著.邵维忠,麻志毅,张文娟等译.UML用户指南[M].北京:机械工业出版社,2001
    [94] 蒋慧,吴礼发,陈卫卫.时代新潮流UML Programming Guide设计核心技术[M].北京希望出版社,2001
    [95] Meilir Page-Jones著.包晓露,赵晓玲,叶天军等译.UML面向对象设计基础[M]。北京:人民邮电出版社,2001
    [96] 刘润海。UML对象设计与编程[M].北京:北京希望电子出版社.2001
    [97] W. Boggs, M. Boggs著.邱仲潘等译.UML with Rational Rose从入门到精通[M].北京:电子工业出版社.2000
    [98] Amon Rosenthal, Edward Sciore. View Security as the Basis for Data Warehouse Security[A], Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW'2000) [C], Stockholm, Sweden: 5-6
    [99] Nevana Katic, Gerald Quirchmayr, dosef Schiefer, M Stolba, A Min Tjoa. A Prototype Model for Data Warehouse Security Based on Metadata[A], Ninth International Workshop on Database and Expert Systems Applications (DEXA Workshop) 1998 [C]: 300-308
    [100] 李勇.智能检索中基于本体的个性化用户建模技术及应用[D].长沙:国防科技大学,2002
    [101] Colin White. Managing Distributed Data Warehouse Meta Data[EB/OL]. DM Review, http://www.dmreview.com/master.cfm?NavID=55&EdID=159
    [102] Colin White. Decision Processing Meta Data Sharing and Interchange Approaches[EB/OL]. DM Review, 1999
    [103] Hurwitz Group. Enterprise Metadata Management[EB/OL]. 1998.
    
    http://www.dmreview.com/whitepaper/metaa.pdf
    [104] http://www.Oracle.com
    [105] http://www.Viasoft.com
    [106] http://www.ArdentSoftware.com
    [107] http://www.Software.ibm.com/data/vw
    [108] http://www.Informatica.com
    [109] http://sybase.com
    [110] http://msdn.microsoft.com/repository
    [111] http://www.ca.com
    [112] 艾中良,麦中凡.仓储库的发展.计算机工程与应用.2001,37(11):73-76
    [113] 曹蓟光,王申康.元数据管理策略的比较研究.计算机应用.2001,21(2):3-5
    [114] 王红兵.数据仓库中的元数据.微机发展.1999,第5期:44-48
    [115] 戴超凡.数据仓库中的元数掘管理.计算机科学与工程.2002,已录用
    [116] Transaction Processing Performance Council. TPC Benchmark~(TM) H (Decision Support)Standard Specification, Revision 1.5.0[S], 2002, http://www.tpc.org
    [117] 戴超凡.起源集的跟踪方法Wivem.国防科技大学学报.2002,已录用
    [118] 戴超凡.基于Wivem的数据志跟踪方法.国防科技大学学报.2002,已录用
    [119] 戴超凡.一种数据志跟踪方法——Wivem.国防科技大学第二届研究生学术活动周.2002
    [120] 戴超凡.基于Wivem跟踪转换图的起源集.计算机工程与科学.2002,已录用
    [121] 与Yingwei Cui博士的私人通信[Z]
    [122] 与Anca Vaduva博士的私人通信[Z]

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700