面向数据集成的数据清理关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

面向数据集成的数据清理关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Data Cleaning in Data Integration
作者：刘杰
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：数据集成 ; 数据质量 ; 完整性约束 ; 数据仓库 ; 数据清理 ; 性能优化
英文关键词：Data integration ; data quality ; integrity constraint ; data warehouse ; data cleaning ; optimization
学位年度：2010
导师：黄涛
学科代码：081202
学位授予单位：中国科学技术大学
论文提交日期：2010-10-01

摘要

数据集成是把不同来源、不同格式、不同语义的数据在物理上或逻辑上有机地集中,从而提供一个统一视图的过程。数据集成需求持续增长,但是因为数据集成环境复杂,数据的完整性、一致性、准确性难以保障,数据质量问题导致企业大量数据集成项目延期完成,并大大增加项目成本。数据质量工具成为企业数据管理不可或缺的组成部分,数据质量保障也一直是计算机科学重要的研究领域。
     完整性约束支持用户采用声明式语言定义数据要满足的依赖关系,同时支持约束之间的蕴含推理,在经典关系数据库研究中,完整性约束一直被用来保证数据库模式的正确性。如何以完整性约束理论为基础,来推理和挖掘数据清理规则并保证数据的一致性,是数据质量保障一个新的热点问题。本文在数据集成场景中研究这一问题,提出新的方法实现自动化高效地检测和清理不一致数据。
     首先,本文原创性地研究如何在数据集成流程设计完成后,根据目的端的质量约束推理源端需要满足的质量约束从而在源端进行异常数据检测。在数据集成流程中,数据源端的数据经过流程处理后,可能会将违反目的端的完整性约束,导致不成功的加载或者成为目的端数据库中的脏数据,因为数据量大,而且可能存在远程的数据传输,通过执行调试的方法来定位问题数据的代价太大。本论文中提出反向约束传播(Backwards Constraint Propagation,BCP)的方法,首先将数据集成流程建模为有向无环图,它自动将目的端数据库的完整性约束沿着数据流反方向,向数据源端推理,得到的数据源的完整性约束可以用来检测异常数据从而指导设计者进行异常数据过滤或改进流程设计。文中采用一阶逻辑定义并证明面向基本关系代数操作的约束传播规则,并定义约束传播规则支持采用属性映射和元组映射两个抽象操作标注的复杂数据操作,使BCP可以支持大多数类型的数据操作。案例分析及实验表明该方法可以有效辅助捕获异常数据并提高数据集成流程的设计效率。
     其次,本文提出基于NULL修复的一致性查询方法,支持对不一致数据源在查询时过滤不一致的属性信息。当多个数据源的数据集成后,因为缺乏足够的辅助信息进行清理,还可能存在大量违反完整性约束的数据。一致性查询技术(Consistent Query Answering,CQA)研究如何在查询时采用虚拟修复的方法获取一致的结果,但已有的方法大多基于元组删除的修复语义,可能导致信息丢失,而且对于大多数约束求解CQA是NP问题。我们将约束类型限制在属性级,即只有违反约束的属性为不一致信息,并提出基于NULL的修复语义,将所有不一致属性使用NULL替换得到虚拟修复。当进行NULL修复后可能会产生新的不一致属性,针对该问题提出约束扩展算法,来查询定位所有可能的不一致属性。基于NULL修复语义,给出了SQL重写算法来实现CQA。文中对不一致属性定位算法与SQL重写方法进行了实验与性能分析,表明该方法的计算复杂度与数据库规模、不一致数据比例、查询的类型都是线性关系。
     接着,本文研究如何基于流程重构实现数据清理流程性能优化,并研究如何将该方法推广应用于web数据mashup。随着数据量飞速的增长,性能成为数据清理的瓶颈,如果对数据清理流程的逻辑模型进行优化,可以在不增加资源的情况下获取性能的提升。本文研究了通用的数据清理流程的逻辑优化框架,通过对流程进行语义等价的结构变换生成备选流程,并预测各备选流程的执行代价选择最优的流程。支持对操作组件标注其操作语义的特征属性,定义特定领域的流程变换规则,同时提出基于流程代价相对关系来构建代价偏序图,提高流程选择的精确度。为了表明该框架的适用性和有效性,将其应用到web数据Mashup工具中进行案例分析,并通过实验表明可以有效降低mashup的响应时间。
     最后,本文研究实现了模型驱动的数据集成流程的开发平台OnceDQ,并在其上对提出的数据清理新技术进行了实现和应用。该平台基于Eclipse插件机制实现数据操作组件的可扩展性,支持用户自定义操作组件和数据源接口,采用代码生成工具将用户设计的流程自动生成平台独立的Java代码,可以跨平台部署。
Data integration is to collect data in various sources, format, and semantics, integration them physically or logically, and provde a unified view to access them. Due to large amount of data and the increasing complexity of business intelligence application requirements, it is hard to ensure the integrity, consistency and accuracy of data. It is error-prone and labor-intensive to develop data integration projects due to data quality issues.
     Integrity constraints provide user a way to define the data dependencies in a declarative way to ensure the consistency and there are sound theory basis to do implication analysis of integrity constraints.It is a hot area to induce and mine data quality rules based on constraint theories. This thesis targets on this problem in the integration scenario to present new method to automatically and efficiently detect and clean the data.
     First, we originally present a method to induce the data quality constraints for the data sources from the data quality constraints defined on the target database. The data quality in a data source may exceed the expectations of designers at the design time when validation and transformation rules are specified, and this will cause unsuccessful load of target database due to constraint violations or flush dirty data into the target database. Due to large amount of data, and there may need to transfer data between distributed servers, it is costly to debug the DIF by executing it. In this paper, we design a general framework for the problem, called Backwards Constraint Propagation (BCP), which automatically analyzes a DIF, generates data quality rules from the constraints defined in the DW, and propagate them backwards from target to sources. The derived data quality rules can be used to detect exceptional data in the data sources and help designers improve the DIFs. BCP supports most relational algebra operators and data transformation functions by defining constraint propageation rules. Case studies and experiments are provided to demonstrate the correctness and efficiency of BCP.
     Second, we present a method to automatically filter the inconsistent attirbutes from data sources based on virtual repair by NULL. Although integrity constraints can successfully capture data semantics, the actual data in the database often violates such constraints. When one DIF can be transformed to a relational algebra query, we can apply consistent query answering (CQA) to get an answer which is true in every minimal repair of the inconsistent database. It has been proved that for most constraints and queries CQA is a NP problem based on repairing by tuple deletions or tuple insertions. Furthermore, repairing by deleting tuples will also cause information losing. In this paper we present a new repair semantics named repairing with nulls, which replaces the inconsistent attribute values with nulls. To capture all the inconsistent attribute values, we study the transitivity of nulls and provide an algorithm to extend the original constraints. Based on repairing with nulls, there will be only one repair and CQA can be computed in PTIME by SQL query rewritings. Finally, we study the performance of our new approach for CQA by detailed experiments.
     Third, we research on enhancing the performance of data cleaning processes via automatically refactoring the structure of its data flows. First a set of operational semantics features are selected for annotating the operators in data flows and refactoring rules are defined to generate all candidate semantics equivalent data flows. Then a heuristic algorithm is described for accurately and quickly searching the data flow of minimal execution time by constructing a partially ordered set of data flows based on their cost estimation. To validate the framework, we apply it to mashups. Mashup tools usually allow end users quickly and graphically build complex mashups using pipes to connect web data sources into a data flow. Because end users are of varying degrees of technical expertise, the designed data flows may be inefficient and this will definitely increase the response time of mashups. Case study shows the framework is applicable to general mashup data flows without knowing complete operational semantics of their operators and the efficiency improvement is demonstrated by experiments.
     Finally, we research on model driven development method for data integration process and implement a development platform. The details of implementing our research work in the system are discussed.

引文

郭志懋,周傲英. 2002.数据质量和数据清理研究综述[J].软件学报,23(11):2076-2082.
    韩京宇,徐立臻,董逸生. 2005. ETL执行的流水线优化[J].小型微型计算机系统,26(6):1013-1017.
    贾自艳,黄友平,罗平,李嘉佑,秦亮曦,史忠植. 2004.面向数据质量的ETL过程建模与实现[J].系统仿真学报,16(5):907-911.
    张旭峰,孙未未,汪卫,冯雅慧,施伯乐. 2006.增量ETL过程自动化产生方法的研究[J].计算机研究与发展,43(6):1097-1103
    钟华,冯文澜,谭红星,黄涛. 2004.面向数据集成的ETL系统设计与实现[J].计算机科学,31(9):87-89 ,F004
    网驰平台.2010. http://www.once.org.cn
    Abiteboul S,Greenshpan O,Milo T. 2008. Modeling the mashup space[C]//Proceeding of the 10th ACM workshop on Web information and data management. New York:ACM,87-94.
    Abiteboul S , Greenshpan O , Milo T et al. 2009. Matchup: Autocompletion for mashups[C]//Proceedings of IEEE International Conference on Data Engineering. New York:IEEE Press,1479–1482.
    Akkaoui ZE , Zimanyi E. 2009. Defining ETL worfklows using BPMN and BPEL[C]//Proceeding of the ACM twelfth international workshop on Data warehousing and OLAP. New York:ACM,41-48.
    Albrecht A. 2009. METL: Managing and Integrating ETL Processes[C]//Proceedings of the 32nd international conference on Very large data bases PhD Workshop. VLDB Endowment.
    Arenas M , Bertossi L , Chomicki J. 1999. Consistent query answers in inconsistent database[C]//Proceedings of ACM Symposium on Principles of Database System (PODS). ACM Press,68-79.
    Arenas M,Bertossi L,Chomicki J.2000. Specifying and Querying Database Repairs using Logic Programs with Exceptions[M].Springer,27-41.
    Balta M. 2007. Data verification in etl processes[C]//Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques. 282–289.
    Bertossi L. 2006. Consistent query answering in databases[J]. SIGMOD Record. New York,NY, USA:ACM,68-76.
    Bertossi L,Bravo L,Franconi E et al. 2005. Complexity and approximation of fixing numerical attributes in database under integrity constraints[C]//Proceedings of International Symposiumon Database Programming Languages. Springer,262-278.
    Bi?rnstad B,Pautasso C. 2007. Let It Flow: Building Mashups with Data Processing Pipelines[C]//Proceedings of service-Oriented Computing - ICSOC 2007 Workshops. Berlin, Heidelberg:Springer-Verlag,15-28.
    Bohannon P,Fan WF,Flaster M et al. 2005. A cost-based model and effective heuristic for repairing constraints by value modification[C]//Proceedings of the ACM International Conference on Management of Data. ACM,143–154.
    Bravo L,Bertossi L. 2006. Semantically correct query answers in the presence of null values[C]// Proceedings of EDBT Workshops (IIDB). Springer,336–357.
    Bravo L,Fan WF,Ma S. 2007. Extending dependencies with conditions[C]//In VLDB.243-254.
    Bry F. 1997. Query answering in information system with integrity constraints[C]//Proceedings of the IFIP TC11 Working Group 11.5, First Working Conference on Integrity and Internal Control in Information Systems. London,UK:Chapman & Hall, Ltd,113–130.
    CalìA,Calvanese D,Lenzerini M. 2004. Data integration under integrity constraints[J]. Information Systems,29(2):147-163.
    Chandel A , Koudas N , Pu K et al. 2007. Fast identification of relational constraint violations[C]//In Proceedings of International Conference on Database Engineering. IEEE Computer Society Press,776–785.
    Chaudhuri S,Ganjam K,Ganti V et al. 2005. Data Cleaning in Microsoft SQL Server 2005[C]// Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM,918-920.
    Chomicki J. 2006. Consistent query answering: Opportunities and limitations[C]// /Proceedings of International Conference on Database and Expert Systems Applications. IEEE Computer Society Press,527–531.
    Chomicki J. 2007. Consistent query answering:Five easy pieces[C]//Proceedings of International Conference on Database Theory:Springer,68–76.
    Chomicki J , Marcinkowski J. 2005. Minimal-change integrity maintenance using tuple deletions[J]. Information and computation. Duluth, MN, USA:Academic Press, Inc.,90-121.
    Chomicki J,Marcinkowski J,Staworko S. 2004. Computing consistent query answers using conflict hypergraphs[C]//Proceedings of ACM international conference on Information and knowledge management. New York, NY, USA:ACM,417-426.
    Cui Y,Widom J,Wiener JL. 2000. Tracing the lineage of view data in a warehousing environment[J]. ACM Trans Database Syst,25(2):179-227.
    Cui Y,Widom J. 2003. Lineage tracing for general data warehouse transformations[J]. VLDBJournal,12(1):Springer-Verlag New York,Inc,41-58.
    Dasu T, Johnson T. 2003. Exploratory data mining and data cleaning[M]. Wiley-IEEE. Dayal U,Castellanos M,Simitsis A et al. 2009. Data integration flows for business intelligence[C]//Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. New York, NY, USA:ACM,1-11.
    Dessloch S,Hernandez MA,Wisnesky R et al. 2008. Orchid: Integrating schema mapping and ETL[C]// Proceedings of the 2008 IEEE 24th International Conference on Data Engineering.
    Washington, DC, USA:IEEE Computer Society,1307-1316.
    Eclipse GMF. 2010. http://www.eclipse.org/modeling/gmf/
    Eclipse JET. 2010. http://www.eclipse.org/modeling/m2t/?project=jet
    Elmagarmid AK,Ipeirotis PG,Verykios VS. 2007. Duplicate Record Detection: A Survey[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,19(1):1-16.
    Fan WF. 2008. Dependencies Revisited for Improving Data Quality[C]//Proceddings of PODS.ACM,159-170.
    Fan WF , Ma S , Hu Y et al. 2008. Propagating functional dependencies with conditions[C]//Proceddings of VLDB. 391–407.
    Fan WF,Geerts F,Jia XB. 2008. Conditional functional dependencies for capturing data inconsistencies[J]. ACM Transactions on Database Systems (TODS). ACM,33(2):Article 6. Fagin R. 1982. Horn clauses and database dependencies[J]. Journal of ACM,29(4):952–985.
    Flesca S,Furfaro F,Parisi F. 2005. Consistent query answers on numerical databases under aggregate constraints[C]//Proceedings of International Sysmposium on Database Programming Languages. Springer,279–294.
    Friedman T,Beyer MA,Bitterer A. 2008. Magic quadrant for data integration tools[EB]. [2009-11-01]. http://www.sap.com/solutions/businessobjects/pdf/Magic_Quadrant_Data_IntegrInteg_Tools.pdf.
    Friedman T,Bitterer A. 2010. Magic quadrant for data quality tools[EB]. Gartner。Fuxman A,Fazli E,Miller RJ. 2005. Conquer: Efficient management of inconsistent database[C]//Proceedings of ACM SIGMOD International Conference on Management of Data. ACM,155–156.
    Haas L. 2007. Beauty and the Beast: The Theory and Practice of Information Integration[C]// Proceedings of International Conference on Database Theory. Berlin , Heidelberg :Springer-Verlag 28-43.
    Hass LM,Soffer A. 2009. New Challenges in Information Integration[C]// Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. Berlin,Heidelberg:Springer-Verlag,1-8.
    Halevy A,Rajaraman A,Ordille J. 2006. Data integration: the teenage years[C]//Proceedings of VLDB.9-16.
    Hassan OA,Ramaswamy L,Miller JA. 2009. MACE: A Dynamic Caching Framework for Mashups[C]//Proceedings of IEEE International Conference on Web Service. Washington,DC, USA:IEEE Computer Society,75-82.
    Halevy A,Rajaraman A,Ordille J. 2006. Data Integration: The Teenage Years[C]// Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment,9-16.
    Hidders J,Kwasnikowska N,Sroka J et al. 2008. DFL: A dataflow language based on Petri nets and nested relational calculus[J]. Information Systems,33(3):261-284. IDC. 2008.中国企业数据集成与数据质量白皮书.
    Informatica. 2008. Informatica Data quality. http://www.informatica.com/PRODUCTS_SERVICES/DATA_QUALITY/Pages/index.aspx OnceDI. 2010. http://www.once.com.cn.
    Jarke M,Lenzerini M,Vassiliou Y,Vassiliadis P. 2001. Fundamentals of Data Warehouses[M]. New York:Springer-Verlag.
    Kepler Develop Team. 2010. Kepler scientific workflow system. https://kepler-project.org/ Kifer M,and Lozinskii EL. 1986. Filtering data flow in deductive databases[C]//Proceedings of the International Conference on Database Theory. London,UK:Springer-Verlag,186–202.
    Klug C. 1980. Calculating constraints on relational expressions[J]. TODS, 5(3):260–290.
    Kraft T,Schwarz H,Rantzau R et al. 2003. Coarse-Grained Optimization: Techniques for Rewriting SQL Statement Sequences[C]//Proceedings of the 29th International Conference on Very Large Databases. VLDB Endowment,488–499.
    Kuper G,Paredaens J,Libkin L. 1999. Constraint Databases[M]. Spinger.
    Lemos M,Casanova MA,Furtado AL. 2008. Process pipeline scheduling[J]. Journal of Systems and Software,81(3):Elsevier,307-327.
    Lenzerini M. 2002. Data integration: a theoretical perspective[C]//Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.ACM,233-246.
    Levy AY , Mumick IS , Sagiv Y. 1994. Query optimization by predicate move-around[C]//Proceedings of the 20th International Conference on Very Large Data Bases. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc,96–107.
    Liu D,Deters R. 2008. The Reverse C10K Problem for Server-Side Mashups[C]//Proceedings of ICSOC 2008 International Workshops. Berlin, Heidelberg:Springer-Verlag,166-177.
    Liu J,Huang F,Ye D et al. 2008. Efficient consistent query answering based on attribute Deletions[C]//Proceedings of International Symposium on Computer Science and its Applications. IEEE Computer Society ,222-227.
    Liu J,Liang S,Ye D,Wei J,Huang T. 2009. ETL Workflow Analysis and Verification Using Backwards Constraint Propagation[C]//Proceedings of the 21st International Conference on Advanced Information Systems (CAiSE’09). Springer,455-469.
    Lopatenko A,Bertossi L. 2007. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics[C]//Proceedings of International Conference on Database Theory. IEEE Computer Society,179–193.
    Lopatenko A,Bravo L. 2007. Efficient approximation algorithms for repairing inconsistent databases[C]//Proceedings of International Conference on Database Engineering. IEEE Computer Society,216–225.
    Lorenzo GD,Hacid H,Paik H et al. 2009. Data integration in mashups[J]. SIGMOD Record,38(1):59-66.
    Loshin D. 2000. Rule based data quality[C]//Proceedings of the eleventh international conference on Information and knowledge management. New York:ACM,614-616.
    Lud?scher B,Lin K,Bowers S et al. 2006. Managing scientific data: From data integration to scientific workflows[J]. GSA Special Papers,v. 397:109-129.
    Oracle. 2010. Oracle Warehouse Builder. http://www.oracle.com/technetwork/developer-tools/warehouse/documentation/index.html
    Papadimitriou CH,Yannakakis M. 1997. On the complexity of database queries[C]//Proceedings of the Sixteenth ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems. ACM Press,12-19.
    Rahm E,Do HH. 2000. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull.,23(4):3–13.
    Sadiq S,Orlowska M,Sadiq W et al. 2004. Data Flow and Validation in Workflow Modelling[C]//Proceedings of the Fifteenth Australasian Database Conference (ADC2004). ACS,207-214.
    Simitsis A. 2005. Mapping conceptual to logical models for ETL processes[C]//Proceedings of the ACM 8th International Workshop on Data Warehousing and OLAP (DOLAP’05). New York:ACM Press,67-76.
    Simitsis A,Vassiliadis P,Sellis T. 2005. State-space optimization of etl workflows[J]. IEEE Transactions on Knowledge and Data Engineering,17(10):1404–1419.
    Simitsis A,Vassiliadis P,Dayal U et al. 2009. Benchmarking ETL Workflows[C]//Proceedings of TPCTC. Springer,199-220.
    Simitsis A , Wilkinson K , Dayal U et al. 2010. Optimizing ETL Workflows for Fault-Tolerance[C]//Proceedings of IEEE International Conference on Data Engineering. New York:IEEE Press,385-396.
    Song S,Li C,Yu JX. 2010. Extending Matching Rules with Conditions[C]//Proceedings of VLDB.
    Srivastava D, Ramakrishnan R. 1992. Pushing constraint selections[C]//Proceedings of the eleventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. New York,NY,USA:ACM,301–315.
    SSIS. 2009. Microsoft SQL Server Integration Service. http://msdn.microsoft.com/en-us/library/ms141026.aspx
    Sudhakaran HS,Mahadevan A. 2009. Performance Engineering in ETL: A Proactive Approach [J]. SETLabs Birefings,7(1):37-44.
    TDWI. 2009. http://www.tdwi.org/research/display.aspx?ID=6064
    Thomas E,Michael F,Gianluigi G et al. 2003. Efficient evaluation of logic programs for querying data integration system[C]//Proceedings of International Conference on Logic Programming. Springer,163–177.
    Trujillo J,Luján-Mora S. 2003. A UML Based Approach for Modeling ETL Processes in Data Warehouses[C]//Proceedings of the 22nd International Conference on Conceptual Modeling. Springer, 307-320.
    Tsangaris M,Kakaletris G,Kllapi H et al. 2009. Dataflow Processing and Optimization on Grid and Cloud Infrastructures[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. IEEE Computer Society,67-74.
    Ullman JD. 1989. Principles of Database and Knowledge-Base Systems[M]. New York:Computer Science Press.
    Vassiliadis P,Simitsis A,Baikousi E. 2009. A taxonomy of ETL activities[C]// Proceedings of the ACM international workshop on Data Warehousing and OLAP. New York:ACM,25-32.
    Vassiliadis P,Simitsis A,Georgantas P et al. 2005. A generic and customizable framework for the design of etl scenarios[J]. Information Systems,30(7):492– 525.
    Vassiliadis P,Simitsis A,Skiadopopulos S.2002. Conceptual modeling for ETL processes[C]// Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP. New York: ACM,14-21.
    Wijsen J. 2005. Database repairing using updates[J]. ACM Transactions on Database System. ACM,722–768.
    Wohlstadter E,Li P,Cannon B. 2009. Web service mashup middleware with partitioningof xml pipelines[C]//Proceedings of IEEE International Conference on Web Service. Washington,DC, USA:IEEE Computer Society,91–98.
    Yahoo Inc. 2010. Yahoo Pipes. http://pipes.yahoo.com

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700