基于本体的数据清洗系统框架研究

英文题名：Research on Framework of Ontology Based Data Cleaning System
作者：张联超
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据质量 ; 数据清洗 ; 本体 ; 清洗规则 ; 任务结构 ; 系统框架
英文关键词：Data Quality ; Data Cleaning ; Ontology ; Cleaning Rule ; Task Structure ; Framework
学位年度：2008
导师：黄志球
学科代码：081203
学位授予单位：南京航空航天大学
论文提交日期：2008-01-01

摘要

随着数据库技术的飞速发展以及数据获取手段的多样化,数据资源日益丰富,数据量急剧增加。数据的价值在于其自身的质量,基于劣质数据的决策支持具有不可信性,目前数量巨大而且零乱的劣质数据成为制约数据应用的“瓶颈”。因此,作为数据质量问题的主要解决技术,数据清洗成为研究的热点。然而现有数据清洗技术的研究大多是从数据文本取值的层面进行清洗处理,往往忽略了数据自身蕴含的语义信息。因此,如何在现有数据清洗研究中引入语义特性成为该领域一个新的研究点。针对这一研究课题,本文主要开展了如下几个方面的研究工作:
     首先,基于信息化建设的背景,对数据质量问题和数据清洗问题进行了研究。通过对该领域在国内外研究现状的分析,归纳了现有数据清洗研究中存在的不足,并论证了利用本体及相关技术解决上述不足的可行性。
     其次,针对知识表示及其常规性的方法,本体及相关技术的研究进行了总结,作为支撑论文研究的理论基础。
     然后,基于本体提出了一个数据清洗系统框架。按照资源描述的特性,将系统框架划分为描述静态语义信息的本体表达模型和描述过程语义信息的动态处理模型,并分别给出了模型中各组成部分的形式化描述和主要模块在处理过程中的工作原理和实现机制。
     最后,在对课题研究中的两个语义模型进行分析介绍的基础上,设计并实现了基于本体的数据清洗系统框架,并使用UML对框架的静态结构设计和动态行为语义进行了建模,解决了现有数据清洗研究中缺乏语义约束和不能支持自动推理的问题。
With the rapid development of database technology and the diversification of ways for getting data, the categories of data are increasing rapidly and the amount of data is increasing dramatically.The value of data lies in the quality rather than the quantity, and the decision based on bad data is unbelievable. The huge and chaotic poor data has become a"bottleneck"in data application.As a primary method, data cleaning has become a hotspot to resolve the data quality problem.However, most of the current researches are based on the text value but the latent semantic of the data.How to introduce the semantic to the current researches is becoming a new hotspot.Data cleaning and its semantic are studied in this dissertation, and the main contributions are as follows:
     Firstly, the data quality and data cleaning under the background of the information construction are researched in this dissertation. According to the analysis of the domestic and foreign researches in this field, the weaknesses of current researches are summarized. Then the ontology and its critical technology are introduced to resolve them, meanwhile the argumentation of this method is given.
     Secondly, the researches of knowledge and its expression method, ontology and its critical technology, are summarized in this dissertation and used as the theoretical principle of our research.
     Thirdly, a data cleaning system framework based on ontology is proposed in this dissertation. According to the characteristics of resource description, the system framework is divided into the ontological expression model and dynamic processing model, which describe static semantic information and processing semantic information respectively. Meanwhile, the formal description of every component of the model, the working principle and implementation mechanism in process of main modules are also given respectively in this dissertation.
     Finally, the data cleaning system framework is designed and implemented in this dissertation under the analysis of both semantic models. The static structural designs and dynamic behavior semantics are modeled with UML.And the framework resolves the lack of semantic restriction and automated reasoning in current research.

引文

[1] Trillium Software. A practical guide to achieving enterprise data quality [EB/OL]. http://www.trilliumsoftware.com,2003
    [2] Erhard Rahm, Hong Hai Do. Data cleaning: problems and current approaches [J].IEEE data engineering Bulletin, 2000, 23(4):3-13
    [3] Tamraparni Dasu, Theodore Johnson. Exploratory data mining and data cleaning [M]. John wiley, 2003
    [4] Theodore Johnson, Tamraparni Dasu, Data Quality and Data Cleaning: An Overview [J], SIGMOD tutorial, 2003.
    [5]陈伟,数据清理关键技术及其软件平台的研究与应用, [博士学位论文],南京,南京航空航天大学, 2004.
    [6] Dominik Lueebber, Udo Grimmer. Systematic development of data mining based data quality tools[C].Proc. of the 29th VLDB Conference, Berlin, Germany, 2003.
    [7] K. T.Huang, Y.W.Lee and R.Y.Wang. Quality information and knowledge management. Prentice Hall, New Jersey, 1998
    [8] Beverly K. Kahn, Diane M. Strong. Product and Service Performance Model for Information Quality: An Update [J]. IQ 1998: 102-115, 1998
    [9] Cinzia Cappiello, Chiara Francalanci, Barbara Pernici.data quality assessment from user’s pespective[C].IQIS, 2004
    [10] Daniel Aebi, Louis Perrochon. Towards improving data quality[C]. Proc of the international conference on information systems and management of data, 1993, 273-281
    [11]郭志懋,周傲英.数据质量和数据清洗研究综述[J],软件学报, 2002, 13(11): 2076-2082.
    [12]韩京宇,董逸生,数据质量研究综述[J],中国科技论文在线, http://www.paper.edu.cn,2006.
    [13] Galhardas H,Florescu D,Shasha D,et al.An extensible framework for data cleaning[A]. In: 16th International Conference on Data Engineering[C]. California, 2000:312
    [14] Hernandez M A, Stolfo S J.Real-world data is dirty: data cleansing and the merge/purge problem [J].Data Mining and Knowledge Discovery, 1998, 2(1):9-37
    [15] Lee M L, Ling T W, Low W L.IntelliClean: a knowledge-based intelligent data cleaner [A].In: Proceeding of the 6th ACM SIGKDD International Conference onKnowledge discovery and Data Mining[C].Boston: ACM Press, 2000:290-294
    [16] Jin L, Li C, Mehrotra S.Efficient record linkage in large data sets [A].In: Eighth International Conference on Database Systems for Advanced Applications[C].Kyoto, 2003:137-146
    [17] Monge A E.Matching algorithms within a duplicate detection system [J].IEEE Data Engineer Bulletin, 2000, 23(4):14~20
    [18] Quick Address Systems[EB/OL].http://www.voltweb.com/services/corporate/qas.html,2003
    [19]王宁,基于WebService信息集成系统的数据清洗研究,[硕士学位论文],西安,西安电子科技大学,2007.
    [20] Galhardas H, Florescu D, Shasha D.Declarative data cleaning: language, model, and algorithms [A] .In: Proceedings of the 27th VLDB Conference[C].Roma: Morgan Kaufmann, 2001:371~380
    [21] Galhardas H., Florescu D, Shasha D, etal.AJAX: an extensible data cleaning tool[C].In: Proceedings of the ACM SIGMOD International Conference on Management of Data.Texas, 2000:590
    [22] Raman V, Hellerstein J M.Potter's wheel: an interactive data cleaning system [A]. In: Proceedings of 27th International Conference on Very Large Data Bases[C].Rome, 2001:381~390
    [23] Raman V, Hellerstein J M. An interactive framework for data cleaning [EB/OL]. http://sunsite.berkeley.edu/TechRepPages/CSD-00-1110,2000.
    [24]俞荣华,数据质量和数据清洗关键技术研究, [硕士学位论文] ,上海,复旦大学,2002
    [25]邱越峰等.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24 (1):69~77
    [26]俞荣华,田增平,周傲英等.一种检测多语言文本相似重复记录的综合方法[J].计算机科学, 2002,29(1):118~121
    [27]查峰,数据仓库化中数据清洗问题的研究, [硕士学位论文] ,南京,东南大学,2002.
    [28]韩京宇,徐立臻,董逸生等.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005, 12:2206~2212
    [29]韩京宇,胡孔法,徐立臻等,一种在线数据清洗方法[J],应用科学学报,2005(3):292-296.
    [30]陈伟,丁秋林.可扩展数据清理软件平台的研究[J].电子科技大学学报,2006,01: 100~103
    [31]沈国华,黄志球等.基于数据仓库技术的工程数据管理系统的研究与实现[J].小型微型计算机系统, 2004,(01):127~130.
    [32]柳雪涛,数据仓库系统集成框架研究, [硕士学位论文],南京,南京航空航天大学,2002.
    [33] Shen Guo-hua, Huang Zhi-qiu.Role of meta-model in engineering data warehouse [J]. Transactions of Nanjing University of Aeronautics & Astronautics, 2004, 23(4):317~321.
    [34]张联超,沈国华,黄志球,基于领域本体的数据质量改进方法研究[C],国防科技工业软件评测技术会议, 2007, 11.
    [35]程亮,张联超,黄志球,李婧,基于语义的XML数据清洗框架[J].郑州大学学报,2007, 12:102~106
    [36] D.Fensel, Ontologies: A Silver Bullet for Knowledge Management and ECommerce [J], Springer, Berlin, Heidelberg, 2rd edition, 2003.
    [37] Zoubida Kedad and Elisabeth Métais, Ontology-Based Data Cleaning[J], Lecture Notes In Computer Science, 2002, 2553:137~149.
    [38] Xin Wang,Howard J. Hamilton, Yashu Bither, An Ontology-Based Approach to Data Cleaning[EB/OL], http://www.cs.uregina.ca/Research/Techreports/2005-05.pdf, 2005.
    [39]曹忠升,万劲伟,基于语义的数据清理技术[J],华中科技大学学报, 2005, 33(2) :76~78.
    [40] S.Benson, C.Standing, Informatioin Systems: A Business Approach. John Wiley & Sons Australia, Ltd 2002
    [41]杨涛,基于本体的案例推理框架研究, [硕士学位论文],南京,南京航空航天大学, 2006.
    [42]邓志鸿,唐世渭等, Ontology研究综述[J],北京大学学报, 2002, 38(5):730-738.
    [43] N.Guarino, Formal Ontology: Conceptual Analysis and Knowledge Representation [J], International Journal of Human-Computer Studies, 1995, 43(2/3): 625~640.
    [44] Nicola Guarino, Formal ontology and information systems[C], In: Proc.of the 1st Int. Conf. on Formal Ontologies in Information Systems (FOIS’98). IOS Press, 1998: 3~15
    [45] B.Chandrasekaran, J.R.Josephson, and V.Richard Benjamins. Ontology of Tasks and Methods[C]. In Workshop on Knowledge Acquisition, Modeling and Management, Canada, 1998.
    [46]孔红云,基于本体和问题求解方法的Web服务管理框架研究, [硕士学位论文],南京,南京航空航天大学, 2007.
    [47]贾秀玲,文敦伟,面向文本的本体学习研究概述[J],计算机科学, 2007, 34(2):181-185.
    [48]杜小勇,李曼,王珊,本体学习研究综述[J],软件学报, 2006, 17(9):1837-1847.
    [49]张联超,黄志球,沈国华,周航,基于逆向工程的本体构建方法研究[J].计算机工程与设计,2007, 28(24):6012-6015.
    [50] C.S.J.Hou, M.A.Musen, N.F.Noy.EZPAL: environment for Composing Constraint Axioms by Instantiating Templates [J]. International Journal of Human-Computer Studies, 62(5):578-596.
    [51] PAL Introduction [EB/OL]. http://protege.stanford.edu/plugins/paltabs/pal-documentation/
    [52] EZPal [EB/OL].http://protege.stanford.edu/plugins/ezpal/EZPal_Documentation.html.
    [53] Wai Lup Low, Mong-Li Lee, Tok Wang Ling. A knowledge-based approach for duplicate elimination in data cleaning [J].Information Systems, 2001, 26(8): 585-606.
    [54] B.Chandrasekaran, Generic tasks in knowledge-based reasoning: High-level building blocks for expert systems design [J], IEEE Expert, 1986, 1(3), 23~30
    [55] B.Chandrasekaran, T.R.Johnson, Generic tasks and task structures: History, critique and new directions[C], Berlin: Springer-Verlag, 1993: 232~272
    [56] O. Vasilecas, D. Bugaite. An Algorithm for the Automatic Transformation of Ontology Axioms into a Rule Model[C], Proc. of the International Conference on Computer Systems and Technologies "CompSysTech 07", Rousse, Bulgaria, 14-15 June, 2007: II.2-1-II.2-6.
    [57] Chih-Sheng Johnson Hou, N F Noy, et al. A Template-Based Approach toward Acquisition of Logical Sentences[C], Proceedings of the IFIP 17th World Computer Congress, 2002
    [58] D.Fensel, E.Motta, Structured Development of Problem Solving Methods[J], IEEE Transactions on Knowledge and Data Engineering, 2001, 13(6): 913~932.
    [59] Dieter Fensel, Enrico Motta, V Richard Benjamins, The Unified Problem-solving Method Development Language UPML [J], Knowledge and Information Systems, 2001.
    [60] John Y, John H, Mark M, Mappings for Reuse in Knowledge-Based Systems[C], Eleventh Ban_ Knowledge Acquisition for Knowledge-Based Systems Workshop, 1998.
    [61]张联超,基于本体的领域信息互操作框架研究[C],第21届研究生通信年会, 2006.
    [62] Mauricio A.H., Salvatore J.S.,The merge/purge problem for large databases[C],Proceedings of the 1995 ACM SIGMOD international Conference on Management of Data, 1995: 127-138.
    [63] Robert A.W., Roy L., An extension of the string-to-string correction problem [J], Journal of the ACM, 1975, 22(2):177-183.
    [64] A.Haller, E.Cimpian, A.Mocan, E.Oren, C.Bussler: WSMX-A Semantic Service-Oriented Architecture[C], in Proceedings of the International Conference on Web Service (ICWS 2005). Orlando, Florida, 2005.
    [65] D.Fensel, R.Groenboom, A Software Architecture for Knowledge-Based Systems [J], The Knowledge Engineering Review, 1999, 14(3).
    [66] Dublin Core Metadata Element Set,http://dublincore.org/documents/2003/06/02/dces/, 2003.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700