重复和不完整数据的清理方法研究及应用

英文题名：The Research and Application of Duplicated Records and Incomplete Data's Cleaning Approach
作者：鲁均云
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据清理 ; 数据质量 ; 相似重复记录 ; 内码序值 ; 不完整数据 ; 清理系统 ; 可扩展性
英文关键词：data cleaning ; data quality ; approximately duplicated records ; inner code's sequence value ; incomplete data ; cleaning system ; extensibility
学位年度：2009
导师：李星毅
学科代码：081203
学位授予单位：江苏大学
论文提交日期：2009-12-01

摘要

随着信息化产业的不断推进,企业积累的数据越来越多,激增的数据背后隐藏着重要信息,对企业作出正确、科学的决策,提高竞争力是至关重要的。为满足决策分析的需要,数据仓库应运而生。在数据仓库构建过程中,由于各种原因,数据仓库中含有重复的、不完整的以及异常的数据,即数据存在质量问题。高质量的数据是决策支持的前提条件,因此,为提高数据质量,对数据进行清理是非常必要的。
     本文先论述了数据预处理的相关知识,分析了数据清理的必要性以及国内外研究现状,并介绍了数据质量和数据清理的相关理论,阐述了数据清理的定义、原理与基本流程及相关清理技术。重点对相似重复记录检测及不完整数据清理方法做了深入研究,对相关算法进行了改进,并在此基础上设计了一个数据清理原型系统。本文主要工作如下:
     (1)在重复记录清理中,提出一种基于内码序值聚类的相似重复记录检测方法。该方法先选择关键字段或字段某些位,根据字符的内码序值,利用聚类思想将大数据集聚集成多个小数据集;再根据等级法计算各字段的权值,在各个小数据集中检测和消除相似重复记录。为避免关键字选择不当而造成记录漏查问题,采用多趟检测方法。实验表明该方法具有较好的检测精度和时间效率。
     (2)在不完整数据清理中,提出一种基于小波聚类加权1-NN的不完整数据清理方法。首先将数据集分成完整记录集和不完整记录集,然后对完整记录集利用小波聚类算法进行聚类,形成不同的子类,再判断不完整记录集中记录的可用性,利用加权1-NN方法找到不完整记录的最近邻子类,最后填充不完整记录缺失属性值。实验表明该方法具有较好填充效果。
     (3)在分析和研究多种清理框架基础上,设计一种数据清理原型系统。该系统具有开放的算法库、规则库与评估库,包含了丰富的清理算法和大量的清理规则,提供了多种质量评估指标。从分析体系结构各个模块的主要功能及其应用,体现了该系统具有良好的可扩展性、灵活性和交互性。
As the development of informatization industry, the enterprise is accumulating more and more data. There is some important information behind the explosive data, this information is crucial for the enterprise to make the proper, scientific decision and to improve the competitive strength. To meet the needs of decision analysis, data warehouse was born. In the construction of data warehouse, for various reasons, it contains duplicated, incomplete and outlier data, that is the data has quality problem. The data with high quality is the precondition of decision support, so for enhancing data quality, it is very necessary to make data cleaning.
     In the first place, this paper discusses some knowledge of data preprocessing, and analyzes the necessity of data cleaning and the research actuality of data cleaning at home and abroad. Then some theories about data quality and data cleaning is introduced, which expatiates the definition, principle, basic process and some techniques of data cleaning. It puts more emphases on the deep study of approximately duplicated records detection and incomplete data cleaning, and makes the improvement towards related algorithms, meanwhile designs a data cleaning prototype system based on the previous theories. The works in this paper is as follows:
     In order to clean the approximately duplicated records, this paper presents an approach for detecting approximately duplicated records based on cluster of inner code's sequence value. The proposed method firstly chooses the key field or some bits of it, and according to the inner code's sequence value of character, large datasets are clustered into many small datasets by cluster thought. Then in term of rank-based weights method, each attribute is endowed with certain weight. Finally, approximately duplicated records are detected and eliminated in each small dataset. To avoid missing some records caused by choosing improper key field, the multiple-detecting method can be adopted. Experimental results show the proposed method has good detection precision and time efficiency.
     In order to clean the incomplete data, an approach for treatment of the incomplete data based on WaveCluster and weighted 1-Nearest Neighbor (1-NN) is brought forward. Firstly dataset is divided into the complete record set and the incomplete record set. Then for the complete record set do the clustering by WaveCluster to form different subclasses. For the incomplete record, judge the availability of incomplete records. Finally, use the weighted 1-NN method to find the nearest neighbor subclass of incomplete record in the complete record set, and fill the missing attribute value of incomplete record. The experiment demonstrated the proposed method is an appropriate and effective method in treatment of the incomplete data.
     On the basis of analyzing and studying many data cleaning framework, a data cleaning prototype system is designed, which has open algorithms library, rules library and assessment library. It contains plenty of cleaning algorithms and many cleaning rules and provides a wide range of quality assessment methods. From the analysis of the main functions of each module of system architecture and its application, it shows that the system has good extensibility, flexibility and interactivity.

引文

[1]William H.Inmon.Building the Data Warehouse[M].4~(th) ed.John Wiley & Sons,2005:1-40.
    [2]Sid Adelman,Larissa Terpeluk Moss.Data warehouse project management[M].Boston,MA:Addison-Wesley,2000:3-37.
    [3]范明,孟小峰等译,Jiawei Han,Micheline Kamber著.数据挖掘概念与技术[M].第二版.北京:机械工业出版社,2007.
    [4]Pang-Ning Tan,Michael Steinbach,Vipin Kumar.Introduction to data mining[M].Boston:Person Addison Wesley Education Press,2006:5-47.
    [5]Daniel Aebi,Louis Perrochon.Towards improving data quality[C].Proceedings of the International Conference on Information Systems and Management of Data.Delhi,1993:273-281.
    [6]Mong-Li Lee,Tok Wang Ling,Hongjun Lu,Yee Teng Ko.Cleansing Data for Mining and Warehousing[C].10th International Conference and Workshop on Database and Expert Systems Applications,Florence,1999:751-760.
    [7]LEE N C.Improving data quality:development and evaluation of error detection methods[D].Taiwan:National Sun Yat-Sen University,2002.
    [8]Jin L,Li C,Mehrotra S.Efficient record linkage in large data sets[C].Eighth International Conference on Database Systems for Advanced Applications.Kyoto,2003:137-148.
    [9]Tejada S,Knoblock C A,Minton S.Learning domain-independent string transformation weights for high accuracy object identification[C].Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston:ACM Press,2002:350-359.
    [10]MA Hernandez,SJ Stolfo.Real-world Data is Dirty:Data Cleansing and The Merge/Purge Problem[J].Data Mining and Knowledge Discovery,1998(2):9-37.
    [11]Galhardas,H.,Florescu,D.,Shasha,D.,et al.AJAX:an extensible data cleaning tool[C].Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.Texas:ACM,2000:590.
    [12]Monge,A.E.,Elkan,C.The field matching problem:algorithms and applications[C].Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining.Oregon:AAAI Press,1996:267-270.
    [13]Maletic J.I.,Marcus A.Data Cleansing:Beyond Integrity Analysis[J].Proceedings of the International Conference on Information Quality,Boston:MA,2000:200-209.
    [14]Galhardas H,Florescu D,Shasha D.Declarative data cleaning:language model and algorithms[C].Proceedings of the 27th International Conference on VLDB Conference.Romazz:Morgan Kaufmann,2001:371-380.
    [15]Galhardas H,Florescu D,Shasha D,et al.An extensible framework for data cleaning[C].Proceedings of the 16th International Conference on Data Engineering.San Diego,California,2000:312-313.
    [16]Raman V,Hellerstein J M.Potter's wheel:an interactive data cleaning system[C].Proceedings of 27th International Conference on Very Large Data Bases.Rome:Morgan Kaufmann,2001:381-390.
    [17]Lee,M.L.,Ling,T.W.,Low,W.L.IntelliClean:a knowledge-based intelligent data cleaner[C].Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston:ACM Press,2000:290-294.
    [18]邱越峰,田增平,季文赟等.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77.
    [19]俞荣华.数据质量和数据清洗关键技术研究[D],上海:复旦大学,2002.
    [20]俞荣华,田增平,周傲英等.一种检测多语言文本相似重复记录的综合方法[J].计算机科学,2002,29(1):118-121.
    [21]查峰.数据仓库化中数据清洗问题的研究[D].南京:东南大学,2002.
    [22]庄晓青,徐立臻,董逸生.数据清理及其在数据仓库中的应用[J].计算机应用研究,2003(06):147-149.
    [23]韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212.
    [24]韩京宇,胡孔法,徐立臻等.一种在线数据清洗方法[J].应用科学学报,2005,23(03):391-395.
    [25]孟坚,董逸生,王永利.一种基于规则的交互式数据清洗技术[J].微机发展,2005,15(4):141-144.
    [26]柴玮.数据清理工具C-Cleaner的设计与实现[D].北京:北京大学,1999.
    [27]陈伟,丁秋林.具有数据清理功能的交互式数据迁移及应用[J].吉林大学学报(信息科学版),2004,22(2):148-153.
    [28]陈伟,丁秋林.可扩展数据清理软件平台的研究[J].电子科技大学学报,2006,35(1):100-103.
    [29]陈伟,丁秋林.一种XML相似重复数据的清理方法研究[J].北京航空航天大学学报,2004,30(9):835-838.
    [30]陈伟,王昊,朱文明等.基于孤立点检测的错误数据清理方法[J].计算机工程与应用,2005,22(11):71-73.
    [31]黄大荣,黄席樾.基于粗糙集理论的数据清洗模型[J].计算机工程与应用,2004(31):11-13.
    [32]覃华,苏一丹,李陶深.遗传神经网络的数据清洗方法[J].计算机工程与应用,2004(03):45-47.
    [33]鲍玉斌,孙焕良,冷芳玲.数据仓库环境下以用户为中心的数据清洗过程模型[J].计算机科学,2004(05):52-54.
    [34]Rahm E.,Do H.H..Data cleaning:problems and current approaches[J].IEEE Data Engineer Bulletin,2000,23(4):3-13.
    [35]Jarke M.,Jeusfeld M.,Quix C.Architecture and Quality in Data Warehouse:An Extended Repository Approach[J].Information Systems.1999,24(3):229-253.
    [36]郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082.
    [37]韩京宇,徐立臻,董逸生.数据质量研究综述[J].计算机科学,2008,35(02):1-6.
    [38]DM Strong,YW Lee,RY Wang.Data Quality In Context[J].Communications of the ACM,1997,40(5):103-110.
    [39]Rudra A.,Yeo E.Key issues in achieving data quality and consistency in data warehousing among large organisations in Australia[C].Proceedings of the 32nd Annual Hawaii International Conference on System Sciences,1999.
    [40]I Guyon,N Matic,V N Vapnik.Discovering Information Patterns and Data Cleaning.In Advances in Knowledge Discovery in Data Mining[M].M1T Press:AAAI Press,1996:181-203.
    [41]Simoudis E,Livezey B,Kerber R.Using Recon for Data Cleaning[J].Proceedings of First Internation Conference on KDD,1995:282-287.
    [42]Ralph Kimball.Dealing with dirty Data[M].DBMS Tools & Strategies For IS Professionals,1996,09(10):55-60.
    [43]Levitin.A,Redman.T.Model of the data(life) cycles with application to quality[J].Information and Software Technology,1995,35(4):217-223.
    [44]C.Fox,A.Levitin and T.Redman.The notion of Data and Its Quality Dimensions[J].Information Processing & Management,1994,30(1):9-19.
    [45]Verykios V S,Elmagarmid A K,Houstis E N.Automating the approximate record matching process[J].Journal of Information Sciences,2000,126(1-4):83-98.
    [46]Ananthakrishna R,Chaudhuri S,Ganti V.Eliminating fuzzy duplicates in data warehouses[C].Proceeding of the 28th VLDB Conference.Hong Kong,2002:586-597.
    [47]Mauricio Hernandez,Salvatore Stolfo.The merge/purge problem for large databases[C].Proc ACM SIGMOD International Conference on Management of Data,1995:127-138.
    [48]Masek W,Paterson M A.Faster Algorithm Computing String Edit Distance[J].Journal of Computer System Science,1980,20(6):18-31.
    [49]Monge A E,Elkan C.An efficient domain-independent algorithm for detecting approximately duplicate database records[C].Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery.Tucson:Arizona,1997,23-29.
    [50]A McCallum,K Nigam,L Ungar.Efficient clustering of high-dimensional data sets with application to reference matching[C].Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining.Santiago:Morgan Kaufmann,2000:169-178.
    [51]陈伟.数据清理关键技术及其软件平台的研究与应用[D].南京:南京航空航天大学,2005.
    [52]李华旸,易宝林,桂浩.基于动态规划的缩写发现算法[J].武汉大学学报(工学版),2004(37):128-131.
    [53]Liu P.,Lei L.,Zhang X.F..A Comparison Study of Missing Value Processing Methods[J].Computer Science,2004,31(10):155-156.
    [54]Gediga,G.and Duntsch,I.Maximum Consistency of Incomplete Data via Non-Invasive Imputation[J].Artificial Intelligence Review,2003,19(1):93-107.
    [55]Batista G E A P A,Monard M C.A study of K-nearest neighbour as an imputation method[C]. Proceedings of the Second International Conference on Hybrid Intelligent Systems.Santiago:IOS Press,2002:251-260.
    [56]Grzymala-Busse J W,Hu M.A comparison of several approaches to missing attribute values in data mining[C].Proceedings of the Second International Conference on Rough Sets and Current Trends in Computing.Banff:Springer-Verlag Heidelberg,2000:378-385.
    [57]Batista G E A P A,Monard M C.An analysis of four missing data treatment methods for supervised learning[J].Applied Artificial Intelligence,2003,17(5-6):519-533.
    [58]李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法[J].电子科技大学学报,2007,36(6):1273-1277.
    [59]Bilenko M,Mooney R.Adaptive name matching in information integration[J].IEEE Intelligent Systems,2003,18(5):16-23.
    [60]Minton S,Nanjo C,Knoblock C,etal.A heterogeneous field matching method for record linkage[C].Proceedings of the 5th International Conference on Data Mining(ICDM2005).Washington:IEEE Computer Society,2005:314-321.
    [61]Barron F H,Barrett B E.Decision quality using ranked attribute weights[J].Management Science,1996,42(11):1515-1523.
    [62]Dey D,Sarkar S,De P.A distance-based approach to entity reconciliation in heterogeneous databases[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(3):567 - 582.
    [63]张永,迟忠先.位置编码在数据仓库ETL中的应用[J].计算机工程,2007,33(1):50-52.
    [64]程国达,苏杭丽.一种检测汉语相似重复记录的有效方法[J].计算机应用,2005,25(6):1361-1365.
    [65]李先国,梁涌.一种高效的适用于字词检索的数据结构[J].微电子学与计算机,2006,23(12):157-160.
    [66]张靖,姚珍,唐雪飞.基于决策树的不完整数据的处理[J].电子科技大学学报,2007,36(1):116-118.
    [67]Oba S,Sato M A,Takemasa I,Monden M,Matsubara K,Ishii S.A Bayesian missing value estimation method for gene expression profile data[J].Bioinformatics,2003,19(16):2088-2096.
    [68]Chmielewski M R,Grzymala-Busse J W,Peterson N W,etal.The rule induction system LERS-A version for personal computers[J].Found Computer Decision Science,1993,18(3-4):181-212.
    [69]Hruschka,E.R.,Hruschka Junior,E.R.,Ebecken,N.F.F..Evaluating a Nearest-Neighbor method to substitute continuous missing values[C].The 16th Australian Joint Conference on Artificial Intelligence.Heidellberg:Springer-Verlag,2003:723-734.
    [70]Hruschka,E.R.,Hruschka Junior,E.R.,Ebecken,N.F.F..Towards efficient imputation by Nearest-Neighbors:A clustering-based approach[C].The 16th Australian Joint Conference on Artificial Intelligence.Berlin:Springer-Vedag,2005,513-525.
    [71]Sheikholeslami G,Chatterjee S,Zhang A.WaveCluster:A Multi-resolution clustering approach for very large spatial databases[C].Proceedings of the 24th conference on VLDB.New YorK,1998,428-439.
    [72]Sheikholeslami G,Surojit C,Zhang Aidong.WaveCluster:A Wavelet-based clustering approach for Spatial data in very large database[J].The VLDB Journal,2000,8(3-4):289-304.
    [73]Panos Vassiliadis,Zografoula Vagena,Spiros Skiadopoulos,Nikos Karayannidis,Timos Se11is.ARKTOS:A Tool for Data Cleaning and Transformation in Data Warehouse Environments[J].IEEE Computer Society Technical Committee on Data Engineering.Bulletin,2000,23(4):42-47.
    [74]郭志懋,俞荣华,田增平等.一个可扩展的数据清洗系统[J].计算机工程,2003(3):95-96(183).
    [75]孟坚.基于规则的交互式数据清洗技术[D].南京:东南大学,2005.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700