基于时效规则的数据修复方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Data Repair Algorithm Based on Currency Rules
  • 作者:段旭良 ; 郭兵 ; 沈艳 ; 申云成 ; 董祥千 ; 张洪
  • 英文作者:DUAN Xu-Liang;GUO Bing;SHEN Yan;SHEN Yun-Cheng;DONG Xiang-Qian;ZHANG Hong;College of Computer Science, Sichuan University;College of Information Engineering, Sichuan Agricultural University;School of Control Engineering, Chengdu University of Information Technology;
  • 关键词:数据质量 ; 数据时效 ; 数据修复 ; 数据清洗 ; 个人大数据
  • 英文关键词:data quality;;data currency;;data repairing;;data cleaning;;personal big data
  • 中文刊名:RJXB
  • 英文刊名:Journal of Software
  • 机构:四川大学计算机学院;四川农业大学信息工程学院;成都信息工程大学控制工程学院;
  • 出版日期:2019-03-15
  • 出版单位:软件学报
  • 年:2019
  • 期:v.30
  • 基金:国家自然科学基金(61332001,61772352,61472050);; 四川省科技计划(2019ZDZX0045,2019ZDZX0010,2018ZDZX0010,2017GZDZX0003,2018JY0182)~~
  • 语种:中文;
  • 页:RJXB201903007
  • 页数:15
  • CN:03
  • ISSN:11-2560/TP
  • 分类号:99-113
摘要
数据时效性是影响数据质量的重要因素,可靠的数据时效性对数据检索的精确度、数据分析结论的可信性起到关键作用.数据时效不精确、数据过时等现象给大数据应用带来诸多问题,很大程度上影响着数据价值的发挥.对于缺失了时间戳或者时间不准确的数据,精确恢复其时间戳是困难的,但可以依据一定的规则对其时间先后顺序进行还原恢复,满足数据清洗及各类应用需求.在数据时效性应用需求分析的基础上,首先明确了属性的时效规则相关概念,对属性的时效规则等进行了形式化定义;然后提出了基于图模型的时效规则发现以及数据时序修复算法;随后,对相关算法进行了实现,并在真实数据集上对算法运行效率、修复正确率等进行了测试,分析了影响算法修复数据正确率的一些影响因素,对算法进行了较为全面的分析评价.实验结果表明,算法具有较高的执行效率和较好的时效修复效果.
        Data currency is an important factor in?uencing the data quality. The reliability of data currency plays a critical role in data retrieval accuracy and data analysis credibility. Inaccurate data currency and outdated data bring many problems to the application of big data, which greatly affects the exertion of data value. For data that with imprecise time attribute or missing timestamp, exact repair of timestamp is often difficult, but it is possible to restore the currency orders according to specific currency based rules to meet various requirements in data cleaning and applications. Based on the analysis of data currency application requirements, this study first introduces the related concepts of data currency, defines attributes currency-based rules in formal method, and then proposes the currency rules discovery algorithm and the currency repair method. The algorithms efficiency and recovery effect are tested on real dataset, the factors that affect accuracy of the algorithms are analyzed. Experimental results show that the proposed methods are ef?cient and effective.
引文
[1]Ding XO,Wang HZ,Zhang XY,Li JZ,Gao H.Association relationships study of multi-dimensional data quality.Ruan Jian Xue Bao/Journal of Software,2016,27(7):1626-1644(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/5040.htm[doi:10.13328/j.cnki.jos.005040]
    [2]Eckerson WW.Data quality and the bottom line:Achieving business success through a commitment to heigh quality Data.Washington:The Data Warehouse Institute,2002.
    [3]Fan WF,Greets F.Foundations of Data Quality Management.Beijing:National Defense Industry Press,2012(in Chineses).
    [4]Fuhr N,R?lleke T.A probabilistic relational algebra for the integration of information retrieval and database systems.ACM Trans.on Information Systems,1997,15(1):32-66.[doi:10.1145/239041.239045]
    [5]Zhou AY,Jin CQ,Wang GR,Li JZ.A survey on the management of uncertain data.Chineses Journal of Computers,2009,32(1):1-16(in Chinese with English abstract).
    [6]Koubarakis M.Representation and querying in temporal databases:The power of temporal constraints.In:Proc.of the Int’l Conf.on Data Engineering.IEEE Computer Society,1993.327-334.[doi:10.1109/ICDE.1993.344049]
    [7]Meyden VD.The complexity of querying indefinite data about linearly ordered domains.Journal of Computer&System Sciences,1997,54(1):113-135.[doi:10.1006/jcss.1997.1455]
    [8]Guo B,Li Q,Duan XL,Shen YC,Dong XQ,Zhang H,Shen Y,Zhang ZL,Luo J.Personal data bank:A new mode of personal big data asset management and value-added services based on bank architecture.Chineses Journal of Computers,2017,40(1):126-143(in Chinese with English abstract).
    [9]Zhang H,Diao Y,Immerman N.Recognizing patterns in streams with imprecise timestamps.Elsevier Science Ltd.,2013.[doi:10.14778/1920841.1920875]
    [10]Fan W,Geerts F,Tang N,Yu W.Conflict resolution with data currency and consistency.Journal of Data and Information Quality(JDIQ),2014,5(1-2):6.[doi:10.1145/2631923]
    [11]Du YF,Shen DR,Nie TZ,Kou Y,Yu G.A cleaning method for consistency and currency in related data.Chineses Journal of Computers,2017,40(1):92-106(in Chinese with English abstract).
    [12]Jin CQ,Liu HP,Zhou AY.Functional dependency and conditional constraint based data repair.Ruan Jian Xue Bao/Journal of Software,2016,27(7):1671-1684(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/5037.htm[doi:10.13328/j.cnki.jos.005037]
    [13]Fan W,Geerts F,Wijsen J.Determining the currency of data.In:Proc.of the 30th ACM Sigmod-sigact-sigart Symp.on Principles of Database Systems.ACM Press,2011.71-82.[doi:10.1145/1989284.1989295]
    [14]Fan W,Geerts F,Wijsen J.Determining the currency of data.ACM Trans.on Database Syst.,2012,37(4):1-46.[doi:10.1145/2389241.2389244]
    [15]Li MH,Li JZ,Gao H.Evaluation of data currency.Chineses Journal of Computers,2012,35(11):2348-2360(in Chinese with English abstract).
    [16]Li MH,Li JZ.Algorithms for improving data currency.Journal of Computer Research and Development,2015,52(9):1992-2001(in Chinese with English abstract).
    [17]Li M,Li J.A minimized-rule based approach for improving data currency.Journal of Combinatorial Optimization,2016,32(3):812-841.[doi:10.1007/s10878-015-9904-8]
    [18]Ding X,Wang H,Gao Y,et al.Determining the currency of dynamic data.In:Proc.of the ACM Turing,Celebration Conf.ACMPress,2017.17.[doi:10.1145/3063955.3063972]
    [19]Ding X,Wang H,Gao Y,et al.Efficient currency determination algorithms for dynamic data.Tsinghua Science and Technology,2017,22(3):227-242.[doi:10.23919/TST.2017.7914196]
    [20]Song SX,Cao Y,Wang JM.Cleaning timestamps with temporal constraints.Proc.of the VLDB Endowment,2016,9(10):708-719.[doi:10.14778/2977797.2977798]
    [21]Wang HZ,Fan WF.Object identification on complex data:A survey.Chineses Journal of Computers,2011,34(10):1843-1852(in Chinese with English abstract).
    [22]Huo R,Wang HZ,Zhu R,Li JZ,Gao H.Map-reduce based entity identification in big data.Journal of Computer Research and Development,2013,50(s2):170-179(in Chinese with English abstract).
    [1]丁小欧,王宏志,张笑影,李建中,高宏.数据质量多种性质的关联关系研究.软件学报,2016,27(7):1626-1644.http://www.jos.org.cn/1000-9825/5040.htm[doi:10.13328/j.cnki.jos.005040]
    [3]樊文飞,弗洛里斯?吉尔茨.数据质量管理基础.北京:国防工业出版社,2016.
    [5]周傲英,金澈清,王国仁,李建中.不确定性数据管理技术研究综述.计算机学报,2009,32(1):1-16.
    [8]郭兵,李强,段旭良,申云成,董祥千,张洪,沈艳,张泽良,罗键.个人数据银行--一种基于银行架构的个人大数据资产管理与增值服务的新模式.计算机学报,2017,40(1):126-143.
    [11]杜岳峰,申德荣,聂铁铮,寇月,于戈.基于关联数据的一致性和时效性清洗方法.计算机学报,2017,40(1):92-106.
    [12]金澈清,刘辉平,周傲英.基于函数依赖与条件约束的数据修复方法.软件学报,2016,27(7):1671-1684.http://www.jos.org.cn/1000-9825/5037.htm[doi:10.13328/j.cnki.jos.005037]
    [15]李默涵,李建中,高宏.数据时效性判定问题的求解算法.计算机学报,2012,35(11):2348-2360.
    [16]李默涵,李建中.数据时效性修复问题的求解算法.计算机研究与发展,2015,52(9):1992-2001.
    [21]王宏志,樊文飞.复杂数据上的实体识别技术研究.计算机学报,2011,34(10):1843-1852.
    [22]霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法.计算机研究与发展,2013,50(s2):170-179.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700