基于划分的海量数据相似重复记录检测

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于划分的海量数据相似重复记录检测

详细信息查看全文 | 推荐本文 |

英文篇名：Similar Duplicate Record Detection of Massive Data Based on Partition
作者：李莉 ; 张晓雯
英文作者：LI Li;ZHANG Xiao-Wen;School of Computer Science and Communication Engineering,Jiangsu University;
关键词：数据质量 ; 数据清洗 ; 相似重复记录 ; 划分 ; SNM算法
英文关键词：data quality;;data cleaning;;similar duplicate records;;partition;;SNM algorithm
中文刊名：XTYY
英文刊名：Computer Systems & Applications
机构：江苏大学计算机科学与通信工程学院;
出版日期：2019-03-15
出版单位：计算机系统应用
年：2019
期：v.28
语种：中文;
页：XTYY201903025
页数：7
CN：03
ISSN：11-2854/TP
分类号：174-180

摘要

针对目前社工库存储的海量数据,数据冗余、查询效率低下的质量问题,本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集,采用划分思想,对大数据集进行分割,形成簇;采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明,划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率,检测准确率也有所提升.
Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data,this study proposed an effective partition-based neighbor sorting algorithm.The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a twodimensional form.The partitioning idea was used to segment the massive data set to clusters;the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results.The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data,but also improves the detection accuracy.

引文

1 Dhivyabharathi GV, Kumaresan S. A survey on duplicate record detection in real world data. Proceedings of the 3rd International Conference on Advanced Computing and Communication Systems. Coimbatore, India. 2016. 1-5.
    2杨巧巧,郭振波,王开西.基于网格分组和属性权值的相似重复记录识别算法.青岛大学学报(自然科学版),2017,30(2):69-73.
    3刘许刚,黄海,马宏.一种基于分段匹配的字符串匹配算法.计算机应用与软件,2012, 29(3):128-131.[doi:10.3969/j.issn.1000-386X.2012.03.035]
    4 Beskales G, Ilyas IF, Golab L, et al. On the relative trust between inconsistent data and inaccurate constraints.Proceedings of the IEEE 29th International Conference on Data Engineering. Brisbane, QLD, Australia. 2013. 541-552.
    5 Monge AE, Elkan CP. An efficient domain-independent algorithm for detecting approximately duplicate database records. Proceedings of Workshop on Research Issues on Data Mining and Knowledge Discovery. Tucson, AZ, USA.1997. 23-29.
    6 Hernandez MA, Stolfo SJ. Real-world data is dirty:Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2(1):9-37.[doi:10.1023/A:1009761603038]
    7 Hernandez MA, Stolfo SJ. The merge/purge problem for large databases. Proceedings of 1995 ACM SIGMOD International Conference on Management of Data. San Jose,CA, USA. 1995. 127-138.
    8杨巧巧,郭振波,王开西.基于聚类分组和属性综合权值的SNM改进算法.工业控制计算机,2017, 30(9):27-28,31.[doi:10.3969/j.issn.1001-182X.2017.09.012]
    9张平.海量数据相似重复记录检测的研究[硕士学位论文].桂林:桂林电子科技大学,2011.
    10时念云,张金明,褚希.基于CURE算法的相似重复记录检测.计算机工程,2009, 35(5):56-58.[doi:10.3969/j.issn.1007-130X.2009.05.016]
    11刘雅思,程力,李晓.基于长度过滤和动态容错的SNM改进算法.计算机应用研究,2017, 34(1):147-150, 155.[doi:10.3969/j.issn. 1001-3695.2017.01.031]
    12 Li M, Xie Q, Ding QL. An improved data cleaning algorithm based on SNM. Huang ZQ, Sun XM, Luo JZ, et al. Cloud Computing and Security. Cham:Springer, 2015. 259-269.
    13陈爽,刁兴春,宋金玉,等.基于伸缩窗口和等级调整的SNM改进方法.计算机应用研究,2013, 30(9):2736-2739.[doi:10.3969/j.issn.1001-3695.2013.09.044]
    14 Low WL, Lee ML, Ling TW. A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 2001, 26(8):585-606.[doi:10.1016/S0306-4379(01)00041-2]
    15周典瑞.基于可变滑动窗口的相似重复记录检测算法研究与设计[硕士学位论文].镇江:江苏大学,2013.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700