数据仓库下基于学习的并行实体解析算法研究

英文篇名：The Parallel Entity Resolution Algorithm Based on Learning in Data Warehouse
作者：刘叶 ; 吴晟 ; 吴兴蛟 ; 周海河 ; 李英娜 ; 张晶
英文作者：LIU Ye;WU Sheng;WU Xing-jiao;ZHOU Hai-he;LI Ying-na;ZHANG Jing;School of Information Engineering and Automation,Kunming University of Science and Technology;
关键词：数据仓库 ; 数据质量 ; 实体解析 ; 自主学习 ; 并行计算
英文关键词：data warehouse;;data quality;;the entity resolution;;autonomous learning;;parallel computing
中文刊名：RJDK
英文刊名：Software Guide
机构：昆明理工大学信息工程与自动化学院;
出版日期：2018-01-23 08:47
出版单位：软件导刊
年：2018
期：v.17;No.184
语种：中文;
页：RJDK201802007
页数：5
CN：02
ISSN：42-1671/TP
分类号：23-26+31

摘要

为了改善传统实体解析算法在单机环境下采用人为方式设定属性权值及阈值难以对海量数据进行快速有效处理的缺点,基于Hadoop框架使用MapReduce计算模型,在多节点分布式环境下,通过不断调整网络学习属性之间的内在关系以及属性权值、阈值等参数后,再将模型放在Hive数据仓库中的真实数据集上进行有效性验证。分别使用5 000及9 000条数据进行实验,实验结果表明,基于学习的并行实体解析算法准确率、召回率和F1值较高。因此,基于学习的并行实体解析算法对于海量数据不仅能进行快速有效的处理,而且能有效降低人工经验中存在的误差,同时也能提高识别结果的准确度,提升识别效率。
For solving the disadvantage in traditional entity resolution algorithm which is usually used in the single machine environment setting the artificial attribute weights and threshold processing methods for entity analysis,which makes the recognition result heavily dependent on manual experience and difficult in efficient big data processing,this article tries to study the intrinsic relationship between the attributes through adjustment network in multiple-nodes-distributed environment by using MapReduce calculation model based on Hadoop frame.Through adjusting attribute weight and threshold value we can validate on the real data set in the Hive data warehouse by using separately 5 000 and 9 000 data records.Experiment result have shown that parallel entity analysis algorithm based on self-learning has higher accuracy,recall value and F1 value,thus we can draw the conclusion that parallel entity analysis algorithm based on learning has not only effectively reduced the errors in the artificial experience,which made the recognition result obtain high recognition accuracy and recognition efficiency,but can also deal with the massive data with high efficiency.

引文

[1]NEWCOMBE H B,KENNEDY J M,AXFORD S J,et al.Automatic linkage of vital records[J].Science,1959,130(3381):954-959.
    [2]FELLEGI I P,SUNTER A B.A Theory for record linkage[J].Journal of the American Statistical Association,1969,64(328):1183-1210.
    [3]WANG Y R,MADNICK S E.The inter-database instance identification problem in integrating autonomous systems[C].International Conference on Data Engineering,1989.Proceedings.IEEE,1989:46-55.
    [4]BAMFORD R,BUTLER D,KLOTS B,et al.Architecture of oracle parallel server[C].International Conference on Very Large Data Bases.Morgan Kaufmann Publishers Inc,1998:669-670.
    [5]王超文.面向结构化数据的实体解析方法[D].哈尔滨:哈尔滨工程大学,2014.
    [6]何峰权,李建中.基于属性模式的实体识别框架[J].智能计算机与应用,2014,4(1):65-68.
    [7]湛文红.数据仓库与数据挖掘实例分析[J].软件导刊,2013(2):99-102.
    [8]刘栋,王黎峰,张怀锋.基于大数据的统计分析模型设计[J].软件导刊,2016(7):28-30.
    [9]燕彩蓉,张洋舜,徐光伟.支持隐私保护的众包实体解析[J].计算机科学与探索,2014,8(7):802-811.
    [10]黄敏.大数据下基于块依赖的实体解析方法[D].北京:北京交通大学,2015.
    [11]甄灵敏,杨晓春,王斌,等.基于属性权重的实体解析技术[J].计算机研究与发展,2013,50(S1):281-289.
    [12]黎玲利.实体识别关键技术的研究[D].哈尔滨:哈尔滨工业大学,2015.