面向数据集成的多真值发现算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Multi?Truth Finding Algorithms for Data Integration
  • 作者:陈烈锋 ; 许青林
  • 英文作者:Chen Liefeng;Xu Qinglin;Department of Computer Science and Technology,Guangdong University of Technology;
  • 关键词:数据集成 ; 数据冲突 ; 真值发现 ; 多真值 ; 数据源可信度
  • 英文关键词:data integration;;data conflicting;;truth finding;;multi-truth finding;;source trustworthiness
  • 中文刊名:SJCJ
  • 英文刊名:Journal of Data Acquisition and Processing
  • 机构:广东工业大学计算机学院;
  • 出版日期:2019-05-15
  • 出版单位:数据采集与处理
  • 年:2019
  • 期:v.34;No.155
  • 基金:广东省重大科技厅重大专项(2016B030306003)资助项目
  • 语种:中文;
  • 页:SJCJ201903008
  • 页数:11
  • CN:03
  • ISSN:32-1367/TN
  • 分类号:74-84
摘要
大数据时代,大规模数据往往由多个数据源组成并服务于多个数据驱动型应用程序。由于数据源的可信度不同,不同数据源往往会产生数据冲突,使得难以判断哪些信息是真实的。近年来,真值发现方法通过从多个数据源中找到最符合现实的真值来解决冲突而成为研究热门。当前真值发现算通常假设实体某个属性只有一个真值,然而在现实中,实体具有多个真值的情况更为常见。针对多值实体提出了一个多真值发现算法,该算法将多真值发现转化为一个函数优化问题。根据对目标函数的求解选取置信度最高的多个值作为实体的真值。同时在计算描述值的置信度时,提出一种非对称的支持度计算方法,结合相似值的支持对其置信度进行修正。通过多个真实数据集上的实验表明本文算法的准确性优于现有的真值发现算法。
        In the era of big data,large-scale data are often contributed by numerous data sources and used by many data-driven applications. Because of different trustworthiness of sources,different sources often produce data conflicts,making it difficult to determine which information is true. In recent years,truth finding has become a research hotspot by finding the most credibility values from multiple sources. The current truth finding methods usually assume that the entity has only one truth,while in reality,entities may have multiple true values. In this paper,we present an approach for multi-truth finding,which transforms the multi-truth finding into an optimization problem. In so doing,we select the values with the highest credibility as truths of entities. We also propose an asymmetric approach to compute support between values and incorporate influences of similar values to measure value credibility for better truth finding. Experiments on several data sets show that the effectiveness of our algorithm outperform the existing state-of-the-art techniques.
引文
[1]Yin X,Han J,Yu P S.Truth discovery with multiple conflicting information providers on the web[C]//ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining.[S.l.]:ACM,2007:1048-1052.
    [2]Dong X L,Berti-Equille L,Srivastava D.Integrating conflicting data:The role of source dependence[J].Proceedings of the Vldb Endowment,2010,2(1):550-561.
    [3]Benslimane D,Sheng Q Z,Barhamgi M,et al.The uncertain web:Concepts,challenges,and current solutions[J].ACMTransactions on Internet Technology,2016,16(1):1-6.
    [4]Li X,Dong X L,Lyons K B,et al.Scaling up copy detection[C]//IEEE,International Conference on Data Engineering.[S.l.]:IEEE,2015:89-100.
    [5]Blanco L,Crescenzi V,Merialdo P,et al.Probabilistic models to reconcile complex data from inaccurate data sources[C]//International Conference on Advanced Information Systems Engineering.[S.l.]:Springer-Verlag,2010:83-97.
    [6]Dong X L,Berti-Equille L,Srivastava D.Truth discovery and copying detection in a dynamic world[J].Proceedings of the VLDB Endowment,2009,2(1):562-573.
    [7]Qi G J,Aggarwal C C,Han J,et al.Mining collective intelligence in diverse groups[C]//Proceedings of the 22nd International Conference on World Wide Web.[S.l.]:ACM,2013:1041-1052.
    [8]Pochampally R,Das Sarma A,Dong X L,et al.Fusing data with correlations[C]//Proceedings of the 2014 ACM SIGMODInternational Conference on Management of Data.[S.l.]:ACM,2014:433-444.
    [9]Galland A,Abiteboul S,Senellart P.Corroborating information from disagreeing views[C]//ACM International Conference on Web Search and Data Mining.[S.l.]:ACM,2010:131-140.
    [10]Pasternack J,Roth D.Knowing what to believe(when you already know something)[C]//Proceedings of the 23rd International Conference on Computational Linguistics.[S.l.]:Association for Computational Linguistics,2010:877-885.
    [11]Liu X,Dong X L,Ooi B C,et al.Online data fusion[J].Proceedings of the Vldb Endowment,2011,4(11):932-943.
    [12]Zhao Z,Cheng J,Ng W.Truth discovery in data streams:A single-pass probabilistic approach[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management.[S.l.]:ACM,2014:1589-1598.
    [13]Zhao B,Rubinstein B I P,Gemmell J,et al.A Bayesian approach to discovering truth from conflicting sources for data integration[J].Proceedings of the Vldb Endowment,2012,5(6):550-561.
    [14]Wang X,Sheng Q Z,Fang X S,et al.An integrated Bayesian approach for effective multi-truth discovery[C]//ACMInternational on Conference on Information and Knowledge Management.[S.l.]:ACM,2015:493-502.
    [15]Wang X,Sheng Q Z,Yao L,et al.Truth discovery via exploiting implications from multi-source data[C]//Proceedings of the25th ACM International on Conference on Information and Knowledge Management.[S.l.]:ACM,2016:861-870.
    [16]Wang X,Sheng Q Z,Yao L,et al.Empowering truth discovery with multi-truth prediction[C]//Proceedings of the 25th ACMInternational on Conference on Information and Knowledge Management.[S.l.]:ACM,2016:881-890.
    [17]Fang X S,Sheng Q Z,Wang X,et al.SmartMTD:A graph-based approach for effective multi-truth discovery[EB/OL].[2017-08-07](2018-09-20).https://arxiv.org/abs/1708.02018.
    [18]Li Y,Gao J,Meng C,et al.A survey on truth discovery[J].ACM Sigkdd Explorations Newsletter,2016,17(2):1-16.