Noise-Tolerant Approximate Blocking for Dynamic Real-Time Entity Resolution
详细信息    查看全文
  • 作者:Huizhi Liang (23)
    Yanzhe Wang (23)
    Peter Christen (23)
    Ross Gayler (24)
  • 关键词:Entity Resolution ; Real ; time ; Locality Sensitive Hashing ; Indexing
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2014
  • 出版时间:2014
  • 年:2014
  • 卷:8444
  • 期:1
  • 页码:449-460
  • 参考文献:1. Christen, P.: Data Matching. Data-Centric Systems and Appl. Springer (2012)
    2. Christen, P., Gayler, R.W.: Adaptive temporal entity resolution on dynamic databases. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol.聽7819, pp. 558鈥?69. Springer, Heidelberg (2013) CrossRef
    3. Lange, D., Naumann, F.: Cost-aware query planning for similarity search. Information Systems, 455鈥?69 (2012)
    4. Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: SIGKDD, pp. 529鈥?34 (2006)
    5. Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: CIKM, pp. 1565鈥?568 (2009)
    6. Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: Li, J., Cao, L., Wang, C., Tan, K.C., Liu, B., Pei, J., Tseng, V.S. (eds.) PAKDD 2013 Workshops. LNCS (LNAI), vol.聽7867, pp. 47鈥?8. Springer, Heidelberg (2013) CrossRef
    7. Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518鈥?29 (1999)
    8. Kim, H.S., Lee, D.: HARRA: Fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp. 525鈥?36 (2010)
    9. Bawa, M., Condie, T., Ganesan, P.: LSH forest: Self-tuning indexes for similarity search. In: WWW, pp. 651鈥?60 (2005)
    10. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950鈥?61 (2007)
    11. Das Sarma, A., Jain, A., Machanavajjhala, A., Bohannon, P.: An automatic blocking mechanism for large-scale de-duplication tasks. In: CIKM, pp. 1055鈥?064 (2012)
    12. Li, L., Wang, D., Li, T., Knox, D., Padmanabhan, B.: Scene: A scalable two-stage personalized news recommendation system. In: SIGIR, pp. 125鈥?34 (2011)
    13. Anand, R., Ullman, J.D.: Mining of massive datasets. Cambridge University Press (2011)
    14. Gan, J., Feng, J., Fang, Q., Ng, W.: Locality-sensitive hashing scheme based on dynamic collision counting. In: SIGMOD, pp. 541鈥?52 (2012)
    15. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, pp. 440鈥?45 (2006)
    16. Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: DL, pp. 185鈥?94 (2007)
    17. Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: ICDE, pp. 1073鈥?083 (2012)
    18. Christen, P.: Preparation of a real voter data set for record linkage and duplicate detection research. Technical report, Australian National University (2013)
  • 作者单位:Huizhi Liang (23)
    Yanzhe Wang (23)
    Peter Christen (23)
    Ross Gayler (24)

    23. Research School of Computer Science, The Australian National University, Canberra, ACT, 0200, Australia
    24. Veda, Melbourne, VIC, 3000, Australia
  • ISSN:1611-3349
文摘
Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world datasets show the effectiveness of the proposed approach.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700