An Experimental Survey of MapReduce-Based Similarity Joins
详细信息    查看全文
  • 关键词:Similarity joins ; Big Data systems ; Performance evaluation ; MapReduce
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2016
  • 出版时间:2016
  • 年:2016
  • 卷:9939
  • 期:1
  • 页码:181-195
  • 全文大小:1,389 KB
  • 参考文献:1.Silva, Y.N., Aref, W.G., Ali, M.: The similarity join database operator. In: ICDE (2010)
    2.Silva, Y.N., Pearson, S.: Exploiting database similarity joins for metric spaces. In: VLDB (2012)
    3.Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: SimDB: a similarity-aware database system. In: SIGMOD (2010)
    4.Silva, Y.N., Aref, W.G., Larson, P.-A., Pearson, S., Ali, M.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22(3), 395–420 (2013)CrossRef
    5.Silva, Y.N., Aref, W.G.: Similarity-aware query processing and optimization. In: VLDB Ph.D. Workshop, France (2009)
    6.Bernstein, P.A., Jensen, C.S., Tan, K.-L.: A call for surveys. SIGMOD Rec. 41(2), 47 (2012)CrossRef
    7.Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and efficient parallel processing of massive data sets. In: VLDB (2008)
    8.Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)CrossRef
    9.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
    10.Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP (2003)
    11.Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys (2007)
    12.Dohnal, V., Gennaro, C., Zezula, P.: Similarity join in metric spaces using eD-index. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003). doi:10.​1007/​978-3-540-45227-0_​48 CrossRef
    13.Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: SIGMOD (2001)
    14.Dittrich, J.-P., Seeger, B.: GESS: a scalable similarity join algorithm for mining large data sets in high dimensional spaces. In: SIGKDD (2001)
    15.Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33, 7:1–7:38 (2008)CrossRef
    16.Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
    17.Chaudhuri, S., Ganti, V., Kaushik, R.: Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. 29(2), 60–66 (2006)
    18.Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB (2001)
    19.Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD 2010 (2010)
    20.Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: VLDB/Cloud-I (2012)
    21.Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD (2012)
    22.Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE (2012)
    23.Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)
    24.Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In: VLDB (2012)
    25.Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)
    26.Apache Hadoop. http://​hadoop.​apache.​org/​
    27.SimCloud Project: MapReduce-based similarity join survey. http://​www.​public.​asu.​edu/​~ynsilva/​SimCloud/​SJSurvey
    28.Harvard Library: Harvard bibliographic dataset. http://​library.​harvard.​edu/​open-metadata
  • 作者单位:Yasin N. Silva (16)
    Jason Reed (16)
    Kyle Brown (16)
    Adelbert Wadsworth (16)
    Chuitian Rong (16)

    16. Arizona State University, Glendale, AZ, USA
  • 丛书名:Similarity Search and Applications
  • ISBN:978-3-319-46759-7
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
  • 卷排序:9939
文摘
In recent years, Big Data systems and their main data processing framework - MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.