Efficient top-k similarity document search utilizing distributed file systems and cosine similarity
详细信息    查看全文
  • 作者:Mahmoud Alewiwi ; Cengiz Orencik ; Erkay Savaş
  • 关键词:Z ; order ; Document similarity ; MapReduce ; Hadoop ; Cosine similarity
  • 刊名:Cluster Computing
  • 出版年:2016
  • 出版时间:March 2016
  • 年:2016
  • 卷:19
  • 期:1
  • 页码:109-126
  • 全文大小:1,311 KB
  • 参考文献:1.Angiulli, F., Pizzuti, C.: An approximate algorithm for top-k closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)CrossRef MATH
    2.Apache Hadoop. http://​hadoop.​apache.​org
    3.Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 918–929. VLDB Endowment (2006)
    4.Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 731–736 (2010). doi:10.​1109/​ICDM.​2010.​70
    5.Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 131–140. ACM, New York (2007). doi:10.​1145/​1242572.​1242591
    6.Brown, R.A.: Hadoop at home: large-scale computing at a small college. SIGCSE Bull. 41(1), 106–110 (2009). doi:10.​1145/​1539024.​1508904 CrossRef
    7.Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, p. 5. IEEE Computer Society, Washington, DC (2006). doi:10.​1109/​ICDE.​2006.​9
    8.Connor, M., Kumar, P.: Fast construction of k-nearest neighbor graphs for point clouds. IEEE Trans. Vis. Comput. Graph. 16(4), 599–608 (2010). doi:10.​1109/​TVCG.​2010.​9 CrossRef
    9.Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.​1145/​1327452.​1327492 CrossRef
    10.Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short ’08, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)
    11.Enron Dataset. http://​www.​cs.​cmu.​edu/​~.​/​enron/​
    12.Falchi, F., Perego, R., Lucchese, C., Rabitti, F., Orlando, S.: A metric cache for similarity search. In: LSDS-IR (2008)
    13.Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://​dl.​acm.​org/​citation.​cfm?​id=​1005332.​1005345
    14.Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
    15.Lucene. http://​lucene.​apache.​org/​
    16.Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRef MATH
    17.Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Cloud-I’13, p. 2 (2013)
    18.Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
    19.Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754. ACM, New York (2004). doi:10.​1145/​1007568.​1007652
    20.Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20:1–20:46 (2010). doi:10.​1145/​1806907.​1806912 CrossRef
    21.Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 495–506. ACM, New York (2010). doi:10.​1145/​1807167.​1807222
    22.Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 131–140. ACM, New York (2008). doi:10.​1145/​1367497.​1367516
    23.Yang, B., Myung, J., Lee, S.G., Lee, D.: A mapreduce-based filtering algorithm for vector similarity join. In: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’13, pp. 71:1–71:5. ACM, New York (2013). doi:10.​1145/​2448556.​2448627
    24.Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp. 38–49. ACM, New York (2012). doi:10.​1145/​2247596.​2247602
    25.Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70(1), 60–83 (2011)CrossRef
  • 作者单位:Mahmoud Alewiwi (1)
    Cengiz Orencik (1)
    Erkay Savaş (1)

    1. Faculty of Science and Engineering, Sabanci University, Istanbul, Turkey
  • 刊物类别:Computer Science
  • 刊物主题:Processor Architectures
    Operating Systems
    Computer Communication Networks
  • 出版者:Springer Netherlands
  • ISSN:1573-7543
文摘
Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents’ features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700