基于Simhash算法的海量文本相似性检测方法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Mass Text Similarity Detection Based on Simhash Algorithm
  • 作者:任民山 ; 蔡红霞
  • 英文作者:Ren Minshan;Cai Hongxia;
  • 关键词:相似性计算 ; Simhash算法 ; TF-IDF技术 ; 海明距离 ; 指纹值
  • 英文关键词:similarity calculation;;Simhash algorithm;;TF-IDF technique;;Hamming distance;;fingerprint value
  • 中文刊名:JLYS
  • 英文刊名:Metrology & Measurement Technique
  • 机构:上海大学智能制造及机器人重点实验室;
  • 出版日期:2018-04-30
  • 出版单位:计量与测试技术
  • 年:2018
  • 期:v.45;No.311
  • 语种:中文;
  • 页:JLYS201804025
  • 页数:3
  • CN:04
  • ISSN:51-1412/TB
  • 分类号:83-85
摘要
为了在知识文档搜索中更加精确的为用户推荐更多语义内容相似的文档。本文对基于Simhash算法的文档相似性计算技术进行深入研究,引入ICT-CLAS分词技术,将TF-IDF技术作为计算权重的主要方法,对原有的Simhash算法作出改进,采用海明距离对Simhash指纹值进行相似性度量计算。最后以民机研制领域的工序数据为实验数据进行相关实验,实验结果表明:改进的方案性能得到提高,并且总体优于Shingle算法和原Simhash算法,能够实现大规模文档中相似性的精确检测。
        In order to more accurately recommend more documents with similar semantic content for users in the search of knowledge documents. This paper makes an in-depth study of document similarity calculation technology based on Simhash algorithm,introduces ICT-CLAS word segmentation technology,uses TF-IDF technology as the main method for calculating weights,improves the original Simhash algorithm,and uses Hamming distance to Simhash fingerprinting. Values are calculated for similarity metrics. Finally,the experiment data of the civil aircraft development field is used as experimental data to carry out relevant experiments. The experimental results show that the improved scheme performance is improved,and the overall improvement over the Shingle algorithm and the original Simhash algorithm can achieve accurate detection of similarity in large-scale documents.
引文
[1]Wu H C,Luk R W P,Wong K F,et al.Interpreting TF-IDF term weights as making relevance decisions[J].Acm Transactions on Information Systems,2008,26(3):55~59.
    [2]Manber U.Finding Similar Files in a Large File System[C]//Usenix Winter Technical Conference.1994:1~10.
    [3]陈春玲,陈琳,熊晶,等.基于Simhash算法的重复数据删除技术的研究与改进[J].南京邮电大学学报(自然科学版),2016,36(3):85~91.
    [4]刘克强.2009共享版ICTCLAS的分析与使用[J].科教文汇,2009(22):271.
    [5]张广庆,葛唯益,贺成龙.基于Simhash的海量相似文档快速搜索优化方法[J].指挥信息系统与技术,2015,6(2):61~65.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700