搜索引擎中的相似网页探测算法研究

英文题名：Research on Near-Duplicates Detection Algorithm of Search Engine
作者：郑鹏
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：搜索引擎 ; 相似网页 ; 相似度 ; 指纹
英文关键词：Search Engine ; Near-Duplicate ; Similarity ; Fingerprint
学位年度：2008
导师：卢正鼎
学科代码：081203
学位授予单位：华中科技大学
论文提交日期：2008-06-01

摘要

相似网页(Near-Duplicate Web Pages)在互联网中的大量存在,给搜索引擎带来了多方面的问题,如爬行程序反复的搜录同样内容的网页给搜索引擎的爬行程序自身及互联网都带来了沉重的负担,由此导致索引的重复与额外存储空间的消耗,并降低了搜索引擎的性能和用户体验。因此,若有效的相似网页探测算法能够去除大量的重复网页,则可以大大减轻搜索引擎的负担和提升搜索引擎的性能与用户体验。
     回顾了搜索引擎的发展历程、分析了搜索引擎的工作原理,并着重研究了搜索引擎的相似网页探测算法的现状,在分析现有算法的优势与不足后,概括了准确有效的相似网页探测算法所应该具备的两个基本条件。在经典的Simhash指纹算法和Shingle算法的的基础上,针对这两个算法的不足,以上述两个基本条件为基准,给出了多种改进方案。针对Simhash算法所缺乏的单词位置信息等,将单词位置信息融入单词权重;针对Shingle算法缺乏词频特征,于是将全部Shingle的指纹进行叠加。为了进一步改善算法性能,考虑两个算法的特点,特将两个算法进行集成,同时进一步挖掘网页内容特征,将词性、词频、单词位置等融入Shingle权重。另外由于Shingle数量级太大,而给出根据单词指纹叠加生成Shingle指纹的方法。
     改进主要集中于提取更加完善的网页内容特征,从而提高网页相似度计算的准确度。根据这些算法的改进,以Manku等人的指纹探测算法为基础,构建了一个原型系统,以实验分析验证了算法改进的有效性,并将该原型系统进行重构加入到一个搜索引擎爬行系统中,实现了有效的相似网页在线探测。
There are a great number of duplicate or near-duplicate web pages in the Internet, and a lot of problems arise from the duplicates, such as it is a great burden for both crawlers and the Internet, and it leads to duplicate indecies and extra space to store files and a great effect on the performance of search engines. Some effective duplicate page detection algorithms can discover and remove most of the duplicates, reduce the burden of the crawlers of search engines and lead to better performance.
     This paper has reviewed the development of search engines, analised the principles of search engines, and then research the situation of different duplicate detection algorithms. On the basis of analysis of the advantages and disadvantages of current duplicate detection algorithms, it came to a conclusion that an excellent duplicate detection technique has to satisfy two necessary conditions. Simhash fingerprinting and Shingling are two classic algorithms of duplicate detection. On the basis of complete analysis of these two algorithms, several improved solutions were proposed, according to the two necessary conditions. In order to get more web page features, the word weight has taken the word position into account, had all the shingle to generate the final fingerprint of a web page, and even integrated the two algorithms to improve the performance further, on the basis of their characteristics. Since the huge magnitude of shingles, a new method, which based on all the words included in the shingle, was proposed to generate the fingerprint of a shingle.
     The proposed improved solutions is based on the web page feature selection, in order to improve the accuracy of the computation of the similarity between two duplicates. Then a prototype was developed on the basis of Manku’s algorithm, which is aimed to detect duplicate web pages efficiently, to confirm the effectiveness and efficiency of the improved solutions. Finally, a crawler system having the prototype as a subsystem was developed to detect duplicate web pages online.

引文

[1]唐铭杰.论搜索引擎的发展概况及发展趋势[J ] .情报杂志, 2001 , (5) : 70-71
    [2] A.Arasu, J.Cho, H.Garcia-Molina et al. Searching the Web. ACM Trans. on Internet Technology 2001,1(1):2-43
    [3] Sergey Brin, Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998,30(1-7):107-117
    [4] Fetterly D, Manasse M, Najork M. Detecting phrase-level duplication on the World Wide Web. In: Proc. of the SIGIR 2005. New York: ACM Press, 2005. 170?177
    [5] Gurmeet Manku, Arvind Jain, Anish Das Sarma. Detecting Near-Duplicates for Web Crawling. May. 2007 Proceedings of the 16th international conference on World Wide Web
    [6] Xiubo Geng, Tie-Yan Liu, Tao Qin, PHang Li. Feature selection for ranking. Jul. 2007 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
    [7] J. Dean and M. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, 1999
    [8] M. R. Henzinger. Finding Near-Duplicate Web Pages: a Large Scale Evaluation of Algorithms. In SIGIR 2006,pages 284-291, 2006
    [9] Broder A, Glassman S, Manasse M, Zweig G. Syntactic clustering of the Web. In: Proc. of the 6th Int’l World Wide Web Conf. New York: ACM, 1997. 1157?1166
    [10] Y. B. Cao, J. Xu, T. Y. Liu, H. Li, Y. L. Huang, and H. W. Hon. Adapting Ranking SVM to Document Retrieval. In Proceedings of SIGIR 2006, p.186-193, 2006
    [11]张刚,刘挺,郑实福,车万翔,李生.大规模网页快速去重算法.中国中文信息学会二十周年学术会议论文集. 2001.11
    [12] Bin Cao, Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen. Feature selection in a kernel space. Jun. 2007 Proceedings of the 24th international conference on Machine learning
    [13] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC’02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, New York, NY, USA, 2002. ACM Press
    [14] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proc. 25th VLDB} p. 518--529, 1999
    [15] Robert R. Korfhage. Information Storage and Retrieval. New York: John Wiley&Sons, Inc, 1997:194
    [16] Baeza-Yates R and Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press, New York
    [17] A. Broder. Some applications of rabin's fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143-152. Springer-Verlag, 1993
    [18] Gurmeet Manku, Arvind Jain, Anish Das Sarma. Detecting Near-Duplicates for Web Crawling. May. 2007 Proceedings of the 16th international conference on World Wide Web
    [19] PRajeev Motwani, PAssaf Naor, Rina Panigrahi. Lower bounds on locality sensitive hashing. Jun. 2006 Proceedings of the twenty-second annual symposium on Computational geometry
    [20] M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR-15-81, Center for Research in Computing Techonlogy, Harvard University, 1981
    [21] Calvin Chan and Hahua Lu. Fingerprinting Using Polynomial (Rabin's Method). Faculty of Science, University of Alberta, CMPUT690 Term Project, December 2001
    [22] A. Z. Broder. On the resemblance and containment of documents. Proc. Compression and Complexity of SEQUENCES, p. 21--29. IEEE Computer Society, 1997
    [23] Dennis Fetterly, Mark Manasse, Marc Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. Web Congress, 2003. Proceedings. First Latin American, 2003 - ieeexplore.ieee.org
    [24] Cho J, Garcia-Molina H, Page L. Efficient crawling through url ordering. Computer Networks, 1998,30(1-7): 161-172
    [25] Broder AZ, Najork M, Wiener JL. Efficient URL caching for World Wide Web crawling. In: Proc. of the 12th Int’l World Wide Web Conf. New York: ACM Press, 2003. 679?689
    [26] PQingzhao Tan, Prasenjit Mitra, C. Lee Giles. Designing clustering-based webcrawling policies for search engine crawlers. Nov. 2007 Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
    [27] Fetterly D, Manasse M, Najork M, Wiener J. A large-scale study of the evolution of Web pages. In: Proc. of the 12th Int’l World Wide Web Conf. New York: ACM Press, 2003. 669?678
    [28]彭涛.面向专业搜索引擎的主题爬行技术研究.吉林大学博士学位论文,2007
    [29]刘红,邵晓良,胡吉兵.基于页面内容和链接结构的超链接主题预测算法.现代图书情报技术. 2005,(5):41~45
    [30]王晓宇,周傲英,万维网的链接结构分析及其应用综述.软件学报.2003,14(10):1768~1780
    [31] Xue GR, Yang Q, Zeng HJ, Yu Y, Chen Z. Exploiting the hierarchical structure for link analysis. In: Ricardo A, Ziviani N, Marchionini G, et al., eds. Proc. of the 28th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. New York: ACM Press, 2005. 186?193
    [32] Donato, D., Leonardi, S., Millozzi, S., and Tsaparas, P. 2005. Mining the inner structure of the Web graph. In Proceedings of the Eighth International Workshop on the Web and Databases (WebDB). 145--150
    [33]熊冬明.汉语自动分词和中文人名识别技术研究.浙江大学硕士学位论文, 2006
    [34]王继成,潘金贵,张福炎. Web文本挖掘技术研究.计算机研究与发展. 2000. Vol.37, No.5, P.513-520
    [35] C. A. Michel Klein, Ubbo Visser. Semantic Web Challenge 2004. Journal of Web Semantics, 2005, 3(2-3): 209~210
    [36] QIN, T., LIU, Y., ZHANG, X., CHEN, Z. and MA, W., A Study of Relevance Propagation for Web Search, In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil, SIGIR’05
    [37] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, S., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web. Comput. Netw. 33, 1-6, 309--320
    [38] Justin Zobel, Alistair Moffat .Inverted Files for Text Search Engines[J].ACMComputing Surveys, 2006, 38(2)
    [39]杨传耀.中文信息检索索引模型及相关技术研究.复旦大学博士学位论文,2007
    [40] Caroline M. Eastman, Bernard J. Jansen. Coverage relevance and ranking The impact of query operators on Web search engine results. ACM Transactions on Information Systems (TOIS), 2003 - portal.acm.org
    [41] Anh, V. N. and Moffat, A. 2005. Inverted index compression using word-aligned binary codes. Kluwer International Journal of Information Retrieval 8, 1, 151--166
    [42] Scholer F, Williams HE, Yiannis J and Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Beaulieu M, Baeza-Yates R, Myaeng SH and Jarvelin K, Eds., Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, p. 222-229

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700