基于Nutch的并行搜索系统的优化设计

英文题名：The Optimized Design of Parallel Search System Based on Nutch
作者：陈车前
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：Nutch ; 并行搜索 ; 索引划分 ; 缓存 ; 冗余备份
英文关键词：Nutch ; Parallel Search ; Index Partition ; Cache ; Redundant Backup
学位年度：2011
导师：董守斌
学科代码：081201
学位授予单位：华南理工大学
论文提交日期：2011-05-25

摘要

Nutch开源系统的出现,大大地促进了企业、校园甚至个人网络的搜索引擎的发展。它具备完整的商业搜索引擎的基本功能,包括采集器、索引器以及检索器,用户可以根据自己的需求在上面搭建专用的搜索系统。为了提高处理的数据量,Nutch提供了多机版的搜索解决方案。它利用Hadoop提供强大的分布式采集功能,能够极大地提高了采集的速度,但它对索引以及检索部分却支持不够,需要较多的人工参与。同时,随着互联网的快速发展和网页数据的海量增长,即使是多机版的搜索,性能也会成为一个瓶颈。鉴于此,本文基于Nutch提出了一套优化设计的并行搜索方案,为企业级检索应用提供高性能的搜索解决方案。
     首先,本文利用Shell脚本与配置文件来控制系统的运行流程,免除了大部分的手工操作,为Nutch提供了完整的多机版索引以及检索的解决方案,提高了系统的自动化管理以及可维护性。
     其次,本文提出了一套简单高效的索引划分方法——基于URL的索引划分。即根据网页URL的哈希值,对计算节点总数取模,得到被分配的计算节点号。一方面,根据URL便可以唯一确定网页所在的计算节点,因此这种方法能够有效地解决索引动态更新的问题;另一方面,由于URL的哈希值非常随机,取模操作符合伯努利大数定律,因此能够非常均匀地将网页分配到各个计算节点。
     然后,本文实现了“静态+动态”的高效缓存机制。静态缓存是指根据用户的搜索日志,统计出一部分热门的查询词,将它们的搜索结果存入缓存中。本文提出以查询词的公众关注度和查询词的稳定热门性来选择热门的查询词。动态缓存是指采用一定的替换策略来缓存当前的搜索结果。静态缓存与动态缓存各有优劣,但它们刚好优劣互补,因此能够更加有效地提高缓存的命中率。实验测试表明,加入缓存后,系统能够有效地减少完整处理搜索过程的次数,从而较大地提高了系统的性能。
     最后,为了提高系统的稳定性,本文提出了一种简单的冗余备份机制——“一级循环冗余备份”,即下一号节点备份前一号节点的索引数据,在前一号节点突然崩溃的情况下,启动下一号节点的搜索服务。实验测试表明,在部分节点崩溃的情形下,仍然能给出正确的搜索结果,虽然导致性能略为下降,但保证了搜索结果的准确性。
The arising of Nutch open-system, greatly promotes the development of enterprise, campus or even personal web search engine. Nutch has complete basic functions of commercial search engines, including crawler, indexer and searcher, users can build their own search system based on it. In order to increase the amount of search data, Nutch provides a multi-machine version of search solution. It uses Hadoop to provide powerful distributed crawl function, which can greatly improve the speed of crawling, but don’t provide enough support to indexer and searcher, you need human intervention to get multi-machine version of that parts. As the development of Internet, the amout of web pages will increase rapidly, even in the multi-machine version of Nutch, performance still will become a bottleneck. In view of that, this paper proposes a Nutch-based optimized design of parallel search, which can provide a high performance solution for multi-machine version of Nutch.
     First,this paper uses Shell scripts and configuration files to control the run of the system, which eliminates most of the manual operation, so can improve the automatic management and maintainability of the system.
     Secondly, this paper proposes an efficient method of index partitioning, which is partitioned by URL. Because URL is a unique feature of Web page, and the hash code of URL is very random, this method can not only solve the dynamic update problem of index, but also equally divide web pages to every node.
     Thirdly, this paper achieves a“static+dynamic”caching method, which can effectively reduce the number of processing search procedure, so can greatly improve the performace of the system. Here, static cache means the content of cache is not changeable, and dynamic cache means the content exchange frequently under a cache placement policy.
     Finally, in order to improve the stability of the system, this paper presents a simple redundant backup method, which is“one-level cyclic redundancy”. That is, next node backs up the index of the former node, when the former node suddenly collapses, the next node automatically starts up to replace it, thus can ensure the accuracy of search results.

引文

[1]中国互联网络信息中心.第27次中国互联网络发展状况统计报告[DB/OL]. http://www.cnnic.cn/research/bgxz/tjbg/201101/P020110221534255749405.pdf
    [2]章玮.搜索引擎的发展历史及现状[J].科技博览, 2010, (26)
    [3]艾瑞咨询: 2010-2011年中国搜索引擎年度监测报告[DB/OL]. http://www.iresearch.com.cn/Report/view.aspx?Newsid=132786
    [4]艾瑞咨询: 2010年中国搜索引擎年度数据发布[DB/OL]. http://www.iresearch.com.cn/Report/view.aspx?Newsid=131516
    [5]蒋建洪.主要分布式搜索引擎技术的研究[J].科学技术与工程, 2007, (10)
    [6]毛蕾.浅议网络搜索引擎的发展趋势[J].内蒙古科技与经济, 2010, (17)
    [7]李胜华.搜索引擎的现状及发展趋势探讨[J].现代商贸工业, 2010, (12)
    [8]张强.搜索引擎——网络信息检索方法[J].农业网络信息, 2010, (2)
    [9]包瑞.浅析第三代搜索引擎的发展[J].晋图学刊, 2010, (4)
    [10]胡双双,秦杰.搜索引擎技术及其发展趋势[J].福建电脑, 2008, (6)
    [11]于宴清.分布式系统[J].中国科技财富, 2009, (12)
    [12]胡涛,路红英.基于Nutch的搜索引擎的研究[J].计算机时代, 2007, (1)
    [13]邵秀丽,刘彬,张涛.基于Nutch的垂直搜索引擎的设计与实现[J].计算机工程与设计, 2011, (2)
    [14]郑小波,郑诚,封军.基于Nutch专题搜索引擎的研究[J].微计算机信息, 2010, (30)
    [15]李东海.基于Nutch技术的主题搜索引擎实现[D].吉林大学, 2007.
    [16]林卉,王一先,朱毅华.基于Lucene和Nutch的教学资源搜索引擎的研究与实现[J].中国教育信息化, 2010, (21)
    [17]申晋.基于Lucene和Nutch的林业垂直搜索引擎的研建[J].农业网络信息, 2008, (4)
    [18]黄冬. Nutch在网络学习资源搜索中的应用探究[D].华东师范大学, 2009.
    [19]周鹏,吴华瑞,赵春江,等.基于Nutch农业搜索引擎的研究与设计[J].计算机工程与设计, 2009, (3)
    [20]常智荣.搜索引擎Nutch在数字图书馆中集成应用的研究与实现[D].北京邮电大学, 2010
    [21]张斌,周尔宁.基于Nutch的分布式纺织垂直搜索引擎研究[J].电脑知识与技术,2009, (21)
    [22]时延军.基于Nutch的分布式搜索引擎的设计与研究[D].长春理工大学, 2010.
    [23]董守斌.木棉:企业级校园网搜索引擎[J].中国教育网络, 2007, (06)
    [24] http://lucene.apache.org/nutch/
    [25] Rohit Khare, Doug Cutting, Kragen Sitaker, et al. Nutch:A Flexible and Scalable Open-Source Web Search Engine[G]. Commerce Net Labs, CN-TR-04-04, 2004
    [26] http://lucene.apache.org/
    [27]吴翠雁.基于nutch的信息采集系统的研究与实现[D].华南理工大学, 2010
    [28] http://hadoop.apache.org/
    [29] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters[A]. OSDI 2004
    [30] Jose E.Moreira, Maged M.Michael, Dilma Da Silva, et al. Scalability of the NutchSearch Engine[C]. Proceddings of the 21st annual international conference on Supercomputing, Jun. 2007
    [31] Micheal M., Moreira J.E., Shiloach D., et al. Scale-up x Scale-out: A Case Study using Nutch/Lucene[C]. Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International 26-30 March 2007 Page(s):1-8
    [32]胡晓军. Linux服务器集群系统的研究和应用[D].广东工业大学, 2005
    [33]屈钢,邓健青,韩云路. Linux集群技术研究[J].计算机应用研究, 2005, (5)
    [34]魏本洁.企业级搜索引擎关键技术的研究与实现[D].华南理工大学, 2009
    [35] A. Tomasic, H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrieval systems[C]. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, pages 8-17, San Diego, California, U.S.A, 1993
    [36] C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, et al. Distributed Query Processing Using Partitioned Inverted Files[C]. String Processing and Information Retrieval, 2001.pp. 10-20.
    [37] R. Baeza-Yates, C. Castillo, F. Junqueira, et al. Challenges on Distributed Web retrival[C]. In IEEE 23rd International Conference on Data Engineering, 2007.
    [38] A. MacFarlane, J.A. McCann, S.E. Robertson. Parallel methods for the generation ofpartitioned inverted files[C]. Aslib Proceedings: New Information Perspectives, 57, 5 (2005), Emerald Group Publishing Limited, 2005, pp. 4334-459.
    [39] A. Moffat, W. Webber, J. Zobel. Load Balancing for Term-Distributed Parallel Retrieval[C]. The 29th annual international ACM SIGIR conference on Research and development in information, ACM, New York, 2006, pp. 348-355.
    [40] W. Xi, O. Somil, M. Luo, E. Fox. Hybrid partition inverted files for large-scale digital libraries[C]. In Proc. Digital Library: IT Opportunities and Challenges in the New Millennium, Beijing, China, July 2002. Beijing Library Press, 2002, pp. 404-418.
    [41] A. MacFarlane, J.A. McCann, S.E. Robertson. Parallel Search using Partitioned Inverted Files[C]. String Processing and Information Retrieval, 2000. pp. 209-220
    [42] Ribeiro-Neto B., Barbosa R. Query performance for tightly coupled distributed digital libraries [A]. In Proceedings of the Third ACM International Conference on Digital Libraries [C], 1998.6: 182-190
    [43] B. Cambazoglu, A. Catal, C. Aykanat. Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems[C]. A. Levi et al. (Eds): ISCIS 2006, LNCS, 4263, 6 (2006), Springer, Heidelberg, 2006, pp. 717-725.
    [44] Ahmad Abusukhon, Michael P. Oakes, Mohammad Talib, et al. Comparison Between Document-based, Term-based and Hybrid Partitioning[C]. Applications of Digital Information and Web Technologies, 2008. pp. 90-95.
    [45] Sergey Brin, Lawrence Page. The anatomy of a large-scale hypertextual Web search engine[C]. In WWW'98: Proceedings of the 7th International Conference on the World Wide Web, pages 107-117, 1998.
    [46] Evangelos P. Markatos. On Caching Search Engine Query Results[J]. Computer Communications, 24(2):137-143, 2001.
    [47] Ronny Lempel, Shlomo Moran. Predictive Caching and Prefetching of Query Results in Search Engines[C]. In WWW'03: Proceedings of the 12th International Conference on the World Wide Web, pages 19-28. ACM Press, 2003.
    [48] Tiziano Fagni, Ra_aele Perego, Fabrizio Silvestri, et al. Boosting the Performance of Web Search Engines: Caching and Prefetching Query Results by Exploiting Historical Usage Data[J]. ACM Transactions on Information Systems, 24(1):51-78, 2006.
    [49] Ricardo Baeza-Yates, Flavio Junqueira, Vassilis Plachouras, et al. Admission Policies for Caches of Search Engine Results[C]. In SPIRE, 2007.
    [50] P. Saraiva, E. Moura, N. Ziviani, et al. Rank-preserving two-level caching for scalable search engines[C]. In Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 51-58, 2001.
    [51] Xiaohui Long, Torsten Suel. Three-level caching for efficient query processing in large Web search engines[C]. In WWW'05: Proceedings of the 14th International Conference on the World Wide Web, pages 257-266, May 2005.
    [52] Baeza-Yates R., Gionis A., Junqueira F., et al. The impact of caching on search engines[C]. In Proc. of ACM SIGIR 2007, 183-190.
    [53] Ricardo Baeza-Yates, Aristides Gionis, Flavio P. Junqueira, et al. Design trade-offs for search engine caching[J]. ACM Transactions on the Web, 2(4):1-28, 2008.
    [54] Rifat Ozcan, Ismail Sengor Altingovde, ?zgür Ulusoy. Static Query Result Caching Revisited[C]. In Proceeding of the 17th international conference on World Wide Web. Apr. 2008, Beijing, China. pages 1169-1170.
    [55] Hui Li, Cun-hua Li, Yun Hu, et al. Improved Techniques for Caches of Search Engines Results[C]. Web Information Systems and Mining (WISM), 2010. pages 266-270.
    [56] Hui Li, Cun-hua Li, Shu Zhang, et al. Optimizing the Web Search Engines With Features and Caching[C]. Web Information Systems and Mining (WISM), 2010. pages 193-197.
    [57] Rifat Ozcan, Ismail Sengor Altingovde, ?zgür Ulusoy. Space Efficient Caching of Query Results in Search Engines[C]. Computer and Information Sciences, 2008. ISCIS’08. Pages 1-6.
    [58] Roi Blanco, Edward Bortnikov, Flavio P. Junqueira, et al. Caching Search Engine Results over Incremental Indices[C]. In Proc. of ACM SIGIR 2010, 82-89.
    [59] http://jakarta.apache.org/jcs/index.html

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700