基于Nutch的农业搜索引擎检索结果排序策略的研究

英文题名：Researching on the Sorting Strategy of Agricultural Search Engine Based on Nutch
作者：王春花
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：深2度 ; 新资源补偿 ; 文档的农业相关性 ; 规模统一化处理 ; 曲线对照评估法
英文关键词：2 degrees deep ; compensation for new resources ; the relevance of document to agriculture ; the scale of Unification ; evaluation method based on compared curve
学位年度：2010
导师：朱俊平
学科代码：081202
学位授予单位：西北农林科技大学
论文提交日期：2010-05-01

摘要

搜索引擎是从互联网上快速有效的定位信息的一门技术,其中和用户关系最紧密的是检索结果排序技术,其结果直接反应给用户体验感受,从某种程度上讲,好的排序结果成就好的搜索引擎。而随着我国计算机在农村的普及,农业信息剧增,农业搜索引擎的研究成为热门课题。本研究的目标是对搜索引擎检索结果排序策略进行深入分析研究,改进传统的PageRank算法,最终把它应用在基于Nutch建立的农业搜索引擎中。
     本文首先分析搜索引擎的工作流程,研究网页抓取、索引建立、检索执行等环节含有的影响排序的因素;其次分析排序流程,找到了影响排序的关键性因素及其基本原理;再次分析了经典的排序算法及其实现过程;接着分析Nutch开源搜索引擎,研究其排序算法,分别从基于超链接分析的权威性和基于内容分析的相关性两方面对算法进行改进;最后在Nutch基础上,通过对网页抓取入口地址控制建立了农业类搜索引擎,并运用提出的改进排序算法对其进行改进。
     具体实验中,给出了构建基于Nutch的农业搜索引擎的具体流程。采用了通用的P@n评估法和首页重复率评估法,对改进算法评估。通过具体实验,从量化的角度分析了算法效率,得出改进算法的用户满意度和首页重复率比改进前的算法提高了7%左右。
     本文的主要成果是对PageRank算法超链接分析权威性的改进,包括两个方面:基于深2度链接分析的父网页非平均传递权值的思想实现和对新创资源与孤立资源的补偿策略。主要分析了以上两个创新的基本改进思路,提出了具体计算公式,并进行了简要分析说明。而对内容分析的相关性研究主要引入了农业主题向量概念和计算构造方法,并给出了文档的农业相关度计算公式。最后,进一步综合形成了引入内容分析的基于父子页面相关性的非平均传递权值的算法。
Search engines is a technology which locates information from the Internet quickly and effectively, and in which the most closely with customer relationship is the technology searching results sequencing, the results direct response to the user。To some extent,a good sort results will become a good search engine. With the popularity of our computers in the countryside, and the increasing of the agricultural information , agricultural engine research becomes a hot researching topic. The aim of this researching is to analyze the sorting strategy of search engine in-depth, to improve the traditional PageRank algorithm, and to apply it to the agricultural search engine Nutch-based.
     Analyzing the work flow of search engine, and researching the factors of impacting sorting be containing by the web crawling, indexing, retrievaling and other sections,which is the main work. At the same time, Analyzing the sorting processes, and finding out the critical factors and the basic principle of affecting sorting,which is also the important jobs that have been done. By Analyzing the Nutch which is an open source search engine and its implementation process, researches a classic sorting algorithm, and improves the sorting algorithm based on two aspects whice are the authority based on hyperlink analysis and the content correlation. Finally,based on Nutch, established an agricultural search engine by controlling the address of Crawlling the web page to, which is improved by using the improved sorting algorithm.
     In the experiment, the specific processes of agricultural search engine Nutch-based is brought forward.With the general evaluating method of the P@n and the Home duplicating rate, the improved algorithm is been well evaluated. Through the specific experiment, the efficiency of the algorithm is been discussed from the quantitative point of view, and the following results are been improved: the improved algorithm derived customer satisfaction and improved page repetition rate than the before algorithm increases about 7%.
     The main achievement of this paper is the improvement to the link analysis for ultra-authoritatives based on PageRank algorithm.Including the following two aspects: the ideology to the hyperlink analysis based on 2 degrees deep which is the weight of the parent page transmist non-average, and the compensation strategies in the new or isolation resources. Mainly analyzes the basic improvement ideas of these two aspacts, and puts forward the specific formula, and a brief analysis shows. For researching into the relevance of the content analysising,introduces the concept of the agricultural theme vectors and the methods of calculation and construction, and gives the document's agriculture-related degree formula. Finally, the algorithm is been further introduced which is integrated content analysissing based on parent-child transmissing non-average weight.

引文

百度公司. 2010.竞价排名. http://www.baidu.com/about/service/shifen.html[2010-2-11]
    蔡小艳,寇应展,杨杰,赵新杰. 2008.基于页面关联比重的PageRank排序算法的改进.军械工程学院学报. 20(3): 66~69
    丁成杰. 2006.搜索引擎技术的研究与实现. [硕士学位论文].上海:上海交通大学
    管建和,甘剑峰. 2007.基于Lucene全文检索引擎的应用研究与实现.计算机工程与设计. 28(2): 489~491
    郭谢. 2006.基于Web Community识别专业搜索引擎研究. [硕士学位论文].浙江:浙江大学
    韩毅. 2006.基于DTD的XML文档内容检索研究.情报科学. 24(3): 91~94
    胡涛,路红英. 2007.基于Nutch的搜索引擎的研究.计算机时代. 16(1): 57~59
    黄德才,戚华春. 2006. PageRank算法研究.计算机工程. 32(4): 145~147
    金玉玲. 2005.基于Lucene的全文检索系统研究与应用[硕士学位论文].辽宁:大连理工大学
    老杨. 2010.搜索引擎中文分词和网页排序技术研究. http://blog.sina.com.cn/s/blog_50e156e501 00gcja.html[2010-2-11]
    李村合,吕克强. 2008.一种改进PageRank的新方法.计算机系统应用. 21(3): 112~115
    李刚,宋伟,邱哲. 2006. Ajax+Lucene构建搜索引擎.北京:人民邮电出版社
    李晓明,闫宏飞,王继民. 2004.搜索引擎-原理、技术与系统.北京:科学出版社: 279
    李阳. 2007. Nutch入门学习.北京:北京邮电大学: 61
    李志蜀,李果. 2001.中文搜索引擎原理剖析及开发实现技术.计算机应用研究. 8(1): 96~99
    刘彤彤,伍小芹. 2008.融入权威性与相关性的PageRank算法.信息技术. 18(4): 18~21
    马亮,陈群秀. 2002.智能Web中文主题的信息收集系统IRobot设计.中文信息学报. 16(4): 23~29
    农业网址大全网站后台支持. 2010.农业网址信息. http://www.nyw456.com[2010-4-1]
    彭波. 2004.搜索引擎检索系统效率优化和效果评估研究. [博士学位论文].北京:北京大学
    钱功伟,倪林, MIAO Yuan,曹荣. 2007.基于网页链接和内容分析的改进PageRank算法.计算机工程与应用. 43(21): 160~164
    全球农业电子商务. 2010.农业信息. http://www.gloagri.com[2010-3-28]
    陕西农产品加工技术研究院. 2010.分类. http://www.nongye.cn/category.shtml[2010-2-14]
    宋佳,诸云强,刘润达. 2008.一种基于Lucene改进的全文检索工具包.计算机工程与应用. 44(4): 172~175
    孙莉. 2006.搜索引擎Google的PageRankTM技术.情报科学. 100(2): 111~115
    王东,雷景生,李壮. 2008.基于PageRank的页面排序改进算法.计算机工程与设计.29(22):5921~5924
    王京婕. 2010.互联网发展信息与动态. http://research.cnnic.cn/html/1265769514d1861. html [2010-03-25]
    王奇,宋国新,邵志清. 2000.信息检索中基于链接的网页排序算法.华东理工大学学报. 21(10): 27~32
    王仕仲,宁龙兵. 2009.基于Nutch的中文搜索引擎的研究与实现.电脑开发与应用. 22(7):77~79
    王学松. 2009. Lucene+Nutch搜索引擎.北京:人民邮电出版社.第二版: 452
    王玉珍. 2007. Google的PageRank技术分析.电脑学习. 13(5): 13~15
    王知津,贾福新,郑红军. 2005.探索现代信息检索的理论架构--《现代信息检索》评述.图书馆杂志. 24(7): 15~50
    温艳鸿. 2007.基于PageRank的页面排序改进算法.福建电脑. 29(8): 144~145
    吴敏琦,丁岳伟. 2008.基于Nutch的XML网站全文搜索引擎实现.计算机工程. 34(15): 95~97
    邢志宇. 2003.集成搜索引擎与元搜索引擎. http://www.sowang.com/sousuo/20031005.htm [2010-2-11]
    新闻中心. 2009.浅谈搜索引擎排名的基本原理. http://www.hxwebs.com/news/top/602.html [2010-02-11]
    徐金雷. 2007.专业搜索引擎的排序算法研究. [硕士学位论文].江苏:南京师范大学
    徐金雷,杨晓江. 2007.基础教育资源搜索引擎的排序算法研究.网络教育. 166(2): 69~72
    余小兵. 2007. Google搜索引擎的核心PageRank算法综述.通信技术. 21(6): 245~248
    原福永,张园园. 2007.基于链接分析的相关排序方法的研究和改进.计算机工程与设计. 28(7): 1630~1631
    张丽. 2007. PageRank算法的改进.科学技术与工程. 7(5): 673~677
    张锦忻. 2009.基于的中文搜索引擎的构建.图书馆研究与工作. 117(77): 54~57
    张敏,高剑峰,马少平. 2004.基于链接描述文本及其上下文的信息检索.计算机研究与发展. 41(1): 221~226
    张骞. 2007.搜索引擎简史笔记. http://bbs.sowang.com/viewthread.php?tid=11301&extra= page% 3D4[2010-2-20]
    张巍,李志蜀. 2005.基于PageRank算法搜索引擎优化策略.计算机应用. 25(7): 1711~1713
    张贤,周娅. 2009.基于Lucene网页排序算法的改进.计算机系统应用. 18(2): 155~158
    张晓刚,李明树. 2001.智能搜索引擎技术的研究与发展.计算机工程与应用. 24(10): 57~60
    张毅,张冬梅. 2008.搜索引擎PageRank算法的比较与改进.科技创新导报. 21(1): 18~20
    张志刚. 2004.基于网页的信息系统的一种预处理过程. [硕士学位论文].北京:北京大学
    中国搜网后台支持. 2009.搜网. http://www.sowang.com/SEARCH/nongye.html [2010-3-12]
    中华人民共和国农业部. 2010.中国农业信息. http://www.agri.gov.cn[2010-3-28]
    周登朋,谢康林. 2007. Lucene搜索引擎.计算机工程. 33(18): 95~97
    Adam Cannane, Hugh E.Williams. 2002. A general-purpose compression scheme for large collections. ACM Transactions on Information Systems(TOIS). 20(3): 329~355
    Algorithm, Communication Networks, Apache Software Foundation. 2005. Lucene Query Syntax. http://lucene.apache.org/java/docs/[2010-2-11]
    Ayaman Farahat, Thomas Lofaro, Joel C. Miller, Gregory Rae, Lesley A. Ward. 2006. Authority Rankings from Hits, PageRank, and Salsa: Existence,uniqueness,and Effect of Initialization. SIAM Jourmal on Scientific Computing.27(4): 1181~1201
    Brian, Goets W. 2004. The Lucene search engine:powerful, flexible, and free. http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html?[2010-2-11]
    Category. 2009. Google PageRank算法. http://www.yhoog.com/news/2008815152735.html [2010-02-11]
    C.C.Aggarwal, F.Al-Garawi, P.S.Yu. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the tenth international conference on World Wide Web. Hong Kong: ACM Press: 96~105
    Christopher D, Mannning, Prabhakar Raghavan, Hin-rich Schutze. 2007. An Introduction to Information Retrieval. USA: Cambridge University Press: 581
    Charles L.A.Clarke, Gordon V.Cormack. 2000. Shortest-substring retrieval and ranking. ACM Transactions on Information Systems. 18(1): 44~78
    David Hawking, Stephen Robertson. 2003. On collection size and retrieval effectiveness. Information Retrieval. 6(1): 99~150
    Dragomir R.Radev, Kelsey Libner, Weiguo Fan. 2002. Getting answers to natural language questions on the web. JASIST. 53(5): 64~92
    Filippo Menczer, Richard K.Belew. 2000. Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning, 39(3): 203~242
    Haveliwala T H. 2003. Topic-sensitive PageRank. IEEE Transactions on Knowledge and Data Engineering. 15(3): 784~796
    J. Cho. 2001. Crawling the Web: Discovery and maintenance of large- scale Web data. [博士学位论文]. USA: Stanford University
    Krishna Bharat, Andrei Broder, Monika Henzinger et al. 1998. The Connectivity Server: fast access to linkage information on the Web. Computer Networks. 30(7): 469~477
    Matthew Richard, Pedro Domings. 2009. Combing link and Content Information in Web Search. USA: University of Washington: 16
    Michael Gordon, Praveen Pathak. 1999. The retrieval effectiveness of search engines. Information Processing and Management. 35(2): 141~180
    Nutchchina.后台支持. 2010. Nutch相关知识. http://www.nutchchina.com/[2010-4-23] Otis Gospodnetic, Erik Hatcher. 2006. Lucene in Action(中文版). Greenwich: Manning Publications: 76
    0 Friede, D A Grossman, A Chowdhury et al. 2000. Efficinecy Consideratinos for Scalable information Retrieval Servers. Journal of Digital Information. 1(5): 119~144
    Page L, B rin S, Motwani R et al. 1998. The PageRank Citation Ranking: Bringing Order to the Web Technical report, Stanford: Stanford University
    Pavel Berkhin. 2005. A survey on PageRank Computing. Internet Mathematics. 2(1): 73~120
    SEO论坛. 2009. Alexa排名是什么? http://www.seowhy.com/37_15_zh.html[2010-2-11]
    Xing Wenpu, Ghorbani A. 2004. Weighted PageRank Services Research. 21(19): 305~314
    Tom White. 2006. Introduction to Nutch running with windows. http://today.java.net/ pub/a/today/ 2006/01/10/introduction-to-nutch-1.html[2010-1-25]
    Trotman A, Geva S. 2006. Relevance in XML Retrieval: The Usererspective. Proceedings of the SIGIR Conference on XML Element Retrieval Methodology. Washington, USA:ACM Press
    Trystan Upstill, Nick Craswell, and David Hawking. 2003. Query-independent evidence in home page finding.ACM Transactions on Information Systems (TOIS). 21(3): 286~313
    Zhu Rongbo. 2008. Personalized web pages ranking algorithm based on user preferences. Journal of Southeast University. 24(3). 351~353

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700