垂直搜索引擎的抓取技术研究

英文题名：Crawl Technology Research in Vertical Search Engine
作者：刘迟
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：垂直搜索 ; 可扩展 ; 隐蔽网 ; 时效性
英文关键词：Vertical Search ; Extensible ; Hidden Web ; Time-effectiveness
学位年度：2008
导师：陈刚
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2008-05-15

摘要

垂直搜索引擎的概念,是针对某一特定行业领域提供有一定价值的信息和相关服务,它是搜索引擎的细分和延伸,是为用户提供符合专业用户操作行为的全新信息服务方式,本文是对垂直搜索引擎的抓取技术研究,主要关注垂直搜索引擎的抓取中所遇到的隐蔽网抓取、时效性以及性能和效率问题。
     本文首先介绍了垂直搜索抓取系统的体系结构,提出了一种分布式和基于可扩展插件的垂直搜索抓取系统框架,其分布式特性和插件模式都便于将来的扩展。然后讨论了垂直搜索抓取系统中隐蔽网抓取的三个问题,并针对隐蔽网抓取中结果消重的问题提出了一种自学习的中文地址判重方法;接下来针对垂直搜索的时效性问题提出了一种基于查询驱动的实时抓取方式;讨论了并比较了影响垂直搜索抓取系统的抓取模式、抓取策略和抓取频率,在本文的系统中采用了稳定持续模式、及时替换式更新、实时抓取与固定频率相结合的方式。
     本文最后进行了关于判重问题和时效性问题实验,通过实验,证明了本文提出的方法在应用中能获得更好的效果和用户体验。
The concept of Vertical Search Engine is directed towards a specific domain to provide some valuable information and some interrelated service. It is the subdivision and the extension of Search Engine. It is a brand new way of providing information service in accordance with the operation of professional users. This paper is concerning about the crawl technology of search Engine, mainly concerning about the crawl problem in Vertical Search Engine: Hidden Web, time-effectiveness, performance and efficiency.
     We first introduce the architecture of our Vertical Search Crawl System and propose a crawl system framework which is distributed and based on extensible plug-ins. The distributed property and the plug-in are all convenient for extensible for the future. Then discuss 3 questions in Hidden Web Crawl, bring a self-learning way of Elimination of Duplicated Chinese address for the crawl result of hidden web; Then develop a query triggered crawling for the time-effectiveness problem. Discuss and compare the crawl mode, crawl strategy, crawl frequency which could affect the Vertical Search crawl system and in our system we adopt the steady mode, in-place strategy, combine of real time crawl and fixed frequency.
     According to the experiment, our method for eliminating duplicate result and the time-effectiveness could get better effectiveness and better user experience.

引文

[1]Spink,A.et al.A study of results overlap and uniqueness among major web search engines.Information Processing and Management,2006,42(5):1379-1391.
    [2]Ding,W,G.Marchionini.A Comparative Study of Web Search Service Performance.Annual Conference of the American Society for Information Science,1998.
    [3]Bharat,K,A.Broder.A technique for measuring the relative size and overlap of public Web search engines.WWW 7:The Seventh International World Wide Web Conference,1998.
    [4]Yang Sok Kim,Byeong Ho Kang,Paul Compton.Search Engine Retrieval of Changing Information.WWW,2007.
    [5]Junghoo Cho,Hector Garcia-Molina.Synchronizing a database to improve freshness.ACM SIGMOD,2000.
    [6]高岭,赵朋朋,崔志明.Deep Web查询接口的自动判定.计算机技术与发展,2007.
    [7]Lage J P,da Silva A S,Golgher P B,et al.Automatic generation of agents for collecting hidden Web pages for data extraction[J].Data & Knowledge Engineering,2004,49:177-196.
    [8]Cormen T H,Leiserson C E,Rivest R L.Introduction to algorithms[M].MIT Press/McGraw Hill,2001.
    [9]Ipeirotis P,Gravano L.Distributed search over the hidden web:Hierarchical database sampling and selection[C].VLDB,2002.
    [10]李晓明,闰宏飞,王继民.搜索引擎原理.技术与系统,2005.
    [11]Qingzhao Tan,Ziming Zhuang,Prasenjit Mitra,C.Lee Giles.Designing Efficient Sampling Techniques to Detect Webpage Updates.WWW,2007.
    [12]刘林,汪涛,樊孝忠.主题爬虫的解决方案.华南理工大学学报(自然科学版),2004,11:137-141.
    [13]周立柱,林玲.聚焦爬虫技术研究综述.计算机应用,2005,9:1965-1969.
    [14] Jesse James Garrett. AJAX: A New Approach to Web Applications. http://www.adaptivepath.com/publications/essays/archives/000385.php.
    [15] Alexandros Ntoulas, Petros Zerfos, Junghoo Cho. Downloading Textual Hidden Web Content Through Keyword Queries. JCDL, 2005: 100-109.
    [16] Dirk Kukulenz, Alexandras Ntoulas. Answering Bounded Continuous Search Queries in the World Wide Web. WWW, 2007.
    [17] Christian Borgs, Jennifer Chayes, Omid Etesami, Nicole Immorlica, Kamal Jain, Mohammad Mahdian. Dynamics of Bid Optimization in Online Advertisement Auctions. WWW, 2007.
    [18] Kamal Ali, Mark Scarr. Robust Methodologies for modeling Web Click Distributions. WWW, 2007.
    [19]D. Bitton, D.J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 1995.
    [20]M. Hernandez, S. Stolfo. The merge/purge problem for large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data, 1995: 127-138.
    [21]Mattew Richardson, Ewa Dominowska, Robert Ragno. Predicting Clicks: Estimating the Click-Through Rate for New Ads. WWW, 2007.
    [22] Soumen Chakrabarti, Kunal punera, Mallela Subramanyam. Accelerated. Focused Crawling through Online Relevance Feedback. WWW, 2003.
    [23]Qingzhao Tan, et,al. Designing Efficient Sampling Techniques to Detect Webpage Updates. WWW, 2007.
    [24] Tomcat 6, http://tomcat.apache.org/.
    [25] Andre Bergholz, Boris Chidlovskii. Crawling for Domain-Specific Hidden Web Resources. Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003.
    [26]Sriram Raghavan, Hector Garcia-Molina. Crawling the Hidden Web. 27th Very Larger Data Bases(VLDB).

    [27]黄晓东 Invisible Web 研究综述. 情报科学, 2004, 9: 1144-1147.
    [28]Panagiotis G.Ipeirotis, Luis Gravano, Mehran Sahami. Probe, Count, and Classify: Categorizing Hidden Web Databases. Proc of the ACM SIGMOD Conference, Santa Barbara,2001.
    [29]叶允明,于水,马范援,宋晖,张岭.分布式Web Crawler的研究:结构、算法和策略.电子学报,2002,12A:2008-2011.
    [30]李刚,周立柱,郭奇,林玲.领域相关的Web网站抓取方法.计算机科学,2007,34(2):137-140.
    [31]T Elrad,MM Aksit,G Kiczales.Discussing aspects of AOP.COMMUNICATIONS OF THE ACM,2001.
    [32]RMI 介绍.http://java.sun.com/javase/technologies/core/basic/rmi/index.jsp.
    [33]Jianfeng Gao,Mu Li,Andi Wu,Chang-Ning Huang.Chinese word segmentation and named entity recognition:a pragmatic approach.Computational Linguistics,31(4).

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700