可扩展分布式垂直搜索引擎设计与实现研究

英文题名：Research on Design and Implementation of the Extensible Distributed Vertical Search Engines
作者：黎斌
论文级别：硕士
学科专业名称：电子与通信工程
中文关键词：垂直搜索引擎 ; 分布式 ; 聚焦爬虫 ; 模糊分类
英文关键词：vertical search engines ; distributed ; focusedcrawler ; fuzzy classification
学位年度：2008
导师：鲜明
学科代码：081001
学位授予单位：国防科学技术大学
论文提交日期：2008-04-01

摘要

众所周知,在浩如烟海的互联网上存在着大量的隐蔽网络资源,这些资源由于许多因素不容易被用户轻易地发掘,然而这些隐蔽信息在数量和质量上都优于普通的网络资源,所以对它们的发掘研究变得越来越重要。通用搜索引擎由于受到爬行深度的限制不可能全面地抓取这些信息,并且许多网站都设置了访问权限,一般爬虫(Crawler)被禁止访问;通用搜索引擎的页面解析也不能适应各具特色的网页形式的要求。相对于通用搜索引擎,具备特殊功能的垂直搜索引擎在挖掘隐蔽信息方面却能取得较好的效果。垂直搜索引擎采用针对资源特点的定制抓取策略和解析方法,能提取出精度非常高的网络信息,对于用户来讲,通过它可以在某一领域查询到经过精心筛选的信息。
论文研究了搜索引擎的相关技术。通过分析研究聚焦爬虫的各种爬行策略,提出了基于树型网络结构的国外军事论坛网站资源的网络爬虫方法。通常论坛在网络分布上严格符合树型网络结构,可以针对性地加入爬行链路选择机制,使爬虫只抓取存有信息的贴子网页。在信息分类方面,论坛贴子内容含有大量的无用信息(回贴、恶意发贴),而这些无用信息通过统计发现,含有两个通常的特点:字数少、段落少。本文针对这一特点,提出了基于模糊模式识别的信息分类方法,将贴子信息的字数和段落数提取出来做为影响因子,采用样本分析法确定其影响度和权重,根据S型函数形态计算出分类隶属函数公式,有效地提高了分类的质量。在索引与检索方面,研究了垂直搜索引擎常用的索引软件Lucene的索引方法,提出了针对用户查询的结果缓存方法,通过OSCache进行了实现,大大提高了检索的响应速度。通过对搜索引擎的整体研究,使用Java建立了一个包含Military.com论坛的部分信息的军事资料搜索引擎,并将前面的研究结果进行了实现。最后研究了分布式搜索引擎的各种系统结构及运行机制,提出了基于分布式元搜索引擎系统的分布式垂直搜索引擎的系统框架,并提出了基于CORBA模式的分布式实现方法。
It is known that there are a lot of hidden resources in the Internet which are not easily explered by the users for many reasons. Because the quantity and quality of these hidden resources exceed the ordinary ones, researches on their exploration become increasingly important. General searching engines can not grasp the information fully due to the restrictions of the crawl depth. The general crawler is prohibited to access many web sits for the limited permission and can not adapt the diversiform web pages. The vertical searching engines are superior in mining hidden information compared to general ones. They adopt specific crawling strategy and analytical method for the characteristics of the resources and can extract highly accurate web information. They can provide the specially selected information in some field for the users.
The technologies of the search engines are studied in this dissertation. A crawler based on the tree structure is proposed for the web sits of foreign military forums, through the analysis of the various focused crawlers' strategies. Usually forums accord strictly the tree structure in the network distribution, so the selection scheme of the crawling link can be added to crawl in the web pages containing information. In information classification, the forum postings contain a lot of useless information (post, malicious post), which statistically contain two features: few words and paragraphs. A method of information classification is proposed based on the fuzzy pattern recognition. Using the quantity of words and paragraphs as an effect factor, determining the effect and weight with the sample analysis method. The quality of the classification is improved effectively by calculating classification formula with S-function. In the index searching, a vertical search engine with Lucene's method is studied and a buffer method is proposed to solve the users' inquiries. The response speed is improved greatly by using OSCache. Based on the study of the search engines, a search engine is designed and realized using Java for military information in the Military.com forum. At last the structure and operational scheme of the various distributed search engines are studied and the system framework of the distributed vertical search engine is proposed based on the design of distributed CORBA model.

引文

[1]Diane Clark.Invisible Web:Finding Hidden Content.http://www.thealbe rtalibrary.ab.ca/netspeed/netspd2003/presentations/E2_Invisible_Web,ppt,2004
    [2]袁顺波.隐蔽网络及应对策略研究.南京:南京大学,2005.
    [3]中国互联网络发展状况统计报告.中国互联网络信息中心(CNNIC),2008
    [4]中文全文检索网http://www.fullsearcher.com
    [5]Danny Sullivan.Fifth Annual Search Engine Meeting Report.Boston,MA,2000
    [6]闫俊英.垂直搜索引擎的研究与实现.哈尔滨:哈尔滨工业大学,2004
    [7]杨坚争,李朝平.垂直搜索引擎及其应用.上海:上海理工大学,2006
    [8]吴欣茹.垂直搜索引擎的设计与实现.西安:西北工业大学,2006
    [9]陈康,许婷.基于WEB的全文搜索引擎的设计与实现.计算机工程,2005
    [10]李刚,宋伟.Ajax+Lucene构建搜索引擎.北京:人民邮电出版社,2006
    [11]E.Spertus.ParaSite.Mining Structural Information on the Web.Computer Networks and ISDN Systems,1997
    [12]梁斌样.走进搜索引擎.北京:电子工业出版社,2007
    [13]卢亮,张博文.搜索引擎原理、实践与应用.北京:电子工业出版社,2007
    [14]邱哲,符滔滔.开发自己的搜索引擎-Lucene 2.0+Heritrix.北京:人民邮电出版社,2007
    [15]徐宝文,张卫丰.搜索引擎与信息获取技术.北京:清华大学出版社,2002
    [16]中文搜索引攀揭秘.http://Polog.csda.net/tember/archive/2006/02/19/602622.aspx
    [17]Google search engine,http://www.google.com/,2006
    [18]Junghoo Cho,Hector Garcia-Molina,Lawrence Page.Efficient crawling through URL ordering Computer Networks and ISDN Systema,1998
    [19]Andrei Z.Broder,Marc Najork,Janet L.Wiener.Eficient URL Caching for World Wide Web Crawling.WWW2003,Budapest,Hungary,May 2003
    [20]S.Brin,L.Page.The anatomy of a large-scale hypertext ual Web search engine.Computer Networks and ISDN Systems,April 1998
    [21]Wensi Xi,Ohm Sornil,Ming Luo,Edward A.Fox.Hybrid Partition Inverted Files for Large-Scale Digital Libraries,2002
    [22]Xiaohui Long,Torsten Suel.Optimized Query Execution in Large Searc h Engines with Global Page Ordering,2003
    [23]Sanjay Ghemawat,Howard Gobioff,Shun-Tak Leung.The Google File System,2003
    [24]刘洁清.网站聚焦爬虫研究.南昌:江西财经大学,2006
    [25]孙即祥.现代模式识别.长沙:国防科技大学出版社,2002
    [26]韩正忠,曹乐乐.网页模糊归类算法的应用与实现.南京:东南大学,2005
    [27]苏洋.http://www.ccw.com.cnfhtm/app/course/01_11_7_3.asp
    [28]P.DeBra,R.Post Information retrieval in the World Wide Web:makin g client-based searching feasible.Proc.1~(st) International World Wide Web Confe-re nce,1994
    [29]http://www.yuanma.org/data/2007/0118/article_2144.htm
    [30]Edward T.O'Neill,Brian F.Lavoie,Rick Bennett.Trendsin the Evolution of the Public Web.D -Lib Magazine,April 2003
    [31]李晓明,刘建国.搜索引擎技术及趋势.http://www.ccident.com,2000
    [32]Michael L.Mauldin.Lycos:Design choices in an Internet search service.IEEE EXPERT,Jan 1997
    [33]George Laughead Jr.HISTORY:W3 SEARCH ENGINES,2003
    [34]孟晓明.搜索引擎在网络信息挖掘中的应用,2004
    [35]G.Salton,M.J.McGill.Introduction to Modern Information Retrie-val.M cGraw-Hill,1983
    [36]Danny Sullivan.SearchEngine Sizes.searchengine watch,com/reports,2005
    [37]Cho Junghoo,Garcia-Molina H.The Evolution of the Web and I mplications for an Incremental Crawler.Proceedings of the 26~(th) International Conference on Very Large Data Bases,2000
    [38]朱炜,王超,李俊,潘金贵.WEB超链分析算法研究.计算机科学,2003
    [39]Eric W.Brown,James P.Callan,W.Brace Croft,J.Eliot B.Moss.Supporting full-text information retrieval with a persistent object store.In 4~(th) International Conference on Extending Database Technology,March 1994
    [40]D.A.Gorssman,J.R.Driscoll.Structuring text with in a relation system.In Proceedings of the 3rd International Conference on Database and Expert System Applications,September 1992
    [41]R.L.Haskin.Special-purpose processors for text retrieval.Database Eng- ineering 4(1),September 1981
    [42]Cho Junghoo,Garcia-Molina H.Synchronization a Database to Improve Freshness.In Proceedings Of the ACM SIGMOD International Conference On Management of Data,May 2000
    [43]李盛韬,白硕.基于主题的Web信息采集技术研究.中国科学院计算技术研究所,2002
    [44]Mike Burner,Brewster Kahle.WWW Archive File Format Specification.www.archive.org,1996
    [45]Joao Campos.Versus:a Web Data Repository with Time Support.Master Thesis,Faculty of Sciences,University of Lisbon,September 2002
    [46]Burner,M.Crawling to ward seternity:Building an archive of the World Wide Web.Web Techniques Magazine,May 1997
    [47]Jun Hirail,Sriram Raghavan,Hector Garcia-Molina.Andreas Paepeke We bBase:A repository of web pages.In Proceedings of the Nineth World-Wide Web Conference,1999
    [48]Sriram Raghavan,Hector Garcia-Molina.Representing Web graphs.In Pr oceedings of the IEEE International Conference on Data Engineering,2003
    [49]D.E.Knuth.The Art of Computer Programming,Vol.3:Sorting and Sea rching.Addison-Wesley,Reading,Mass,1973
    [50]J.Zobel,A.Moffat,R.Sacks Davis.An eficient indexing techniques for full-text database systems.In Proceedings 18~(th) International Conference on Very L arge Databases,1992
    [51]A.Mofat,J.Zobel.Compression and fast indexing for multi-gigabyte text da tabases.Australian Computer Journal,1994
    [52]Ian H.Witten,IanH.Witten,Alistair Moffat,Timothy C.Bell.Managing Gigabytes:Compressing and Indexing Documents and Images.Van Nostrand Reinhold,New York,1994
    [53]万方,王大震.分布式Crawler系统研究与设计.武汉:湖北工业大学,2007
    [54]用Backup Exec实现数据同步备份.http://www.pconline.com.cn/pcedu/uijian/system/backup/0508/687731.html,2005。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700