智能垂直搜索引擎的研究与设计
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet的快速发展,Web上的信息与资源日益膨胀。面对海量的信息资源,如何更快更好的获取需要的资源成为人们日益关注的问题。通用搜索引擎返回的结果页面中含有大量的“噪声”页面,需要人为的去挑选自己所关注的主题。垂直搜索引擎的出现,为人们提供了更快,更专业,更精准的网络资源的检索服务。
     垂直搜索引擎是以构筑某一专题领域或学科领域的因特网信息资源库为目标,智能地在互联网上搜集符合设定专题或满足学科需要的信息资源,它只针对某一特定主题,能够提供更集中、更专业的搜索服务。在对垂直搜索引擎的关键技术进行研究的基础上,本文研究并设计了垂直搜索引擎的主题爬行模块、索引模块和检索模块,并最终实现了一个垂直搜索引擎原型系统。主要工作如下:
     ①针对当前垂直搜索引擎面临的一个亟需解决的“主题漂移”问题,本文提出了一种改进型的主题爬行模型。主要包括基于反馈的主题知识库、主题判定模型和链接分析模型。通过不断提炼和反馈主题网页数据库中的主题关键词,丰富和完善主题知识库,使主题知识库具有一定的学习和自适应能力;考虑HTML不同标签的权值,采用改进的向量空间模型算法判定网页的主题相似度,提高主题判定的有效性和准确性;基于Shark算法思想,通过将HTML文档解析为DOM树形结构,同时设置链接上下文阈值,提出一种基于链接上下文的链接主题相似度DOM判定模型,从而更好的来判断URL的主题相似度,指导主题爬行的方向。
     ②在研究全文检索基本原理和倒排索引组织结构的基础上,综合字索引、词索引和主题网页的特征,提出了一种基于主题知识库的混合索引模型,提高了索引建立的效率和准确性;设计了基于混合索引的检索器的工作流程,并结合向量空间模型,对检索结果排序进行了分析和探讨。
     ③最后采用Nutch框架,实现了一个面向“五金”的垂直搜索引擎原型系统。通过对该原型系统进行实验测试,实验结果表明该垂直搜索引擎系统具有较好的查准率,并且具有自适应性,体现了一定的智能,在一定程度上解决了“主题漂移”问题,基本达到了本文的研究目的,同时也为后续的研究提供一定的理论和实验依据。
Along with the rapid development of Internet, the resources of web are on the increasingly expanding. Facing the mass of information resources, more and more people are now concerning how to access to resources better and faster. General search engine results pages contain a lot of "noise" pages, people need to choose what he needs. Vertical search engines provide people with a faster, more professional, more accurate search services of network resources.
     Vertical search engine is used to collect information resources of Internet that meet specific topics. It is able to provide more professional search services. The thesis designs a vertical search engine prototype system, including the focused crawling model, the index model and the retrieval model. The main work is listed as follow:
     ①The thesis presents an improved focused crawling model, which can solves the“topic dirft”problem, including a subject knowledge based on feedback, a topics identification model and a link analysis model. Through getting continuous feedback from the theme words, subject knowledge can have a certain adaptive capacity; considering the various weight of html’s tags, the thesis presents a improved vsm algorithm to determine the topic similarity of page; Through parsing the HTML document as a DOM tree structure, the thesis proposes a link context model to determine the topic similarity of URL correctly.
     ②The thesis studys the principles of full-text search and the structure of inverted index in depth. On this basis, the thesis presents a hybrid index model based on subject knowledge to improve the efficiency and accuracy of Index. Then, the thesis designs the workflow of search baseed on the hybrid index and analyzes the sort model of search results combining the vector space model.
     ③Finally, the thesis realizes a hardware-oriented vertical search engine prototype system based on the framework of Nutch. Experiments show that, the vertical search engine system has more precise rate and certain self-adaptive properties, solves the“topic drift”problem, and reachs the research’s purpose basically, also provides a theoretical and experimental basis for the follow-up study.
引文
[1] December 2009 Web Server Survey[EB/OL]. http://news.netcraft.com/archives/2009/12/24/december_2009_web_server_survey.html.
    [2]第25次中国互联网络发展状况统计报告[R].Beijing,CNNIC,2010.
    [3]李晓明,闫宏飞,王继民.搜索引擎—原理、技术与系统[M].北京:科学出版社,2004.
    [4]李蕾.中文搜索引擎概念检索初探[J].计算机工程与应用,2000,36(6):1-11.
    [5]姚琪.垂直搜索引擎系统的研究与设计[D].上海:上海交通大学软件工程系,2008.
    [6] Chskrabarti S,vander Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discovery[J].Comput Netw,1999,(31):11-16.
    [7] P.D.Bra,G.Houben,Y.Kornatzky,and R.Post.Information retrieval in distributed Hypertexts[J]. In Procs.of the 4th RIAO Conference,New York,1994,pages 481–491.
    [8] M.Hersovici,M.Jacovi,Y.S.Maarek,D.Pellegb,M.Shtalhaima,and S.Ura.The shark-search algorithm.an application:tailored web site mapping[J].In WWW7,1998.
    [9]邱哲,符滔滔.Lucene+Heritrix开发自己的搜索引擎[M].北京:人民邮电出版社,2007.6.
    [10]孙建军,成颖等.信息检索技术[M].科学出版社,2004.
    [11]林其东,陈传波,郑乐丹,张一曼.数字图书馆主题搜索引擎的设计与实现[J]计算机应用研究,2009,26(8):2952-2955.
    [12]杨沛,郑启伦,彭宏. Inherit/ Feedback:一种新的Web主题挖掘方法[J].计算机研究与发展, 2004, 41(5): 40-44.
    [13] HAVELIWALA TH. Topic sensitive PageRank[C] Proceedings of the 11th International WWW Conference. New York: ACM Press, 2002: 517-526.
    [14] Michael Hersovici,Michal Jacovi,etc.The Shark-Search Algorithm:AnApplication of Tailored Web Site Mapping[J/OL].Computer Networks and ISDN System,1998,30:317-326.
    [15] C. C. Aggarwal, F.Al-Garawi and S .P .Yu. Intelligent Crawling on the World Wide Web with Arbitrary Predicates[J]. In: Proceedings of the 10th International Conference on World Wide Web. Hong Kong: ACM Press, 96~105. 2001.
    [16] A. Shapiro, E. Rimon and S. Shoval. Immobilization Based Control of Spider-like Robots in Tunnel Environments[J]. Proceedings 2001 ICRA, IEEE International Conference on Robotics and Automation. Vo1.4. n: 3636~3642. 2001.
    [17] RICARDO B-Y , BERTHIER R-N . Modern information retrieval[M].王知津,贾福新,郑红军,等译.北京:机械工业出版社,2004.
    [18] J.Kleinberg,etc.Authoritative sources in a hyperlinked environment[J].Proceedings of the 9thACMSIAM Symposium on Discrete Algorithms,1998.
    [19]王小平,曹立明.遗传算法——理论、应用与软件实现[M].西安:西安交通大学出版社,2002.
    [20] D MBikel, RLSchwartz, RMWeischedel. An algorithm that learns what’s in a name[J]. Machine Learning, 1999, 34(1~3):211~231.
    [21] Aphinyanaphongs Y,C F Aliferis.Learning Boolean Queries for Article Quality Filtering[J].San Francisco,CA:MED INFO,2004,263-267.
    [22]陈丛丛.主题爬虫搜索策略研究[D].山东:山东大学计算机软件与理论系,2009.
    [23]崔保国.信息社会的理论与模式[M].北京:高等教育出版社,1999.
    [24] SALTON G, WONG A, YANG C. A vector space model for auto-matic indexing[J]. Communications of ACM, 1995, 18(11): 613-620.
    [25]靖继鹏,吴正荆.信息社会学[M].北京:科学出版社,2004.
    [26]陈军,陈竹敏.基于网页分块的Shark-Search算法[J].山东大学学报(理学版).2007,42(9):62-66.
    [27]蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用.2008,28(4):942-944.
    [28]姚忠存.锚文本增量主题爬行[D].吉林:吉林大学软件与理论系,2007.
    [29]谢鲲,秦拯,文吉刚,等.联合多维布鲁姆过滤器查询算法[J].通信学报,2008.
    [30] Ricardo Baeza-Yates.Modern Information Retrieval[M].New York:ACM Press,1999.
    [31]张校乾,金玉玲,侯丽波一种基于Lucene检索引擎的全文数据库的研究与实现[J].现代图书情报技术,2005.
    [32]王学松.Lucene+nutch搜索引擎开发[M].北京:人民邮电出版社,2008.
    [33] Baeza-Yates R,Navarro G.Block-addressing indices for approximate text retrieval[J].In:Proc.of the 6th CIKM conferece.Las Vegas,Nevada,1997:1-8.
    [34]颜维龙,盖杰.面向网络的全文检索中索引文件的组织[J].计算机应用研究,2002,(11):124-12.
    [35] H.S.Heaps.Information Retrieval::computational and theoretical aspects[M]. New York: Academic Press,1978.
    [36] Richrdo,Beeza-Yates,etc.Modern Information retrieval[M]. Addison-Wesley Longman Limited, 1999.
    [37]李刚,宋伟,邱哲.征服Ajax+Lucene构建搜索引擎[M].北京:人民邮电出版社,2006.
    [38]赖茂生等编著.计算机情报检索[M].北京:北京大学出版社,1993.
    [39] Bookstein A,Swanson D R.A Decision Theoretic Foundation for Indexing[J].Journal of the American Society for Information Science,1975(26).
    [40] Sara Baase,Allen Van Gelder.Computer Algorithms Introduction to Design and Analysis[M].Pearson Education Press.2001 251-311.
    [41] N.Friedman,D.Geiger,M.Goldszmidt.BayesianNetworkClassifiers[J]. Machine Learning, 1997, 29:131-163.
    [42]赵汀,孟祥武.基于Lucene API的中文全文数据库设计与实现[J].计算机工程与应用,2003,(20):179-181.
    [43] Otis Gospodnetic Erik hatcher.Lucene IN ACTION[M].谭鸿,黎俊鸿,周鹏,高承山译.北京:电子工业出版社,2007.
    [44]宛玲,杨秀丹.试析中文搜索引擎的评价标准[J].情报科学,2000,18(1):28-31.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700