基于聚焦相关度排序的搜索引擎研究与应用

英文题名：Research and Development of Search Engine Based on Focus Relevance Ranking
作者：温泉
论文级别：硕士
学科专业名称：计算机体系结构
中文关键词：垂直搜索引擎 ; PageRank ; 聚焦相关度 ; 主题爬虫 ; 用户行为模型
英文关键词：Vertical Search Engine ; PageRank ; Focus Relevence ; Topic Crawler ; User Behavior Model
学位年度：2010
导师：丁祥武
学科代码：081201
学位授予单位：东华大学
论文提交日期：2010-01-01

摘要

搜索引擎是人们从海量网络数据中获取有用信息的重要工具,是网络信息研究和应用的关键内容。目前随着网络信息的爆炸式增长以及信息多元化的发展,快速有效地获取所需的信息变得越来越困难,通用搜索引擎已不能适应用户对信息检索的准确性要求,专业化的、面向主题的垂直搜索引擎正成为研究的热点。相关度排序技术是搜索引擎中的关键技术之一,它对于获取主题相关的数据和提供相关的查询结果集起着至关重要的作用。
     论文研究了垂直搜索引擎中相关度技术,并分析了其中的不足之处,然后对主题爬行、基于链接结构排序、基于页面权重排序等方面提出了改进模型和算法,以提高相关度排序的质量,从而改善垂直搜索引擎的性能。最终设计并实现了面向领域的垂直搜索引擎系统。论文的主要贡献包括:
     (1)针对主题爬虫无法穿越“黑暗tunnel”问题,使用在线学习的方法并利用辅助函数,对主题爬虫的主题爬行策略进行改进,使其能抓取到相关度更高的主题数据。
     (2)研究了PageRank算法及其改进算法,通过对用户点击网页行为进行建模,改进链接之间PageRank值的传递方式,从而提出改进算法。实验证明,该算法能在不增加额外存储空间的情况下,有效地避免主题漂移现象的发生。
     (3)针对网页权重特征提取模型维度过高的缺陷,提出网页权重的自定义方法,定义出网页权重的因素,并利用可分性判据来衡量页面权重因素的权重,从而给出页面权重的评价函数,有效地降低网页特征空间维度。
     (4)融合以上三方面改进方案,提出聚焦相关度排序方案,并将其运用到搜索引擎的实现中。
     (5)利用Lucene全文搜索引擎框架,实现了汽车主题资源的垂直搜索引擎系统。经实际应用表明,聚焦相关度排序使本垂直搜索引擎的相关性、查全率、查准率都有了不同程度的提高。
Search engine is the most important tool for people to get useful information from the magnanimity web data,also it is the key content of researching and developing web information. But currently,with the web information's blast increasing and multivariant information's developing,it comes to be more and more difficult to retrieve desirable information speediness and effectively. Traditional search engine can't meets users' high precision requirement of searching information, vertical search engine ,which is professional and oriented topic,becoming the research hot spot.Relevance ranking technology is the core technology of the vertical search engine, it plays an important role in retrieving topic data and providing relevance searching result.
     The paper works on research the key issues of relevance ranking technology of vertical search engine,and describes improved model and algorithm of the topic crawling technology, ranking bases on links structure,ranking bases on page weight and so on.Improve the quality of relevance ranking to improve the performance of vertical search engine.Finally,design and develop a vertical search engine orient domain.The main contributions of the paper include:
     (1)Aim at the problem of topic crawler can't get through the dark tunnel,use online learning method and assist function to improve the topic crawling strategy of topic crawler,make it can retrieve high relevance topic data.
     (2)Research PageRank algorithm and its'improve algorithms,through modeling of the behavior of user click page,improve the way of delivering PageRank value between links,and describes the improve algorithm,which does't need added space ,can prevent the topic drift event from happening.
     (3)Aim at the high dimension shortcoming of feature distill model, describes the customization method of page weight to constitute the factor of page weight,and use dissoluble criterion to weigh the page weight factors,and get the evaluate function which can reduce the dimension of feature vector.,
     (4)Describes the focus relevance ranking strategy to integrate the three aspect improvement above,and put it into practice in the development of search engine.
     (5)Using the Lucene full text search engine framwork to develop a oriented automobile topic search engine system. The Pratical Application shows , our focus relevance ranking strategy makes the search engine have improvement in relevance, recall ratio and precision ratio.

引文

[1]李小明,刘建国.搜索引擎技术及趋势[J].中国计算机用户,2007,(9):34-35.
    [2]中国互联网络信息中心(CNNIC).中国互联网络发展状况统计报告.2008,http://www.enie.net.index/OE/00/11/index.html.
    [3]宋聚平,王永成,尹中航.对网页PageRank算法的改进[J],上海交通大学报,2003(3):397-400.
    [4]黄丽雯,钱微.多文档文本摘要的一种改进HITS算法[J].计算机应用.2006,26(11):2625-2627.
    [5]Mvanden's berg,B.Dom.Focus Crawling:A New Approach to Special Web Resource Discovery[C].The 8th International World Web Conference,1999,5:7-15.
    [6]李绍华,高文.搜索引擎页面排序算法研究综述[J].计算机应用研究,2007,2(6):4-7.
    [7]田梅梅.搜索引擎Google与百度的比较分析[J],云南档案,2007,01:6-10.
    [8]杨思洛.搜索引擎排序技术的研究[J].现代图书情报技术,2005,1:26-29.
    [9]吴家麟,谭永基.PageRank算法的优化和改进.计算机工程与应用,2009,45(16):56.
    [10]M.Chau,H.Chen,Comparison of three vertical search spiders[J],Computer,vol.36,2003,5:56-62.
    [11]Eric J Glover,Kostas Tsioutsiouliklis,Steve Lawrence,etal.Using web structure for classifying and describing web pages[C].Proceedings of the 11th international conference on World Wide Web,Honolulu,Hawaii,ACM Press,2002.
    [12]TH.Haveliwala.Topic-Sensitive PageRank[DB/OL].2005-02[2006-06]
    [13]M.Richardson,P.Domingos.The intelligent surfer:probabilistic combination of link and content information in PageRank[J].Advances in Neural Information Processing Systems,2007,14,144121448.
    [14]S.Chakrabarti,Mvanden Berg and B.Dom.Distributed hypertext resource discovery through examples[C],VLDB.1999:375-386.
    [15]S.Chakrabarti,B.Dom,P.Raghavan,S.Rajagopalan,D.Gibson and J.M.Kleinberg.Automaic resource compilation by analyzing hyperlink structure and associated text. Computer Networks[J], 1998, 30:65 - 74.
    [16] F.Menczer and R.K.Belew. Adaptive retrieval agents:Internalizing local context and scaling up to the web. Machine Learning[J], 2009, 39:203-213.
    [17] M.Diligenti, F.M.Coetzee, S.Lawrence, C.L.Giles and M.Gori. Focused crawling using context graphs[C]. In:Proc. Very Large Databases. Cairo, Egypt, 2000.
    [18] J.Johnson, K.Tsioutsiouliklis and C.L.Giles. Evolving strategies for focused web crawling[J]. In:ICML. 2003, 3:298 - 305.
    [19] L.Introna and H.Nissenbaum. Defining the web:the politics of search engines[J]. Computer, 2000, 33(1):54 - 62.
    [20] AsunciOn GoOmez-Perez and Oscar Corcho. Ontology languages for the semantic web[J]. IEEE Intelligent Systems, 2002, 17(1):54-60.
    [21] L.Barbosa and J.Freire.Combining classfiers to identify online databases.Proceedings of the International Conference on World Wide Web[C]. ACM Press,2007.431 - 439.
    [22] Cho J, Garcia-Molina H. Effective page refresh policies for Web crawlers[J]. ACM Trans. on Database Systems, 2003,28(4):390-426.
    [23] Taher Haveliwala, Sepandar Kamvar, Dan Kleinetal. Computing PageRank using Power Extrapolation [R]. Technical Report, Stanford University, 2006.
    [24] LawrencePage, SergeyBrin, RajeevMotwani, TerryWinograd. ThePageRank Citation Ranking: Bringing Order to the Web, 2008.
    [25] Han JC, Cercone N, Hu XH. A weighted freshness metric for maintaining search engine local repository[C]. In:Proc. of the 2004 IEEE/WIC/ACM Int'Conf.on Web Intelligence. Washington:IEEE Press, 2004: 677?680.
    [26] Jr Coffman EG, Liu Z, Weber R. Optimal robot scheduling for Web search engines.Journal of Scheduling[J], 1998, 1(1):15-29.
    [27] Barbosa L, Freire J. Combining classifiers to identify online databases.In:Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ, eds. Proc.of the World Wide Web Conf[C]. (WWW). ACM, 2007:431-440.
    [28] Barbosa L,Freire J. Searching for hidden-Web databases. In:Doan AH, Neven F,McCann R, Bex GJ, eds. Proc.of the 8th Int'Workshop on the Web and Databases(WebDB). Baltimore:ACM Press, 2005.1-6.
    [29] Hong D, Kin K.Update conscious inverted indexes for XML queries in relational databases.In:Galindo F,Takizawa M,Traunm R,eds.Poc.ofthe 15th Int'l Conf.on Database and Expert Systems Applications.Berlin,Heidelberg:Springer-Verlag,2004:2637272.
    [30]高丹,古士文等.基于Lucene的搜索引擎设计与实现[J].微机发展.2005,1:7-10.
    [31]车东.Lucene.基于Java的全文检索引擎简介.http://www.chedong.com/.
    [32]周立柱,林玲.聚焦爬虫技术研究综述.北京:北京机械出版社.2005.9.
    [33]J.Cho and S.Roy,Impact of Web Search Engines on Page Popularity[C],WWW,New York,USA,2004:20-29.
    [34]Philip S.Yu,Xin Li,Bing Liu.On the Temporal.Dimension of Search[C],WWW 2004,May 17-22,2004:448-449.
    [35]M.Cafarella,A.Halevy,Z.Wang E Wu,and Y.Zhang.Uncovering the relational web.In under review,2008.
    [36]J.Madhavan,P.A.Bernstein,and E.Rahm.Generic.schema matching with cupid[C].In VLDB,2007.
    [37]H.Chen,S.Tsai,and J.Tsai.Mining tables from large,scale html texts[C].In 18th International Conference on.Computational Linguistics(COLING),2009:166-172.
    [38]Wei Vivian Zhang and Rosie Jones.Comparing click.logs and editorial labels for training query rewriting.In Query Log Analysis Workshop,WWW '07.
    [39]R.Baeza-Yates and B.Ribeiro-Neto.Modern.Information Retrieval[C].ACM Press,1999.
    [40]Salton G,Wong A,Yang C.A Vector Space Model for Automatic Indexing[J].Communications of ACM,2005,18(11):613-620.
    [41]Larry,M.M.,Malik,Y.,One-class SVMs for document classification.Journal of Machine Learning.Research[J],2001,2:139-154.
    [42]S.Ozmutlu,H.C.Ozmutlu and A.Spink,A day in the.life of Web searching:an exploratory study[J],Information Processing and Management,2004,40:319-345.
    [43]G.Shafer,A mathematical theory of evidence.Princeton University Press,Princeton,NJ,2006.
    [44]S.Ozmutlu and F.Cavdur,Neural Network Applications for Automatic New Topic Identification[J],Online Information Review,2005,29:35-53.
    [45]Zhang M.Study on Web text information retrieval.Beijing:Tsinghua University,2003.
    [46]Nan Ma,Jiancheng Guan,Yi Zhao.Bringing PageRank to the citation analysis.Information Processing and Management,no.6,2007.
    [47]N.Eiron and K.S.McCurley,Analysis of Anchor Text for Web Search[J],SIGIR 2003:17-28.
    [48]Guo Yan,Bai Shuo,Yang Zhi-feng,Zhang Kai.Analyzing Scale of Web Logs and Mining Users'Interests[J].Chinese Journal of Computers,vol.9,2005,18:1483-1496.
    [49]戚华春,黄德才,郑月锋.具有时间反馈的PageRank改进算法[J].浙江工业大学学报,2005,33(3):2722275.
    [50]H.C.Ozmutlu and F.Cavdur,Application of automatic topic identification on excite web search engine data logs[J],Information Processing and Management,2008,41:1243-1262.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700