网页综合信息与领域本体相结合的主题爬行研究

英文题名：Research on the Focused Crawling Combining Synthetic Web-Page Information and Domain Ontology
作者：关鑫
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：主题爬行 ; 本体 ; 锚文本 ; 特征项位置
英文关键词：Focused Crawling ; Ontology ; Anchor text ; Term Location
学位年度：2010
导师：欧阳丹彤
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2010-04-01

摘要

主题爬行是在背景知识的指导下,根据一定的网页分析算法过滤主题无关的网页,预测并抓取主题相关的网页。主题爬行对于解决从海量信息中提取需要的信息及在特定领域搜索信息具有重要的意义。
     本文的主要工作是研究利用本体作为背景知识来指导主题爬行策略,将URL的综合信息与本体结合以求提高主题爬行的效率。在传统爬行框架的基础上,本文对网页内容做了具体的分析,指出网页某些位置的信息对于揭示网页主题具有很重要的意义。算法从网页文档提取出特征向量,并将特征向量加上文档位置权重因子与本体的概念进行匹配从而得到网页主题相关度;利用扩展锚文本来预测超链接的主题相关度。根据计算的网页主题相关度与预测链接的主题相关度结合来设计一个爬行策略,并与现有的基于本体的爬行策略对比。
     通过实验表明,本文的爬行策略收获比明显优于对比实验中的其他爬行策略。通过大量的实验数据对比分析:利用网页综合信息与领域本体结合来指导主题爬行策略,可以有效提升网页主题爬行的收获比。
Since 1994, Search engine on the web has been developed significantly. It solves the problem of mass resource to be indexed and fast located on the web. The effect of search engine has more and more important in the people’s live. However, with data increasing more and more rapidly, traditional search engine will not meet user request. Search engine with poor semantic processing ability will not meet users’accuracy demand.
     Focused crawling is a improve technique for search engine. It is an intelligent search application in the search domain. The aim of focused crawling is to find the web pages which are defined previously. It classifies web pages by using text categorization and predicting hyperlink technique to get a good search effect.
     If we integrate focused crawling with semantic technique, then during the progress of the search, crawler would like be guided by domain specialist. The search engine will not only return search result, but also give resources concerned with topic. Designing focused crawling strategy is based on common search engine. Actually, it is extension for traditional search engine. Under direction of background knowledge, crawler gets as more as possible web page. Range of focused crawler is smaller than the common crawler. However, focused crawling will get more precise result. Focused crawler filter un-relevant web pages to get and save lots of relevant web pages under limited web resource. The main orientation of focused crawling is how to filter off-topic web pages and how to get more topic web page.
     Marc Ehrig proposes an approach of document discovery building on a frame for ontology-focused crawling of web documents. Ontology is a description for conception and properties. It can describe background knowledge precisely, and it is a tool for knowledge representation. Ontology-focused crawling will get more satisfied search result. From research of computing relevance of web page, we find that combining document term location on the web and ontology will get more precise relevance of web page. Traditional methods do not give more research on the link. The approach predicts topic relevance of link using extend anchor text and relationship of links. The whole algorithm centre on above two points.
     The main work of this paper is based on the ontology-focused crawling. Firstly, we analyze text of web page to extend this approach, and point that information of specific location in web page plays an important role to the topic of web page. Secondly, this approach gives a analysis for topic relevance of link which is contained on the page.
     Anchor text is the hyperlink text. It is summarize of information of hyperlink. Because anchor text usually distributes on other web pages, it represents the intension of web authors. They want to guide users to know subject of web pages and visit URL by using brief information. Comparing with web page which is selected randomly, anchor text has stronger ability to describe goal page. So, Predicting topic relevance of web page based on the anchor text is a hot pot for researcher.
     The thought of Algorithm is that when get web page, it delete the tag which is not important. Then system extracts text from page, counts high frequency and convert text to vector. When computing topic score of web page, it judge each term of vector to belong to conception of ontology. And it judges it to map the conceptions, properties and instances of ontology. It gives the vector the topic score by combining web location weight and ontology. If topic score of page is higher than the threshold, all of hyperlinks of the page will be extracted. And each hyperlink has been judged whether it has been crawled. For the hyperlink which has not been crawled, the algorithm predicts its topic score. Hyperlinks are made to enter different queue according to the score.
     We get deep research for this problem. And we proposed a strategy based on ontology background and anchor text information. It will improve accuracy and this paper integrates search engine with semantic web such as resource description frame, ontology, and reasoning technique and so on. I construct finance domain ontology to realize search strategy. To test advantage of ontology-based search strategy, I do experiment with finance information.
     The most important standard to measure effect of focused crawling is how to select relevant web pages and how to filter topic-off web pages. Harvest rate represents the fraction of web pages crawled that satisfy the target among the crawled page.
     The paper designs three groups of experiments. The first compute topic score of web page by combining term location weight and ontology. The second predict relevant score of hyperlinks. In the third , the algorithm combine the first two experiments ,and compare the four strategy.
     We can get conclude from result of the experiment that our approach has a higher efficiency and harvest rate. The strategy use domain ontology as background knowledge and combine with text term location weights to compute topic score of web pages. It also uses anchor text and dependency of html to predict relevance hyperlink. This strategy can be made an effect use to focused crawling research.
     The research of focused crawling does not only have theoretical value, but also have wide application prospect. There are some issues of focused crawling discussed in the paper. Future of web is well expected and our work is the beginning of the research which should be done in the future. How to change the research of focused crawling to web application and how to support service according different users’demand is direction of our research.

引文

[1]印鉴,陈忆群,张刚.搜索引擎技术研究与发展[J].计算机工程, 31(14) :54-56
    [2]李晓明,闫宏飞,王继民.搜索引擎:原理、技术与系统[M].北京:科学出版社2005
    [3] Chakrabarti S, Berg VD, Dom B. Focused crawling: a new approach to topic-specific web resource discovery[J]. Computer Networks, 1999, 31(1-2): 1623-1640.
    [4] J Rennie, A K McCallum. Using reinforcement learning to spider the Web efficiently. In:I Bratko, S Dzeroski(Eds):Proceeings of ICML-99, 16th International Conference on Machine Learning[C]. San Francisco, US, Morgan Kaufmann Publishers, 1999, 335-343
    [5]刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究, 2007 ,24(10):26-29
    [6] Grigoris Antoniou, Frank Van Harmelen. A Semantic Web Primer[M], MIT Press, 2004
    [7]欧阳柳波,李学勇,李国徽,王鑫.专业搜索引擎搜索策略综述[J].计算机工程,2004 ,30(13):32-33
    [8]叶育鑫,欧阳丹彤.语义Web搜索研究进展[J].计算机科学, 2010, 37(1):1-5.
    [9]叶育鑫,欧阳丹彤.基于语义的主题爬行策略[J].软件学报, (已录用).
    [10] ]P.D.Bra, G.Houben, Y.Kornatzky, et al. Information retrieval in distributed hypertexts. Proceedings of the 4th RIAO Conference[C], New York, 1994. 481–491.
    [11] HERSOV ICIM, JACOV IM, MAAREK Y S, et al. The shark-search algorithm: an app lication: tailored Web site mapping [C] / /Proc of the 7th International World Wide Web Conference. Brisbane: [s.n.], 1998: 65-74.
    [12] Ehrig M, Maedche. A. Ontology-focused crawling of web documents[C]. In: Proceedings of the 2003 ACM Symposium on Applied Computing. ACM press, New York, NY, 2003, 1174-1178.
    [13] Kleinberg J M.Authoritative sources in a hyperlinked environment. Proceedings of ACM-SIAM Symposium on Discrete Algorithms,1998,668-677
    [14] Ehrig M, Maedche. A. Ontology-focused crawling of web documents[C]. In: Proceedings of the 2003 ACM Symposium on Applied Computing. ACM press, New York, NY, 2003, 1174-1178
    [15] Hai-Tao, Bo-Yeong Kang, Hong-Gee Kim. An ontology-based approach to learnable focused crawling[J]. Information Sciences. 2008,178: 4512-4522
    [16]廖明宏.本体论与信息检索[J].计算机工程, 2000 ,26(12):56-58
    [17]李善平,尹奇,胡玉杰,郭鸣,付相君.本体论研究综述[J].计算机研究与发展, 2004, 42(7): 1041-1052
    [18] Filippo Menczer, Gautam Pant, Padmin Srinivasan. Topical web crawlers: Evaluating adaptive algorithms[J]. ACM Transactions on Internet Technology, 2004 4(4): 378-419
    [19] J Rennie, A K McCallum. Using reinforcement learning to spider the Web efficiently[C]. In:I Bratko,S Dzeroski(Eds):Proceeings of ICML-99, 16th International Conference on Machine Learning. San Francisco, US, Morgan Kaufmann Publishers, 1999, 335-343
    [20]汪涛,樊孝忠.主题爬虫的设计与实现[J].计算机应用, 2004 24: 270-272
    [21] L.Page,S.Brin,R.Motwani,et al.The PageRank Citation Ranking:Bringing Order to the Web[OL],Stanford Digital Library Technologies Project ,http://google.stanford.edu/~backrub/pageranksub.ps.1998.
    [22] Ching-Chi Hsu, Fan Wu, Topic-specific crawling on the Web with the measurements of the relevancy context graph Information Systems 31 (2006) :232–246
    [23]罗娜.基于本体的主题爬行技术的研究[D].长春:吉林大学计算机科学与技术学院2009
    [24] Salton G., M.J.McGill, Introduction to Modern Information Retrieval[J].Journal of the American Society for Information Science,1983, 41: 288-297.
    [25] SALTON G, FOX E A, WU H. Extended boolean information retrieval[J]. Communications of the ACM, 1983, 26(11): 1022-1036.
    [26] Yang Y, Pedersen J O. A comparative study on feature selection in text categorization[C]. In: Proceedings of 14th International Conference on Machine Learning(ICML’97), 1997, 412-420
    [27]李文斌,刘椿年,陈嶷英.基于特征信息增益权重的文本分类算法[J].北京工业大学学报, 2006, 32(5): 89-93.
    [28]张承立,陈剑波,齐开悦.基于语义网的相似度算法改进[J].计算机工程与应用, 2006 , 17 : 165-169
    [29]李杰,丁颖.语义网关键技术概述[J].计算机工程与设计, 2007, 28(8):1831-1834
    [30] Nicola Guarino Formal ontology and information systems. Proceeding of FOIS’98 Trento , Italy : IOS Press , 1998 3-15
    [31]宋绍成,毕强,杨达.本体技术在学术研究领域中的应用[J].东北师大学报自然科学版, 2005, 37(1): 41-45
    [32] T R Gruber. Towards principles for the design of ontologies used for knowledge sharing[J]. In International Journal human-Computer Studies, 2003,43: 907-928
    [33] A U Frank. Spatial ontology : A geographical point of view[C]. In : OStocked Spatial and Temporal Reasoning. Dordrecht, Nether-lands: Kluwer Academic Publishers , 1997, 135-153
    [34] Maedche A. Ontology Learning for the Semantic Web[J]. Intelligent Systems, IEEE,2001,16(2):72-79.
    [35] M R Genesereth, N J Nilsson. Logical Foundations of Artificial Intelligence[M]. San Mateo : Morgan Kaufmann Publishers , 1987
    [36]丁璇,侯汉清,章成志.中文网页标引源主题表达能力的调查统计[J].文献信息组织与利用, 2002, 6:70-72
    [37]刘菁菁,林鸿飞,赵晶.基于PageRank和锚文本的网页排序研究[J].计算机工程与应用, 2007, 43(10): 170-173
    [38] Brian D. Davision. Topical locality in the Web[C]. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval table of contents, 2000:272-279
    [39] Gautam P, Padmini S. Link Contexts in Classifier-Guided Topical Crawlers[J]. Knowledge and Data Engineering, IEEE Transactions on, 2006, 18 (1): 107-122
    [40]高珊,何婷婷,胡文敏.一种基于锚文本的并行检测策略[J].计算机工程, 2007, 30(19): 30-31
    [41]曹军. Google的PageRank技术剖析[J].情报杂志, 2002, 10: 15-19
    [42]黄颖,黄治平. HtmlParser提取网页信息的设计与实现[J].江西理工大学学报, 2007, 28(6):26-30
    [43]吴栋,滕育平.中文信息检索引擎中的分词与检索技术[J].情报学报, 2004, 24(7): 128-131
    [44]孙炜.基于语义网技术的主题搜索引擎原型研究及其在电子政务领域的应用[D].北京:北京交通大学交通运输学院2008
    [45]王学松. Lucene+Nutch搜索引擎开发[M].北京:人民邮电出版社2008

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700