基于RSS的聚焦网络爬虫在高校网站群中的研究

英文题名：Research of Focused Crawler about Group of University Website Based on RSS
作者：张睿涵
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：聚焦网络爬虫 ; RSS ; PageRank算法 ; TF-IDF算法 ; 增量式抓取
英文关键词：Focused Web crawler ; RSS ; PageRank algorithm ; The
英文关键词：TF-IDF algorithm ; Incremental crawl
学位年度：2012
导师：林振荣
学科代码：081203
学位授予单位：南昌大学

摘要

网络发展迅速,网页数量越来越庞大,人们为了获取需要的信息,往往需要翻阅大量的网页,浪费时间和精力,并且还不一定能够获取最新最全的信息,而网络信息的发布者也希望有更多的用户能够实时的阅读自己的信息,为此有很多针对该需求的研究孕育而生,例如由网络爬虫支持的搜索引擎、RSS信息推送等。但是它们都各有各的局限性,例如我们需要按照分类得到某高校的所有网站中的最新通知,比如该高校所有科研类别的最新通知。使用搜索引擎进行搜索,结果差强人意。而RSS虽然可以实现分类的推送最新信息,但是它推送的信息仅限于那些提供RSS feed的网站。对于一些类似于高校网站群这种早期建立的时候就没有实现RSS推送功能的对象来说,它就爱莫能助了。
     因此,本文主要研究基于RSS的聚焦网络爬虫来解决上述问题,并将其应用在高校网站群中,取得了较好的效果。它的原理是用聚焦网络爬虫对目标网站群的数据进行抓取、分析和处理,然后提供RSS推送。通过这种方式,对于即使没有提供RSS feed的网站,用户也可以通过RSS阅读器分类订阅其最新的信息。免去了大量翻阅网页查找信息的麻烦,以及查找疏忽对信息的遗漏。
     本文的主要研究内容包括：
     (1)提出一种新的基于RSS的聚焦网络爬虫的研究,使得用户可以使用RSS阅读器,订阅并阅读到没有提供RSS feed的网站的最新的信息。过滤无用的广告等垃圾信息,免去查找信息的麻烦。
     (2)基于TF-IDF算法对抓取的网页文本进行分类,并且在用TF-IDF提取不同类别的特征向量部分,针对网页的特征对其进行了改进。使得提取出的特征向量更能好的代表类别,分类结果更准确。
     (3)对网络爬虫的增量式爬取进行改进,基于传统的增量式爬取算法提出了一种新的计算预测更新时间的算法,使得预测时间更贴近实际更新时间的值,减少系统的开销,提高效率。
     (4)将基于RSS的聚焦网络爬虫的研究应用到高校网站群中,针对高校网站群的特征对PageRank算法进行改进,提高网络爬虫的查全率。
Internet is developing much faster and the number of pages is increasing, so when people want to get the information they need, they have to read a large number of web pages. It wastes people's time and energy, and also makes people unable to get the latest and most complete information. Network of information publishers hope that more users can read their information in real time. To meet this demand, a lot of research comes out, such as the search engine supported by the web crawler, RSS information pushing technology. But they have limitations, for example, we need to get the latest notice from all the sites of a university by category, such as the latest notice of the research category. A typical search engine can't return the satisfactory result. RSS can push the latest information in accordance with the classification, but the information which it pushed is limited to the websites which provide the RSS feed. So the RSS can't work on the websites which do not provide RSS feed at all such as university website group. Therefore, the focus of this study is the research of focused crawler based on RSS, and it's application insolving the above problem, and expansion to the group of the university website, which will achieved good results. Its principle is to use the focus web crawler to crawl, analyse and process the data of the site group, and then offer RSS feed. In this way, for those websites without RSS feeds, people can also use the RSS reader to subscribe their latest classification information. The research will reduce a lot of time spant in flipping through the pages to find the latest information and will reduce negligent omission of information.
     The main study contents are as follows:
     (1) To propose a new research of focused crawler based on RSS, the user can use a RSS reader, subscribe and read the latest information from the sites which did not provide the RSS feed. It filters unwanted ads and spam, and eliminates the trouble of finding information.
     (2) Use TF-IDF algorithm to classify the pages'text, and improve it on extracting category feature vector based on the characteristics of the web page, improving the accuracy of the feature vector, and making the classification more accurate.
     (3) The research improved incremental crawled of the web crawler. Proposed a new computing forecast update algorithm based on the traditional incremental algorithm, making the prediction closer to the actual update time, reducing system overhead and improving efficiency.
     (4) Applied the research of focused crawler based on RSS to the university website group, and improved the PageRank algorithm baseds on the characteristics of the university website group to raise the recall rate of Web crawler.

引文

[1]孙立伟.何国辉,吴礼发.网络爬虫技术的研究[J].电脑知识与技术.,2010,5：4112-4115.
    [2]夏亮.主题搜索引擎网络爬虫搜索策略的研究与实现[D].北京化工大学,2010.
    [3]胡海燕.RSS技术在高校网站中的设计与实现[J].吉林工商学院学报,2009.25(3)：67-69,74
    [4]韩冰.基于BP网络的高校主题爬虫的设计与实现[D].长春：东北师范大学,2009.
    [5]Arnon Rungsawang, Niran Angkawattanawit.Learnable topic-specific web crawler[J]. Journal of Network and Computer Applications.2005,28(2):97-114
    [6]Baumgart, Andre Stephan, Knapp. Hartwig, Suetterlin, Pascal, Schader, Martin.A profile-based peer-to-peer RSS information distribution[C].2007 2nd International Symposium on Wireless Pervasive Computing,2007, Febrary 5,2007-Febrary 7,2007, 218-223,
    [7]林捷.主题网络爬虫的研究和实现[D].武汉理工大学,2011.
    [8]张宇鹏.面向能源的垂直搜索引擎研究与实现[D].西安电子科技大学.2010.
    [9]Guerriero, A., Ragni, F., Martines, C.. A dynamic URL assignment method for parallel web crawler[C]. CIMSA 2010-IEEE International Conference on Computational Intelligence for Measurement Systems and Applications. Proceedings.2010,119-123
    [10]Takano, Hajime, Kubo, Nobuya.Development of a scalable web crawler[J]. NEC Research and Development,1999,40(3):334-339
    [11]刘淑梅,夏亮,许南山.主题搜索引擎网络爬虫搜索策略的研究与实现[J].计算机应用,2010，19(3)：49-52.
    [12]陈丛丛.主题爬虫搜索策略研究[D].山东大学,2009.
    [13]陈瑜芳,何克右.网络蜘蛛的设计与实现[J].现在计算机.2009.11：141-144,148.
    [14]夏亮.主题搜索引擎网络爬虫搜索策略的研究与实现[D].北京化工大学,2010.
    [15]PENG Tao,HE Fengling,Zuo Wangli.A New Framework for Focused Web Crawling[J]. WUHAN UNIVERSITY JOURNAL OF NATURAL SCIENCES,2006,11(5):1349-1397.
    [16]王津涛,兰皓.面向主题元搜索引擎的设计与实现[J].计算机工程.2005(07).
    [17]Yohanes, Banu Wirawan, Handoko, Wardana, Hartanto Kusuma. Focused crawler optimization using genetic algorithm[J]. Telkomnika,9(3):403-410.
    [18]Batsakis, Sotiris, Petrakis, Euripides G.M., Milios, Evangelos.Improving the performance of focused web crawlers[J]. Data and Knowledge Engineering,68(10):1001-1013.
    [19]Yang Sheng-Yuan, OntoCrawler:A focused crawler with ontology-supported website models for information agents[J]. Expert Systems with Applications,37(7):5381-5389.
    [20]黄志权.基于RSS的搜索引擎框架的研究与应用[D].武汉理工大学.2009.
    [21]李永锋.RSS个性化内容聚合框架[D].复旦大学，2007.
    [22]肖建国.台湾地区大学图书馆RSS服务实践[D].2008,2008(9)：63-65.
    [23]于魁飞.基于RSS的信息发布与订阅技术研究[D].北京邮电大学,2007.
    [24]徐玉凤.基于RSS的新闻采集系统的研究与应用[D].西安工业大学,2008.
    [25]Enright, Cicely. Standards Tracker and RSS feeds[J]. Standardization News,36(3): 19428-2959.
    [26]Kelly Brian. RSS-More than just news feeds[J]. New Review of Information Networking,11 (2):219-227.
    [27]Tseng Chris, Ng Patrick.Precisiated information retrieval for RSS feedsfJ]. Information Management and Computer Security,15(3):184-200.
    [28]刘双林LUCENE实现的基于RSS的博客搜索引擎[D].哈尔滨工程大学,2009.
    [29]刘洁清,吴京慧.面向主题的个人实时搜索引擎的设计与实现[J].现代图书情报技,2006,2006(5)：40-43.
    [30]杨帅.搜索引擎中Crawler的设计、实现与扩展优化[D].电子科技大学.2009.
    [31]张航.主题爬虫的实现及其关键技术研究[D].武汉理工大学，2010.
    [32]刘金红，陆余良.主题网络爬虫研究综述[J].计算机应用研究.2007，24(10)：26-19.
    [33]Yen Chia-Chen, Hsu Jih-Shih. Pagerank algorithm improvement by page relevance measurement[J]. Journal of Convergence Information Technology.2020,5(8).
    [34]Dominich Sandor,Skrop Adrienn. PageRank and interaction information retrieval[J]. Journal of the American Society for Information Science and Technology, 2005.51(6):63-69
    [35]Haveliwala Taher H.Topic-sensitive pagerank:A context-sensitive ranking algorithm for web search[J]. IEEE Trans Knowl Data Eng,2003,15(4):784-796.
    [36]袁浩.主题爬虫搜索Web页面策略的研究[D].中南大学,2009.
    [37]贺晟.搜索引擎中主题网络爬虫的研究与设计[D].安徽大学.2010.
    [38]刘喜亮.面向主题的网络爬虫设计与实现[D].湖南大学.2009.
    [39]施聪莺,徐朝军,杨晓江TFIDF算法研究综述[J].计算机应用,2009,2009(29)：167-170，180.
    [40]王潇.胡鑫.三种文本分类算法的比较[J].石河子大学学报.2005,23(26)：769-771.
    [41]郑凯.面向军事主题的搜索引擎研究[D].中国石油大学.2010.
    [42]Yohanes Banu Wirawan, HandokoWardana,Hartanto Kusuma. Focused crawler optimization using genetic algorithm[J]. Telkomnika,2011,9(3):403-410.
    [43]Chandramouli A.Gauch S, Eno J.A cooperative approach to web crawler URL ordering[J]. Advances in Intelligent and Soft Computing,2012,98:343-357.
    [44]Fu Tianjun,Abbasi Ahmed,Chen, Hsinchun.A focused crawler for dark web forums[J]. Journal of the American Society for Information Science and Technology.2010,61(6), 1213-1231.
    [45]Yadav Divakar.Sharma, Ak,Gupta, J.P. Topical web crawling using weighted anchor text and web page change detection techniques[J]. WSEAS Transactions on Information Science and Applications.2009,6(2):263-275.
    [46]周立柱，林玲.聚焦爬虫技术研究综述.计算机应用.2005.25(9)：1965-1969.
    [47]骆斌,费翔.多线程技术的研究与应用[J].计算机研究与发展,2000,37(4),407-412.
    [48]李勇,韩亮.主题搜索引擎中网络爬虫的搜索策略研究[J].计算机工程与科学,2008，30(3)：4-6,56.
    [49]谢剑猛.高校网站的规划与设计[J].华东交通大学学报,2004,21(5)：45-48
    [50]秦玉早,王秀坤,艾青,刘卫江.多主题文本分类的实现算法[J].计算机工程，2008，34(2)：190-192.
    [51]Sugiyama Kazunari, Hatano Kenji,Yoshikawa Masatoshi. improvement in TF-IDF scheme for web pages based on the contents of their hyperlinked neighboring pages[J]. Systems and Computers in Japan,2005.36(14):56-68.
    [52][52] Wu Ho Chung,Luk, Robert Wing Pong.Wong, Kam Fai. Kwok, Kui Lam. Interpreting TF-IDF term weights as making relevance decisions[J]. ACM Transactions on Information Systems,2008,26(3).
    [53]Rezgui Yacine. Text-based domain ontology building using Tf-Idf and metric clusters techniques[J]. Knowledge Engineering Review,2007,22(4):379-403.
    [54]杨溥.搜索引擎中爬虫的若干问题研究[D].北京邮电大学.2009.
    [55]张红云.基于页面分析的主题网络爬虫的研究[D].武汉理工大学,2010.
    [56]程显毅.基于Agent的专题搜索引擎爬虫的研究[D].江苏大学,2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700