引入主题链接块因子的候选链接搜索策略研究

英文篇名：Research of Searching Strategy in Candidate Link Introducing Topic Link Blocking Factor
作者：周雪 ; 刘乃文
英文作者：ZHOU Xue;LIU Naiwen;School of Information Science and Engineering,Shandong Normal University;Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology;
关键词：网页分块 ; Shark-search算法 ; 链接结构 ; 主题链接块
英文关键词：page-block;;Shark-search algorithm;;link-structure;;topic-relative link block
中文刊名：JSSG
英文刊名：Computer & Digital Engineering
机构：山东师范大学信息科学与工程学院;山东省分布式计算机软件新技术重点实验室;
出版日期：2018-05-20
出版单位：计算机与数字工程
年：2018
期：v.46;No.343
语种：中文;
页：JSSG201805006
页数：5
CN：05
ISSN：42-1372/TP
分类号：29-33

摘要

网页主题爬取过程中,需要计算网页中出现的url权重,不断填充待爬行队列,以满足爬行条件,如何发现与主题最相关的链接,同时又不会导致"主题漂移"问题是关键。针对链接的锚文本较短小,不能很好地表明链接指向页面与主题的相关性的问题,论文在Shark-search算法的基础上引入相关链接块权重,利用块中子链接的锚文本进行块的权重计算,通过对比实验验证了改进算法可以更好地区分处于同一页面中的链接的相关度评分,提高爬虫的查准率,同时缓和"主题漂移"的问题。
In crawling process,the urls' weight is need to compute,the crawl queue is filled to meet the crawl conditions. It's the key problem that how to find the most relevant links to the theme and how to avoid "theme drift" problem. Due to anchor text is short,it can't clearly show the page's relevance to the topic which the page linked to. On the basis of Shark-search algorithm introducing the related link weights,the neutron link anchor text is used for calculating blocks' weight. Through contrasted experiments,verified the effectiveness of the improved algorithm is verfied,it can better distinguish the links' relevance score in the same page,improve the precision of the crawler and moderate "theme drift" problem at the same time.

引文

[1]郭华.基于锚文本上下文和链接分析的主题爬取算法[D].杭州:浙江大学,2014.GUO Hua.Theme crawl algorithm based on the anchortext context and crawl the theme[D].Hangshou:zhejianguniversity,2014.
    [2]Ester M,Grob M,Kriegel H.Focused Web Crawling:AGeneric Framework for Specifying the User Interest andfor Adaptive Crawling Strategies[C]//Proceedings of the26thInternational Conference on Very Large Database(VLDB’01),2001:527-534.
    [3]J.Johnson,K.Tsioutsiouliklis,C.L.Giles.Evolving strate-gies for focused web crawling[C]//Proceedings of the 20th International Conference on Machine Learning,2003:298-305.
    [4]Menczer F,Belew R K.Adaptive retrieval agents:Internal-izing local context and scaling up to the Web[J].MachineLearing,2000,39(2):203-242.
    [5]李军,陈君,王玲芳.一种垂直页面分割与信息提取方法的研究[J].计算机应用研究,2013,30(3):844-852.LI Jun,CHEN Jun,WANG Lingfang.A research on verti-cal page segmentation and information extraction method[J].Computer application research,2013,30(3):844-852.
    [6]黄仁,王良伟.基于主题相关概念和网页分块的主题爬虫研究[J].计算机应用研究,2013,30(8):2377-2380.HUANG Ren,WANG Liangwei.Research on topic crawlerbased on the concept of and page partitioned[J].Comput-er application research,2013,30(8):2377-2380.
    [7]张文跃.基于改进shark-search算法的主题爬虫的研究与实现[D].呼和浩特:内蒙古大学,2015.ZHANG Wenyue.The research and implementation Basedon the improved shark-topic crawler search algorithm[D].Hohhot:Inner Mongolia university,2015.
    [8]罗林波,陈绮,吴清秀.基于Shark-search和HITS算法的主题爬虫研究[J].计算机技术与发展,2010,20(11):76-79.LUO Linpo,CHEN Qi,WU Qingxiu.The Shark-Search al-gorithm based on web block[J].Journal of shandong uni-versity(science edition),2007,42(9):62-66.
    [9]陈军,陈竹敏.基于网页分块的Shark-Search算法[J].山东大学学报(理学版),2007,42(9):62-66.CHEN Jun,CHEN Zhumin.The Shark-Search algorithmbased on web block[J].Journal of shandong university(science edition),2007,42(9):62-66.
    [10]常红要,朱征宇,陈烨.基于HTML标记用途分析的网页正文提取技术[J].计算机工程与设计.2010.31(24):5187-5175.CHANG Hongyao,ZHU Zhengyu,CHEN Ye.Based onthe analysis of the HTML tag USES web text extractiontechnology[J].Computer engineering and design.2010,31(24):5187-5175.
    [11]雷军程,黄同成,柳小文.一种基于权重的文本特征选择方法[J].计算机科学,2012(7):45-50.LEI Juncheng,HUANG Tongcheng,LIU Xiaowen.Akind of text feature selection method based on weighted[J].Journal of computer science,2012(7):45-50.
    [12]熊忠阳,蔺显强,张玉芳.结合网页结构与文本特征的征文提取方法[J].计算机工程,2013,39(12):200-203.XIONG Zhongyang,LIN Xianqiang,ZHANG Yufang.Combined with the feature of structure and text page es-say extraction method[J].Computer engineering,2013,39(12):200-203.
    [13]罗林波,陈绮,吴清秀.基于Shark-search和HITS算法的主题爬虫研究[J].计算机技术与发展.2010,20(11):76-79.LUO Linbo,CHEN Qi,WU Qingxiu.Topic crawler basedon Shark-search and HITS algorithm study[J].Comput-er technology and development,2010,20(11):76-79.
    [14]Du Ya Jun,Hai Yu Feng,Xie Chun Zhi,et al.An approachfor selecting seed URLs of focused crawler based on us-er-interest ontology[J].Applied Soft Computing,2014,14(1):663-676.
    [15]Punam Bedi,Anjali Thukral,Hema Banati.Focusedcrawling of tagged web resources using ontology[J].Computers and Electrical Engineering,2013,39(2):613-628.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700