Scrapy框架下反反爬虫和数据有序性的实现

英文篇名：Implementation of Anti-reptile and Data Ordering Under Scrapy Framework
作者：向洋 ; 董林鹭 ; 宋弘
英文作者：XIANG Yang;DONG Linlu;SONG Hong;School of Automation and Information Engineering,Sichuan University of Science and Engineering;
关键词：scrapy框架 ; 网络爬虫 ; 数据有序性 ; 反反爬虫机制
英文关键词：scrapy frame;;web crawler;;data ordering;;anti-anti-crawler mechanism
中文刊名：YBSG
英文刊名：Journal of Yibin University
机构：四川轻化工大学自动化与信息工程学院;
出版日期：2019-03-27 11:40
出版单位：宜宾学院学报
年：2019
期：v.19;No.245
语种：中文;
页：YBSG201906011
页数：5
CN：06
ISSN：51-1630/Z
分类号：48-52

摘要

在爬虫获取数据时,由于获取速度过快易被网页反爬虫机制拦截,需要一种规避网页反爬虫的机制.结合目前反爬虫常用的手段,提出了一种反反爬虫机制.首先找到需要进行网页数据爬取的url,在不使用任何反反爬虫机制的情况下观察爬虫程序被拦截的次数,然后针对目标网页反爬虫机制,设计出一种反反爬虫机制.实验结果表明,通过设置随机抽取useragent和随机IP,跟踪目标网页referer,禁用cookies的反反爬机制,可以规避反爬虫的阻挠,使爬虫爬取到目标网页数量增加,成功率和效率都有所提高.
When the crawler gets the data, a mechanism for circumventing the webpage anti-reptile is needed because the acquisition speed is too fast and easily blocked by the webpage anti-crawling mechanism. Combined with the commonly used methods of anti-reptiles, a research on anti-reptiles was proposed. First the url that needs to crawl the webpage data was found. By observing the number of times that the crawler was intercepted without using any anti-reptiles, a reverse anti-reptile for the target web anti-reptile mechanism was designed. The experimental results show that by setting random user-agent and random IP, tracking the target page referer, and disabling the anti-rebound mechanism of cookies, one can avoid the anti-reptile blocking, increase the number of crawlers crawling to the target page, and improve the success rate.

引文

[1]余豪士,匡芳君.基于Python的反反爬虫技术分析与应用[J].智能计算机与应用,2018,8(4):112-115.
    [2]华云彬,匡芳君.基于Scrapy框架的分布式网络爬虫的研究与实现[J].智能计算机与应用,2018,8(5):46-50.
    [3]安子建.基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春:吉林大学,2017.
    [4] JIANG M, LEEMAN R W, FU K W. Networked framing:Chinese microbloggersframing of the political discourse at the 2012 democratic national convention[J]. Communication Reports, 2016,29(2):1-13.
    [5]魏冬梅,何忠秀,唐建梅.基于Python的Web信息获取方法研究[J].软件导刊,2018,17(1):41-43.
    [6]赵丽娜,李伟,康犇等.基于Python爬虫的借阅数据获取[J].北华航天工业学院学报,2018,28(4):61-62.
    [7] WANG R, RHO S, CHEN B W, et al. Modeling of largescale social network services based on mechanisms of information diffusion:Sina Weibo as a case study[J]. Future Generation Computer Systems, 2017,74(C):291-301.
    [8] MIHAILAP, BALAN T, CURPEN R, et al. Network automation and abstraction using Python programming methods[J]. MACRo, 2017,2(1):95-103.
    [9] KARGER D, SHERMAN A, BERKHEIMER A, et al. Web caching with consistent hashing[J]. Computer Networks, 2011,31(11-16):1203-1213.
    [10]黄林波.一种分布式聚焦型爬虫系统的设计与实现[D].武汉:华中科技大学,2016.
    [11] QUOC D L, FETZER C, FELBER P, et al. UniCrawl:A practical geographically distributed web crawler[C]. IEEE 8th International Conference on Cloud Computing, New York,USA, 2015:389-396.
    [12] CHEN X, SHANG W Q. Research and design of web crawler for music resources finding[J]. Applied Mechanics and Materials, 2014(543-547):2957-2960.
    [13] LAVELLI A, CALIFF M E, CIRAVEGNA F, et al. Evaluation of machine learning-based information extraction algorithms:criticisms and recommendations[J]. Language Resources and Evaluation, 2008, 42(4):361-393.
    [14]邹科文,李达,邓婷敏,等.网络爬虫针对“反爬”网站的爬取策略研究[J].电脑知识与技术,2016(3):61-63.
    [15]谢克武.大数据环境下基于Python的网络爬虫技术[J].电子制作,2017(9):44-45.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700