Scrapy框架下反反爬虫和数据有序性的实现
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Implementation of Anti-reptile and Data Ordering Under Scrapy Framework
  • 作者:向洋 ; 董林鹭 ; 宋弘
  • 英文作者:XIANG Yang;DONG Linlu;SONG Hong;School of Automation and Information Engineering,Sichuan University of Science and Engineering;
  • 关键词:scrapy框架 ; 网络爬虫 ; 数据有序性 ; 反反爬虫机制
  • 英文关键词:scrapy frame;;web crawler;;data ordering;;anti-anti-crawler mechanism
  • 中文刊名:YBSG
  • 英文刊名:Journal of Yibin University
  • 机构:四川轻化工大学自动化与信息工程学院;
  • 出版日期:2019-03-27 11:40
  • 出版单位:宜宾学院学报
  • 年:2019
  • 期:v.19;No.245
  • 语种:中文;
  • 页:YBSG201906011
  • 页数:5
  • CN:06
  • ISSN:51-1630/Z
  • 分类号:48-52
摘要
在爬虫获取数据时,由于获取速度过快易被网页反爬虫机制拦截,需要一种规避网页反爬虫的机制.结合目前反爬虫常用的手段,提出了一种反反爬虫机制.首先找到需要进行网页数据爬取的url,在不使用任何反反爬虫机制的情况下观察爬虫程序被拦截的次数,然后针对目标网页反爬虫机制,设计出一种反反爬虫机制.实验结果表明,通过设置随机抽取useragent和随机IP,跟踪目标网页referer,禁用cookies的反反爬机制,可以规避反爬虫的阻挠,使爬虫爬取到目标网页数量增加,成功率和效率都有所提高.
        When the crawler gets the data, a mechanism for circumventing the webpage anti-reptile is needed because the acquisition speed is too fast and easily blocked by the webpage anti-crawling mechanism. Combined with the commonly used methods of anti-reptiles, a research on anti-reptiles was proposed. First the url that needs to crawl the webpage data was found. By observing the number of times that the crawler was intercepted without using any anti-reptiles, a reverse anti-reptile for the target web anti-reptile mechanism was designed. The experimental results show that by setting random user-agent and random IP, tracking the target page referer, and disabling the anti-rebound mechanism of cookies, one can avoid the anti-reptile blocking, increase the number of crawlers crawling to the target page, and improve the success rate.
引文
[1]余豪士,匡芳君.基于Python的反反爬虫技术分析与应用[J].智能计算机与应用,2018,8(4):112-115.
    [2]华云彬,匡芳君.基于Scrapy框架的分布式网络爬虫的研究与实现[J].智能计算机与应用,2018,8(5):46-50.
    [3]安子建.基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春:吉林大学,2017.
    [4] JIANG M, LEEMAN R W, FU K W. Networked framing:Chinese microbloggersframing of the political discourse at the 2012 democratic national convention[J]. Communication Reports, 2016,29(2):1-13.
    [5]魏冬梅,何忠秀,唐建梅.基于Python的Web信息获取方法研究[J].软件导刊,2018,17(1):41-43.
    [6]赵丽娜,李伟,康犇等.基于Python爬虫的借阅数据获取[J].北华航天工业学院学报,2018,28(4):61-62.
    [7] WANG R, RHO S, CHEN B W, et al. Modeling of largescale social network services based on mechanisms of information diffusion:Sina Weibo as a case study[J]. Future Generation Computer Systems, 2017,74(C):291-301.
    [8] MIHAILAP, BALAN T, CURPEN R, et al. Network automation and abstraction using Python programming methods[J]. MACRo, 2017,2(1):95-103.
    [9] KARGER D, SHERMAN A, BERKHEIMER A, et al. Web caching with consistent hashing[J]. Computer Networks, 2011,31(11-16):1203-1213.
    [10]黄林波.一种分布式聚焦型爬虫系统的设计与实现[D].武汉:华中科技大学,2016.
    [11] QUOC D L, FETZER C, FELBER P, et al. UniCrawl:A practical geographically distributed web crawler[C]. IEEE 8th International Conference on Cloud Computing, New York,USA, 2015:389-396.
    [12] CHEN X, SHANG W Q. Research and design of web crawler for music resources finding[J]. Applied Mechanics and Materials, 2014(543-547):2957-2960.
    [13] LAVELLI A, CALIFF M E, CIRAVEGNA F, et al. Evaluation of machine learning-based information extraction algorithms:criticisms and recommendations[J]. Language Resources and Evaluation, 2008, 42(4):361-393.
    [14]邹科文,李达,邓婷敏,等.网络爬虫针对“反爬”网站的爬取策略研究[J].电脑知识与技术,2016(3):61-63.
    [15]谢克武.大数据环境下基于Python的网络爬虫技术[J].电子制作,2017(9):44-45.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700