摘要
在爬虫获取数据时,由于获取速度过快易被网页反爬虫机制拦截,需要一种规避网页反爬虫的机制.结合目前反爬虫常用的手段,提出了一种反反爬虫机制.首先找到需要进行网页数据爬取的url,在不使用任何反反爬虫机制的情况下观察爬虫程序被拦截的次数,然后针对目标网页反爬虫机制,设计出一种反反爬虫机制.实验结果表明,通过设置随机抽取useragent和随机IP,跟踪目标网页referer,禁用cookies的反反爬机制,可以规避反爬虫的阻挠,使爬虫爬取到目标网页数量增加,成功率和效率都有所提高.
When the crawler gets the data, a mechanism for circumventing the webpage anti-reptile is needed because the acquisition speed is too fast and easily blocked by the webpage anti-crawling mechanism. Combined with the commonly used methods of anti-reptiles, a research on anti-reptiles was proposed. First the url that needs to crawl the webpage data was found. By observing the number of times that the crawler was intercepted without using any anti-reptiles, a reverse anti-reptile for the target web anti-reptile mechanism was designed. The experimental results show that by setting random user-agent and random IP, tracking the target page referer, and disabling the anti-rebound mechanism of cookies, one can avoid the anti-reptile blocking, increase the number of crawlers crawling to the target page, and improve the success rate.
引文
[1]余豪士,匡芳君.基于Python的反反爬虫技术分析与应用[J].智能计算机与应用,2018,8(4):112-115.
[2]华云彬,匡芳君.基于Scrapy框架的分布式网络爬虫的研究与实现[J].智能计算机与应用,2018,8(5):46-50.
[3]安子建.基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春:吉林大学,2017.
[4] JIANG M, LEEMAN R W, FU K W. Networked framing:Chinese microbloggersframing of the political discourse at the 2012 democratic national convention[J]. Communication Reports, 2016,29(2):1-13.
[5]魏冬梅,何忠秀,唐建梅.基于Python的Web信息获取方法研究[J].软件导刊,2018,17(1):41-43.
[6]赵丽娜,李伟,康犇等.基于Python爬虫的借阅数据获取[J].北华航天工业学院学报,2018,28(4):61-62.
[7] WANG R, RHO S, CHEN B W, et al. Modeling of largescale social network services based on mechanisms of information diffusion:Sina Weibo as a case study[J]. Future Generation Computer Systems, 2017,74(C):291-301.
[8] MIHAILAP, BALAN T, CURPEN R, et al. Network automation and abstraction using Python programming methods[J]. MACRo, 2017,2(1):95-103.
[9] KARGER D, SHERMAN A, BERKHEIMER A, et al. Web caching with consistent hashing[J]. Computer Networks, 2011,31(11-16):1203-1213.
[10]黄林波.一种分布式聚焦型爬虫系统的设计与实现[D].武汉:华中科技大学,2016.
[11] QUOC D L, FETZER C, FELBER P, et al. UniCrawl:A practical geographically distributed web crawler[C]. IEEE 8th International Conference on Cloud Computing, New York,USA, 2015:389-396.
[12] CHEN X, SHANG W Q. Research and design of web crawler for music resources finding[J]. Applied Mechanics and Materials, 2014(543-547):2957-2960.
[13] LAVELLI A, CALIFF M E, CIRAVEGNA F, et al. Evaluation of machine learning-based information extraction algorithms:criticisms and recommendations[J]. Language Resources and Evaluation, 2008, 42(4):361-393.
[14]邹科文,李达,邓婷敏,等.网络爬虫针对“反爬”网站的爬取策略研究[J].电脑知识与技术,2016(3):61-63.
[15]谢克武.大数据环境下基于Python的网络爬虫技术[J].电子制作,2017(9):44-45.