摘要
该文基于Scrapy框架研究并实现了爬取温州租房信息的爬虫程序,程序采用分布式结构,运用去重算法去除重复URL提高爬虫的效率,并针对反爬虫策略提出多种解决方法。爬取的信息存储在MongoDB数据库中,最终通过测试和分析得出位置、朝向对温州租房价格的影响,得出租房性价比较高的方案。
Based on the Scrapy framework,this paper studies and implements a crawler program for crawling W enzhou rental information.The program adopts distributed structure,uses the de-duplication algorithm to remove duplicate URLs to improve the efficiency of crawler,and proposes various solutions for anti-crawler strategy.The crawled information is stored in the MongoDB database.Finally,the influence of location and orientation on the rent price of Wenzhou was obtained through testing and analysis,and gets a cost effective rental solution.
引文
[1]马联帅.基于Scrapy的分布式网络新闻抓取系统设计与实现[D].西安:西安电子科技大学, 2015.
[2]张笑天.分布式爬虫应用中布隆过滤器的研究[D].沈阳:沈阳工业大学, 2017.
[3]安子建.基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春:吉林大学, 2017.