基于Scrapy框架爬取温州租房信息的研究与实现

英文篇名：Research and implementation of Scrapy framework to crawl Wenzhou rental information
作者：范鹏程 ; 涂嘉庆
英文作者：FAN Peng-cheng;TU Jia-qing;School of Information Engineering, Wenzhou Business College;
关键词：租房爬虫 ; Scrapy框架 ; 分布式 ; 去重策略
英文关键词：rent crawler;;Scrapy framework;;distributed strategy;;de-duplication strategy
中文刊名：DNZS
英文刊名：Computer Knowledge and Technology
机构：温州商学院信息工程学院;
出版日期：2019-06-25
出版单位：电脑知识与技术
年：2019
期：v.15
语种：中文;
页：DNZS201918003
页数：3
CN：18
ISSN：34-1205/TP
分类号：10-12

摘要

该文基于Scrapy框架研究并实现了爬取温州租房信息的爬虫程序,程序采用分布式结构,运用去重算法去除重复URL提高爬虫的效率,并针对反爬虫策略提出多种解决方法。爬取的信息存储在MongoDB数据库中,最终通过测试和分析得出位置、朝向对温州租房价格的影响,得出租房性价比较高的方案。
Based on the Scrapy framework,this paper studies and implements a crawler program for crawling W enzhou rental information.The program adopts distributed structure,uses the de-duplication algorithm to remove duplicate URLs to improve the efficiency of crawler,and proposes various solutions for anti-crawler strategy.The crawled information is stored in the MongoDB database.Finally,the influence of location and orientation on the rent price of Wenzhou was obtained through testing and analysis,and gets a cost effective rental solution.

引文

[1]马联帅.基于Scrapy的分布式网络新闻抓取系统设计与实现[D].西安:西安电子科技大学, 2015.
[2]张笑天.分布式爬虫应用中布隆过滤器的研究[D].沈阳:沈阳工业大学, 2017.
[3]安子建.基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春:吉林大学, 2017.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700