解析及跟进爬取的逻辑，怎么抽象封装，供多个spider调用？

来源：8-7 scrapy实现ip代理池 - 2

慕尼黑7546459

2020-02-20

老师
我基于scrapy-redis爬取豆瓣电影数据，在实现爬取优先级时，遇到一个问题：
首先：我设计了2个url队列，队列1: 高优先级的url，队列2: 低优先级的url
其次：开2个spider， spider1实现高优先级的url； spider1实现低优先级的url

期望：spider1和spider2的唯一区别只是redis_key的不同，其他完全一样，希望能复用代码。

我目前的想法是，spider1实现完整的逻辑， spider2调spider1的parse函数实现解析及跟进爬取等逻辑的复用。
但是，spider2调spider1, 在执行到spider1需要跟进爬取的代码：yield Request(url=xxx, callback=self.parse_xxx) 的时候，报错：

builtins.ValueError: Function <bound method Spider1.parse_celebrities
of <Spider1 ‘movie1’ at 0x111f3d7f0>> is not a method of: <Spider2
’movie2’ at 0x110fa8320>

spider2调spider1的代码：

def parse(self, response):
spider = =Spider1()
yield from spider.parse(response)

请问，这种问题该如何处理啊，或者一般爬虫中，对于需要把解析及跟进爬取逻辑封装，供多个spider复用的，怎么处理比较好呢

写回答

1回答

bobby

2020-02-22

你的问题我已经把代码发到你的请求了，不过为了方便其他同学还是做一个记录，简单的方法是修改三处：

修改scrapy-redis的scheduler

//img.mukewang.com/szimg/5e50ace50935b15c12240526.jpg

2. settings中设置

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
3. 自己写一个脚本向redis中添加记录就行了

import redis
import json

rd = redis.Redis("127.0.0.1", decode_responses=True)

rd.lpush("cnblogs:start_urls", "https://news.cnblogs.com/")
urls = [("https://news.cnblogs.com/n/656059/", 3), ("https://news.cnblogs.com/n/656053/", 5), ("https://news.cnblogs.com/n/656060/", 8), ("https://news.cnblogs.com/n/656052/", 20)]
for url in urls:
    rd.lpush("cnblogs:urls", json.dumps(url))

Scrapy打造搜索引擎畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy，用Django+Elasticsearch搭建搜索引擎

5808 学习 · 6290 问题

查看课程

相似问题

crawlspider 分布式是先爬网页再解析网页吗

回答 2

在解析的过程，有多个url被yield跟进爬取时，如何保证都解析完成，才被yield给pipelines

回答 1

如何实现提供restful接口，并实现增量爬取

回答 1

基于CrawlSpider，同一份代码爬取多个网站的数据，如果限制不爬取外链网站数据

回答 2

爬虫的进阶学习

回答 1

打开慕课网App查看更多内容