scrapy.Request.priority设置无效?
来源:2-1 pycharm的安装和简单使用
Max_wen
2022-07-26
按照文档中对于 priority的解释,这个属性是可以控制scheduler对于请求发起的顺序的,但实际测试下来并没有效果,请老师帮忙解答:
爬虫代码如下:
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = {
'DOWNLOAD_DELAY': 5,
'CONCURRENT_REQUESTS': 1
}
def start_requests(self):
urls = {
10: 'https://www.baidu.com/s?wd=1',
20: 'https://www.baidu.com/s?wd=2',
30: 'https://www.baidu.com/s?wd=3'
}
for index, url in urls.items():
yield scrapy.Request(url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
},
callback=self.parse,
priority=index,
meta={'priority': index},
dont_filter=True)
def parse(self, response, **kwargs):
self.log(datetime.datetime.now().strftime('%H:%M:%S'))
self.log(response.request.url)
title = response.xpath('//title/text()').get()
self.log(title)
执行日志:
2022-07-26 16:05:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=1> (referer: None)
2022-07-26 16:05:39 [test] DEBUG: 16:05:39
2022-07-26 16:05:39 [test] DEBUG: https://www.baidu.com/s?wd=1
2022-07-26 16:05:39 [test] DEBUG: 1_百度搜索
2022-07-26 16:05:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=2> (referer: None)
2022-07-26 16:05:45 [test] DEBUG: 16:05:45
2022-07-26 16:05:45 [test] DEBUG: https://www.baidu.com/s?wd=2
2022-07-26 16:05:45 [test] DEBUG: 2_百度搜索
2022-07-26 16:05:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=3> (referer: None)
2022-07-26 16:05:52 [test] DEBUG: 16:05:52
2022-07-26 16:05:52 [test] DEBUG: https://www.baidu.com/s?wd=3
2022-07-26 16:05:52 [test] DEBUG: 3_百度搜索
2022-07-26 16:05:52 [scrapy.core.engine] INFO: Closing spider (finished)
官方文档中解释:
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
priority值越高,越优先执行,但是代码中 30 优先级的还是最后执行,测试了很多遍都是这样。
写回答
1回答
-
bobby
2022-07-27
priority
的前提是第一scrapy的队列采用的是优先级队列, 这样新加入的request就会放入到优先级队列的前面, 而且还有一个前提就是,你的优先级再高也只会比队列中的高,如果下载器速度过快,你之前的url在你还没有加入队列的时候其他的url就已经被downloader取到了,你还是会更慢
022022-07-29
相似问题