scrapy.Request.priority设置无效?

来源:2-1 pycharm的安装和简单使用

Max_wen

2022-07-26

按照文档中对于 priority的解释,这个属性是可以控制scheduler对于请求发起的顺序的,但实际测试下来并没有效果,请老师帮忙解答:

爬虫代码如下:

class TestSpider(scrapy.Spider):
    name = 'test'

    custom_settings = {
        'DOWNLOAD_DELAY': 5,
        'CONCURRENT_REQUESTS': 1
    }

    def start_requests(self):
        urls = {
            10: 'https://www.baidu.com/s?wd=1',
            20: 'https://www.baidu.com/s?wd=2',
            30: 'https://www.baidu.com/s?wd=3'
        }

        for index, url in urls.items():
            yield scrapy.Request(url,
                                 headers={
                                     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
                                 },
                                 callback=self.parse,
                                 priority=index,
                                 meta={'priority': index},
                                 dont_filter=True)

    def parse(self, response, **kwargs):
        self.log(datetime.datetime.now().strftime('%H:%M:%S'))
        self.log(response.request.url)
        title = response.xpath('//title/text()').get()
        self.log(title)

执行日志:

2022-07-26 16:05:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=1> (referer: None)
2022-07-26 16:05:39 [test] DEBUG: 16:05:39
2022-07-26 16:05:39 [test] DEBUG: https://www.baidu.com/s?wd=1
2022-07-26 16:05:39 [test] DEBUG: 1_百度搜索
2022-07-26 16:05:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=2> (referer: None)
2022-07-26 16:05:45 [test] DEBUG: 16:05:45
2022-07-26 16:05:45 [test] DEBUG: https://www.baidu.com/s?wd=2
2022-07-26 16:05:45 [test] DEBUG: 2_百度搜索
2022-07-26 16:05:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=3> (referer: None)
2022-07-26 16:05:52 [test] DEBUG: 16:05:52
2022-07-26 16:05:52 [test] DEBUG: https://www.baidu.com/s?wd=3
2022-07-26 16:05:52 [test] DEBUG: 3_百度搜索
2022-07-26 16:05:52 [scrapy.core.engine] INFO: Closing spider (finished)

官方文档中解释:
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

priority值越高,越优先执行,但是代码中 30 优先级的还是最后执行,测试了很多遍都是这样。

写回答

1回答

bobby

2022-07-27

priority

的前提是第一scrapy的队列采用的是优先级队列, 这样新加入的request就会放入到优先级队列的前面, 而且还有一个前提就是,你的优先级再高也只会比队列中的高,如果下载器速度过快,你之前的url在你还没有加入队列的时候其他的url就已经被downloader取到了,你还是会更慢

0
2
bobby
回复
Max_wen
如果你想要证明这件事,你可以再源码获取queue的地方打个断点,然后调试看看是否获取到优先级最高的, 也可以 https://stackoverflow.com/questions/8768439/how-to-give-delay-between-each-requests-in-scrapy 看看这个试试
2022-07-29
共2条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5796 学习 · 6290 问题

查看课程