图片下载302问题

来源:4-25 有没有方法可以比较准确的解析出 title 和正文内容

慕粉13276915582

2021-12-15

老师你好,请教一下 我这边去下载新浪的图片,但是呢图片有个重定向,点击重定向会获取大图,
我用scrapy获取时候,会发生下面错误,搜索了好久 也没想到办法。
我也配置了


setting.py 的媒体重定向 MEDIA_ALLOW_REDIRECTS = True
也配置了允许的域名:allowed_domains = [‘weibo.cn’, ‘wx1.sinaimg.cn’]
不知道是哪里缺失步骤了



class PersonSpider(scrapy.Spider):
    name = 'weibo_person'
    allowed_domains = ['weibo.cn', 'wx1.sinaimg.cn']
    settings = get_project_settings()
    base_url = f'https://weibo.cn/u/{settings.get("USER_URI")}'
    page = 2

    @staticmethod
    def comment_url(weibo_id):
        weibo_id = weibo_id.replace('M_', '')
        return f"https://weibo.cn/comment/{weibo_id}?ckAll=1"

    def start_requests(self):
        self.base_url = 'https://weibo.cn/comment/L6cjVzyed'
        yield scrapy.Request(url=self.base_url,
                             callback=self.parse_long_weibo,
                             meta={
                                 'base_url': self.base_url
                             })
                             
    def parse_long_weibo(self, response: HtmlResponse):
        """
        获取长原创微博
        :return:
        """
        weibo_item = WeiboItem()
        main_pic = response.xpath("//div/a[text() = '原图']/@href")
        pic_url = main_pic.extract_first()
        weibo_item['image_urls'] = [response.urljoin(pic_url)]
        weibo_item['weibo_id'] = '123'
        weibo_item['weibo'] = '123'
        weibo_item['url'] = '123'
        weibo_item['url_object_id'] = '123'
        yield weibo_item

2021-12-15 23:36:13 [scrapy.core.engine] INFO: Spider opened
2021-12-15 23:36:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-15 23:36:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-15 23:36:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/comment/L6cjVzyed> (referer: None)
2021-12-15 23:36:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://wx1.sinaimg.cn/large/d7562ea8gy1gxetb4izz7j20n01dsaei.jpg> from <GET https://weibo.cn/mblog/oripic?&id=L6cjVzyed&u=d7562ea8gy1gxetb4izz7j20n01dsaei&rl=1>
2021-12-15 23:36:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://wx1.sinaimg.cn/large/d7562ea8gy1gxetb4izz7j20n01dsaei.jpg> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2021-12-15 23:36:40 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://wx1.sinaimg.cn/large/d7562ea8gy1gxetb4izz7j20n01dsaei.jpg> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

图片描述

写回答

1回答

bobby

2021-12-21

这个图片需要登录才能抓取,你这里已经登录过了吗?

0
3
bobby
回复
慕粉13276915582
也就是说 使用requests设置cookie可以抓取,但是scrapy不能抓取? 没有在settings中设置cookie_enabled?
2021-12-24
共3条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5795 学习 · 6290 问题

查看课程