老师，使用图片下载出现错误

来源：8-1 爬虫和反爬的对抗过程以及策略

我是一只有宝贝的熊

2019-03-07

我使用了splash 在 spider的start_requets进行了页面渲染，之后自定义了一个imagepipeline, 但是报了一个异常，导致我的图片下载失败，主要问题就在下面我标注的那个 results, 出错的地方在源代码的MediaPipeline中，但是我不知道应该怎么解决。。

# spider 的方法
def start_requests(self):
        """重写start_requets，使用splash进行js加载"""
        for url in self.start_urls:
            yield SplashRequest(url,  args={'wait': 0.5})

# 这是imagepipeline
    def get_media_requests(self, item, info):
	    # 这里执行没有问题
        try:
            for image_url in item['image']:
                # yield SplashRequest(url=image_url,dont_process_response=True)
                yield Request(url=image_url)
        except Exception as e:
            pass

    def get_images(self, response, request, info):
	    # 在断点调试中，这里没有执行
        path = self.file_path(request, response=response, info=info)
        orig_image = Image.open(BytesIO(response.body))

        width, height = orig_image.size
        if width < self.min_width or height < self.min_height:
            logging.warning("Image too small (%dx%d < %dx%d)" %
                                 (width, height, self.min_width, self.min_height))


        image, buf = self.convert_image(orig_image)
        yield path, image, buf

    def file_path(self, request, response=None, info=None):
	    # 这里的图片path没有问题
        img_path = super(OssImagePipeline, self).file_path(request, response, info)
        image_name = img_path.rsplit('/', 1)[-1] if '/' in img_path else img_path
        self.image_list.append(image_name)
        if self.folder:
            image_name = os.path.join(self.folder, image_name)
        print(image_name)  # 没有问题
        return image_name

    def item_completed(self, results, item, info):
        try:
            print(results,1111111111111111111111111111111111111111111111111111111111)
            print(item,22222222222222222222222222222222222222222222222222222222222)
            base_str = "https://xxx.oss-cn-xxx.aliyuncs.com/"  # 我是需要将图片下载到oss上，因此需要路径拼接
			
			# resultes 返回的异常， 它在图片下载之后，不知什么原因，返回了False，因此，我这里的image_path为空
            image_path = [base_str + x['path'] for ok, x in results if ok]
            item['image_path'] = image_path
            # print("+"*20,item['image_path'],"+"*20)

        except:
            raise DropItem("Item contains no images")
        else:
            return item

这是报的异常
2019-03-07 15:44:25 [scrapy.pipelines.files] WARNING: File (unknown-error): Error downloading image from <GET http://qw.lixia.gov.cn/picture/0/s_c39d2751a63f4c969116e11dd8d5ffdb.jpg> referred in <None>: 'splash'
2019-03-07 15:44:25 [scrapy.pipelines.files] WARNING: File (unknown-error): Error downloading image from <GET http://qw.lixia.gov.cn/picture/0/s_29d03c6e89bc413cba694a5e849ec1e3.jpg> referred in <None>: 'splash'
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >), (False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >)]

写回答

1回答

我是一只有宝贝的熊

提问者

2019-03-07

当我使用SplashRequest
进行图片下载的时候我打印了一下那个response.body

发现没有执行，Requets执行却报错，我的推测可能是在start_requets时修改了它的原生返回Requet对象导致的，目前这种情况，应该怎么解决？

我是一只有宝贝的熊

bobby

def get_media_requests(self, item, info): try: for image_url in item['image']: # yield Request(image_url) yield SplashRequest(image_url, dont_process_response=True, endpoint='render.jpeg') except: pass 我是重写了这个方法，我的想法是，既然start_request使用了splashRequest,那么，它上传不上去就是因为它原生还在使用Request,事实证明我的猜测是正确的，但是图片下载后有问题，它有一个白色边框，图片的大小被固定到了1024x768，不明白它的内部是怎么操作了，断点调试没有看出来：

2019-03-26

共2条回复

Scrapy打造搜索引擎畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy，用Django+Elasticsearch搭建搜索引擎

5829 学习 · 6293 问题

查看课程

相似问题

srcary下载图片，封面及详细页图片分开下载，如何使用默认下载组件

回答 3

将selenium集成到scrapy后，爬取cnblogs下载图片报错了。集成后怎么解决图片下载问题呢？

回答 1

使用ItemLoader后下载图片报错

回答 1

关于知乎图片下载和问答内容处理的一些疑问

回答 1

如何将爬取图片jpg文件写入csv显示出来

回答 2

打开慕课网App查看更多内容