使用bloom filter,大量出现Connection closed by server.

来源:13-3 通过修改scrapy-redis完成增量抓取-2

ZimSeraphim

2020-05-04

连接本地的redis服务时,不会有任何问题,但是一旦连接服务器上的redis服务,就会大量出现Connection closed by server的报错,同时爬取速度极慢(大概只有本地速度的百分之一),而且会发生大量重复爬取的情况。不再使用bloom filter,改为scrapy-redis原来的实现,则问题消失。为此,我有两个问题想要请教老师:

  1. 是否有办法解决这个报错以继续在连接服务器redis的情况下使用bloom filter?
  2. bloom filter对比原实现,究竟优势在哪里?是否值得耗费大量内存和效率去使用?

错误信息:

2020-05-04 15:44:20 [twisted] CRITICAL:
Traceback (most recent call last):
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\twisted\internet\task.py”, line 517, in _oneWorkUnit
result = next(self._iterator)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\scrapy\utils\defer.py”, line 74, in
work = (callable(elem, *args, **named) for elem in iterable)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\scrapy\core\scraper.py”, line 193, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\scrapy\core\engine.py”, line 216, in crawl
self.schedule(request, spider)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\scrapy\core\engine.py”, line 222, in schedule
if not self.slot.scheduler.enqueue_request(request):
File “C:\Users\airan\Desktop\中间件\spider-distributed\scrapy_APP\scrapy_redis\scheduler.py”, line 163, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
File “C:\Users\airan\Desktop\中间件\spider-distributed\scrapy_APP\scrapy_redis\dupefilter.py”, line 107, in request_seen
self.bf.add(fp)
File “C:\Users\airan\Desktop\中间件\spider-distributed\scrapy_APP\utils\bloomfilter.py”, line 35, in add
self.redis.setbit(name, hash, 1)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\client.py”, line 1777, in setbit
return self.execute_command(‘SETBIT’, name, offset, value)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\client.py”, line 878, in execute_command
return self.parse_response(conn, command_name, **options)
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\client.py”, line 892, in parse_response
response = connection.read_response()
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\connection.py”, line 734, in read_response
response = self._parser.read_response()
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\connection.py”, line 316, in read_response
response = self._buffer.readline()
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\connection.py”, line 248, in readline
self._read_from_socket()
File “C:\Users\airan\Anaconda3\envs\scrapy-redis\lib\site-packages\redis\connection.py”, line 193, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
2020-05-04 15:44:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost/4> (referer: http://localhost/2)

写回答

1回答

bobby

2020-05-06

  1. 连接丢失 需要确定是不是因为部署在外部服务器导致的? 你可以写python脚本去不停的循环调用redis看看是否稳定

  2. bloom filter对比原实现的优点课程中讲解了。就是为了大量去重使用的。如果你的url的量级不大 那就不要使用bloomfilter了

0
0

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5796 学习 · 6290 问题

查看课程