爬虫未爬完所有数据提前终止
来源:4-17 items数据写入到json文件中
真鱻
2020-10-19
老师您好:
我在利用爬虫爬取领域文献时,爬虫只爬取了1873篇文献就终止了,而 “finish_reason” 显示时 ”finished“。然而理论上应该要爬取大约1w篇。这个爬虫在之前爬取少量文献是没有问题的。所以想请问是什么原因造成爬虫提前终止?谢谢老师!
下面是爬虫的统计信息:
2020-10-19 20:47:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 4081262,
‘downloader/request_count’: 2065,
‘downloader/request_method_count/GET’: 2065,
‘downloader/response_bytes’: 51026952,
‘downloader/response_count’: 2065,
‘downloader/response_status_count/200’: 2063,
‘downloader/response_status_count/500’: 2,
‘dupefilter/filtered’: 27,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2020, 10, 19, 12, 47, 44, 177850),
‘item_scraped_count’: 1873,
‘log_count/DEBUG’: 3962,
‘log_count/INFO’: 13,
‘request_depth_max’: 190,
‘response_received_count’: 2063,
‘retry/count’: 2,
‘retry/reason_count/500 Internal Server Error’: 2,
‘scheduler/dequeued’: 2066,
‘scheduler/dequeued/memory’: 2066,
‘scheduler/enqueued’: 2066,
‘scheduler/enqueued/memory’: 2066,
‘start_time’: datetime.datetime(2020, 10, 19, 12, 42, 34, 202197)}
1回答
-
你可以看一下 是不是数据循环造成了 有些数据访问不到 你是列表页的形式访问的吗?
032020-11-13
相似问题