bobby老师求助,还是callback的问题
来源:6-17 知乎spider爬虫逻辑的实现以及answer的提取 - 1
夜色小闪
2021-08-03
爬取知乎热榜
源代码:
import re
import json
import datetime
try:
import urlparse as parse
except:
from urllib import parse
import scrapy
from scrapy import Request
from scrapy.loader import ItemLoader
from ArticleSpider.items import ZhihuQuestionItem, ZhihuAnswerItem
from ArticleSpider.utils import zhihu_login_sel
from ArticleSpider.settings import USER, PASSWORD
class ZhihuSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com/hot']
start_urls = ['https://www.zhihu.com/hot']
headers = {
"HOST": "www.zhihu.com",
"Referer": "https://www.zhizhu.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
custom_settings = {
"COOKIES_ENABLED": True
}
def start_requests(self):
#这里模拟登录拿到cookie就可以
#两种滑动验证码识别方案:1.使用opencv识别 2.使用机器学习方法识别
l = zhihu_login_sel.Login(USER, PASSWORD, 6)
cookie_dict = l.login()
for url in self.start_urls:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
yield scrapy.Request(url, cookies=cookie_dict, headers=headers, dont_filter=True)
def parse(self, response, **kwargs):
post_nodes = response.css('.HotItem .HotItem-content')[:1]
for post_node in post_nodes:
post_url = post_node.css('a::attr(href)').extract_first("")
yield scrapy.Request(url=parse.urljoin(response.url, post_url), headers=self.headers,
callback=self.parse_question)
def parse_question(self, response):
#处理question页面,从页面中提取出具体的question item
match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url)
if match_obj:
question_id = int(match_obj.group(2))
item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
item_loader.add_css("title", "h1.QuestionHeader-title::text")
item_loader.add_css("content", "")
item_loader.add_value("url", response.url)
item_loader.add_value("zhihu_id", question_id)
item_loader.add_css("answer_num", ".List-headerText span::text")
item_loader.add_css("comments_num", ".QuestionHeader-Comment button::text")
item_loader.add_css("watch_user_num", ".NumberBoard-itemValue::text")
item_loader.add_css("topics", ".QuestionHeader-topics .Popover::text")
question_item = item_loader.load_item()
pass
断点打到def parse_question(self, response)这里,发现没办法callback进入解析函数,之前的headers也定义配置了,这个异常是这么回事?
2021-08-03 15:34:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/hot> (referer: None)
2021-08-03 15:34:21 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘www.zhihu.com’: <GET https://www.zhihu.com/question/476619729>
2021-08-03 15:34:21 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-03 15:34:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 3017,
‘downloader/request_count’: 1,
‘downloader/request_method_count/GET’: 1,
‘downloader/response_bytes’: 64901,
‘downloader/response_count’: 1,
‘downloader/response_status_count/200’: 1,
‘elapsed_time_seconds’: 57.679837,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2021, 8, 3, 7, 34, 21, 366648),
‘httpcompression/response_bytes’: 282686,
‘httpcompression/response_count’: 1,
‘log_count/DEBUG’: 293,
‘log_count/INFO’: 13,
‘offsite/domains’: 1,
‘offsite/filtered’: 1,
‘request_depth_max’: 1,
‘response_received_count’: 1,
‘scheduler/dequeued’: 1,
‘scheduler/dequeued/memory’: 1,
‘scheduler/enqueued’: 1,
‘scheduler/enqueued/memory’: 1,
‘start_time’: datetime.datetime(2021, 8, 3, 7, 33, 23, 686811)}
2021-08-03 15:34:21 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
1回答
-
通过日志看来是这条请求被去重了。你可以在yield request的时候加上dont_filter参数设置为false试试
012021-08-04
相似问题