bobby老师求助,还是callback的问题

来源:6-17 知乎spider爬虫逻辑的实现以及answer的提取 - 1

夜色小闪

2021-08-03

爬取知乎热榜
源代码:

import re
import json
import datetime

try:
    import urlparse as parse
except:
    from urllib import parse

import scrapy
from scrapy import Request
from scrapy.loader import ItemLoader
from ArticleSpider.items import ZhihuQuestionItem, ZhihuAnswerItem
from ArticleSpider.utils import zhihu_login_sel
from ArticleSpider.settings import USER, PASSWORD

class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com/hot']
    start_urls = ['https://www.zhihu.com/hot']

    headers = {
        "HOST": "www.zhihu.com",
        "Referer": "https://www.zhizhu.com",
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    custom_settings = {
        "COOKIES_ENABLED": True
    }

    def start_requests(self):
        #这里模拟登录拿到cookie就可以
        #两种滑动验证码识别方案:1.使用opencv识别 2.使用机器学习方法识别
        l = zhihu_login_sel.Login(USER, PASSWORD, 6)
        cookie_dict = l.login()
        for url in self.start_urls:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            yield scrapy.Request(url, cookies=cookie_dict, headers=headers, dont_filter=True)

    def parse(self, response, **kwargs):
        post_nodes = response.css('.HotItem .HotItem-content')[:1]
        for post_node in post_nodes:
            post_url = post_node.css('a::attr(href)').extract_first("")
            yield scrapy.Request(url=parse.urljoin(response.url, post_url), headers=self.headers,
                                 callback=self.parse_question)
    def parse_question(self, response):
        #处理question页面,从页面中提取出具体的question item
        match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url)
        if match_obj:
            question_id = int(match_obj.group(2))
        item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
        item_loader.add_css("title", "h1.QuestionHeader-title::text")
        item_loader.add_css("content", "")
        item_loader.add_value("url", response.url)
        item_loader.add_value("zhihu_id", question_id)
        item_loader.add_css("answer_num", ".List-headerText span::text")
        item_loader.add_css("comments_num", ".QuestionHeader-Comment button::text")
        item_loader.add_css("watch_user_num", ".NumberBoard-itemValue::text")
        item_loader.add_css("topics", ".QuestionHeader-topics .Popover::text")

        question_item = item_loader.load_item()
        pass

断点打到def parse_question(self, response)这里,发现没办法callback进入解析函数,之前的headers也定义配置了,这个异常是这么回事?
2021-08-03 15:34:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/hot> (referer: None)
2021-08-03 15:34:21 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘www.zhihu.com’: <GET https://www.zhihu.com/question/476619729>
2021-08-03 15:34:21 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-03 15:34:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 3017,
‘downloader/request_count’: 1,
‘downloader/request_method_count/GET’: 1,
‘downloader/response_bytes’: 64901,
‘downloader/response_count’: 1,
‘downloader/response_status_count/200’: 1,
‘elapsed_time_seconds’: 57.679837,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2021, 8, 3, 7, 34, 21, 366648),
‘httpcompression/response_bytes’: 282686,
‘httpcompression/response_count’: 1,
‘log_count/DEBUG’: 293,
‘log_count/INFO’: 13,
‘offsite/domains’: 1,
‘offsite/filtered’: 1,
‘request_depth_max’: 1,
‘response_received_count’: 1,
‘scheduler/dequeued’: 1,
‘scheduler/dequeued/memory’: 1,
‘scheduler/enqueued’: 1,
‘scheduler/enqueued/memory’: 1,
‘start_time’: datetime.datetime(2021, 8, 3, 7, 33, 23, 686811)}
2021-08-03 15:34:21 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

写回答

1回答

bobby

2021-08-04

通过日志看来是这条请求被去重了。你可以在yield request的时候加上dont_filter参数设置为false试试

0
1
夜色小闪
谢谢bobby老师,我把dont_filter=True这个参数加上就解决了
2021-08-04
共1条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5796 学习 · 6290 问题

查看课程