无法进入parse_question函数

来源：6-14 item loder方式提取question - 1

weixin_慕勒4383646

2020-06-07

Boby老师：
我的parse函数代码如下，通过尝试带headers，在headers中加入referer以及带上cookies但都无法进入parse_question函数，报错是：
2020-06-07 13:18:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com> (referer: None)
2020-06-07 13:18:03 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘www.zhihu.com’: <GET https://www.zhihu.com/question/387800632>
2020-06-07 13:18:03 [scrapy.core.engine] INFO: Closing spider (finished)
代码：

class JobboleSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['news.zhihu.com']
    start_urls = ['https://www.zhihu.com']
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"
    }
    def parse(self, response):
        cookie_dict = response.meta.get("cookies","")
        all_urls = response.css("a::attr(href)").extract()
        all_urls = [urljoin(response.url,url) for url in all_urls]
        all_urls = filter(lambda x:True if x.startswith("https") else False,all_urls)
        for url in all_urls:
            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*",url)
            if match_obj:
                request_url = match_obj.group(1)
                question_id = match_obj.group(2)

                self.headers["referer"] = url

                yield scrapy.Request(request_url,headers=self.headers,callback=self.parse_question,cookies=cookie_dict)
                # a = requests.get(request_url,headers = self.headers,cookies = cookie_dict).text
                # a.encode("utf-8")


    def parse_question(self,response):
        pass

写回答

1回答