无法进入parse_question函数

来源:6-14 item loder方式提取question - 1

weixin_慕勒4383646

2020-06-07

Boby老师:
我的parse函数代码如下,通过尝试带headers,在headers中加入referer以及带上cookies但都无法进入parse_question函数,报错是:
2020-06-07 13:18:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com> (referer: None)
2020-06-07 13:18:03 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘www.zhihu.com’: <GET https://www.zhihu.com/question/387800632>
2020-06-07 13:18:03 [scrapy.core.engine] INFO: Closing spider (finished)
代码:

class JobboleSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['news.zhihu.com']
    start_urls = ['https://www.zhihu.com']
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"
    }
    def parse(self, response):
        cookie_dict = response.meta.get("cookies","")
        all_urls = response.css("a::attr(href)").extract()
        all_urls = [urljoin(response.url,url) for url in all_urls]
        all_urls = filter(lambda x:True if x.startswith("https") else False,all_urls)
        for url in all_urls:
            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*",url)
            if match_obj:
                request_url = match_obj.group(1)
                question_id = match_obj.group(2)

                self.headers["referer"] = url

                yield scrapy.Request(request_url,headers=self.headers,callback=self.parse_question,cookies=cookie_dict)
                # a = requests.get(request_url,headers = self.headers,cookies = cookie_dict).text
                # a.encode("utf-8")


    def parse_question(self,response):
        pass

写回答

1回答

bobby

2020-06-09

//img.mukewang.com/szimg/5edefcc1097eaea608860535.jpg 你在这个参数中加上 dont_filter=True 然后再试试

1
3
bobby
回复
风暴洋
好的。
2020-09-10
共3条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5796 学习 · 6290 问题

查看课程