无法进入parse_question函数
来源:6-14 item loder方式提取question - 1
weixin_慕勒4383646
2020-06-07
Boby老师:
我的parse函数代码如下,通过尝试带headers,在headers中加入referer以及带上cookies但都无法进入parse_question函数,报错是:
2020-06-07 13:18:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com> (referer: None)
2020-06-07 13:18:03 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘www.zhihu.com’: <GET https://www.zhihu.com/question/387800632>
2020-06-07 13:18:03 [scrapy.core.engine] INFO: Closing spider (finished)
代码:
class JobboleSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['news.zhihu.com']
start_urls = ['https://www.zhihu.com']
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"
}
def parse(self, response):
cookie_dict = response.meta.get("cookies","")
all_urls = response.css("a::attr(href)").extract()
all_urls = [urljoin(response.url,url) for url in all_urls]
all_urls = filter(lambda x:True if x.startswith("https") else False,all_urls)
for url in all_urls:
match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*",url)
if match_obj:
request_url = match_obj.group(1)
question_id = match_obj.group(2)
self.headers["referer"] = url
yield scrapy.Request(request_url,headers=self.headers,callback=self.parse_question,cookies=cookie_dict)
# a = requests.get(request_url,headers = self.headers,cookies = cookie_dict).text
# a.encode("utf-8")
def parse_question(self,response):
pass
写回答
1回答
-
你在这个参数中加上 dont_filter=True 然后再试试
132020-09-10
相似问题