老师,运行程序,进不去PARSE方法,为什么
来源:6-13 知乎分析以及数据表设计 - 2
三肥牛元气
2018-11-02
-- coding: utf-8 --
import re
from urllib import parse
import scrapy
class ZhihuSpider(scrapy.Spider):
name = 'zhihu’
allowed_domains = [‘www.zhihu.com’]
start_urls = [‘https://www.zhihu.com/’]
headers = {
"HOST": "www.zhihu.com",
"Referer": "https://www.zhihu.com",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36"
}
def parse(self, response):
all_urls = response.css("a::attr(href)").extract()
all_urls = [parse.urljoin(response.url, url) for url in all_urls]
all_urls = filter(lambda x: True if x.startswith("https") else False, all_urls)
for url in all_urls:
print(url)
match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", url)
if match_obj:
# 如果提取到question相关的页面则下载后交由提取函数进行提取
request_url = match_obj.group(1)
yield scrapy.Request(request_url, headers=self.headers, callback=self.parse_question)
else:
# 如果不是question页面则直接进一步跟踪
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
写回答
1回答
-
bobby
2018-11-03
你需要看一下console中的输出 看看这个url返回的状态码是否为非200状态码? 导致了无法进入parse方法
072018-11-09
相似问题