老师您好,无法debug 到 parse_detail 该怎么解决
来源:4-10 编写spider完成抓取过程 - 2
朗月清风
2021-09-16
from urllib import parse
import scrapy
from scrapy import Request
class JobboleSpider(scrapy.Spider):
name = 'jobbole’
allowed_domains = [‘news.cnblogs.com/’]
start_urls = [‘http://news.cnblogs.com/’]
custom_settings = {
"COOKIES_ENABLED": True
}
def start_requests(self):
# 模拟登陆拿到cookie, selenium控制器会被网站识别出来。
import undetected_chromedriver.v2 as uc
browser = uc.Chrome()
browser.get('https://account.cnblogs.com/signin')
input("回撤继续:")
cookies = browser.get_cookies()
cookie_dict = {}
for cookie in cookies:
cookie_dict[cookie['name']] = cookie['value']
for url in self.start_urls:
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
yield scrapy.Request(url, cookies=cookie_dict, headers=headers, dont_filter=True)
def parse(self, response):
post_nodes = response.css('#news_list .news_block')
for post_node in post_nodes:
image_url = post_node.xpath('//div[@class="entry_summary"]//a/img/@src').extract_first('')
post_url = post_node.xpath("//h2/a/@href").extract_first('')
yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url": image_url}, callback=self.parse_detail)
def parse_detail(self, response):
pass
报错信息 我看到了302:
回撤继续:
2021-09-16 15:36:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:60723/session/15b4957016393e58d217442c2a6506fa/cookie {}
2021-09-16 15:36:05 [urllib3.connectionpool] DEBUG: http://localhost:60723 “GET /session/15b4957016393e58d217442c2a6506fa/cookie HTTP/1.1” 200 2077
2021-09-16 15:36:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-09-16 15:36:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-16 15:36:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.cnblogs.com/robots.txt> from <GET http://news.cnblogs.com/robots.txt>
2021-09-16 15:36:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://account.cnblogs.com:443/signin?ReturnUrl=https%3A%2F%2Fnews.cnblogs.com%2Frobots.txt> from <GET https://news.cnblogs.com/robots.txt>
2021-09-16 15:36:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://account.cnblogs.com:443/signin?ReturnUrl=https%3A%2F%2Fnews.cnblogs.com%2Frobots.txt> (referer: None)
2021-09-16 15:36:05 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2021-09-16 15:36:05 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2021-09-16 15:36:05 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2021-09-16 15:36:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.cnblogs.com/> from <GET http://news.cnblogs.com/>
2021-09-16 15:36:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.cnblogs.com/> (referer: None)
2021-09-16 15:36:06 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘news.cnblogs.com’: <GET https://news.cnblogs.com/n/702398/>
2021-09-16 15:36:06 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-16 15:36:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 3883,
‘downloader/request_count’: 5,
‘downloader/request_method_count/GET’: 5,
‘downloader/response_bytes’: 18940,
‘downloader/response_count’: 5,
‘downloader/response_status_count/200’: 2,
‘downloader/response_status_count/302’: 3,
‘elapsed_time_seconds’: 91.526641,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2021, 9, 16, 7, 36, 6, 349190),
‘httpcompression/response_bytes’: 83003,
‘httpcompression/response_count’: 2,
‘log_count/DEBUG’: 24,
‘log_count/INFO’: 12,
‘memusage/max’: 60391424,
‘memusage/startup’: 52572160,
‘offsite/domains’: 1,
‘offsite/filtered’: 30,
‘request_depth_max’: 1,
‘response_received_count’: 2,
‘robotstxt/request_count’: 1,
‘robotstxt/response_count’: 1,
‘robotstxt/response_status_count/200’: 1,
‘scheduler/dequeued’: 2,
‘scheduler/dequeued/memory’: 2,
‘scheduler/enqueued’: 2,
‘scheduler/enqueued/memory’: 2,
‘start_time’: datetime.datetime(2021, 9, 16, 7, 34, 34, 822549)}
2021-09-16 15:36:06 [scrapy.core.engine] INFO: Spider closed (finished)
2021-09-16 15:36:06 [uc] DEBUG: closing webdriver
2021-09-16 15:36:06 [uc] DEBUG: killing browser
2021-09-16 15:36:06 [uc] DEBUG: removing profile : /var/folders/9b/mkqz4d4d63n9_l2rr78d7fdc0000gn/T/tmpu4au6z3z
Process finished with exit code 0
1回答
-
朗月清风
提问者
2021-09-16
看到 szuxxy 同学的回答,成功解决了这个问题。
发现:DEBUG: Filtered offsite request to 'news.cnblogs.com':
查到下,加入, dont_filter=True搞定。即:
yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url": image_url},
callback=self.parse_detail, dont_filter=True)012021-09-17
相似问题
回答 1
回答 1