老师您好,无法debug 到 parse_detail 该怎么解决

来源:4-10 编写spider完成抓取过程 - 2

朗月清风

2021-09-16

from urllib import parse

import scrapy

from scrapy import Request

class JobboleSpider(scrapy.Spider):
name = 'jobbole’
allowed_domains = [‘news.cnblogs.com/’]
start_urls = [‘http://news.cnblogs.com/’]

custom_settings = {
    "COOKIES_ENABLED": True
}


def start_requests(self):
    # 模拟登陆拿到cookie, selenium控制器会被网站识别出来。
    import undetected_chromedriver.v2 as uc
    browser = uc.Chrome()
    browser.get('https://account.cnblogs.com/signin')
    input("回撤继续:")
    cookies = browser.get_cookies()
    cookie_dict = {}
    for cookie in cookies:
        cookie_dict[cookie['name']] = cookie['value']

    for url in self.start_urls:
        headers = {
            'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
        }
        yield scrapy.Request(url, cookies=cookie_dict, headers=headers, dont_filter=True)


def parse(self, response):
    post_nodes = response.css('#news_list .news_block')
    for post_node in post_nodes:
        image_url = post_node.xpath('//div[@class="entry_summary"]//a/img/@src').extract_first('')
        post_url = post_node.xpath("//h2/a/@href").extract_first('')
        yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url": image_url}, callback=self.parse_detail)


def parse_detail(self, response):
    pass

报错信息 我看到了302:

回撤继续:
2021-09-16 15:36:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:60723/session/15b4957016393e58d217442c2a6506fa/cookie {}
2021-09-16 15:36:05 [urllib3.connectionpool] DEBUG: http://localhost:60723 “GET /session/15b4957016393e58d217442c2a6506fa/cookie HTTP/1.1” 200 2077
2021-09-16 15:36:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-09-16 15:36:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-16 15:36:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.cnblogs.com/robots.txt> from <GET http://news.cnblogs.com/robots.txt>
2021-09-16 15:36:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://account.cnblogs.com:443/signin?ReturnUrl=https%3A%2F%2Fnews.cnblogs.com%2Frobots.txt> from <GET https://news.cnblogs.com/robots.txt>
2021-09-16 15:36:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://account.cnblogs.com:443/signin?ReturnUrl=https%3A%2F%2Fnews.cnblogs.com%2Frobots.txt> (referer: None)
2021-09-16 15:36:05 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2021-09-16 15:36:05 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2021-09-16 15:36:05 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2021-09-16 15:36:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.cnblogs.com/> from <GET http://news.cnblogs.com/>
2021-09-16 15:36:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.cnblogs.com/> (referer: None)
2021-09-16 15:36:06 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘news.cnblogs.com’: <GET https://news.cnblogs.com/n/702398/>
2021-09-16 15:36:06 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-16 15:36:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 3883,
‘downloader/request_count’: 5,
‘downloader/request_method_count/GET’: 5,
‘downloader/response_bytes’: 18940,
‘downloader/response_count’: 5,
‘downloader/response_status_count/200’: 2,
‘downloader/response_status_count/302’: 3,
‘elapsed_time_seconds’: 91.526641,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2021, 9, 16, 7, 36, 6, 349190),
‘httpcompression/response_bytes’: 83003,
‘httpcompression/response_count’: 2,
‘log_count/DEBUG’: 24,
‘log_count/INFO’: 12,
‘memusage/max’: 60391424,
‘memusage/startup’: 52572160,
‘offsite/domains’: 1,
‘offsite/filtered’: 30,
‘request_depth_max’: 1,
‘response_received_count’: 2,
‘robotstxt/request_count’: 1,
‘robotstxt/response_count’: 1,
‘robotstxt/response_status_count/200’: 1,
‘scheduler/dequeued’: 2,
‘scheduler/dequeued/memory’: 2,
‘scheduler/enqueued’: 2,
‘scheduler/enqueued/memory’: 2,
‘start_time’: datetime.datetime(2021, 9, 16, 7, 34, 34, 822549)}
2021-09-16 15:36:06 [scrapy.core.engine] INFO: Spider closed (finished)
2021-09-16 15:36:06 [uc] DEBUG: closing webdriver
2021-09-16 15:36:06 [uc] DEBUG: killing browser
2021-09-16 15:36:06 [uc] DEBUG: removing profile : /var/folders/9b/mkqz4d4d63n9_l2rr78d7fdc0000gn/T/tmpu4au6z3z

Process finished with exit code 0

写回答

1回答

朗月清风

提问者

2021-09-16

看到  szuxxy 同学的回答,成功解决了这个问题。

发现:DEBUG: Filtered offsite request to 'news.cnblogs.com':
查到下,加入, dont_filter=True搞定。即: 
  yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url": image_url},
                          callback=self.parse_detail, dont_filter=True)


0
1
bobby
好的。
2021-09-17
共1条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5796 学习 · 6290 问题

查看课程