调试的时候不会跳转到parse_job()函数的断点,直接搜索完就结束
来源:7-4 Rule和LinkExtractor使用
慕用5281994
2018-05-19
控制台返回状态如下(之前还有一些如下页面信息,2018-05-19 21:49:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=113.99.220.141> from <GET https://www.lagou.com/zhaopin/CTO/>,包括有 jobs的 ):
2018-05-19 21:49:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 131331,
'downloader/request_count': 329,
'downloader/request_method_count/GET': 329,
'downloader/response_bytes': 189515,
'downloader/response_count': 329,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 325,
'dupefilter/filtered': 323,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 19, 13, 49, 44, 206308),
'log_count/DEBUG': 331,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 327,
'scheduler/dequeued/memory': 327,
'scheduler/enqueued': 327,
'scheduler/enqueued/memory': 327,
'start_time': datetime.datetime(2018, 5, 19, 13, 49, 28, 666419)}
2018-05-19 21:49:44 [scrapy.core.engine] INFO: Spider closed (finished)
2回答
-
302了 可以先用selenium模拟登录 然后再重新抓取
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule import time import pickle import datetime import sys import io class LagouSpider(CrawlSpider): name = 'lagou_sel' allowed_domains = ['www.lagou.com'] start_urls = ['https://www.lagou.com/'] headers={ "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, br", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", "Referer":'https://www.lagou.com', 'Connection': 'keep-alive', "HOST": "www.lagou.com" } custom_settings = { "COOKIES_ENABLED": True } rules = ( Rule(LinkExtractor(allow=r'gongsi/j/\d+.html'), follow=True), Rule(LinkExtractor(allow=r'zhaopin/.*'), follow=True), Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True), ) def parse_item(self, response): pass def start_requests(self): from selenium import webdriver sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') chrome_opt=webdriver.ChromeOptions() prefs={"profile.managed_default_content_settings.images":2} chrome_opt.add_experimental_option("prefs",prefs) browser = webdriver.Chrome(executable_path="E:/tmp/chromedriver.exe",chrome_options=chrome_opt) browser.get("https://passport.lagou.com/login/login.html?service=https%3a%2f%2fwww.lagou.com%2f") browser.find_elements_by_css_selector(".input.input_white")[0].send_keys("xxx") browser.find_elements_by_css_selector(".input.input_white")[1].send_keys("xx") # browser.find_element_by_xpath("/html/body/section/div[1]/div[2]/form/div[2]/input").send_keys(password) browser.find_element_by_css_selector(".btn.btn_green.btn_active.btn_block.btn_lg").click() time.sleep(10) Cookies = browser.get_cookies() cookie_dict={} for cookie in Cookies: f=open('H:/慕课网课程/python爬虫/课程源码最终版/ArticleSpider/cookies123'+cookie['name']+'.lagou','wb') pickle.dump(cookie,f) f.close() cookie_dict[cookie['name']]=cookie['value'] browser.close() return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)]
262018-08-15 -
qq_AGGRESSIVE_0
2018-05-20
我也遇到了,可以通过在parse_job()中添加print(),验证进入了方法,
00
相似问题