知乎request400
来源:6-14 item loder方式提取question - 1
了不起的水獭
2018-07-14
老师,我在请求知乎的时候总是返回400,代码看过也没有找到问题,能否帮忙看下
# -*- coding: utf-8 -*-
import scrapy
try:
import urlparse as parse
except:
from urllib import parse
class ZhihuSelSpider(scrapy.Spider):
name = 'zhihu_sel'
allowed_domains = ["www.zhihu.com"]
start_urls = ['https://www.zhihu.com/']
headers = {
"HOST": "www.zhihu.com",
"Referer": "https://www.zhihu.com",
"User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
}
def parse(self, response):
"""
提取html页面中的所有url,并跟踪这些url进行进一步爬取
如果爬取的url中的格式为/question/xxx 就下载后直接进行解析
"""
all_urls = response.css("a::attr(herf").extract()
all_urls = [parse.urljoin(response.url,url) for url in all_urls]
for url in all_urls:
pass
def start_requests(self):
# return [scrapy.Request('https://www.zhihu.com/signin',headers=self.headers,callback=self.login)]
# def login(self,response):
from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe")
browser.get("https://www.zhihu.com/signin")
browser.find_element_by_css_selector('.Login-content input[name="username"]').send_keys("127281031@qq.com")
browser.find_element_by_css_selector(".Input-wrapper input[name='password']").send_keys("9544123")
browser.find_element_by_css_selector("button.SignFlow-submitButton").click()
import time
time.sleep(10)
Cookies = browser.get_cookies()
print(Cookies)
cookie_dict = {}
import pickle
for cookie in Cookies:
#写入文件
f = open('C:\Mycode\ArticleSpider\cookies\zhihu' + cookie['name'] + '.zhihu','wb')
pickle.dump(cookie,f)
cookie_dict[cookie['name']] = cookie['value']
browser.close()
return [scrapy.Request(url=self.start_urls[0], dont_filter=True,headers=self.headers,cookies=cookie_dict)]400错误:
2018-07-14 20:58:40 [scrapy.core.engine] INFO: Spider opened
2018-07-14 20:58:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 20:58:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-14 20:58:45 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com)
2018-07-14 20:58:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.zhihu.com/>: HTTP status code is not handled or not allowed
2018-07-14 20:58:45 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-14 20:58:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 393,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 517,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 14, 12, 58, 45, 921529),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/400': 1,
'log_count/DEBUG': 32,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 7, 14, 12, 58, 40, 498900)}
2018-07-14 20:58:45 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
1回答
-
了不起的水獭
提问者
2018-07-14
老师 我知道了,我把User-Agent写成了User_Agent
012018-07-16
相似问题