知乎request400

来源:6-14 item loder方式提取question - 1

了不起的水獭

2018-07-14

老师,我在请求知乎的时候总是返回400,代码看过也没有找到问题,能否帮忙看下

# -*- coding: utf-8 -*-
import scrapy
try:
    import urlparse as parse
except:
    from urllib import parse

class ZhihuSelSpider(scrapy.Spider):
    name = 'zhihu_sel'
    allowed_domains = ["www.zhihu.com"]
    start_urls = ['https://www.zhihu.com/']

    headers = {
        "HOST": "www.zhihu.com",
        "Referer": "https://www.zhihu.com",
        "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
    }

    def parse(self, response):
        """
        提取html页面中的所有url,并跟踪这些url进行进一步爬取
        如果爬取的url中的格式为/question/xxx 就下载后直接进行解析
        """
        all_urls = response.css("a::attr(herf").extract()
        all_urls = [parse.urljoin(response.url,url) for url in all_urls]
        for url in all_urls:
            pass

    def start_requests(self):
        # return [scrapy.Request('https://www.zhihu.com/signin',headers=self.headers,callback=self.login)]

    # def login(self,response):
        from selenium import webdriver
        browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe")
        browser.get("https://www.zhihu.com/signin")

        browser.find_element_by_css_selector('.Login-content input[name="username"]').send_keys("127281031@qq.com")
        browser.find_element_by_css_selector(".Input-wrapper input[name='password']").send_keys("9544123")

        browser.find_element_by_css_selector("button.SignFlow-submitButton").click()
        import time
        time.sleep(10)
        Cookies = browser.get_cookies()
        print(Cookies)
        cookie_dict = {}
        import pickle
        for cookie in Cookies:
            #写入文件
            f = open('C:\Mycode\ArticleSpider\cookies\zhihu' + cookie['name'] + '.zhihu','wb')
            pickle.dump(cookie,f)
            cookie_dict[cookie['name']] = cookie['value']
        browser.close()
        return [scrapy.Request(url=self.start_urls[0], dont_filter=True,headers=self.headers,cookies=cookie_dict)]

400错误:

2018-07-14 20:58:40 [scrapy.core.engine] INFO: Spider opened

2018-07-14 20:58:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-07-14 20:58:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2018-07-14 20:58:45 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com)

2018-07-14 20:58:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.zhihu.com/>: HTTP status code is not handled or not allowed

2018-07-14 20:58:45 [scrapy.core.engine] INFO: Closing spider (finished)

2018-07-14 20:58:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 393,

 'downloader/request_count': 1,

 'downloader/request_method_count/GET': 1,

 'downloader/response_bytes': 517,

 'downloader/response_count': 1,

 'downloader/response_status_count/400': 1,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2018, 7, 14, 12, 58, 45, 921529),

 'httperror/response_ignored_count': 1,

 'httperror/response_ignored_status_count/400': 1,

 'log_count/DEBUG': 32,

 'log_count/INFO': 8,

 'response_received_count': 1,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2018, 7, 14, 12, 58, 40, 498900)}

2018-07-14 20:58:45 [scrapy.core.engine] INFO: Spider closed (finished)


Process finished with exit code 0


写回答

1回答

了不起的水獭

提问者

2018-07-14

老师 我知道了,我把User-Agent写成了User_Agent

0
1
bobby
好的,
2018-07-16
共1条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5829 学习 · 6293 问题

查看课程