4-8博客园无法登录

来源:4-8 . cnblogs模拟登录(新增内容)

沧海红心

2022-06-08

问题:
代码运行之后一段时间没有反应。

代码:

import scrapy


class CnblognewsSpider(scrapy.Spider):
    name = 'cnblogNews'
    allowed_domains = ['news.cnblogs.com']
    start_urls = ['https://news.cnblogs.com/n/recommend']
    # 只针对当前爬虫设置,
    custom_settings = {
        # 让后面的请求引用前面的cookie
        "COOKIES_ENABLED": True
    }

    def start_requests(self):
        # undetected_chromedriver 为开源项目
        # 入口可以模拟登录拿到cookie, selenium 控制浏览器会被一些网站识别出来,知乎,接勾
        import undetected_chromedriver.v2 as uc
        browser = uc.Chrome()
        browser.get("https://account.cnblogs.com/signin")
        input("请回车继续")
        cookie_dict = {}
        cookies = browser.get_cookies()
        for cookie in cookies:
            cookie_dict[cookie['name']] = cookie['value']

        for url in self.start_urls:
            # 将cookie 交给scrapy,
            headers = {
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.33 '
            }
            yield scrapy.Request(url, cookies=cookie_dict, headers=headers,
                                 dont_filter=True)

        def parse(self, response):
            title_list = response.xpath('//div[@id="news_list"]//h2/a/text()').extract()
            if title_list:
                for title in title_list:
                    print(title)

            pass

运行结果

C:\PythonWorkSpace\TprSpider\venv\Scripts\python.exe "C:\WorkSoft\PyCharm 2022.1\plugins\python\helpers\pydev\pydevd.py" --multiprocess --qt-support=auto --client 127.0.0.1 --port 11022 --file C:/PythonWorkSpace/TprSpider/NewsSpider/main.py
Connected to pydev debugger (build 221.5080.212)
2022-06-08 15:53:04 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: NewsSpider)
2022-06-08 15:53:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Windows-10-10.0.22000-SP0
2022-06-08 15:53:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'NewsSpider',
 'NEWSPIDER_MODULE': 'NewsSpider.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['NewsSpider.spiders']}
2022-06-08 15:53:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-06-08 15:53:04 [scrapy.extensions.telnet] INFO: Telnet Password: 866b42828b89aeab
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-06-08 15:53:04 [scrapy.core.engine] INFO: Spider opened
2022-06-08 15:53:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-08 15:53:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-06-08 15:53:04 [undetected_chromedriver.patcher] DEBUG: getting release number from /LATEST_RELEASE
2022-06-08 15:53:05 [undetected_chromedriver.patcher] DEBUG: downloading from https://chromedriver.storage.googleapis.com/102.0.5005.61/chromedriver_win32.zip
2022-06-08 15:53:06 [undetected_chromedriver.patcher] DEBUG: unzipping C:\Users\tpr\AppData\Local\Temp\tmp4gexy7ur
2022-06-08 15:53:07 [undetected_chromedriver.patcher] INFO: patching driver executable C:\Users\tpr\appdata\roaming\undetected_chromedriver\5f0a4bec7b1bf2c7_chromedriver.exe
2022-06-08 15:53:07 [uc] DEBUG: created a temporary folder in which the user-data (profile) will be stored during this
session, and added it to chrome startup arguments: --user-data-dir=C:\Users\tpr\AppData\Local\Temp\tmpauuoniqd
2022-06-08 15:53:07 [uc] DEBUG: did not find a bad exit_type flag 
2022-06-08 15:53:08 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: NewsSpider)
2022-06-08 15:53:08 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Windows-10-10.0.22000-SP0
2022-06-08 15:53:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'NewsSpider',
 'NEWSPIDER_MODULE': 'NewsSpider.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['NewsSpider.spiders']}
2022-06-08 15:53:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-06-08 15:53:08 [scrapy.extensions.telnet] INFO: Telnet Password: 4c8c0060fdc68eb0
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-06-08 15:53:08 [scrapy.core.engine] INFO: Spider opened
2022-06-08 15:53:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-08 15:53:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-06-08 15:53:09 [undetected_chromedriver.patcher] DEBUG: getting release number from /LATEST_RELEASE
2022-06-08 15:53:09 [undetected_chromedriver.patcher] DEBUG: downloading from https://chromedriver.storage.googleapis.com/102.0.5005.61/chromedriver_win32.zip
2022-06-08 15:53:10 [undetected_chromedriver.patcher] DEBUG: unzipping C:\Users\tpr\AppData\Local\Temp\tmp18w53m4h
2022-06-08 15:53:10 [undetected_chromedriver.patcher] INFO: patching driver executable C:\Users\tpr\appdata\roaming\undetected_chromedriver\260866384a4809aa_chromedriver.exe
2022-06-08 15:53:11 [uc] DEBUG: created a temporary folder in which the user-data (profile) will be stored during this
session, and added it to chrome startup arguments: --user-data-dir=C:\Users\tpr\AppData\Local\Temp\tmpsc_0r93x
2022-06-08 15:53:11 [uc] DEBUG: did not find a bad exit_type flag 
2022-06-08 15:53:11 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\PythonWorkSpace\TprSpider\venv\lib\site-packages\scrapy\core\engine.py", line 150, in _next_request
    request = next(self.slot.start_requests)
  File "C:\PythonWorkSpace\TprSpider\NewsSpider\NewsSpider\spiders\cnblogNews.py", line 18, in start_requests
    browser = uc.Chrome()
  File "C:\PythonWorkSpace\TprSpider\venv\lib\site-packages\undetected_chromedriver\__init__.py", line 388, in __init__
    self.browser_pid = start_detached(
  File "C:\PythonWorkSpace\TprSpider\venv\lib\site-packages\undetected_chromedriver\dprocess.py", line 35, in start_detached
    ).start()
  File "C:\WorkSoft\Python310\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\WorkSoft\Python310\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\WorkSoft\Python310\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\WorkSoft\Python310\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\WorkSoft\Python310\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\WorkSoft\Python310\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.
        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:
            if __name__ == '__main__':
                freeze_support()
                ...
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
2022-06-08 15:53:11 [scrapy.core.engine] INFO: Closing spider (finished)
2022-06-08 15:53:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 1.859053,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 6, 8, 7, 53, 11, 44179),
 'log_count/DEBUG': 6,
 'log_count/ERROR': 1,
 'log_count/INFO': 11,
 'start_time': datetime.datetime(2022, 6, 8, 7, 53, 9, 185126)}
2022-06-08 15:53:11 [scrapy.core.engine] INFO: Spider closed (finished)
2022-06-08 15:53:11 [uc] DEBUG: closing webdriver
2022-06-08 15:53:11 [uc] DEBUG: killing browser
2022-06-08 15:53:11 [uc] DEBUG: successfully removed C:\Users\tpr\AppData\Local\Temp\tmpsc_0r93x
2022-06-08 15:53:11 [undetected_chromedriver.patcher] DEBUG: successfully unlinked C:\Users\tpr\appdata\roaming\undetected_chromedriver\260866384a4809aa_chromedriver.exe

解决问题过程:
C:\Users\tpr\appdata\roaming\undetected_chromedriver\260866384a4809aa_chromedriver.exe 这个路径下面没有此文件,文件名匹配不上。

图片描述

pycharm控制台截图
图片描述

新补充

main方法中代码截图
图片描述

2022-06-20=======
修改main方法如下还是不行
图片描述

写回答

2回答

bobby

2022-06-10

这里在windows下,需要将启动代码放在main下面,。  //img.mukewang.com/szimg/62a2a0b509dac3b909010403.jpg

0
1
沧海红心
在使用chrome 模拟登录之前的代码是可以运行的,之前可以爬取出数据来的。 main方法中的代码 截图已经补充到问题中了。
2022-06-12
共1条回复

bobby

2022-06-15

main的代码你要按照我给的代码写

0
2
bobby
回复
沧海红心
你按照这里的写法以后代码启动是没有问题的,是爬取进行中出了新的问题吧
2022-06-21
共2条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5829 学习 · 6293 问题

查看课程