4-8博客园无法登录
来源:4-8 . cnblogs模拟登录(新增内容)
沧海红心
2022-06-08
问题:
代码运行之后一段时间没有反应。
代码:
import scrapy
class CnblognewsSpider(scrapy.Spider):
name = 'cnblogNews'
allowed_domains = ['news.cnblogs.com']
start_urls = ['https://news.cnblogs.com/n/recommend']
# 只针对当前爬虫设置,
custom_settings = {
# 让后面的请求引用前面的cookie
"COOKIES_ENABLED": True
}
def start_requests(self):
# undetected_chromedriver 为开源项目
# 入口可以模拟登录拿到cookie, selenium 控制浏览器会被一些网站识别出来,知乎,接勾
import undetected_chromedriver.v2 as uc
browser = uc.Chrome()
browser.get("https://account.cnblogs.com/signin")
input("请回车继续")
cookie_dict = {}
cookies = browser.get_cookies()
for cookie in cookies:
cookie_dict[cookie['name']] = cookie['value']
for url in self.start_urls:
# 将cookie 交给scrapy,
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.33 '
}
yield scrapy.Request(url, cookies=cookie_dict, headers=headers,
dont_filter=True)
def parse(self, response):
title_list = response.xpath('//div[@id="news_list"]//h2/a/text()').extract()
if title_list:
for title in title_list:
print(title)
pass
运行结果
C:\PythonWorkSpace\TprSpider\venv\Scripts\python.exe "C:\WorkSoft\PyCharm 2022.1\plugins\python\helpers\pydev\pydevd.py" --multiprocess --qt-support=auto --client 127.0.0.1 --port 11022 --file C:/PythonWorkSpace/TprSpider/NewsSpider/main.py
Connected to pydev debugger (build 221.5080.212)
2022-06-08 15:53:04 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: NewsSpider)
2022-06-08 15:53:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Windows-10-10.0.22000-SP0
2022-06-08 15:53:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'NewsSpider',
'NEWSPIDER_MODULE': 'NewsSpider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['NewsSpider.spiders']}
2022-06-08 15:53:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-06-08 15:53:04 [scrapy.extensions.telnet] INFO: Telnet Password: 866b42828b89aeab
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-08 15:53:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-06-08 15:53:04 [scrapy.core.engine] INFO: Spider opened
2022-06-08 15:53:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-08 15:53:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-06-08 15:53:04 [undetected_chromedriver.patcher] DEBUG: getting release number from /LATEST_RELEASE
2022-06-08 15:53:05 [undetected_chromedriver.patcher] DEBUG: downloading from https://chromedriver.storage.googleapis.com/102.0.5005.61/chromedriver_win32.zip
2022-06-08 15:53:06 [undetected_chromedriver.patcher] DEBUG: unzipping C:\Users\tpr\AppData\Local\Temp\tmp4gexy7ur
2022-06-08 15:53:07 [undetected_chromedriver.patcher] INFO: patching driver executable C:\Users\tpr\appdata\roaming\undetected_chromedriver\5f0a4bec7b1bf2c7_chromedriver.exe
2022-06-08 15:53:07 [uc] DEBUG: created a temporary folder in which the user-data (profile) will be stored during this
session, and added it to chrome startup arguments: --user-data-dir=C:\Users\tpr\AppData\Local\Temp\tmpauuoniqd
2022-06-08 15:53:07 [uc] DEBUG: did not find a bad exit_type flag
2022-06-08 15:53:08 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: NewsSpider)
2022-06-08 15:53:08 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Windows-10-10.0.22000-SP0
2022-06-08 15:53:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'NewsSpider',
'NEWSPIDER_MODULE': 'NewsSpider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['NewsSpider.spiders']}
2022-06-08 15:53:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-06-08 15:53:08 [scrapy.extensions.telnet] INFO: Telnet Password: 4c8c0060fdc68eb0
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-08 15:53:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-06-08 15:53:08 [scrapy.core.engine] INFO: Spider opened
2022-06-08 15:53:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-08 15:53:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-06-08 15:53:09 [undetected_chromedriver.patcher] DEBUG: getting release number from /LATEST_RELEASE
2022-06-08 15:53:09 [undetected_chromedriver.patcher] DEBUG: downloading from https://chromedriver.storage.googleapis.com/102.0.5005.61/chromedriver_win32.zip
2022-06-08 15:53:10 [undetected_chromedriver.patcher] DEBUG: unzipping C:\Users\tpr\AppData\Local\Temp\tmp18w53m4h
2022-06-08 15:53:10 [undetected_chromedriver.patcher] INFO: patching driver executable C:\Users\tpr\appdata\roaming\undetected_chromedriver\260866384a4809aa_chromedriver.exe
2022-06-08 15:53:11 [uc] DEBUG: created a temporary folder in which the user-data (profile) will be stored during this
session, and added it to chrome startup arguments: --user-data-dir=C:\Users\tpr\AppData\Local\Temp\tmpsc_0r93x
2022-06-08 15:53:11 [uc] DEBUG: did not find a bad exit_type flag
2022-06-08 15:53:11 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "C:\PythonWorkSpace\TprSpider\venv\lib\site-packages\scrapy\core\engine.py", line 150, in _next_request
request = next(self.slot.start_requests)
File "C:\PythonWorkSpace\TprSpider\NewsSpider\NewsSpider\spiders\cnblogNews.py", line 18, in start_requests
browser = uc.Chrome()
File "C:\PythonWorkSpace\TprSpider\venv\lib\site-packages\undetected_chromedriver\__init__.py", line 388, in __init__
self.browser_pid = start_detached(
File "C:\PythonWorkSpace\TprSpider\venv\lib\site-packages\undetected_chromedriver\dprocess.py", line 35, in start_detached
).start()
File "C:\WorkSoft\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\WorkSoft\Python310\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\WorkSoft\Python310\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\WorkSoft\Python310\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\WorkSoft\Python310\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\WorkSoft\Python310\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
2022-06-08 15:53:11 [scrapy.core.engine] INFO: Closing spider (finished)
2022-06-08 15:53:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 1.859053,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 6, 8, 7, 53, 11, 44179),
'log_count/DEBUG': 6,
'log_count/ERROR': 1,
'log_count/INFO': 11,
'start_time': datetime.datetime(2022, 6, 8, 7, 53, 9, 185126)}
2022-06-08 15:53:11 [scrapy.core.engine] INFO: Spider closed (finished)
2022-06-08 15:53:11 [uc] DEBUG: closing webdriver
2022-06-08 15:53:11 [uc] DEBUG: killing browser
2022-06-08 15:53:11 [uc] DEBUG: successfully removed C:\Users\tpr\AppData\Local\Temp\tmpsc_0r93x
2022-06-08 15:53:11 [undetected_chromedriver.patcher] DEBUG: successfully unlinked C:\Users\tpr\appdata\roaming\undetected_chromedriver\260866384a4809aa_chromedriver.exe
解决问题过程:
C:\Users\tpr\appdata\roaming\undetected_chromedriver\260866384a4809aa_chromedriver.exe 这个路径下面没有此文件,文件名匹配不上。

pycharm控制台截图
新补充
main方法中代码截图
2022-06-20=======
修改main方法如下还是不行
写回答
2回答
-
这里在windows下,需要将启动代码放在main下面,。
012022-06-12 -
bobby
2022-06-15
main的代码你要按照我给的代码写
022022-06-21
相似问题
