线程池提交函数和参数 递归处理下一页只执行几页就不执行了
来源:8-10 ThreadPoolExecutor线程池重构爬虫

仙女座舜
2020-03-10
# 使用线程池爬取csdn
"""
抓取
解析
存储
"""
import re
import ast
from urllib import parse
from datetime import datetime
import requests
from scrapy import Selector
from csdn_spider.models import *
domain = "https://bbs.csdn.net"
def parse_list(url):
print('*' * 200)
print("解析列表页: {}".format(url))
res_text = requests.get(url).text
sel = Selector(text=res_text)
all_trs = sel.xpath("//table[@class='forums_tab_table']/tbody/tr")
for tr in all_trs:
topic_title = tr.xpath(".//td[3]/a/text()").extract()[0]
print(topic_title)
next_page = sel.xpath("//a[@class='pageliststy next_page']/@href").extract()
if next_page:
next_url = parse.urljoin(domain, next_page[1])
task2 = executor.submit(parse_list, next_url)
thread_list.append(task2)
# parse_list(next_url)
if __name__ == "__main__":
from concurrent.futures import ThreadPoolExecutor, as_completed
executor = ThreadPoolExecutor(max_workers=5)
last_urls = ['https://bbs.csdn.net/forums/ios']
# 爬取这个url下的所有标题 和下一页的标题
thread_list = []
for url in last_urls:
task1 = executor.submit(parse_list, url)
thread_list.append(task1)
# parse_list(url)
# https://bbs.csdn.net/forums/ios' 这个url有一百个下一页
# 使用多线程 executor.submit(parse_list, url) 加不加下面这个代码都只能爬取几页数据
# 不使用executor.submit 。递归使用parse_list(url)可以爬取全部数据
# for future in as_completed(thread_list):
# data = future.result()
# print("-"*60,data)
单线程递归没问题每一页都能爬。使用线程池执行程序也不报错,就是只能爬个几页就结束了。
当我设置断点调试的时候也没找到问题…这是最骚的
写回答
2回答
-
bobby
2020-03-13
import re import ast from urllib import parse from datetime import datetime import requests from scrapy import Selector domain = "https://bbs.csdn.net" def parse_list(url): print('*' * 200) print("解析列表页: {}".format(url)) res_text = requests.get(url).text sel = Selector(text=res_text) all_trs = sel.xpath("//table[@class='forums_tab_table']/tbody/tr") for tr in all_trs: topic_title = tr.xpath(".//td[3]/a/text()").extract()[0] # print(topic_title) next_page = sel.xpath("//a[@class='pageliststy next_page']/@href").extract() if next_page: next_url = parse.urljoin(domain, next_page[1]) print(next_url) task2 = executor.submit(parse_list, next_url) thread_list.append(task2) if __name__ == "__main__": stop = False from concurrent.futures import ThreadPoolExecutor, as_completed executor = ThreadPoolExecutor(max_workers=5) last_urls = ['https://bbs.csdn.net/forums/ios'] # 爬取这个url下的所有标题 和下一页的标题 thread_list = [] print(len(last_urls)) for url in last_urls: task1 = executor.submit(parse_list, url) thread_list.append(task1) parse_list(url) # https://bbs.csdn.net/forums/ios' 这个url有一百个下一页 # 主线程退出以后会导致运行中的线程运行完成以后后续排队的任务不执行 # as_completed只会将参数中的task完成以后就不在执行新的任务,所以采用一个全局变量stop来决定是否退出主线程并关闭线程池 import time while not stop: time.sleep(1)
012020-12-17 -
bobby
2020-03-11
as_completed 这个代码不能注释,不然主线程就容易造成子线程退出
032020-03-11
相似问题