使用蘑菇代理ip请求淘宝搜索结果页的ajax数据接口,程序会卡住(暂停)
来源:8-8 scrapy实现ip代理池 - 3
马小勒
2018-09-13
有的时候往数据库写入8000条数据,会卡住,有的时候写入1000就卡住。卡住地方的控制台日志信息,此外没有其他报错???
......
......
sales: 677
sales: 671
2018-09-13 18:13:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.taobao.com/search?data-key=s&data-value=0&ajax=true&_ksTS=1532158365171_1326&callback=jsonp1327&q=%E6%B4%97%E9%A2%9C%E4%B8%93%E7%A7%91&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180721&ie=utf8&sort=sale-desc&bcoffset=0&p4ppushleft=%2C44>
None
sales: 668
2018-09-13 18:13:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.taobao.com/search?data-key=s&data-value=0&ajax=true&_ksTS=1532158365171_1326&callback=jsonp1327&q=%E6%B4%97%E9%A2%9C%E4%B8%93%E7%A7%91&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180721&ie=utf8&sort=sale-desc&bcoffset=0&p4ppushleft=%2C44>
None
2018-09-13 18:13:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1532158365171_2206&callback=jsonp2207&q=jmsolution&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_222012208721&ie=utf8&sort=sale-desc&bcoffset=220&p4ppushleft=%2C220> (failed 1 times): User timeout caused connection failure: Getting https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1532158365171_2206&callback=jsonp2207&q=jmsolution&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_222012208721&ie=utf8&sort=sale-desc&bcoffset=220&p4ppushleft=%2C220 took longer than 10.0 seconds..
get ip from ip api
2018-09-13 18:13:08 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): piping.mogumiao.com:80
2018-09-13 18:13:08 [urllib3.connectionpool] DEBUG: http://piping.mogumiao.com:80 "GET /proxy/api/get_ip_al?appKey=b828f9952ec847fca9c12d48833c93ba&count=1&expiryDate=0&format=1&newLine=2 HTTP/1.1" 200 57
-------r: {"code":"0","msg":[{"port":"38156","ip":"49.87.117.74"}]}
2018-09-13 18:13:09 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 49.87.117.74:38156
start_requests方法如下:for循环中共有184个star_url
def start_requests(self):
# 传入请求头headers、cookies,模拟真人请求
# 不填入headers, Chrome模拟登陆知乎,会报400错误,不进入parse
for word in self.search_key_words:
start_url = self.goods['start_urls']
yield scrapy.Request(
start_url.format(self.start_data, word),
headers=self.headers,
meta={'next_data': 0, 'counts': self.counts, 'word': word}
)
time.sleep(1)
setting.py配置如下:
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = True(但是程序没用到cookie)AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_DEBUG = True
RETRY_ENABLED = True
RETRY_TIMES = 5
DOWNLOAD_TIMEOUT = 10
custom_setting.py配置如下:
'DOWNLOADER_MIDDLEWARES': {
'SpiderProjects.middlewares.RandomUserAgentMiddlware': 490,
'SpiderProjects.middlewares.RandomProxyMiddleware': 400,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 500,
},
'RANDOM_UA_TYPE': 'random',
MIDDLEWARE如下:
class RandomProxyMiddleware(object):
# 动态设置ip代理
def process_request(self, request, spider):
get_ip = GetIP()
print('get ip from ip api')
ip = get_ip.get_random_ip()
request.meta["proxy"] = ip
# request.meta["proxy"] = 'HTTP://114.229.139.176:35573'
def process_exception(self, request, exception, spider):
# 出现异常时(超时)使用代理
print("
出现异常,正在使用代理重试....
")
get_ip = GetIP()
print('get ip from ip api')
ip = get_ip.get_random_ip()
request.meta['proxy'] = ip
return request
4回答
-
马小勒
提问者
2018-09-18
老师看下~
00 -
马小勒
提问者
2018-09-17
是没有发出这9个请求,log里面没有打印这9个start_url,其他都是打印了的。
截取了日志最后的统计信息,证明程序是运行完了的,您看看有啥问题~
2018-09-14 17:08:38 [scrapy.core.engine] INFO: Closing spider (finished)
2018-09-14 17:08:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 47,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 6,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 41,
'downloader/request_bytes': 1021699,
'downloader/request_count': 1033,
'downloader/request_method_count/GET': 1033,
'downloader/response_bytes': 17872592,
'downloader/response_count': 986,
'downloader/response_status_count/200': 983,
'downloader/response_status_count/504': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 9, 14, 9, 8, 38, 319306),
'item_scraped_count': 18874,
'log_count/DEBUG': 19908,
'log_count/INFO': 1099,
'memusage/max': 65937408,
'memusage/startup': 57499648,
'request_depth_max': 33,
'response_received_count': 983,
'retry/count': 50,
'retry/reason_count/504 Gateway Time-out': 3,
'retry/reason_count/scrapy.core.downloader.handlers.http11.TunnelError': 6,
'retry/reason_count/twisted.internet.error.TimeoutError': 41,
'scheduler/dequeued': 1033,
'scheduler/dequeued/memory': 1033,
'scheduler/enqueued': 1033,
'scheduler/enqueued/memory': 1033,
'start_time': datetime.datetime(2018, 9, 14, 7, 21, 53, 790511)}
2018-09-14 17:08:38 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
00 -
马小勒
提问者
2018-09-13
补充:
1、start_url缩减到10个的时候,就没问题,此时存入mysql的数据是2500条;
2、没使用redis
052018-09-26 -
马小勒
提问者
2018-09-13
我司要做针对电商平台的爬虫系统,现在卡在这里了,麻烦老师了~
00
相似问题