使用蘑菇代理ip请求淘宝搜索结果页的ajax数据接口,程序会卡住(暂停)

来源:8-8 scrapy实现ip代理池 - 3

马小勒

2018-09-13

有的时候往数据库写入8000条数据,会卡住,有的时候写入1000就卡住。卡住地方的控制台日志信息,此外没有其他报错???

......
......
sales: 677
sales: 671
2018-09-13 18:13:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.taobao.com/search?data-key=s&data-value=0&ajax=true&_ksTS=1532158365171_1326&callback=jsonp1327&q=%E6%B4%97%E9%A2%9C%E4%B8%93%E7%A7%91&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180721&ie=utf8&sort=sale-desc&bcoffset=0&p4ppushleft=%2C44>
None
sales: 668
2018-09-13 18:13:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.taobao.com/search?data-key=s&data-value=0&ajax=true&_ksTS=1532158365171_1326&callback=jsonp1327&q=%E6%B4%97%E9%A2%9C%E4%B8%93%E7%A7%91&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180721&ie=utf8&sort=sale-desc&bcoffset=0&p4ppushleft=%2C44>
None
2018-09-13 18:13:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1532158365171_2206&callback=jsonp2207&q=jmsolution&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_222012208721&ie=utf8&sort=sale-desc&bcoffset=220&p4ppushleft=%2C220> (failed 1 times): User timeout caused connection failure: Getting https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1532158365171_2206&callback=jsonp2207&q=jmsolution&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_222012208721&ie=utf8&sort=sale-desc&bcoffset=220&p4ppushleft=%2C220 took longer than 10.0 seconds..
get ip from ip api
2018-09-13 18:13:08 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): piping.mogumiao.com:80
2018-09-13 18:13:08 [urllib3.connectionpool] DEBUG: http://piping.mogumiao.com:80 "GET /proxy/api/get_ip_al?appKey=b828f9952ec847fca9c12d48833c93ba&count=1&expiryDate=0&format=1&newLine=2 HTTP/1.1" 200 57
-------r: {"code":"0","msg":[{"port":"38156","ip":"49.87.117.74"}]}
2018-09-13 18:13:09 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 49.87.117.74:38156

start_requests方法如下:for循环中共有184个star_url

    def start_requests(self):
        # 传入请求头headers、cookies,模拟真人请求
        # 不填入headers, Chrome模拟登陆知乎,会报400错误,不进入parse
        for word in self.search_key_words:
            start_url = self.goods['start_urls']

            yield scrapy.Request(
                start_url.format(self.start_data, word),
                headers=self.headers,
                meta={'next_data': 0, 'counts': self.counts, 'word': word}
            )
            time.sleep(1)

setting.py配置如下:

DOWNLOAD_DELAY = 3
COOKIES_ENABLED = True(但是程序没用到cookie)AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_DEBUG = True
RETRY_ENABLED = True
RETRY_TIMES = 5
DOWNLOAD_TIMEOUT = 10

custom_setting.py配置如下:

    'DOWNLOADER_MIDDLEWARES': {
        'SpiderProjects.middlewares.RandomUserAgentMiddlware': 490,
        'SpiderProjects.middlewares.RandomProxyMiddleware': 400,
        'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 500,
    },
    'RANDOM_UA_TYPE': 'random',

MIDDLEWARE如下:

class RandomProxyMiddleware(object):
    # 动态设置ip代理
    def process_request(self, request, spider):
        get_ip = GetIP()
        print('get ip from ip api')
        ip = get_ip.get_random_ip()
        request.meta["proxy"] = ip
        # request.meta["proxy"] = 'HTTP://114.229.139.176:35573'

    def process_exception(self, request, exception, spider):
        # 出现异常时(超时)使用代理
        print("
出现异常,正在使用代理重试....
")
        get_ip = GetIP()
        print('get ip from ip api')
        ip = get_ip.get_random_ip()
        request.meta['proxy'] = ip
        return request
写回答

4回答

马小勒

提问者

2018-09-18

老师看下~

0
0

马小勒

提问者

2018-09-17

是没有发出这9个请求,log里面没有打印这9个start_url,其他都是打印了的。

截取了日志最后的统计信息,证明程序是运行完了的,您看看有啥问题~

2018-09-14 17:08:38 [scrapy.core.engine] INFO: Closing spider (finished)

2018-09-14 17:08:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 47,

 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 6,

 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 41,

 'downloader/request_bytes': 1021699,

 'downloader/request_count': 1033,

 'downloader/request_method_count/GET': 1033,

 'downloader/response_bytes': 17872592,

 'downloader/response_count': 986,

 'downloader/response_status_count/200': 983,

 'downloader/response_status_count/504': 3,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2018, 9, 14, 9, 8, 38, 319306),

 'item_scraped_count': 18874,

 'log_count/DEBUG': 19908,

 'log_count/INFO': 1099,

 'memusage/max': 65937408,

 'memusage/startup': 57499648,

 'request_depth_max': 33,

 'response_received_count': 983,

 'retry/count': 50,

 'retry/reason_count/504 Gateway Time-out': 3,

 'retry/reason_count/scrapy.core.downloader.handlers.http11.TunnelError': 6,

 'retry/reason_count/twisted.internet.error.TimeoutError': 41,

 'scheduler/dequeued': 1033,

 'scheduler/dequeued/memory': 1033,

 'scheduler/enqueued': 1033,

 'scheduler/enqueued/memory': 1033,

 'start_time': datetime.datetime(2018, 9, 14, 7, 21, 53, 790511)}

2018-09-14 17:08:38 [scrapy.core.engine] INFO: Spider closed (finished)


Process finished with exit code 0


0
0

马小勒

提问者

2018-09-13

补充:

1、start_url缩减到10个的时候,就没问题,此时存入mysql的数据是2500条;

2、没使用redis

0
5
bobby
回复
马小勒
你通过qq群找到我给我发个qq消息我看看
2018-09-26
共5条回复

马小勒

提问者

2018-09-13

我司要做针对电商平台的爬虫系统,现在卡在这里了,麻烦老师了~

0
0

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5795 学习 · 6290 问题

查看课程