lagou的/jobs/allCity.html里面北上广深杭地区标签里没有-zhaopin,这是被反爬虫了吗
来源:2-1 分析招聘网站结构并解析招聘网站城市列表

慕无忌4207111
2019-10-14
<ul class="city_list">
<li >
<a href="https://www.lagou.com/shenzhen/">深圳</a>
<input class="dn" value="https://www.lagou.com/jobs/list_webrtc?&px=default&city=深圳#filterBox"/>
</li>
<li >
<a href="https://www.lagou.com/shanghai/">上海</a>
<input class="dn" value="https://www.lagou.com/jobs/list_webrtc?&px=default&city=上海#filterBox"/>
</li>
<li >
<a href=" https://www.lagou.com/suzhou-zhaopin/">苏州</a>
<input class="dn" value="https://www.lagou.com/jobs/list_webrtc?&px=default&city=苏州#filterBox"/>
</li>
<li >
<a href=" https://www.lagou.com/shenyang-zhaopin/">沈阳</a>
<input class="dn" value="https://www.lagou.com/jobs/list_webrtc?&px=default&city=沈阳#filterBox"/>
感觉被针对之余留了条小路,这个有什么办法能爬到呢?
毕竟这几个城市占了大头。。。
2回答
-
two10
2019-11-05
还有我是Linux系统,请求头headers和老师的不一样,请用你们自己的headers
00 -
two10
2019-11-05
# 这种用xpath就很好获取
import requests
from lxml import etree
class HandleLagou(object):
def __init__(self):
# session保存cookie信息
self.lagou_session = requests.session()
self.header = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
self.city_list = ''
def handle_request(self):
# 获取网页
headers = self.header
city_url = 'https://www.lagou.com/jobs/allCity.html'
response = self.lagou_session.get(city_url, headers=headers)
if response.status_code == 200:
return response.content.decode('utf-8')
return None
def handle_city(self, html):
# 解析城市
etree_html = etree.HTML(html)
city_search = etree_html.xpath('//ul[@class="city_list"]/li/a/text()')self.city_list = city_search
print(self.city_list)
def main(self):
html = self.handle_request()
self.handle_city(html)
# 调用
if __name__ == '__main__':
lagou = HandleLagou()
lagou.main()00
相似问题