爬取头条新闻数据出现乱码,恳请bobby老师帮忙看看

来源:14-3 pymysql的简单使用

慕粉3332094

2020-12-13

bobby老师,您好,有个问题想请教您。我爬取头条新闻数据时候(比如https://www.toutiao.com/a6905208661061321229/),数据出现乱码,查看网页源码发现编码格式为"UTF-8",
我的主要代码如下:
res=requests.get(url=url,headers=headers) #我已经将headers里面各种参数补齐了。
res.encoding="UTF-8"或者res.encoding=res.apparent_encoding
print(res.text) # 出现乱码
导致后面我想通过xpath来定位数据的时候,et_html.xpath(’…’)总返回空列表,
比如:
et_html=etree.HTML(res.text)
news_title=et_html.xpath(‘新闻title对应的xpath’)该问题困扰了我好久,网上找了好多方法都行不通,实在没办法,恳请bobby老师帮忙解答解答,感激不尽。

写回答

3回答

bobby

2020-12-16

# -*- coding: utf-8 -*-
import requests

from lxml import etree

headers={
"authority": "www.toutiao.com",
"method": "GET",
"path": "/a6899266557403251204/",
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
# "accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "tt_webid=6899435744961496589; ttcid=e44467c01f3a48319fd161243c958a1216; s_v_web_id=verify_khyx4t33_E1RbqRat_i7xg_4pvl_BHea_fRGn91EnwpuK; MONITOR_WEB_ID=e3f6502a-7ac3-47db-9f9a-d8aa79c19f97; __ac_nonce=05fbfb93700546bd65761; __ac_signature=_02B4Z6wo00f01mGnw9QAAIBD3oaubePhO6JhosdAAMfiv0dtBhULWTJ4MbGzqemNgV4Jiq50qPcgeTRczagJNBlV9iS0cWroNRUvd5ujbrEMjCJ1eq3GfNFHKnnFlp9U2d3q0Cy8b5u59uo5eb; tt_scid=-FAH69zmFL0eKR4nDbYZW4FVQzOtFDDcLgsOiEytRRiDPukn-5IRMCKJ2NowgNLrb14f",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
"referer": "https://www.toutiao.com/a6899266557403251204/"
}

url='https://www.toutiao.com/a6899266557403251204/'
res=requests.get(url=url,headers=headers)
# print(res.content.decode("utf-8"))
print(res.status_code)
# res.encoding=res.apparent_encoding
# res.encoding="UTF-8"
print(res.text)
et_html=etree.HTML(res.text)

title=et_html.xpath('//*[@id="root"]/div/div[2]/div[1]/div[2]/h1/text()')
print("".join(title))


1
2
bobby
回复
慕粉3332094
好的。
2020-12-18
共2条回复

慕粉3332094

提问者

2020-12-14

# -*- coding: utf-8 -*-
import requests

from lxml import etree

headers={
"authority": "www.toutiao.com",
"method": "GET",
"path": "/a6899266557403251204/",
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "tt_webid=6899435744961496589; ttcid=e44467c01f3a48319fd161243c958a1216; s_v_web_id=verify_khyx4t33_E1RbqRat_i7xg_4pvl_BHea_fRGn91EnwpuK; MONITOR_WEB_ID=e3f6502a-7ac3-47db-9f9a-d8aa79c19f97; __ac_nonce=05fbfb93700546bd65761; __ac_signature=_02B4Z6wo00f01mGnw9QAAIBD3oaubePhO6JhosdAAMfiv0dtBhULWTJ4MbGzqemNgV4Jiq50qPcgeTRczagJNBlV9iS0cWroNRUvd5ujbrEMjCJ1eq3GfNFHKnnFlp9U2d3q0Cy8b5u59uo5eb; tt_scid=-FAH69zmFL0eKR4nDbYZW4FVQzOtFDDcLgsOiEytRRiDPukn-5IRMCKJ2NowgNLrb14f",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
"referer": "https://www.toutiao.com/a6899266557403251204/"
}

url='https://www.toutiao.com/a6899266557403251204/'
res=requests.get(url=url,headers=headers)
print(res.status_code)
res.encoding=res.apparent_encoding
# res.encoding="UTF-8"
print(res.text)
et_html=etree.HTML(res.text)

title=et_html.xpath('//*[@id="root"]/div/div[2]/div[1]/div[2]/h1/text()')
print("".join(title))

-----------------------------------------------------
注:经过分析,对于不同的url,headers里面的参数除了cookie末尾部分不一样外,其余参数均一样。bobby老师如果运行代码可能需要将对应的cookie做相应修改。


0
0

bobby

2020-12-13

你把完整的代码贴一下 我在本地运行试试

0
0

Python爬虫工程师实战 大数据时代必备

慕课网严选精品教程,高质量内容+服务!

2388 学习 · 1158 问题

查看课程