爬取头条新闻数据出现乱码，恳请bobby老师帮忙看看

来源：14-3 pymysql的简单使用

慕粉3332094

2020-12-13

bobby老师，您好，有个问题想请教您。我爬取头条新闻数据时候(比如https://www.toutiao.com/a6905208661061321229/)，数据出现乱码，查看网页源码发现编码格式为"UTF-8",
我的主要代码如下：
res=requests.get(url=url,headers=headers) #我已经将headers里面各种参数补齐了。
res.encoding="UTF-8"或者res.encoding=res.apparent_encoding
print(res.text) # 出现乱码
导致后面我想通过xpath来定位数据的时候，et_html.xpath(’…’)总返回空列表，
比如：
et_html=etree.HTML(res.text)
news_title=et_html.xpath(‘新闻title对应的xpath’)该问题困扰了我好久，网上找了好多方法都行不通，实在没办法，恳请bobby老师帮忙解答解答，感激不尽。

写回答

3回答

bobby

2020-12-16

已采纳

# -*- coding: utf-8 -*-
import requests

from lxml import etree

headers={
"authority": "www.toutiao.com",
"method": "GET",
"path": "/a6899266557403251204/",
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
# "accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "tt_webid=6899435744961496589; ttcid=e44467c01f3a48319fd161243c958a1216; s_v_web_id=verify_khyx4t33_E1RbqRat_i7xg_4pvl_BHea_fRGn91EnwpuK; MONITOR_WEB_ID=e3f6502a-7ac3-47db-9f9a-d8aa79c19f97; __ac_nonce=05fbfb93700546bd65761; __ac_signature=_02B4Z6wo00f01mGnw9QAAIBD3oaubePhO6JhosdAAMfiv0dtBhULWTJ4MbGzqemNgV4Jiq50qPcgeTRczagJNBlV9iS0cWroNRUvd5ujbrEMjCJ1eq3GfNFHKnnFlp9U2d3q0Cy8b5u59uo5eb; tt_scid=-FAH69zmFL0eKR4nDbYZW4FVQzOtFDDcLgsOiEytRRiDPukn-5IRMCKJ2NowgNLrb14f",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
"referer": "https://www.toutiao.com/a6899266557403251204/"
}

url='https://www.toutiao.com/a6899266557403251204/'
res=requests.get(url=url,headers=headers)
# print(res.content.decode("utf-8"))
print(res.status_code)
# res.encoding=res.apparent_encoding
# res.encoding="UTF-8"
print(res.text)
et_html=etree.HTML(res.text)

title=et_html.xpath('//*[@id="root"]/div/div[2]/div[1]/div[2]/h1/text()')
print("".join(title))

bobby

慕粉3332094

好的。

2020-12-18

共2条回复

慕粉3332094

提问者

2020-12-14

# -*- coding: utf-8 -*-
import requests

from lxml import etree

headers={
"authority": "www.toutiao.com",
"method": "GET",
"path": "/a6899266557403251204/",
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "tt_webid=6899435744961496589; ttcid=e44467c01f3a48319fd161243c958a1216; s_v_web_id=verify_khyx4t33_E1RbqRat_i7xg_4pvl_BHea_fRGn91EnwpuK; MONITOR_WEB_ID=e3f6502a-7ac3-47db-9f9a-d8aa79c19f97; __ac_nonce=05fbfb93700546bd65761; __ac_signature=_02B4Z6wo00f01mGnw9QAAIBD3oaubePhO6JhosdAAMfiv0dtBhULWTJ4MbGzqemNgV4Jiq50qPcgeTRczagJNBlV9iS0cWroNRUvd5ujbrEMjCJ1eq3GfNFHKnnFlp9U2d3q0Cy8b5u59uo5eb; tt_scid=-FAH69zmFL0eKR4nDbYZW4FVQzOtFDDcLgsOiEytRRiDPukn-5IRMCKJ2NowgNLrb14f",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
"referer": "https://www.toutiao.com/a6899266557403251204/"
}

url='https://www.toutiao.com/a6899266557403251204/'
res=requests.get(url=url,headers=headers)
print(res.status_code)
res.encoding=res.apparent_encoding
# res.encoding="UTF-8"
print(res.text)
et_html=etree.HTML(res.text)

title=et_html.xpath('//*[@id="root"]/div/div[2]/div[1]/div[2]/h1/text()')
print("".join(title))