爬取评论的疑问

来源:6-18 知乎spider爬虫逻辑的实现以及answer的提取 - 2

EnzoLiu

2018-09-25

有关抓取评论的api接口里面的limit和offset参数,我看到bobby老师给写死20和0了

如果评论较多,这样抓取是不是存在问题呢?

start_answer_url = "https://www.zhihu.com/api/v4/questions/{0}/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={1}&offset={2}"

...

yield scrapy.Request(url=self.start_answer_url.format(question_id, 20, 0), headers=self.headers, callback=self.parse_answer)

写回答

1回答

bobby

2018-09-26

这个地方的逻辑我只需要知道第一页的数据就行了,第一页获取20条数据,至于下一页是什么以及获取多少条知乎已经返回了下一页的url了

0
1
EnzoLiu
非常感谢!
2018-09-26
共1条回复

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5796 学习 · 6290 问题

查看课程