怎么处理csdn论坛中新出现的置顶帖？

来源：14-11 获取和解析列表页 - 2

JackyBreak

2020-02-14

图片描述

老师您好，csdn在每个版面都新增了三个置顶帖，我的处理方式如下图：
图片描述
就是不取这三个置顶帖，让list从第六个tr开始取，经过debug发现list中确实不包含这三个置顶帖了，但是在下面提取topic id的时候却报错：

提示我提取出的id是“J2EE”，也就是最上面的置顶帖的第一个a标签的href：

但是我的tr列表里面已经不包含置顶帖了啊？请问这是什么原因呢？

我的代码如下：

def parse_list(url):
    topic_chart = Topic()
    res_text = requests.get(url).text
    sel = Selector(text=res_text)
    all_trs = sel.xpath("//table[@class='forums_tab_table']//tr")[5:]
    print(all_trs[0].extract())
    for tr in all_trs:
        if tr:
            if tr.xpath("//td[1]//span/text()").extract():
                status = tr.xpath("//td[1]//span/text()").extract()[0]
                topic_chart.status = status
            if tr.xpath("//td[2]//em/text()").extract():
                score = tr.xpath("//td[2]//em/text()").extract()[0]
                topic_chart.score = int(score)
            if tr.xpath("//td[3]/a/@href").extract():
                # try:
                url = tr.xpath("//td[3]/a/@href").extract()[0]
                topic_url = parse.urljoin(domain, url)
                topic_chart.id = int(topic_url.split("/")[-1])
                # except:
                #     topic_url = parse.urljoin(domain, tr.xpath("//td[3]//a[2]/@href").extract()[0])
                #     topic_chart.id = int(topic_url.split("/")[-1])
            if tr.xpath("//td[3]//a/text()").extract():
                topic_title = tr.xpath("//td[3]//a/text()").extract()[0]
                topic_chart.title = topic_title
            if tr.xpath("//td[4]//a/@href").extract():
                author_url = parse.urljoin(domain, tr.xpath("//td[4]//a/@href").extract()[0])
                author_id = author_url.split("/")[-1]
                topic_chart.author = author_id
            if tr.xpath("//td[4]//em/text()").extract():
                create_time = datetime.strptime(tr.xpath("//td[4]//em/text()").extract()[0], "%Y-%m-%d %H:%M")
                topic_chart.create_time = create_time
            if tr.xpath("//td[5]//span/text()").extract():
                answer_info = tr.xpath("//td[5]//span/text()").extract()[0]
                answer_nums = answer_info.split("/")[0]
                click_nums = answer_info.split("/")[1]
                topic_chart.answer_nums = int(answer_nums)
                topic_chart.click_nums = int(click_nums)
            if tr.xpath("//td[6]//em/text()").extract():
                last_reply_time = tr.xpath("//td[6]//em/text()").extract()[0]
                last_time = datetime.strptime(last_reply_time, "%Y-%m-%d %H:%M")
                topic_chart.last_answer_time = last_time

        topic_chart.save()

写回答

1回答