xpath获取img_url和post_url错误

来源:4-9 编写spider完成抓取过程 - 1

GoGo闯1

2019-11-05

代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector

class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['news.cnblogs.com']
    start_urls = ['http://news.cnblogs.com/']

    def parse(self, response):
        
        # extract_first 提取list中第一个元素,若为空list,则返回默认值
        #url = response.xpath('//*[@id="entry_647068"]/div[2]/h2/a/@href').extract_first("")

        post_notes = response.xpath('//*[@id="news_list"]/div[@class="news_block"]')
        for post_note in post_notes:
            print ("="*60)
            
            img_url = post_note.xpath('//div[@class="entry_summary"]/a/img/@src').extract_first("")
            post_url = post_note.xpath('//h2[@class="news_entry"]/a/@href').extract_first("")

            print (post_note)
            print (img_url)
            print (post_url)

执行结果:

============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647122'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647121'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647120'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647119'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647118'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647117'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647116'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647115'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647114'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647113'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647112'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647111'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647110'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647109'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647107'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647108'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647106'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647105'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647104'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647103'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647102'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647101'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647100'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647099'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647098'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647096'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647097'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647095'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647094'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/
============================================================
<Selector xpath='//*[@id="news_list"]/div[@class="news_block"]' data='<div class="news_block" id="entry_647093'>
//images0.cnblogs.com/news_topic/小米.gif
/n/647122/

不知道为毛,for循环中post_note是不同的,但提取img_url和post_url的值都是一样的

写回答

1回答

bobby

2019-11-06

确实有点奇怪,你试试使用css选择器看看是否还有这个问题?

0
0

Scrapy打造搜索引擎 畅销4年的Python分布式爬虫课

带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎

5818 学习 · 6291 问题

查看课程