今天想爬点 nytimes 的新闻来看,爬这个地址的时候,每条消息的 date 爬取不到。 https://www.nytimes.com/section/politics
我加了 scrapy-playwright 还是一样的。求爬虫大声指点一下。
这是爬虫代码
import scrapy
from my_spider.items import MySpiderItem
class Mypider(scrapy.Spider):
name = "myspider"
allowed_domains = ["nytimes.com"]
start_urls = ["https://www.nytimes.com/section/politics"]
def start_requests(self):
for url in self.start_urls:
# 使用 PlaywrightRequest 来加载动态内容
# GET request
yield scrapy.Request(url, meta={"playwright": True})
# POST request
yield scrapy.FormRequest(
url="https://httpbin.org/post",
formdata={"foo": "bar"},
meta={"playwright": True},
)
def parse(self, response):
for article in response.css('.css-18yolpw'):
item = MySpiderItem()
item["title"] = article.css('div:nth-child(1) > article:nth-child(1) > a:nth-child(2) > h3:nth-child(1)::text').get()
t = article.xpath('div/article/a/h3/text()').get()
item["date"] = article.css('div:nth-child(1) > div:nth-child(2) > span:nth-child(1)::text').get()
d = article.xpath('div/div/span/text()').get()
item["url"] = response.urljoin(article.css('div:nth-child(1) > article:nth-child(1) > a:nth-child(2)::attr(href)').get())
item["claim"] = article.css('div:nth-child(1) > article:nth-child(1) > p:nth-child(3)::text').get()
item["rating"] = "True"
item["site"] = "NYTimes"
item["tag"] = "NYTimes"
yield item
d 的值都是"\u00a0"
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.