使用python scrapy爬取天氣並匯出csv檔案

2022-08-06 18:00:55

爬取xxx天氣

爬取網址：https://tianqi.2345.com/today-60038.htm

安裝

pip install scrapy

我使用的版本是scrapy 2.5

建立scray爬蟲專案

在命令列如下輸入命令

scrapy startproject name

name為專案名稱
如，scrapy startproject spider_weather
之後再輸入

scrapy genspider spider_name 域名

如，scrapy genspider changshu tianqi.2345.com

檢視資料夾

- spider_weather
   - spider
       - __init__.py
       - changshu.py
   - __init__.py
   - items.py
   - middlewares.py
   - pipelines.py
   - settings.py
- scrapy.cfg

檔案說明

名稱	作用
scrapy.cfg	專案的設定資訊，主要為Scrapy命令列工具提供一個基礎的設定資訊。（真正爬蟲相關的設定資訊在settings.py檔案中）
items.py	設定資料儲存模板，用於結構化資料，如：Django的Model
pipelines	資料處理行為，如：一般結構化的資料持久化
settings.py	組態檔，如：遞迴的層數、並行數，延遲下載等
spiders	爬蟲目錄，如：建立檔案，編寫爬蟲規則

開始爬蟲

1.在spiders資料夾裡面對自己建立的爬蟲檔案進行資料爬取、如在此案例中的spiders/changshu.py

程式碼演示如下

import scrapy

class ChangshuSpider(scrapy.Spider):
    name = 'changshu'
    allowed_domains = ['tianqi.2345.com']
    start_urls = ['https://tianqi.2345.com/today-60038.htm']

    def parse(self, response):
        # 日期、天氣狀態、溫度、風級
        # 利用xpath解析資料、不會xpath的同學可以去稍微學習一下，語法簡單
        dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall()
        states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall()
        temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall()
        winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall()
        # 返回每條資料
        for date, state, temp, wind in zip(dates,states,temps,winds):
            yield {
                'date' : date,
                'state': state,
                'temp': temp,
                'wind': wind
            }

2.在settings.py檔案中進行設定

修改UA

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

修改機器爬蟲設定

ROBOTSTXT_OBEY = False

整個檔案如下：

# Scrapy settings for spider_weather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'spider_weather'

SPIDER_MODULES = ['spider_weather.spiders']
NEWSPIDER_MODULE = 'spider_weather.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'spider_weather.pipelines.SpiderWeatherPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3.然後在命令列中輸入如下程式碼

scrapy crawl changshu -o weather.csv

注意：需要進入spider_weather路徑下執行
scrapy crawl 檔名 -o weather.csv（匯出檔案）

4.結果如下

補充：scrapy匯出csv時欄位的一些問題

scrapy -o csv格式輸出的時候，發現輸出檔案中欄位的順序不是按照items.py中的順序，也不是爬蟲檔案中寫入的順序，這樣匯出的資料因為某些欄位變得不好看，此外，匯出得csv檔案不同的item之間被空行隔開，本文主要描述解決這些問題的方法。

1.欄位順序問題：

需要在scrapy的spiders同層目錄，新建csv_item_exporter.py檔案內容如下（檔名可改，目錄定死）

from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter

class MyProjectCsvItemExporter(CsvItemExporter):
def init(self, *args, **kwargs):
delimiter = settings.get(‘CSV_DELIMITER', ‘,')
kwargs[‘delimiter'] = delimiter
fields_to_export = settings.get(‘FIELDS_TO_EXPORT', [])
if fields_to_export :
kwargs[‘fields_to_export'] = fields_to_export
super(MyProjectCsvItemExporter, self).init(*args, **kwargs)

2)在settings.py中新增以下內容

#定義輸出格式
FEED_EXPORTERS = {
‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
#指定csv輸出欄位的順序
FIELDS_TO_EXPORT = [
‘name',
‘title',
‘info'
]
#指定分隔符
CSV_DELIMITER = ‘,'

設定完畢，執行scrapy crawl spider -o spider.csv的時候，欄位就按順序來了