<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
爬取網址:https://tianqi.2345.com/today-60038.htm
pip install scrapy
我使用的版本是scrapy 2.5
在命令列如下輸入命令
scrapy startproject name
name為專案名稱
如,scrapy startproject spider_weather
之後再輸入
scrapy genspider spider_name 域名
如,scrapy genspider changshu tianqi.2345.com
檢視資料夾
- spider_weather
- spider
- __init__.py
- changshu.py
- __init__.py
- items.py
- middlewares.py
- pipelines.py
- settings.py
- scrapy.cfg
名稱 | 作用 |
---|---|
scrapy.cfg | 專案的設定資訊,主要為Scrapy命令列工具提供一個基礎的設定資訊。(真正爬蟲相關的設定資訊在settings.py檔案中) |
items.py | 設定資料儲存模板,用於結構化資料,如:Django的Model |
pipelines | 資料處理行為,如:一般結構化的資料持久化 |
settings.py | 組態檔,如:遞迴的層數、並行數,延遲下載等 |
spiders | 爬蟲目錄,如:建立檔案,編寫爬蟲規則 |
1.在spiders資料夾裡面對自己建立的爬蟲檔案進行資料爬取、如在此案例中的spiders/changshu.py
程式碼演示如下
import scrapy class ChangshuSpider(scrapy.Spider): name = 'changshu' allowed_domains = ['tianqi.2345.com'] start_urls = ['https://tianqi.2345.com/today-60038.htm'] def parse(self, response): # 日期、天氣狀態、溫度、風級 # 利用xpath解析資料、不會xpath的同學可以去稍微學習一下,語法簡單 dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall() states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall() temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall() winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall() # 返回每條資料 for date, state, temp, wind in zip(dates,states,temps,winds): yield { 'date' : date, 'state': state, 'temp': temp, 'wind': wind }
2.在settings.py檔案中進行設定
修改UA
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
修改機器爬蟲設定
ROBOTSTXT_OBEY = False
整個檔案如下:
# Scrapy settings for spider_weather project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'spider_weather' SPIDER_MODULES = ['spider_weather.spiders'] NEWSPIDER_MODULE = 'spider_weather.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # ITEM_PIPELINES = { # 'spider_weather.pipelines.SpiderWeatherPipeline': 300, # } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3.然後在命令列中輸入如下程式碼
scrapy crawl changshu -o weather.csv
注意:需要進入spider_weather路徑下執行
scrapy crawl 檔名 -o weather.csv(匯出檔案)
4.結果如下
scrapy -o csv格式輸出的時候,發現輸出檔案中欄位的順序不是按照items.py中的順序,也不是爬蟲檔案中寫入的順序,這樣匯出的資料因為某些欄位變得不好看,此外,匯出得csv檔案不同的item之間被空行隔開,本文主要描述解決這些問題的方法。
需要在scrapy的spiders同層目錄,新建csv_item_exporter.py檔案內容如下(檔名可改,目錄定死)
from scrapy.conf import settings from scrapy.contrib.exporter import CsvItemExporter class MyProjectCsvItemExporter(CsvItemExporter): def init(self, *args, **kwargs): delimiter = settings.get(‘CSV_DELIMITER', ‘,') kwargs[‘delimiter'] = delimiter fields_to_export = settings.get(‘FIELDS_TO_EXPORT', []) if fields_to_export : kwargs[‘fields_to_export'] = fields_to_export super(MyProjectCsvItemExporter, self).init(*args, **kwargs)
2)在settings.py中新增以下內容
#定義輸出格式 FEED_EXPORTERS = { ‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter', } #指定csv輸出欄位的順序 FIELDS_TO_EXPORT = [ ‘name', ‘title', ‘info' ] #指定分隔符 CSV_DELIMITER = ‘,'
設定完畢,執行scrapy crawl spider -o spider.csv的時候,欄位就按順序來了
此時你可能會發現csv檔案中有空行,這是因為scrapy預設輸出時,每個item之間的分隔符是空行
解決辦法:
在找到exporters.py的CsvItemExporter類,大概在215行中增加newline="",即可。
也可以繼承重寫CsvItemExporter類
到此這篇關於使用python scrapy爬取天氣並匯出csv檔案的文章就介紹到這了,更多相關scrapy爬取天氣匯出csv內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com!
相關文章
<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
综合看Anker超能充系列的性价比很高,并且与不仅和iPhone12/苹果<em>Mac</em>Book很配,而且适合多设备充电需求的日常使用或差旅场景,不管是安卓还是Switch同样也能用得上它,希望这次分享能给准备购入充电器的小伙伴们有所
2021-06-01 09:31:42
除了L4WUDU与吴亦凡已经多次共事,成为了明面上的厂牌成员,吴亦凡还曾带领20XXCLUB全队参加2020年的一场音乐节,这也是20XXCLUB首次全员合照,王嗣尧Turbo、陈彦希Regi、<em>Mac</em> Ova Seas、林渝植等人全部出场。然而让
2021-06-01 09:31:34
目前应用IPFS的机构:1 谷歌<em>浏览器</em>支持IPFS分布式协议 2 万维网 (历史档案博物馆)数据库 3 火狐<em>浏览器</em>支持 IPFS分布式协议 4 EOS 等数字货币数据存储 5 美国国会图书馆,历史资料永久保存在 IPFS 6 加
2021-06-01 09:31:24
开拓者的车机是兼容苹果和<em>安卓</em>,虽然我不怎么用,但确实兼顾了我家人的很多需求:副驾的门板还配有解锁开关,有的时候老婆开车,下车的时候偶尔会忘记解锁,我在副驾驶可以自己开门:第二排设计很好,不仅配置了一个很大的
2021-06-01 09:30:48
不仅是<em>安卓</em>手机,苹果手机的降价力度也是前所未有了,iPhone12也“跳水价”了,发布价是6799元,如今已经跌至5308元,降价幅度超过1400元,最新定价确认了。iPhone12是苹果首款5G手机,同时也是全球首款5nm芯片的智能机,它
2021-06-01 09:30:45