首頁 > 軟體

分散式爬蟲scrapy-redis的實戰踩坑記錄

2022-08-05 14:00:10

一、安裝redis

因為是在CentOS系統下安裝的,並且是伺服器。遇到的困難有點多不過。

1.首先要下載相關依賴

首先先檢查是否有c語言的編譯環境,你問我問什麼下載這個,我只能說它是下載安裝redis的前提,就像水和魚一樣。

rpm -q gcc```

如果輸出版本號,則證明下載好了,否則就執行下面的命令,安裝gcc,

2.然後編譯redis

下載你想要的redis版本注意下面的3.0.6是版本號,根據自己想要的下載版本號,解壓

yum install gcc-c++
cd /usr/local/redis
wget http://download.redis.io/releases/redis-3.0.6.tar.gz
tar zxvf redis-3.0.6.tar.gz
make && make install

什麼?你問我沒有redis資料夾怎麼辦,mkdir建立啊!!!

一定要先進入目錄再去執行下載編譯,這樣下載的redis才會進入系統變數。

redis-server
redis-cli

啟動服務你是下面這樣的嗎?

是的就不正常了!!你才下載好了,你會發現你可以開啟服務了,但是退不出來,無法進入命令列了,變成下面的這鬼摸樣了,別急,你還沒設定好,慢慢來。

還記得你剛剛建立的redis資料夾嗎?進入那裡面,找到redis.conf,修改這個組態檔。

redis-server
redis-cli

找到這三個並改正。

  • 首先將bind進行註釋,因為如果不註釋的話,你就只能本機存取了,我相信你下載肯定不只是自己存取吧。這就意味著所有ip都可以存取這個資料庫,但你又問了,這會不會影響安全效能呢?答:你都是租的伺服器了,就算你想讓別人存取,你還有安全組規則限制的啊,你問我什麼是安全組?快去百度!!
  • 將守護模式關閉,這樣你才能遠端讀寫資料庫
  • 開啟後臺模式,你才能像我那樣,而不是退不出來

儲存退出,重啟redis,這樣,redis就設定好了,還可以設定密碼,但是我懶,不想設定。

至此資料庫設定成功

二、scrapy框架出現的問題

1.AttributeError: TaocheSpider object has no attribute make_requests_from_url 原因:

新版本的scrapy框架已經丟棄了這個函數的功能,但是並沒有完全移除,雖然函數已經移除,但是還是在某些地方用到了這個,出現矛盾。

解決方法

自己在相對應的報錯檔案中重寫一下這個方法
就是在

def make_requests_from_url(self,url):
	return scrapy.Request(url,dont_filter=True)

2.ValueError: unsupported format character : (0x3a) at index 9 問題:

我開起了redis的管道,將資料儲存在了redis中,但是每次儲存總是失敗報錯。

原因:

我在settings.py檔案中重寫了儲存的方法,但是儲存的寫法不對導致我一直以為是原始碼的錯誤

# item儲存鍵的設定
REDIS_ITEMS_KEY = '%(spider):items'

原始碼是

return self.spider % {"spider":spider.name}

太坑了,我為了這個錯誤差點重寫了一個scrapy框架…

注意!如果你覺得你的主程式碼一點問題都沒有,那就一定是組態檔的問題,大小寫,設定環境字母不對等

三、scrapy正確的原始碼

1.items.py檔案

import scrapy
class MyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    lazyimg = scrapy.Field()
    title = scrapy.Field()
    resisted_data = scrapy.Field()
    mileage = scrapy.Field()
    city = scrapy.Field()
    price = scrapy.Field()
    sail_price = scrapy.Field()

2.settings.py檔案

# Scrapy settings for myspider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'myspider'

SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

# Obey robots.txt rules
# LOG_LEVEL = "WARNING"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'myspider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

LOG_LEVEL = 'WARNING'
LOG_FILE = './log.log'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 指定管道 ,scrapy-redis元件幫我們寫好
ITEM_PIPELINES = {
        "scrapy_redis.pipelines.RedisPipeline":400
}

# 指定redis
REDIS_HOST = '' # redis的伺服器地址,我們現在用的是虛擬機器器上的迴環地址
REDIS_PORT = # virtual Box轉發redistribution的埠

# 去重容器類設定 作用:redis的set集合來儲存請求的指紋資料,從而實現去重的持久化
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# 使用scrapy-redis的排程器
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

# 設定排程器是否需要持久化,爬蟲結束的時候要不要清空redis中請求佇列和指紋的set集合,要持久化設定為True
SCHEDULER_PERSIST = True

# 最大閒置時間,防止爬蟲在分散式爬取的過程中關閉
# 這個僅在佇列是SpiderQueue 或者 SpiderStack才會有作用,
# 也可以阻塞一段時間,當你的爬蟲剛開始時(因為剛開始時,佇列是空的)
SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3.taoche.py檔案

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from ..items import MyspiderItem
import logging
log = logging.getLogger(__name__)
class TaocheSpider(RedisCrawlSpider):
    name = 'taoche'
    # allowed_domains = ['taoche.com'] # 不做域名限制
    # start_urls = ['http://taoche.com/'] # 起始的url應該去redis(公共排程器) 裡面獲取

    redis_key = 'taoche' # 回去redis(公共排程器)裡面獲取key為taoche的資料 taoche:[]
    # 老師,我給你找一下我改的原始碼在哪裡,看看是那的錯誤嗎
    rules = (
        # LinkExtractor 連結提取器,根據正則規則提取url地址
        # callback 提取出來的url地址傳送請求獲取響應,會把響應物件給callback指定的函數進行處理
        # follow  獲取的響應頁面是否再次經過rules進行提取url
        Rule(LinkExtractor(allow=r'/?page=d+?'),
             callback='parse_item',
             follow=True),
    )

    def parse_item(self, response):
        print("開始解析資料")
        car_list = response.xpath('//div[@id="container_base"]/ul/li')
        for car in car_list:

            lazyimg = car.xpath('./div[1]/div/a/img/@src').extract_first()
            title = car.xpath('./div[2]/a/span/text()').extract_first()
            resisted_data = car.xpath('./div[2]/p/i[1]/text()').extract_first()
            mileage = car.xpath('./div[2]/p/i[2]/text()').extract_first()
            city = car.xpath('./div[2]/p/i[3]/text()').extract_first()
            city = city.replace('n', '')
            city = city.strip()
            price = car.xpath('./div[2]/div[1]/i[1]/text()').extract_first()
            sail_price = car.xpath('./div[2]/div[1]/i[2]/text()').extract_first()

            item = MyspiderItem()
            item['lazyimg'] = lazyimg
            item['title'] = title
            item['resisted_data'] = resisted_data
            item['mileage'] = mileage
            item['city'] = city
            item['price'] = price
            item['sail_price'] = sail_price
            log.warning(item)
            # scrapy.Request(url=function,dont_filter=True)
            yield item

4.其餘檔案

  • 中介軟體沒有用到所以就沒有寫
  • 管道用的是scrapy_redis裡面的,自己也就不用寫

總結

到此這篇關於分散式爬蟲scrapy-redis踩坑的文章就介紹到這了,更多相關分散式爬蟲scrapy-redis踩坑內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com!


IT145.com E-mail:sddin#qq.com