一文教你Python如何建立屬於自己的IP池

2022-04-14 19:01:18

開發環境

Python 3.8

Pycharm

模組使用

requests >>> pip install requests

parsel >>> pip install parsel

如果安裝python第三方模組

win + R 輸入 cmd 點選確定, 輸入安裝命令 pip install 模組名 (pip install requests) 回車

在pycharm中點選Terminal(終端) 輸入安裝命令

如何設定pycharm裡面的python直譯器

選擇file(檔案) >>> setting(設定) >>> Project(專案) >>> python interpreter(python直譯器)

點選齒輪, 選擇add

新增python安裝路徑

pycharm如何安裝外掛

選擇file(檔案) >>> setting(設定) >>> Plugins(外掛)

點選 Marketplace 輸入想要安裝的外掛名字比如:翻譯外掛輸入 translation / 漢化外掛輸入 Chinese

選擇相應的外掛點選 install(安裝) 即可

安裝成功之後是會彈出重啟pycharm的選項點選確定, 重啟即可生效

代理ip結構

proxies_dict = {
    "http": "http://" + ip:埠,
    "https": "http://" + ip:埠,
}

思路

一. 資料來源分析

找我們想要資料內容, 從哪裡來的

二. 程式碼實現步驟

傳送請求, 對於目標網址傳送請求

獲取資料, 獲取伺服器返回響應資料(網頁原始碼)

解析資料, 提取我們想要的資料內容

儲存資料, 爬音樂視訊本地csv 資料庫… IP檢測, 檢測IP代理是否可用可用用IP代理儲存

from 從
import 匯入
從什麼模組裡面匯入什麼方法
from xxx import * # 匯入所有方法

程式碼

# 匯入資料請求模組
import requests  # 資料請求模組 第三方模組 pip install requests
# 匯入 正規表示式模組
import re  # 內建模組
# 匯入資料解析模組
import parsel  # 資料解析模組 第三方模組 pip install parsel  >>> 這個是scrapy框架核心元件


lis = []
lis_1 = []

# 1. 傳送請求, 對於目標網址傳送請求 https://www.kuaidaili.com/free/
for page in range(11, 21):
    url = f'https://www.kuaidaili.com/free/inha/{page}/'  # 確定請求url地址
    """
    headers 請求頭 作用偽裝python程式碼
    """
    # 用requests模組裡面get 方法 對於url地址傳送請求, 最後用response變數接收返回資料
    response = requests.get(url)
    # <Response [200]>  請求之後返回response響應物件, 200狀態碼錶示請求成功
    # 2. 獲取資料, 獲取伺服器返回響應資料(網頁原始碼)  response.text 獲取響應體文字資料
    # print(response.text)
    # 3. 解析資料, 提取我們想要的資料內容
    """
    解析資料方式方法:
        正則: 可以直接提取字串資料內容
    需要把獲取下來html字串資料 進行轉換
        xpath: 根據標籤節點 提取資料內容
        css選擇器: 根據標籤屬性提取資料內容 
        
        哪一種方面用那種, 那是喜歡用那種
    """
    # 正規表示式提取資料內容
    """
    # 正則提取資料 re.findall() 呼叫模組裡面的方法
    # 正則 遇事不決 .*? 可以匹配任意字元(除了換行符n以外) re.S
    
    ip_list = re.findall('<td data-title="IP">(.*?)</td>', response.text, re.S)
    port_list = re.findall('<td data-title="PORT">(.*?)</td>', response.text, re.S)
    print(ip_list)
    print(port_list)
    """
    # css選擇器:
    """
    # css選擇器提取資料 需要把獲取下來html字串資料(response.text) 進行轉換
    # 我不會css 或者 xpath 怎麼辦
    # #list > table > tbody > tr > td:nth-child(1)
    # //*[@id="list"]/table/tbody/tr/td[1]
    selector = parsel.Selector(response.text) # 把html 字串資料轉成 selector 物件
    ip_list = selector.css('#list tbody tr td:nth-child(1)::text').getall()
    port_list = selector.css('#list tbody tr td:nth-child(2)::text').getall()
    print(ip_list)
    print(port_list)
    """
    # xpath 提取資料
    selector = parsel.Selector(response.text) # 把html 字串資料轉成 selector 物件
    ip_list = selector.xpath('//*[@id="list"]/table/tbody/tr/td[1]/text()').getall()
    port_list = selector.xpath('//*[@id="list"]/table/tbody/tr/td[2]/text()').getall()
    # print(ip_list)
    # print(port_list)
    for ip, port in zip(ip_list, port_list):
        # print(ip, port)
        proxy = ip + ':' + port
        proxies_dict = {
            "http": "http://" + proxy,
            "https": "http://" + proxy,
        }
        # print(proxies_dict)
        lis.append(proxies_dict)
        # 4.檢測IP質量
        try:
            response = requests.get(url=url, proxies=proxies_dict, timeout=1)
            if response.status_code == 200:
                print('當前代理IP: ', proxies_dict,  '可以使用')
                lis_1.append(proxies_dict)
        except:
            print('當前代理IP: ', proxies_dict,  '請求超時, 檢測不合格')



print('獲取的代理IP數量: ', len(lis))
print('獲取可用的IP代理數量: ', len(lis_1))
print('獲取可用的IP代理: ', lis_1)

dit = {
    'http': 'http://110.189.152.86:40698',
    'https': 'http://110.189.152.86:40698'
}