Python urllib 入門使用詳細教學

2022-11-17 14:00:35

一、簡介

urllib 庫，它是 Python 內建的 HTTP 請求庫，不需要額外安裝即可使用，它包含四個模組：

`request` 請求模組，提供最基本的 `HTTP` 請求處理。
`parse` 工具模組，提供處理 `url` 的很多方法：拆分、解析、合併等等。
`error` 例外處理模組，如果出現請求錯誤，可以捕獲這些錯誤，保證程式不會意外終止。
`robotparser` 模組，主要用來識別網站的 `robots.txt` 檔案，判斷哪些網站可以爬取，用的比較少。

二、 request 模組

1、urlopen：開啟一個指定 URL，然後使用 read() 獲取網頁的 HTML 實體程式碼。

# 使用 urllib
import urllib.request

# 1、定義一個 url
url = 'http://www.baidu.com'

# 2、模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(url)

# 3、獲取響應資料中的頁面原始碼（注意：read() 返回的是位元組形式的二進位制資料，返回資料會被 b'xxx' 進行包裹）
content = response.read()

# 4、輸出二進位制資料 content
print(content)
# 輸出結果：b'<html>rn<head>rnt<script>rnttlocation.replace(location.href.replace("https://","http://"));rnt</script>rn</head>rn<body>rnt<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>rn</body>rn</html>'

# 5、將二進位制資料轉成字串，這裡需要網頁對應的編碼格式（例如：<meta http-equiv="Content-Type" content="text/html;charset=utf-8">），charset= 的就是編碼格式 utf-8
content = content.decode('utf-8')

# 6、輸出字串 content
print(content)

2、response：響應的資料物件 HTTPResponse 型別

# 使用 urllib
import urllib.request

# 1、定義一個 url
url = 'http://www.baidu.com'

# 2、模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(url)

# response 是 http.client.HTTPResponse 型別
print(type(response))

# read 方法是按照一個位元組一個位元組的去讀取內容
content = response.read()
print(content)

# read 方法可以指定讀取多少個位元組
content = response.read(50)
print(content)

# 讀取一行
content = response.readline()
print(content)

# 讀取所有行
content = response.readlines()
print(content)

# 獲取狀態碼
print(response.getcode())

# 獲取存取的連結地址
print(response.geturl())

# 獲取 headers
print(response.getheaders())

3、Request：自定義請求物件

# 使用 urllib
import urllib.request

# url 的組成
# https://www.baidu.com/s?wd=123

# 協定         主機               埠號    路徑   引數   錨點
# http/https   www.baidu.com      80      s     wd     #
# http                            80
# https                           443
# mysql                           3306
# oracle                          1521
# redis                           6379
# mongdb                          27017

# 1、定義一個 https 的 url
url = 'https://www.baidu.com'

# 2、模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(url)

# 3、獲取內容字串
content = response.read().decode('utf-8')

# 4 會發現直接這麼拿回來的資料不完整，這就是反扒的其中一種，代表給到伺服器識別的資訊不完整，比如 header 頭裡面的請求資訊缺少。
print(content)

# 解決方式：

# 定義 header
headers = {
  # UA 最基本的防爬識別
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

# 1、定義一個 https 的 url
url = 'https://www.baidu.com'

# 2、定義一個 Request 物件，urlopen 方法並不能直接帶 header。
# 細節：為什麼這裡需要寫 url=url 而有的地方不需要？因為 Request 構造方法傳參順序問題 Request(url, data=None, headers={} ...)
request = urllib.request.Request(url=url, headers=headers)

# 3、模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(request)

# 3、獲取內容字串
content = response.read().decode('utf-8')

# 4 輸出
print(content)

4、urlretrieve：下載（例如：圖片、視訊、網頁原始碼…）

# 使用 urllib
import urllib.request

# 下載網頁
url = 'http://www.baidu.com'

# 引數1：頁面地址，引數2：檔名稱（或路徑與名稱，例如：./test/baidu.html、baidu.html，不指定路徑預設當前）
urllib.request.urlretrieve(url, 'baidu.html')


# 下載圖片
url = 'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fpic1.win4000.com%2Fwallpaper%2F8%2F55402f62682e3.jpg&refer=http%3A%2F%2Fpic1.win4000.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1670904201&t=2dc001fbd959432efe8b8ee0792589ba'

# 引數1：頁面地址，引數2：檔名稱（或路徑與名稱，例如：./test/baidu.html、baidu.html，不指定路徑預設當前）
urllib.request.urlretrieve(url, 'dzm.jpg')

二、 parse 模組

1、quote：（GET）引數進行 unicode 編碼

quote 會對引數進行 unicode 編碼，但是得一個一個引數的進行轉換，在進行拼接，在多個引數時使用起來比較麻煩。

# 使用 urllib
import urllib.request

# 定義 header
headers = {
  # UA 最基本的防爬識別
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

# 1、定義一個 https 的 url

# 這種中文寫法會報錯，因為 ascii 檢索不到
# url = 'https://www.baidu.com/s?wd=卡爾特斯CSDN'

# 也就是需要 `卡爾特斯CSDN` 變成 unicode 編碼格式，例如這樣：
# url = 'https://www.baidu.com/s?wd=%E5%8D%A1%E5%B0%94%E7%89%B9%E6%96%AFCSDN'

# 準備基礎地址（不能整個連結去進行 quote 轉換）(GET)
url = 'https://www.baidu.com/s?wd='

# 通過 urllib.parse.quote() 進行轉換
wd = urllib.parse.quote('卡爾特斯CSDN')
# print(wd) # %E5%8D%A1%E5%B0%94%E7%89%B9%E6%96%AFCSDN

# 拼接起來
url = url + wd

# 2、定義一個 Request 物件，urlopen 方法並不能直接帶 header。
# 細節：為什麼這裡需要寫 url=url 而有的地方不需要？因為 Request 構造方法傳參順序問題 Request(url, data=None, headers={} ...)
request = urllib.request.Request(url=url, headers=headers)

# 3、模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(request)

# 3、獲取內容字串
content = response.read().decode('utf-8')

# 4 輸出
print(content)

2、urlencode：（GET）引數進行 unicode 編碼

urlencode 會對多個引數進行 unicode 編碼。

# 使用 urllib
import urllib.request

# 定義 header
headers = {
  # UA 最基本的防爬識別
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

# 1、定義一個 https 的 url

# 這種中文寫法會報錯，因為 ascii 檢索不到
# url = 'https://www.baidu.com/s?wd=卡爾特斯CSDN&sex=男'

# 也就是需要 `卡爾特斯CSDN` 與 `男` 變成 unicode 編碼格式，例如這樣：
# url = 'https://www.baidu.com/s?wd=%E5%8D%A1%E5%B0%94%E7%89%B9%E6%96%AFCSDN&sex=%E7%94%B7'

# 準備基礎地址（不能整個連結去進行 quote 轉換）(GET)
url = 'https://www.baidu.com/s?'

# 引數
params = {
  'wd': '卡爾特斯CSDN',
  'sex': '男'
}

# 通過 urllib.parse.urlencode() 進行轉換（多個引數）
str = urllib.parse.urlencode(params)
# print(str) # wd=%E5%8D%A1%E5%B0%94%E7%89%B9%E6%96%AFCSDN&sex=%E7%94%B7

# 通過 urllib.parse.quote() 進行轉換（單個引數）
# wd = urllib.parse.urlencode('卡爾特斯CSDN')
# print(wd) # %E5%8D%A1%E5%B0%94%E7%89%B9%E6%96%AFCSDN

# 拼接起來
url = url + str

# 2、定義一個 Request 物件，urlopen 方法並不能直接帶 header。
# 細節：為什麼這裡需要寫 url=url 而有的地方不需要？因為 Request 構造方法傳參順序問題 Request(url, data=None, headers={} ...)
request = urllib.request.Request(url=url, headers=headers)

# 3、模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(request)

# 3、獲取內容字串
content = response.read().decode('utf-8')

# 4 輸出
print(content)

2、urlencode：（POST）引數進行 unicode 編碼，附：Python爬蟲Xpath定位資料的兩種方法

# 使用 urllib
import urllib.request
# 使用 json
import json

# 定義 header
headers = {
  # UA 最基本的防爬識別
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

# 請求地址（POST）
url = 'https://fanyi.baidu.com/sug'

# 引數
params = {
  'kw': '名稱'
}

# post 請求，引數不能進行拼接，需放到請求物件指定的引數物件中

# 通過 urllib.parse.urlencode() 進行轉換（多個引數）
# str = urllib.parse.urlencode(params)
# 直接使用轉換的引數字串會報錯：POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
# request = urllib.request.Request(url=url, data=str, headers=headers)

# 上面直接使用引數字串會報錯，是因為 post 請求引數必須要要進行編碼，指定編碼格式
data = urllib.parse.urlencode(params).encode('utf-8')
# 模擬瀏覽器向伺服器傳送請求
request = urllib.request.Request(url=url, data=data, headers=headers)

# 模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(request)

# 獲取內容字串
content = response.read().decode('utf-8')

# 將字串轉成 json
obj = json.loads(content)

# 輸出 json
print(obj)

三、 error 模組（URLError 與 HTTPError）

1、HTTPError 類是 URLError 類的子類。

2、匯入包分別是：urllib.error.URLError、urllib.error.HTTPError。

3、通過 urllib 傳送請求的時候，有可能傳送失敗，可以通過 try-except 進行異常捕獲，異常有兩類：URLError 與 HTTPError 類。

# 使用 urllib
import urllib.request
# 使用 json
import json

# 定義 header
headers = {
  # UA 最基本的防爬識別
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

# 請求地址（POST）
url = 'https://fanyi.baidu.com/sug'

# 引數
params = {
  'kw': '名稱'
}

# post 請求，引數不能進行拼接，需放到請求物件指定的引數物件中

# 通過 urllib.parse.urlencode() 進行轉換（多個引數）
# str = urllib.parse.urlencode(params)
# 直接使用轉換的引數字串會報錯：POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
# request = urllib.request.Request(url=url, data=str, headers=headers)

# 上面直接使用引數字串會報錯，是因為 post 請求引數必須要要進行編碼，指定編碼格式
data = urllib.parse.urlencode(params).encode('utf-8')
# 模擬瀏覽器向伺服器傳送請求
request = urllib.request.Request(url=url, data=data, headers=headers)

# 模擬瀏覽器向伺服器傳送請求
response = urllib.request.urlopen(request)

# 獲取內容字串
content = response.read().decode('utf-8')

# 將字串轉成 json
obj = json.loads(content)

# 輸出 json
print(obj)

四、Handler 處理器（IP 代理）

五、xppath 使用

到此這篇關於Python urllib 入門使用詳細教學的文章就介紹到這了,更多相關Python urllib使用內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com！