Python技能樹共建之python urllib 模組

2022-05-22 19:00:17

一、Python urllib 模組是什麼

urllib 模組是 Python 標準庫，其價值在於抓取網路上的 URL 資源，入門爬蟲時必學的一個模組。

不過更多的爬蟲工程師上手學習的模組已經更換為 requests 了。

在 Python3 中 urllib 模組包括如下內容。

urllib.request：請求模組，用於開啟和讀取 URL；
urllib.error：例外處理模組，捕獲 urllib.error 丟擲異常；
urllib.parse：URL 解析，爬蟲程式中用於處理 URL 地址；
urllib.robotparser：解析 robots.txt 檔案，判斷目標站點哪些內容可爬，哪些不可以爬，但是用的很少。

二、使用方法

上手案例

開啟一個測試站點，然後返回伺服器響應內容。

from urllib.request import urlopen
with urlopen('https://www.example.net') as html:
    page = html.read()
print(page)

上述程式碼用到了 urllib.requests 模組，其內部定義了開啟 URL 的函數，授權驗證的方法，重定向，cookie 操作等方法。

程式碼中用到的 urlopen() 函數，就是開啟一個 URL，該函數的語法格式如下所示：

urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None)

引數描述如下所示：

url：請求地址；
data：傳送到伺服器的其他資料物件，預設為 None；
timeout：超時時間；
cafile 和 capath：cafile 為 CA 證書， capath 為 CA 證書的路徑，使用 HTTPS 需要用到；
context：ssl.SSLContext 型別，用來指定 SSL 設定。

呼叫該物件的 read() 方法，可以讀取到整個網頁資料。

其餘的函數與檔案讀取類似，分別是 readline()，readlines()。

還可以呼叫物件的 getcode() 方法，獲取網頁狀態碼。

print(html.getcode()) # 返回 200

urlopen() 返回物件的更多方法

使用 urlopen() 可以得到一個 HTTPResposne 型別的物件，它包括上文提及的 read() 方法，getcode() 方法，除此之外，還有如下內容可以使用。

getheaders()：獲取請求頭內容；
getheader(name)：獲取指定請求頭；
msg：資訊屬性；
version：版本屬性；
status：狀態屬性。

urllib.Request() 類

URL 請求抽象類，使用它可以擴充套件更多的請求設定，其構造方法如下所示：

def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None)

其引數說明如下所示：

url：請求地址，必選引數；
data：請求引數，必須為 bytes 型別資料，可以使用 urlencode() 進行編碼；
headers：字典型別，請求頭設定；
origin_req_host：請求的主機地址，IP 或域名；
method：請求方法。

測試程式碼如下所示：

from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 你的UA'
}
dict = {
    'name': 'xiangpica'
}
# 轉換資料型別
data = bytes(parse.urlencode(dict), encoding='utf8')
# 範例化物件
req = request.Request(url=url, data=data, headers=headers, method='POST')
# 新增請求頭
req.add_header('HOST', 'httpbin.org')
# 傳送資料
response = request.urlopen(req)
print(response.read().decode('utf-8'))

urllib.parse

該模組主要用於解析 URL，函數原型如下所示：

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

引數說明如下：

urlstring：URL 地址；
scheme：協定型別，可用的包括 file、ftp、gopher、hdl、http、https、imap、mailto、 mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、 sip、sips、snews、svn、svn+ssh、telnet……；
allow_fragments：是否忽略 URL 中的 fragment 部分。

標準的 URL 格式如下：

scheme://netloc/path;params?query#fragment

說明如下所示：

scheme：URL 協定；
netloc：域名和埠；
path：路徑；
params：最後一個路徑元素引數，不常用；
query：查詢字串；
fragment：片段標誌。

from urllib.parse import urlparse
result = urlparse('http://www.example.com/index.html;info?id=10086#comment')
print(type(result), result)
print(result.scheme, result[0])
print(result.netloc, result[1])
print(result.path, result[2])
print(result.params, result[3])
print(result.query, result[4])
print(result.fragment, result[5])

執行結果如下所示：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.example.com', path='/index.html', params='info', query='id=10086', fragment='comment')
http http
www.example.com www.example.com
/index.html /index.html
info info
id=10086 id=10086
comment comment