Python正規表示式re模組講解以及其案例舉例

2022-10-02 14:00:43

一、re模組簡介

Python 的 re 模組（Regular Expression 正規表示式）提供各種正規表示式的匹配操作，和 Perl 指令碼的正規表示式功能類似，使用這一內嵌於 Python 的語言工具，儘管不能滿足所有複雜的匹配情況，但足夠在絕大多數情況下能夠有效地實現對複雜字串的分析並提取出相關資訊。

二、正規表示式的基本概念

所謂的正規表示式，即就是說：

通過設定匹配的字串的格式來在一個文字中找出所有符合該格式的一串字元。

1、正規表示式的語法介紹：

1）特殊字元：

, ., ^, $, {}, [], (), | 等

以上的特殊字元必須使用來跳脫，這樣才能使用原來的意思。

2）字元類

[] 中的一個或者是多個字元被稱為字元類，字元類在匹配時如果沒有指定量詞則只會匹配其中的一個。

字元類的範圍可以進行指定。

比如：

1> [a-zA-Z0-9]表示從a到z，從A到Z，0到9之間的任意一個字元；

2> 左方括號後面可以跟隨一個 ^ ，表示否定一個字元類，字元類在匹配時如果沒有指定量詞則匹配其中一個；

3> 字元類的內部，除了之外，其他的特殊符號不在為原來的意思；

4> ^ 放在開頭表示否定，放在其他位置表示自身。

3）速記法

. ------可以匹配換行符之外的任何一個字元

d ------匹配一個Unicode數位
D ------匹配一個Unicode非數位
s ------匹配Unicode空白
S ------匹配Unicode非空白
w ------匹配Unicode單詞字元
W ------匹配Unicode非單字元
? ------匹配前面的字元0次或者1次
*------匹配前面的字元0次或者多次
+（加號）------匹配前面的字元1次或者多次
{m} ------匹配前面的表示式m次
{m, } ------匹配前面的表示式至少m次
{, n} ------匹配前面的表示式最多n次
{m, n} ------匹配前面的表示式至少m次，最多n次
() ------捕獲括號內部的內容

2、Python中的正規表示式模組

Python中對於正規表示式的處理使用的是re模組，其中的語法可以參加上面所羅列出來的基本語法，尤其應該注意一下上述的 3）速記法中的內容。因為在爬蟲後需要資料分析時，往往會用到上面 3）速記法中所羅列出來的那些語法。

3、re模組的部分方法

1）re.compile()

我們首先在cmd中檢視一下 re.compile() 方法的使用方法：

>>> import re
>>> help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a pattern object.

>>>

Compile a regular expression pattern, returning a pattern object.

的意思如下所示：

編譯常規表達模式，返回模式物件。

使用re.compile(r, f)方法生成正規表示式物件，然後呼叫正規表示式物件的相應方法。這種做法的好處是生成正則物件之後可以多次使用。

2）re.findall()

同樣的，我們先看help

>>> help(re.findall)
Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.

注意這一段話：

Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.

意思是說：

re.findall(s,start, end)

返回一個列表，如果正規表示式中沒有分組，則列表中包含的是所有匹配的內容，
如果正規表示式中有分組，則列表中的每個元素是一個元組，元組中包含子分組中匹配到的內容，但是沒有返回整個正規表示式匹配的內容。

3）re.finditer()

>>> help(re.finditer)
Help on function finditer in module re:

finditer(pattern, string, flags=0)
    Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a match object.

    Empty matches are included in the result.

re.finditer(s, start, end)

返回一個可迭代物件

對可迭代物件進行迭代，每一次返回一個匹配物件，可以呼叫匹配物件的group()方法檢視指定組匹配到的內容，0表示整個正規表示式匹配到的內容

4） re.search()

>>> help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.

re.search(s, start, end)

返回一個匹配物件,倘若沒匹配到，就返回None

search方法只匹配一次就停止，不會繼續往後匹配

5）re.match()

>>> help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

re.match(s, start, end)

如果正規表示式在字串的起始處匹配，就返回一個匹配物件，否則返回None

6） re.sub()

>>> help(re.sub)
Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used.

re.sub(x, s, m)

返回一個字串。每一個匹配的地方用x進行替換，返回替換後的字串，如果指定m，則最多替換m次。對於x可以使用/i或者/gid可以是組名或者編號來參照捕獲到的內容。

模組方法re.sub(r, x, s, m)中的x可以使用一個函數。此時我們就可以對捕獲到的內容推過這個函數進行處理後再替換匹配到的文字。

7） re.subn()

>>> help(re.subn)
Help on function subn in module re:

subn(pattern, repl, string, count=0, flags=0)
    Return a 2-tuple containing (new_string, number).
    new_string is the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in the source
    string by the replacement repl.  number is the number of
    substitutions that were made. repl can be either a string or a
    callable; if a string, backslash escapes in it are processed.
    If it is a callable, it's passed the match object and must
    return a replacement string to be used.

rx.subn(x, s, m)

與re.sub()方法相同，區別在於返回的是二元組，其中一項是結果字串，一項是做替換的個數

8） re.split()

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

re.split(s, m)

分割字串,返回一個列表，用正規表示式匹配到的內容對字串進行分割

如果正規表示式中存在分組，則把分組匹配到的內容放在列表中每兩個分割的中間作為列表的一部分

三、正規表示式使用的範例

我們就爬一個蟲來進行正規表示式的使用吧：

爬取豆瓣電影的Top250榜單並且獲取到每一部電影的相應評分。

import re
import requests
if __name__ == '__main__':
    """
    測試函數（main）
    """
    N = 25
    j = 1
    for i in range(0, 226, 25):
        url = f'https://movie.douban.com/top250?start={i}&filter='
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                          '(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63'
        }
        response = requests.get(url=url, headers=headers)
        result = re.findall(r'<a href="(S+)">s+'
                            r'<img width="100" alt="(S+)" src="S+" class="">s+'
                            r'</a>', response.text)
        for movie in result:
            url_0 = movie[0]
            response_0 = requests.get(url=url_0, headers=headers)
            score = re.findall(r'<strong class="ll rating_num" property="v:average">(S+)'
                               r'</strong>s+'
                               r'<span property="v:best" content="10.0"></span>',
                               response_0.text)[0]
            print(j, end='  ')
            j += 1
            print(movie[1], end='  ')
            print(movie[0], end='  ')
            print(f'評分 : {score}')
        i += N

在這裡，我們的正規表示式用來提取了電影名稱、電影的url連結，然後再通過存取電影的url連結進入電影的主頁並獲取到電影的評分資訊。
主要的正規表示式使用程式碼為：

1、獲取電影名稱以及電影url：

result = re.findall(r'<a href="(S+)">s+'
                            r'<img width="100" alt="(S+)" src="S+" class="">s+'
                            r'</a>', response.text)

2、獲取電影的相應評分：

score = re.findall(r'<strong class="ll rating_num" property="v:average">(S+)'
                               r'</strong>s+'
                               r'<span property="v:best" content="10.0"></span>',
                               response_0.text)[0]

最後我們需要再說一下，這裡爬蟲的美中不足的地方就是這個介面似乎不能夠爬取到250了，只能爬取到248個電影，這個應該只是介面的問題，但是影響不是很大啦。

如下圖所示：