首頁 > 軟體

Python使用BeautifulSoup4修改網頁內容的實戰記錄

2022-05-20 13:00:07

最近有個小專案,需要爬取頁面上相應的資源資料後,儲存到本地,然後將原始的HTML原始檔儲存下來,對HTML頁面的內容進行修改將某些標籤整個給替換掉。

對於這類需要對HTML進行操作的需求,最方便的莫過於 BeautifulSoup4 的庫了。

樣例的HTML程式碼如下:

<html>
<body>
    <a class="videoslide" href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/1381824922cj2dsv3zv3a.JPG" rel="external nofollow"  rel="external nofollow" >
       <img src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/1381824922_zy_compress3mqjst0oef0.JPG" data-zy-media-id="zy_location_201310151613422786"/>
    </a>
    <a href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第一張_1381824798dkiytf5rwvm.JPG" rel="external nofollow"  rel="external nofollow" >
       <img data-zy-media-id="zy_image_201310151613169945" src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第一張_1381824798_zy_compresswtgrmnmp154.JPG"/></a>
    <a href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第二張_1381824796sl0ihsmeing.jpg" rel="external nofollow"  rel="external nofollow" >
       <img data-zy-media-id="zy_image_201310151613163009" src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第二張_1381824796_zy_compressygyaowigzr2.jpg"/>
    </a>
    <a href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第三張kdllz40lz1j.jpg" rel="external nofollow"  rel="external nofollow" >
       <img data-zy-media-id="zy_image_201312311838584446" src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第三張_zy_compressljqije2hldv.jpg"/>
    </a>
</body>
</html>

這裡主要包括了 <a > 標籤, <a > 標籤裡面嵌入了 <img > 標籤,其中有 <a class="videoslide"> 的標識該標籤實際是可以播放動畫的。需要根據 class="videoslide" 來判斷將整個 <a > 標籤換成播放器的 <video > 標籤,將沒有 class="videoslide" 的 <a > 標籤換成 <figure> 標籤。

也就是將帶有的 <a class="videoslide" ...><img ... /></a> 標籤換成

<div class="video">
<video controls width="100%" poster="視訊連結的圖片地址.jpg">
    <source src="視訊檔的靜態地址.mp4" type="video/mp4" />
    您的瀏覽器不支援H5視訊,請使用Chrome/Firefox/Edge瀏覽器。
</video>
</div>

將 <a ....><img .../></a> 標籤換成

<figure>
    < img src="圖片地址_compressed.jpg" data-zy-media-id="圖片地址.jpg">
    <figcaption>文字說明(如果有)</figcaption>
</figure>

這裡通過BeautifulSoup4 的select()方法找到標籤,通過get()方法獲取標籤及標籤屬性值,通過replaceWith來替換標籤,具體程式碼如下:

首先安裝BeautifulSoup4的庫,BeautifulSoup4庫依賴於lxml庫,所以也需要安裝lxml庫。

pip install bs4
pip install lxml

具體程式碼實現如下:

import os
from bs4 import BeautifulSoup
htmlstr='<html><body>' 
        '<a class="videoslide" href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/1381824922cj2dsv3zv3a.JPG" rel="external nofollow"  rel="external nofollow" >' 
        '<img src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/1381824922_zy_compress3mqjst0oef0.JPG" data-zy-media-id="zy_location_201310151613422786"/></a>' 
        '<a href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第一張_1381824798dkiytf5rwvm.JPG" rel="external nofollow"  rel="external nofollow" >' 
        '<img data-zy-media-id="zy_image_201310151613169945" src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第一張_1381824798_zy_compresswtgrmnmp154.JPG"/></a>' 
        '<a href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第二張_1381824796sl0ihsmeing.jpg" rel="external nofollow"  rel="external nofollow" >' 
        '<img data-zy-media-id="zy_image_201310151613163009" src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第二張_1381824796_zy_compressygyaowigzr2.jpg"/></a>' 
        '<a href="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第三張kdllz40lz1j.jpg" rel="external nofollow"  rel="external nofollow" >' 
        '<img data-zy-media-id="zy_image_201312311838584446" src="https://s3.ap-northeast-1.wasabisys.com/img.it145.com/202205/第三張_zy_compressljqije2hldv.jpg"/></a>' 
        '</body></html>'

def procHtml(htmlstr):
    soup = BeautifulSoup(htmlstr, 'lxml')
    a_tags=soup.select('a')
    for a_tag in a_tags:
        a_tag_src = a_tag.get('href')
        a_tag_filename = os.path.basename(a_tag_src)
        a_tag_path = os.path.join('src', a_tag_filename)
        a_tag['href']=a_tag_path
        next_tag=a_tag.next
        #判斷是視訊還是圖片,如果a標籤帶了class="videoslide" 是視訊否則是圖片
        if a_tag.get('class') and 'videoslide'==a_tag.get('class')[0]:
            # 處理視訊檔
            media_id = next_tag.get('data-zy-media-id')
            if media_id:
                media_url = 'http://www.test.com/travel/show_media/' + str(media_id)+'.mp4'
                media_filename = os.path.basename(media_url)
                media_path = os.path.join('src', media_filename)
                # 將div.video標籤替換a標籤
                video_html = '<div class="video"><video controls width = "100%" poster = "' + a_tag_path + '" ><source src = "' + media_path + '" type = "video/mp4" /> 您的瀏覽器不支援H5視訊,請使用Chrome / Firefox / Edge瀏覽器。 </video></div>'
                video_soup = BeautifulSoup(video_html, 'lxml')
                a_tag.replaceWith(video_soup.div)
        else:
            #獲取圖片資訊
            if 'img'==next_tag.name:
                img_src=next_tag.get('src')
                # 判斷是否路徑是否為本地資源 data:image和file:
                if img_src.find('data:image') == -1 and img_src.find('file:') == -1:
                    img_filename = os.path.basename(img_src)
                    img_path = os.path.join('src', img_filename)
                    # 將<figure><img>標籤替換a標籤
                    figcaption=''
                    figure_html='<figure><img src="'+img_path+'" data-zy-media-id="'+a_tag_path+'"><figcaption>'+figcaption+'</figcaption></figure>'
                    figure_soup = BeautifulSoup(figure_html, 'lxml')
                    a_tag.replaceWith(figure_soup.figure)
    html_content = soup.contents[0]
    return html_content

if __name__ == '__main__':
    pro_html_str=procHtml(htmlstr)
    print(pro_html_str)

結果:

<html>
<body>
<div class="video">
<video controls="" poster="src1381824922.JPG" width="100%">
<source src="srczy_location_201310151613422786.mp4" type="video/mp4"/> 您的瀏覽器不支援H5視訊,請使用Chrome / Firefox / Edge瀏覽器。 
</video>
</div>
<figure>
<img data-zy-media-id="src第一張_1381824798.JPG" src="src第一張_1381824798_zy_compress.JPG"/>
<figcaption></figcaption>
</figure>
<figure>
<img data-zy-media-id="src第二張_1381824796.jpg" src="src第二張_1381824796_zy_compress.jpg"/>
<figcaption></figcaption></figure>
<figure>
<img data-zy-media-id="src第三張.jpg" src="src第三張_zy_compress.jpg"/>
<figcaption></figcaption>
</figure>
</body>
</html>

總結 

到此這篇關於Python使用BeautifulSoup4修改網頁內容的文章就介紹到這了,更多相關Python BeautifulSoup4修改網頁內容內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com!


IT145.com E-mail:sddin#qq.com