首頁 > 軟體

Python無法用requests獲取網頁原始碼的解決方法

2022-07-08 14:05:28

最近在抓取http://skell.sketchengine.eu網頁時,發現用requests無法獲得網頁的全部內容,所以我就用selenium先模擬瀏覽器開啟網頁,再獲取網頁的原始碼,通過BeautifulSoup解析後拿到網頁中的例句,為了能讓迴圈持續進行,我們在迴圈體中加了refresh(),這樣當瀏覽器得到新網址時通過重新整理再更新網頁內容,注意為了更好地獲取網頁內容,設定重新整理後停留2秒,這樣可以降低抓不到網頁內容的機率。為了減少被封的可能,我們還加入了Chrome,請看以下程式碼:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time,re
 
path = Service("D:\MyDrivers\chromedriver.exe")#
# 設定不顯示瀏覽器
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36')
 
# 建立Chrome範例 。
 
driver = webdriver.Chrome(service=path,options=chrome_options)
lst=["happy","help","evening","great","think","adapt"]
 
for word in lst:
    url="https://skell.sketchengine.eu/#result?lang=en&query="+word+"&f=concordance"
    driver.get(url)
    # 重新整理網頁獲取新資料
    driver.refresh()
    time.sleep(2)
    # page_source——》獲得頁面原始碼
    resp=driver.page_source
    # 解析原始碼
    soup=BeautifulSoup(resp,"html.parser")
    table = soup.find_all("td")
    with open("eps.txt",'a+',encoding='utf-8') as f:
        f.write(f"n{word}的例子n")
    for i in table[0:6]:
        text=i.text
        #替換多餘的空格
        new=re.sub("s+"," ",text)
        #寫入txt文字
        with open("eps.txt",'a+',encoding='utf-8') as f:
            f.write(re.sub(r"^(d+.)",r"n1",new))
driver.close()

1. 為了加快存取速度,我們設定不顯示瀏覽器,通過chrome.options實現

2. 最近通過re正規表示式來清理格式。

3. 我們設定table[0:6]來獲取前三個句子的內容,最後顯示結果如下。

happy的例子
1. This happy mood lasted roughly until last autumn. 
2. The lodging was neither convenient nor happy . 
3. One big happy family "fighting communism". 
help的例子
1. Applying hot moist towels may help relieve discomfort. 
2. The intense light helps reproduce colors more effectively. 
3. My survival route are self help books. 
evening的例子
1. The evening feast costs another $10. 
2. My evening hunt was pretty flat overall. 
3. The area nightclubs were active during evenings . 
great的例子
1. The three countries represented here are three great democracies. 
2. Our three different tour guides were great . 
3. Your receptionist "crew" is great ! 
think的例子
1. I said yes immediately without thinking everything through. 
2. This book was shocking yet thought provoking. 
3. He thought "disgusting" was more appropriate. 
adapt的例子
1. The novel has been adapted several times. 
2. There are many ways plants can adapt . 
3. They must adapt quickly to changing deadlines. 

補充:經過程式碼的優化以後,例句的爬取更加快捷,程式碼如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time,re
import os
 
# 設定模擬瀏覽器的位置
path = Service("D:\MyDrivers\chromedriver.exe")#
# 設定不顯示瀏覽器
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36')
 
# 建立Chrome範例 。
 
def get_wordlist():
    wordlist=[]
    with open("wordlist.txt",'r',encoding='utf-8') as f:
        lines=f.readlines()
        for line in lines:
            word=line.strip()
            wordlist.append(word)
    return wordlist
 
def main(lst):
    driver = webdriver.Chrome(service=path,options=chrome_options)
    for word in lst:
        url="https://skell.sketchengine.eu/#result?lang=en&query="+word+"&f=concordance"
        driver.get(url) 
        driver.refresh()
        time.sleep(2)
        # page_source——》頁面原始碼
        resp=driver.page_source
        # 解析原始碼
        soup=BeautifulSoup(resp,"html.parser")
        table = soup.find_all("td")
        with open("examples.txt",'a+',encoding='utf-8') as f:
            f.writelines(f"n{word}的例子n")
        for i in table[0:6]:
            text=i.text
            new=re.sub("s+"," ",text)
            with open("eps.txt",'a+',encoding='utf-8') as f:
                f.write(new)
#                 f.writelines(re.sub("(.s)(d+.)","1n2",new))
 
if __name__=="__main__":
    lst=get_wordlist()
    main(lst)
    os.startfile("examples.txt")

總結

到此這篇關於Python無法用requests獲取網頁原始碼的文章就介紹到這了,更多相關requests獲取網頁原始碼內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com!


IT145.com E-mail:sddin#qq.com