Python構建簡單線性迴歸模型

2022-08-25 22:01:33

線性迴歸模型

線性迴歸表示發現函數使用線性組合表示輸入變數。簡單線性迴歸很容易理解，使用了基本的迴歸技術，一旦理解了這些基本概念，可以更好地學習其他型別的迴歸模型。

迴歸用於發現輸入變數和輸出變數之間的關係，一般變數為實數。我們的目標是估計對映從輸入到輸出的對映核函數。

下面從一個簡單範例開始：

1 --> 2
3 --> 6
4.3 --> 8.6
1.1 --> 14.2

看到上面資料，估計你已經看出它們之間的關係：f(x) = 2x

但是現實資料不會這麼直接。下面範例資料來自Vehicles.txt檔案。每行資料使用逗號分割，第一個資料為輸入資料，第二個為輸出資料，我們的目標是發現線性迴歸關係：基於汽車登記量估計省份人口數量。

範例資料如下：

145263,   127329
204477,   312027
361034,   573694
616716,   891181
885665,   1059114
773600,   1221218
850513,   1326513
996733,   1543752
827967,   1571053
1011436,1658138
1222738,1970521
2404651,3744398
2259795,4077166
2844588,4404246
2774071,4448146
3011089,4915123
3169307,5074261
3346791,5850850
3702114,5888472
5923476,10008349

1.載入資料

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import sklearn.metrics as sm
import pickle
filename = "data/vehicles.txt"
x = []
y = []

with open(filename, 'r') as lines:
    for line in lines:
        xt, yt = [float(i) for i in line.split(',')]
        x.append(xt)
        y.append(yt)

上面程式碼載入檔案至x，y變數中，x是自變數，y是響應變數。在迴圈內讀取每一行，然後基於逗號分裂為兩個變數並轉為浮點型。

2.劃分訓練集和測試集

構建機器學習模型，需要劃分訓練集和測試集，訓練集用於構建模型，測試集用於驗證模型並檢查模型是否滿足要求。

num_training = int(0.8 * len(x))
num_test = len(x) - num_training

# 訓練資料佔80%
x_train = np.array(x[: num_training]).reshape((num_training, 1))
y_train = np.array(y[: num_training])

# 測試資料佔20%
x_test = np.array(x[num_training:]).reshape((num_test, 1))
y_test = np.array(y[num_training:])

首先取80%資料作為訓練集，剩餘的作為測試集。這時我們構造了四個陣列：x_train，x_test，y_train，y_test。

3.訓練模型

現在準備訓練模型，需要使用regressor物件。

# Create linear regression object
linear_regressor = linear_model.LinearRegression()

# Train the model using the training sets
linear_regressor.fit(x_train, y_train)

首先從sklearn庫中匯入linear_model方法，用於實現線性迴歸，裡面包括目標值：輸入變數的線性組合。然後使用LinearRegression() 函數執行最小二乘法執行線性迴歸。最後fit函數用於擬合線性模型，需要傳入兩個引數：x_train，y_train。

4.預測資料

上面基於訓練集擬合線性模型，使用fit方法接收訓練資料訓練模型。為了檢視擬合程度，我們可以使用訓練資料進行預測：

y_train_pred = linear_regressor.predict(X_train)

5.畫圖展示線性擬合情況

plt.figure()
plt.scatter(x_train, y_train, color='green')
plt.plot(x_train, y_train_pred, color='black', linewidth=4)
plt.title('Training data')
plt.show()

生成圖示如下：

前面使用訓練模型預測訓練資料。對於未知資料不能確定模型效能，我們需要基於測試資料進行測試。

6.預測資料測試

下面基於測試資料進行預測並畫圖展示：

y_test_pred = linear_regressor.predict(x_test)
plt.figure()
plt.scatter(x_test, y_test, color='green')
plt.plot(x_test, y_test_pred, color='black', linewidth=4)
plt.title('Test data')
plt.show()

與我們預想的一致，省人口與汽車註冊量成正相關。

評估模型精度

上面構建了迴歸模型，但我們需要評估模型的質量。這裡我們定義錯誤為實際值與預測值之間的差異,下面我們看如何計算迴歸模型的精度。

1.計算迴歸模型精度

print("MAE =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("MSE =", round(sm.mean_squared_error(y_test,  y_test_pred), 2))
print("Median absolute error =",
      round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explain variance score =",
      round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))

輸出結果：

MAE = 241907.27
MSE = 81974851872.13
Median absolute error = 240861.94
Explain variance score = 0.98
R2 score = 0.98

R2得分接近1表示模型預測效果非常好。計算每個指標會很麻煩，一般選擇一兩個指標來評估模型。一個好的做法是MSE較低，解釋方差得分較高。

Mean absolute error: 所有資料集的平均絕對值誤差
Mean squared error: 所有資料集的平均誤差平方，是最常用的指標之一。
Median absolute error: 所有資料集的誤差中位數，該指標主要用於消除異常值影響
Explained variance score: 模型在多大程度上能夠解釋資料集中的變化。1.0的分數表明我們的模型是完美的。
R2 score: 這被讀作r²，是決定係數。表示模型對未知樣本的預測程度。最好的分數是1.0，但也可以是負值。

模型持久化

訓練完模型，可以儲存至檔案中，下次需要模型預測可直接從檔案載入。
下面看如何持久化模型。需要使用pickle模組，實現儲存Python物件，它是Python標準庫的一部分。

# 寫入檔案
output_model_file = "3_model_linear_regr.pkl"
with open(output_model_file, ' wb') as f:
    pickle.dump(linear_regressor, f)

# 載入使用
with open(output_model_file, ' rb') as f:
    model_linregr = pickle.load(f)

y_test_pred_new = model_linregr.predict(x_test)
print("New mean absolute error =",
      round(sm.mean_absolute_error(y_test, y_test_pred_new), 2))

輸出結果：