四種Python機器學習超引數搜尋方法總結

2022-11-07 14:01:02

原始模型

作為精度對比，我們最開始使用隨機森林來訓練初始化模型，並在測試集計算精度：

# 資料讀取
df = pd.read_csv('https://mirror.coggle.club/dataset/heart.csv')
X = df.drop(columns=['output'])
y = df['output']

# 資料劃分
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y)

# 模型訓練與計算準確率
clf = RandomForestClassifier(random_state=0)
clf.fit(x_train, y_train)
clf.score(x_test, y_test)

模型最終在測試集精度為：0.802。

GridSearch

GridSearch是比較基礎的超引數搜尋方法，中文名字網格搜尋。其原理是在計算的過程中遍歷所有的超引數組合，然後搜尋到最優的結果。

如下程式碼所示，我們對4個超引數進行搜尋，搜尋空間為 5 * 3 * 2 * 3 = 90組超引數。對於每組超引數還需要計算5折交叉驗證，則需要訓練450次。

parameters = {
    'max_depth': [2,4,5,6,7],
    'min_samples_leaf': [1,2,3],
    'min_weight_fraction_leaf': [0, 0.1],
    'min_impurity_decrease': [0, 0.1, 0.2]
}

# Fitting 5 folds for each of 90 candidates, totalling 450 fits
clf = GridSearchCV(
    RandomForestClassifier(random_state=0),
    parameters, refit=True, verbose=1,
)
clf.fit(x_train, y_train)
clf.best_estimator_.score(x_test, y_test)

模型最終在測試集精度為：0.815。

RandomizedSearch

RandomizedSearch是在一定範圍內進行搜尋，且需要設定搜尋的次數，其預設不會對所有的組合進行搜尋。

n_iter代表超引數組合的個數，預設會設定比所有組合次數少的取值，如下面設定的為10，則只進行50次訓練。

parameters = {
    'max_depth': [2,4,5,6,7],
    'min_samples_leaf': [1,2,3],
    'min_weight_fraction_leaf': [0, 0.1],
    'min_impurity_decrease': [0, 0.1, 0.2]
}

clf = RandomizedSearchCV(
    RandomForestClassifier(random_state=0),
    parameters, refit=True, verbose=1, n_iter=10,
)

clf.fit(x_train, y_train)
clf.best_estimator_.score(x_test, y_test)

模型最終在測試集精度為：0.815。

HalvingGridSearch

HalvingGridSearch和GridSearch非常相似，但在迭代的過程中是有引數組合減半的操作。

最開始使用所有的超引數組合，但使用最少的資料，篩選其中最優的超引數，增加資料再進行篩選。

HalvingGridSearch的思路和hyperband的思路非常相似，但是最樸素的實現。先使用少量資料篩選超引數組合，然後使用更多的資料驗證精度。

n_iterations: 3
n_required_iterations: 5
n_possible_iterations: 3
min_resources_: 20
max_resources_: 227
aggressive_elimination: False
factor: 3
----------

iter: 0
n_candidates: 90
n_resources: 20
Fitting 5 folds for each of 90 candidates, totalling 450 fits
----------

iter: 1
n_candidates: 30
n_resources: 60
Fitting 5 folds for each of 30 candidates, totalling 150 fits
----------

iter: 2
n_candidates: 10
n_resources: 180
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----------

模型最終在測試集精度為：0.855。

HalvingRandomSearch

HalvingRandomSearch和HalvingGridSearch類似，都是逐步增加樣本，減少超引數組合。但每次生成超引數組合，都是隨機篩選的。

n_iterations: 3
n_required_iterations: 3
n_possible_iterations: 3
min_resources_: 20
max_resources_: 227
aggressive_elimination: False
factor: 3
----------

iter: 0
n_candidates: 11
n_resources: 20
Fitting 5 folds for each of 11 candidates, totalling 55 fits
----------

iter: 1
n_candidates: 4
n_resources: 60
Fitting 5 folds for each of 4 candidates, totalling 20 fits
----------

iter: 2
n_candidates: 2
n_resources: 180
Fitting 5 folds for each of 2 candidates, totalling 10 fits

模型最終在測試集精度為：0.828。

總結與對比

HalvingGridSearch和HalvingRandomSearch比較適合在資料量比較大的情況使用，可以提高訓練速度。如果計算資源充足，GridSearch和HalvingGridSearch會得到更好的結果。

後續我們將分享其他的一些高階調參庫的實現，其中也會有資料量改變的思路。如在Optuna中，核心是引數組合的生成和剪枝、訓練的樣本增加等細節。

到此這篇關於四種Python機器學習超引數搜尋方法總結的文章就介紹到這了,更多相關Python超引數搜尋內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com！