分类算法总结 - lgb-分类模型建模 - 《机器学习-学习笔记》

代码来源：第二届中国移动“梧桐杯”-数智乡村（浙江赛区）训练赛base_line
https://www.datacastle.cn/project_content.html?type=project&id=6060

5折交叉验证lgb建模

# 建模
def lgb_model(train_x, train_y, test_x):
    folds = 5
    seed = 2022
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    cv_scores = []
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        train_matrix = lgb.Dataset(trn_x, label=trn_y)
        valid_matrix = lgb.Dataset(val_x, label=val_y)
        params = {
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': 'auc',
            'num_leaves': 2 ** 5,
            'learning_rate': 0.1,
            'seed': 2022,
            'nthread': 28,
            'n_jobs':24,
        }
        model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        train[valid_index] = val_pred
        test = test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))
        print(cv_scores)
    print("%s_scotrainre_list:" % 'lgb', cv_scores)
    print("%s_score_mean:" % 'lgb', np.mean(cv_scores))
    print("%s_score_std:" % 'lgb', np.std(cv_scores))
    return test

# 划分数据集,并训练模型，预测结果
train_x = data[data['flag'].notnull()][used_feat].copy()
train_y = data[data['flag'].notnull()]['flag']
test_x = data[data['flag'].isnull()][used_feat].copy()
print(train_x.shape, test_x.shape)
lgb_test = lgb_model(train_x, train_y, test_x)

(35000, 52) (15000, 52)
************************************ 1 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5605, number of negative: 22395
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001488 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5624
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200179 -> initscore=-1.385179
[LightGBM] [Info] Start training from score -1.385179
Training until validation scores don't improve for 200 rounds
[200]    training's auc: 0.971499    valid_1's auc: 0.89008
Early stopping, best iteration is:
[188]    training's auc: 0.969951    valid_1's auc: 0.890539
[0.8905392049233034]
************************************ 2 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5627, number of negative: 22373
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003396 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5623
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200964 -> initscore=-1.380278
[LightGBM] [Info] Start training from score -1.380278
Training until validation scores don't improve for 200 rounds
[200]    training's auc: 0.970912    valid_1's auc: 0.876217
Early stopping, best iteration is:
[93]    training's auc: 0.944954    valid_1's auc: 0.879035
[0.8905392049233034, 0.879034975812047]
************************************ 3 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5634, number of negative: 22366
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003407 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5617
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201214 -> initscore=-1.378722
[LightGBM] [Info] Start training from score -1.378722
Training until validation scores don't improve for 200 rounds
[200]    training's auc: 0.97181    valid_1's auc: 0.888507
Early stopping, best iteration is:
[119]    training's auc: 0.953518    valid_1's auc: 0.88972
[0.8905392049233034, 0.879034975812047, 0.8897195773363378]
************************************ 4 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5622, number of negative: 22378
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003478 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5624
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200786 -> initscore=-1.381391
[LightGBM] [Info] Start training from score -1.381391
Training until validation scores don't improve for 200 rounds
[200]    training's auc: 0.970869    valid_1's auc: 0.893076
Early stopping, best iteration is:
[114]    training's auc: 0.950973    valid_1's auc: 0.89486
[0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734]
************************************ 5 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5600, number of negative: 22400
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003464 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5632
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200000 -> initscore=-1.386294
[LightGBM] [Info] Start training from score -1.386294
Training until validation scores don't improve for 200 rounds
[200]    training's auc: 0.972879    valid_1's auc: 0.885969
[400]    training's auc: 0.987334    valid_1's auc: 0.886842
Early stopping, best iteration is:
[330]    training's auc: 0.984063    valid_1's auc: 0.887086
[0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734, 0.8870862349021346]
lgb_scotrainre_list: [0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734, 0.8870862349021346]
lgb_score_mean: 0.8882480904314992
lgb_score_std: 0.005241551089184859