代码来源:第二届中国移动“梧桐杯”-数智乡村(浙江赛区)训练赛base_line
https://www.datacastle.cn/project_content.html?type=project&id=6060
5折交叉验证lgb建模
# 建模
def lgb_model(train_x, train_y, test_x):
folds = 5
seed = 2022
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
train_matrix = lgb.Dataset(trn_x, label=trn_y)
valid_matrix = lgb.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'num_leaves': 2 ** 5,
'learning_rate': 0.1,
'seed': 2022,
'nthread': 28,
'n_jobs':24,
}
model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
train[valid_index] = val_pred
test = test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred))
print(cv_scores)
print("%s_scotrainre_list:" % 'lgb', cv_scores)
print("%s_score_mean:" % 'lgb', np.mean(cv_scores))
print("%s_score_std:" % 'lgb', np.std(cv_scores))
return test
# 划分数据集,并训练模型,预测结果
train_x = data[data['flag'].notnull()][used_feat].copy()
train_y = data[data['flag'].notnull()]['flag']
test_x = data[data['flag'].isnull()][used_feat].copy()
print(train_x.shape, test_x.shape)
lgb_test = lgb_model(train_x, train_y, test_x)
(35000, 52) (15000, 52)
************************************ 1 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5605, number of negative: 22395
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001488 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5624
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200179 -> initscore=-1.385179
[LightGBM] [Info] Start training from score -1.385179
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.971499 valid_1's auc: 0.89008
Early stopping, best iteration is:
[188] training's auc: 0.969951 valid_1's auc: 0.890539
[0.8905392049233034]
************************************ 2 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5627, number of negative: 22373
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003396 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5623
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200964 -> initscore=-1.380278
[LightGBM] [Info] Start training from score -1.380278
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.970912 valid_1's auc: 0.876217
Early stopping, best iteration is:
[93] training's auc: 0.944954 valid_1's auc: 0.879035
[0.8905392049233034, 0.879034975812047]
************************************ 3 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5634, number of negative: 22366
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003407 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5617
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201214 -> initscore=-1.378722
[LightGBM] [Info] Start training from score -1.378722
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.97181 valid_1's auc: 0.888507
Early stopping, best iteration is:
[119] training's auc: 0.953518 valid_1's auc: 0.88972
[0.8905392049233034, 0.879034975812047, 0.8897195773363378]
************************************ 4 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5622, number of negative: 22378
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003478 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5624
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200786 -> initscore=-1.381391
[LightGBM] [Info] Start training from score -1.381391
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.970869 valid_1's auc: 0.893076
Early stopping, best iteration is:
[114] training's auc: 0.950973 valid_1's auc: 0.89486
[0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734]
************************************ 5 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] Number of positive: 5600, number of negative: 22400
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003464 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5632
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200000 -> initscore=-1.386294
[LightGBM] [Info] Start training from score -1.386294
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.972879 valid_1's auc: 0.885969
[400] training's auc: 0.987334 valid_1's auc: 0.886842
Early stopping, best iteration is:
[330] training's auc: 0.984063 valid_1's auc: 0.887086
[0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734, 0.8870862349021346]
lgb_scotrainre_list: [0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734, 0.8870862349021346]
lgb_score_mean: 0.8882480904314992
lgb_score_std: 0.005241551089184859