代码来源:第二届中国移动“梧桐杯”-数智乡村(浙江赛区)训练赛base_line
https://www.datacastle.cn/project_content.html?type=project&id=6060
5折交叉验证lgb建模
# 建模def lgb_model(train_x, train_y, test_x):folds = 5seed = 2022kf = KFold(n_splits=folds, shuffle=True, random_state=seed)train = np.zeros(train_x.shape[0])test = np.zeros(test_x.shape[0])cv_scores = []for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):print('************************************ {} ************************************'.format(str(i+1)))trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]train_matrix = lgb.Dataset(trn_x, label=trn_y)valid_matrix = lgb.Dataset(val_x, label=val_y)params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','num_leaves': 2 ** 5,'learning_rate': 0.1,'seed': 2022,'nthread': 28,'n_jobs':24,}model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)val_pred = model.predict(val_x, num_iteration=model.best_iteration)test_pred = model.predict(test_x, num_iteration=model.best_iteration)train[valid_index] = val_predtest = test_pred / kf.n_splitscv_scores.append(roc_auc_score(val_y, val_pred))print(cv_scores)print("%s_scotrainre_list:" % 'lgb', cv_scores)print("%s_score_mean:" % 'lgb', np.mean(cv_scores))print("%s_score_std:" % 'lgb', np.std(cv_scores))return test
# 划分数据集,并训练模型,预测结果train_x = data[data['flag'].notnull()][used_feat].copy()train_y = data[data['flag'].notnull()]['flag']test_x = data[data['flag'].isnull()][used_feat].copy()print(train_x.shape, test_x.shape)lgb_test = lgb_model(train_x, train_y, test_x)
(35000, 52) (15000, 52)************************************ 1 ************************************[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] Number of positive: 5605, number of negative: 22395[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001488 seconds.You can set `force_row_wise=true` to remove the overhead.And if memory is not enough, you can set `force_col_wise=true`.[LightGBM] [Info] Total Bins 5624[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200179 -> initscore=-1.385179[LightGBM] [Info] Start training from score -1.385179Training until validation scores don't improve for 200 rounds[200] training's auc: 0.971499 valid_1's auc: 0.89008Early stopping, best iteration is:[188] training's auc: 0.969951 valid_1's auc: 0.890539[0.8905392049233034]************************************ 2 ************************************[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] Number of positive: 5627, number of negative: 22373[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003396 seconds.You can set `force_col_wise=true` to remove the overhead.[LightGBM] [Info] Total Bins 5623[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200964 -> initscore=-1.380278[LightGBM] [Info] Start training from score -1.380278Training until validation scores don't improve for 200 rounds[200] training's auc: 0.970912 valid_1's auc: 0.876217Early stopping, best iteration is:[93] training's auc: 0.944954 valid_1's auc: 0.879035[0.8905392049233034, 0.879034975812047]************************************ 3 ************************************[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] Number of positive: 5634, number of negative: 22366[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003407 seconds.You can set `force_col_wise=true` to remove the overhead.[LightGBM] [Info] Total Bins 5617[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201214 -> initscore=-1.378722[LightGBM] [Info] Start training from score -1.378722Training until validation scores don't improve for 200 rounds[200] training's auc: 0.97181 valid_1's auc: 0.888507Early stopping, best iteration is:[119] training's auc: 0.953518 valid_1's auc: 0.88972[0.8905392049233034, 0.879034975812047, 0.8897195773363378]************************************ 4 ************************************[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] Number of positive: 5622, number of negative: 22378[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003478 seconds.You can set `force_col_wise=true` to remove the overhead.[LightGBM] [Info] Total Bins 5624[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200786 -> initscore=-1.381391[LightGBM] [Info] Start training from score -1.381391Training until validation scores don't improve for 200 rounds[200] training's auc: 0.970869 valid_1's auc: 0.893076Early stopping, best iteration is:[114] training's auc: 0.950973 valid_1's auc: 0.89486[0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734]************************************ 5 ************************************[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] Number of positive: 5600, number of negative: 22400[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003464 seconds.You can set `force_col_wise=true` to remove the overhead.[LightGBM] [Info] Total Bins 5632[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 52[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200000 -> initscore=-1.386294[LightGBM] [Info] Start training from score -1.386294Training until validation scores don't improve for 200 rounds[200] training's auc: 0.972879 valid_1's auc: 0.885969[400] training's auc: 0.987334 valid_1's auc: 0.886842Early stopping, best iteration is:[330] training's auc: 0.984063 valid_1's auc: 0.887086[0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734, 0.8870862349021346]lgb_scotrainre_list: [0.8905392049233034, 0.879034975812047, 0.8897195773363378, 0.8948604591836734, 0.8870862349021346]lgb_score_mean: 0.8882480904314992lgb_score_std: 0.005241551089184859
