微调参数

有一些参数是已经默认好的,所以我们只需要更改一些其他的参数,使得整个训练过程能够达到良好的结果即可。
我们之前说过,传统的树是通过增加树的宽度来不断进行的,并且这个过程中:树的宽度 = 2。所以,但是LightGBM不满足这个式子,所以这个时候我们要调整的参数有3个。叶子的个数(数的宽度),最大深度,叶子中的数据最小值。

  • num_leaves:叶子的个数。我们为了防止数据的过拟合,通常选取:叶子个数 < 2。例如,我们设定树的深度为7的话,叶子个数为127会造成过拟合,这个时候设定为70/80会得到更好的准确率。
  • min_data_in_leaf:叶子的数据的最小值。这个参数的选取主要却决于训练样本的个数和num_leaves。如果这个值设定的比较大的话会避免树的深度过大,但是会造成欠拟合。通常情况下,对于大数据把它设定为 上百位或上千位就比较合适了。
  • max_depth:树的深度。

    更快的训练速度

  • 通过设定bagging_fractionbagging_freq来使用bagging算法

  • 通过设定feature_fraction来使用特征的子样本
  • 设定更小的max_bin
  • 使用save_binary来加速数据的加载
  • 使用并行学习(parallel learning),参考Parallel Learning Guide

    更好的准确率

  • 更大的max_bin(训练速度会减慢)

  • 使用更小的learning_rate和更大的num_iterations
  • 使用更大的num_leaves(可能会造成过拟合)
  • 使用更大的训练数据
  • 尝试dart

    避免过拟合

  • 使用更小的max_bin

  • 使用更小的num_leaves
  • 使用min_data_in_leafmin_sum_hessian_in_leaf
  • 通过设定baggging_fractionbagging_freq使用bagging
  • 通过设定feature_fraction来使用特征的子样本
  • 使用更大的训练数据
  • 使用lambda_l1, lambda_l2min_gain_to_split来正则化
  • 使用max_depth来限定树的深度
  • 使用extra_trees
  • 尝试增加path_smooth

    参数格式

    参数格式为 key1=value1 key2=value2...,参数可以在配置文件(config file)和命令行(command line)设置。在命令行设置的时候,在=前后不要有空格。通过配置文件,我们可以在命令行的参数设置中只输入一个参数即可。
    如果命令行和配置文件中都有参数的话,LightGBM会使用命令行的参数。

    核心参数

    config

    config ,

    • default = "", type = string, aliases: config_file
  1. path of config file
  • Note: can be used only in CLI version

    task

    task

    • default = train,
    • type = enum, options: train, predict, convert_model, refit, aliases: task_type
  1. train, for training, aliases: training
  2. predict, for prediction, aliases: prediction, test
  3. convert_model, for converting model file into if-else format, see more information in Convert Parameters
  4. refit, for refitting existing models with new data, aliases: refit_tree
  • Note: can be used only in CLI version; for language-specific packages you can use the correspondent functions

    objective

    objective

    • default = regression,
    • type = enum, huber, fair, poisson, quantile, mape, gamma, tweedie, binary, multiclass, multiclassova, cross_entropy, cross_entropy_lambda, lambdarank, rank_xendcg, aliases: objective_type, app, application

      1. regression application

    • regression, L2 loss, aliases: regression_l2, l2, mean_squared_error, mse, l2_root, root_mean_squared_error, rmse

    • regression_l1, L1 loss, aliases: l1, mean_absolute_error, mae
    • huber, Huber loss
    • fair, Fair loss
    • poisson, Poisson regression
    • quantile, Quantile regression
    • mape, MAPE loss, aliases: mean_absolute_percentage_error
    • gamma, Gamma regression with log-link. It might be useful, e.g., for modeling insurance claims severity, or for any target that might be gamma-distributed
    • tweedie, Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any target that might be tweedie-distributed

      2. binary classification application

    • binary, binary log loss classification (or logistic regression)

    • requires labels in {0, 1}; see cross-entropy application for general probability labels in [0, 1]

      3. multi-class classification application

    • multiclass, softmax objective function, aliases: softmax

    • multiclassova, One-vs-All binary objective function, aliases: multiclass_ova, ova, ovr
    • num_class should be set as well

      4. cross-entropy application

    • cross_entropy, objective function for cross-entropy (with optional linear weights), aliases: xentropy

    • cross_entropy_lambda, alternative parameterization of cross-entropy, aliases: xentlambda
    • label is anything in interval [0, 1]

      5. ranking application

    • lambdarank, lambdarank objective. label_gain can be used to set the gain (weight) of int label and all values in label must be smaller than number of elements in label_gain

    • rank_xendcg, XE_NDCG_MART ranking objective function, aliases: xendcg, xe_ndcg, xe_ndcg_mart, xendcg_mart
    • rank_xendcg is faster than and achieves the similar performance as lambdarank
    • label should be int type, and larger number represents the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)

      boosting

      boosting

    • default = gbdt,

    • type = enum, options: gbdt, rf, dart, goss, aliases: boosting_type, boost
  1. gbdt, traditional Gradient Boosting Decision Tree, aliases: gbrt
  2. rf, Random Forest, aliases: random_forest
  3. dart, Dropouts meet Multiple Additive Regression Trees
  4. goss, Gradient-based One-Side Sampling
  • Note: internally, LightGBM uses gbdt mode for the first 1 / learning_rate iterations

    data

    data

    • default = "",
    • type = string, aliases: train, train_data, train_data_file, data_filename
  1. path of training data, LightGBM will train from this data
  • Note: can be used only in CLI version

    valid

    valid

    • default = "",
    • type = string, aliases: test, valid_data, valid_data_file, test_data, test_data_file, valid_filenames
  1. path(s) of validation/test data, LightGBM will output metrics for these data
  2. support multiple validation data, separated by ,
  • Note: can be used only in CLI version

    num_iterations

    num_iterations

    • default = 100,
    • type = int, aliases: num_iteration, n_iter, num_tree, num_trees, num_round, num_rounds, num_boost_round, n_estimators, constraints: num_iterations >= 0
  1. number of boosting iterations
  • Note: internally, LightGBM constructs num_class * num_iterations trees for multi-class classification problems

    learning_rate

    learning_rate

    • default = 0.1,
    • type = double, aliases: shrinkage_rate, eta, constraints: learning_rate > 0.0
  1. shrinkage rate
  2. in dart, it also affects on normalization weights of dropped trees

    num_leaves

    num_leaves

    • default = 31,
    • type = int, aliases: num_leaf, max_leaves, max_leaf, constraints: 1 < num_leaves <= 131072
  3. max number of leaves in one tree

    tree_learner

    tree_learner

    • default = serial,
    • type = enum, options: serial, feature, data, voting, aliases: tree, tree_type, tree_learner_type
  4. serial, single machine tree learner
  5. feature, feature parallel tree learner, aliases: feature_parallel
  6. data, data parallel tree learner, aliases: data_parallel
  7. voting, voting parallel tree learner, aliases: voting_parallel
  • refer to Parallel Learning Guide to get more details

    num_threads

    num_threads

    • default = 0,
    • type = int, aliases: num_thread, nthread, nthreads, n_jobs
  1. number of threads for LightGBM
  2. 0 means default number of threads in OpenMP
  3. for the best speed, set this to the number of real CPU cores, not the number of threads (most CPUs use hyper-threading to generate 2 threads per CPU core)
  4. do not set it too large if your dataset is small (for instance, do not use 64 threads for a dataset with 10,000 rows)
  5. be aware a task manager or any similar CPU monitoring tool might report that cores not being fully utilized. This is normal
  6. for parallel learning, do not use all CPU cores because this will cause poor performance for the network communication
  • Note: please don’t change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors

    device_type

    device_type

    • default = cpu,
    • type = enum, options: cpu, gpu, aliases: device
  1. device for the tree learning, you can use GPU to achieve the faster learning
  • Note: it is recommended to use the smaller max_bin (e.g. 63) to get the better speed up
  • Note: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set gpu_use_dp=true to enable 64-bit float point, but it will slow down the training
  • Note: refer to Installation Guide to build LightGBM with GPU support

    seed

    seed

    • default = None,
    • type = int, aliases: random_seed, random_state
  1. this seed is used to generate other seeds, e.g. data_random_seed, feature_fraction_seed, etc.
  2. by default, this seed is unused in favor of default values of other seeds
  3. this seed has lower priority in comparison with other seeds, which means that it will be overridden, if you set other seeds explicitly

    deterministic

    deterministic

    • default = false,
    • type = bool
  4. used only with cpu device type
  5. setting this to true should ensure the stable results when using the same data and the same parameters (and different num_threads)
  6. when you use the different seeds, different LightGBM versions, the binaries compiled by different compilers, or in different systems, the results are expected to be different
  7. you can raise issues in LightGBM GitHub repo when you meet the unstable results
  • Note: setting this to true may slow down the training

    控制学习参数

    force_col_wise

    force_col_wise

    • default = false,
    • type = bool
  1. used only with cpu device type
  2. set this to true to force col-wise histogram building
  3. enabling this is recommended when:
    • the number of columns is large, or the total number of bins is large
    • num_threads is large, e.g. > 20
    • you want to reduce memory cost
  • Note: when both force_col_wise and force_row_wise are false, LightGBM will firstly try them both, and then use the faster one. To remove the overhead of testing set the faster one to true manually
  • Note: this parameter cannot be used at the same time with force_row_wise, choose only one of them

    force_row_wise

    force_row_wise

    • default = false,
    • type = bool
  1. used only with cpu device type
  2. set this to true to force row-wise histogram building
  3. enabling this is recommended when:
    • the number of data points is large, and the total number of bins is relatively small
    • num_threads is relatively small, e.g. <= 16
    • you want to use small bagging_fraction or goss boosting to speed up
  • Note: setting this to true will double the memory cost for Dataset object. If you have not enough memory, you can try setting force_col_wise=true
  • Note: when both force_col_wise and force_row_wise are false, LightGBM will firstly try them both, and then use the faster one. To remove the overhead of testing set the faster one to true manually
  • Note: this parameter cannot be used at the same time with force_col_wise, choose only one of them

    histogram_pool_size

    histogram_pool_size

    • default = -1.0,
    • type = double, aliases: hist_pool_size
  1. max cache size in MB for historical histogram
  2. < 0 means no limit

    max_depth

    max_depth

    • default = -1,
    • type = int
  3. limit the max depth for tree model. This is used to deal with over-fitting when #data is small. Tree still grows leaf-wise
  4. <= 0 means no limit

    min_data_in_leaf

    min_data_in_leaf

    • default = 20,
    • type = int, aliases: min_data_per_leaf, min_data, min_child_samples, constraints: min_data_in_leaf >= 0
  5. minimal number of data in one leaf. Can be used to deal with over-fitting

    min_sum_hessian_in_leaf

    min_sum_hessian_in_leaf

    • default = 1e-3,
    • type = double, aliases: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight, constraints: min_sum_hessian_in_leaf >= 0.0
  6. minimal sum hessian in one leaf. Like min_data_in_leaf, it can be used to deal with over-fitting

    bagging_fraction

    bagging_fraction

    • default = 1.0,
    • type = double, aliases: sub_row, subsample, bagging, constraints: 0.0 < bagging_fraction <= 1.0
  7. like feature_fraction, but this will randomly select part of data without resampling
  8. can be used to speed up training
  9. can be used to deal with over-fitting
  • Note: to enable bagging, bagging_freq should be set to a non zero value as well

    pos_bagging_fraction

    pos_bagging_fraction

    • default = 1.0,
    • type = double, aliases: pos_sub_row, pos_subsample, pos_bagging, constraints: 0.0 < pos_bagging_fraction <= 1.0
  1. used only in binary application
  2. used for imbalanced binary classification problem, will randomly sample #pos_samples * pos_bagging_fraction positive samples in bagging
  3. should be used together with neg_bagging_fraction
  4. set this to 1.0 to disable
  • Note: to enable this, you need to set bagging_freq and neg_bagging_fraction as well
  • Note: if both pos_bagging_fraction and neg_bagging_fraction are set to 1.0, balanced bagging is disabled
  • Note: if balanced bagging is enabled, bagging_fraction will be ignored

    neg_bagging_fraction

    neg_bagging_fraction

    • default = 1.0,
    • type = double, aliases: neg_sub_row, neg_subsample, neg_bagging, constraints: 0.0 < neg_bagging_fraction <= 1.0
  1. used only in binary application
  2. used for imbalanced binary classification problem, will randomly sample #neg_samples * neg_bagging_fraction negative samples in bagging
  3. should be used together with pos_bagging_fraction
  4. set this to 1.0 to disable
  • Note: to enable this, you need to set bagging_freq and pos_bagging_fraction as well
  • Note: if both pos_bagging_fraction and neg_bagging_fraction are set to 1.0, balanced bagging is disabled
  • Note: if balanced bagging is enabled, bagging_fraction will be ignored

    bagging_freq

    bagging_freq

    • default = 0,
    • type = int, aliases: subsample_freq
  1. frequency for bagging
  2. 0 means disable bagging; k means perform bagging at every k iteration
  • Note: to enable bagging, bagging_fraction should be set to value smaller than 1.0 as well

    bagging_seed

    bagging_seed

    • default = 3,
    • type = int, aliases: bagging_fraction_seed
  1. random seed for bagging

    feature_fraction

    feature_fraction

    • default = 1.0,
    • type = double, aliases: sub_feature, colsample_bytree, constraints: 0.0 < feature_fraction <= 1.0
  2. LightGBM will randomly select part of features on each iteration (tree) if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree
  3. can be used to speed up training
  4. can be used to deal with over-fitting

    feature_fraction_bynode

    feature_fraction_bynode

    • default = 1.0,
    • type = double, aliases: sub_feature_bynode, colsample_bynode, constraints: 0.0 < feature_fraction_bynode <= 1.0
  5. LightGBM will randomly select part of features on each tree node if feature_fraction_bynode smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features at each tree node
  6. can be used to deal with over-fitting
  • Note: unlike feature_fraction, this cannot speed up training
  • Note: if both feature_fraction and feature_fraction_bynode are smaller than 1.0, the final fraction of each node is feature_fraction * feature_fraction_bynode

    feature_fraction_seed

    feature_fraction_seed

    • default = 2,
    • type = int
  1. random seed for feature_fraction

    extra_trees

    extra_trees

    • default = false,
    • type = bool
  2. use extremely randomized trees
  3. if set to true, when evaluating node splits LightGBM will check only one randomly-chosen threshold for each feature
  4. can be used to deal with over-fitting

    extra_seed

    extra_seed

    • default = 6,
    • type = int
  5. random seed for selecting thresholds when extra_trees is true

    early_stopping_round

    early_stopping_round

    • default = 0,
    • type = int, aliases: early_stopping_rounds, early_stopping, n_iter_no_change
  6. will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds
  7. <= 0 means disable

    first_metric_only

    first_metric_only

    • default = false,
    • type = bool
  8. set this to true, if you want to use only the first metric for early stopping

    max_delta_step

    max_delta_step

    • default = 0.0,
    • type = double, aliases: max_tree_output, max_leaf_output
  9. used to limit the max output of tree leaves
  10. <= 0 means no constraint
  11. the final max output of leaves is learning_rate * max_delta_step

    lambda_l1

    lambda_l1

    • default = 0.0,
    • type = double, aliases: reg_alpha, constraints: lambda_l1 >= 0.0
  12. L1 regularization

    lambda_l1

    lambda_l2

    • default = 0.0,
    • type = double, aliases: reg_lambda, lambda, constraints: lambda_l2 >= 0.0
  13. L2 regularization

    min_gain_to_split

  • min_gain_to_split
    • default =0.0,
    • type = double, aliases:min_split_gain, constraints:min_gain_to_split >= 0.0
  1. the minimal gain to perform split

    drop_rate

    drop_rate

    • default = 0.1,
    • type = double, aliases: rate_drop, constraints: 0.0 <= drop_rate <= 1.0
  2. used only in dart
  3. dropout rate: a fraction of previous trees to drop during the dropout

    max_drop

    max_drop

    • default = 50,
    • type = int
  4. used only in dart
  5. max number of dropped trees during one boosting iteration
  6. <=0 means no limit

    skip_drop

    skip_drop

    • default = 0.5,
    • type = double, constraints: 0.0 <= skip_drop <= 1.0
  7. used only in dart
  8. probability of skipping the dropout procedure during a boosting iteration

    xgboost_dart_mode

    xgboost_dart_mode

    • default = false,
    • type = bool
  9. used only in dart
  10. set this to true, if you want to use xgboost dart mode

    uniform_drop

    uniform_drop

    • default = false,
    • type = bool
  11. used only in dart
  12. set this to true, if you want to use uniform drop

    drop_seed

    drop_seed

    • default = 4,
    • type = int
  13. used only in dart
  14. random seed to choose dropping models

    top_rate

    top_rate

    • default = 0.2,
    • type = double, constraints: 0.0 <= top_rate <= 1.0
  15. used only in goss
  16. the retain ratio of large gradient data

    other_rate

    other_rate

    • default = 0.1,
    • type = double, constraints: 0.0 <= other_rate <= 1.0
  17. used only in goss
  18. the retain ratio of small gradient data

    min_data_per_group

    min_data_per_group

    • default = 100,
    • type = int, constraints: min_data_per_group > 0
  19. minimal number of data per categorical group

    max_cat_threshold

    max_cat_threshold

    • default = 32,
    • type = int, constraints: max_cat_threshold > 0
  20. used for the categorical features

    cat_l2

    cat_l2

    • default = 10.0,
    • type = double, constraints: cat_l2 >= 0.0
  21. used for the categorical features
  22. L2 regularization in categorical split

    cat_smooth

    cat_smooth

    • default = 10.0,
    • type = double, constraints: cat_smooth >= 0.0
  23. used for the categorical features
  24. this can reduce the effect of noises in categorical features, especially for categories with few data

    max_cat_to_onehot

    max_cat_to_onehot

    • default = 4,
    • type = int, constraints: max_cat_to_onehot > 0
  25. when number of categories of one feature smaller than or equal to max_cat_to_onehot, one-vs-other split algorithm will be used

    top_k

    top_k

    • default = 20,
    • type = int, aliases: topk, constraints: top_k > 0
  26. used only in voting tree learner, refer to Voting parallel
  27. set this to larger value for more accurate result, but it will slow down the training speed

    monotone_constraints

    monotone_constraints

    • default = None,
    • type = multi-int, aliases: mc, monotone_constraint
  28. used for constraints of monotonic features
  29. 1 means increasing, -1 means decreasing, 0 means non-constraint
  30. you need to specify all features in order. For example, mc=-1,0,1 means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd feature

    monotone_constraints_method

    monotone_constraints_method

    • default = basic,
    • type = enum, options: basic, intermediate, advanced, aliases: monotone_constraining_method, mc_method
  31. used only if monotone_constraints is set
  32. monotone constraints method

    • basic, the most basic monotone constraints method. It does not slow the library at all, but over-constrains the predictions
    • intermediate, a more advanced method, which may slow the library very slightly. However, this method is much less constraining than the basic method and should significantly improve the results
    • advanced, an even more advanced method, which may slow the library. However, this method is even less constraining than the intermediate method and should again significantly improve the results

      monotone_penalty

      monotone_penalty

    • default = 0.0,

    • type = double, aliases: monotone_splits_penalty, ms_penalty, mc_penalty, constraints: monotone_penalty >= 0.0
  33. used only if monotone_constraints is set
  34. monotone penalty: a penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree. The penalty applied to monotone splits on a given depth is a continuous, increasing function the penalization parameter
  35. if 0.0 (the default), no penalization is applied

    feature_contri

    feature_contri

    • default = None,
    • type = multi-double, aliases: feature_contrib, fc, fp, feature_penalty
  36. used to control feature’s split gain, will use gain[i] = max(0, feature_contri[i]) * gain[i] to replace the split gain of i-th feature
  37. you need to specify all features in order

    forcedsplits_filename

    forcedsplits_filename

    • default = "",
    • type = string, aliases: fs, forced_splits_filename, forced_splits_file, forced_splits
  38. path to a .json file that specifies splits to force at the top of every decision tree before best-first learning commences
  39. .json file can be arbitrarily nested, and each split contains feature, threshold fields, as well as left and right fields representing subsplits
  40. categorical splits are forced in a one-hot fashion, with left representing the split containing the feature value and right representing other values
  • Note: the forced split logic will be ignored, if the split makes gain worse
  • see this file as an example

    refit_decay_rate

    refit_decay_rate

    • default = 0.9,
    • type = double, constraints: 0.0 <= refit_decay_rate <= 1.0
  1. decay rate of refit task, will use leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output to refit trees
  2. used only in refit task in CLI version or as argument in refit function in language-specific package

    cegb_tradeoff

    cegb_tradeoff

    • default = 1.0,
    • type = double, constraints: cegb_tradeoff >= 0.0
  3. cost-effective gradient boosting multiplier for all penalties

    cegb_penalty_split

    cegb_penalty_split

    • default = 0.0,
    • type = double, constraints: cegb_penalty_split >= 0.0
  4. cost-effective gradient-boosting penalty for splitting a node

    cegb_penalty_feature_lazy

    cegb_penalty_feature_lazy

    • default = 0,0,...,0,
    • type = multi-double
  5. cost-effective gradient boosting penalty for using a feature
  6. applied per data point

    cegb_penalty_feature_coupled

    cegb_penalty_feature_coupled

    • default = 0,0,...,0,
    • type = multi-double
  7. cost-effective gradient boosting penalty for using a feature
  8. applied once per forest

    path_smooth

    path_smooth

    • default = 0,
    • type = double, constraints: path_smooth >= 0.0
  9. controls smoothing applied to tree nodes
  10. helps prevent overfitting on leaves with few samples
  11. if set to zero, no smoothing is applied
  12. if path_smooth > 0 then min_data_in_leaf must be at least 2
  13. larger values give stronger regularisation

    • the weight of each node is (n / path_smooth) * w + w_p / (n / path_smooth + 1), where n is the number of samples in the node, w is the optimal node weight to minimise the loss (approximately -sum_gradients / sum_hessians), and w_p is the weight of the parent node
    • note that the parent output w_p itself has smoothing applied, unless it is the root node, so that the smoothing effect accumulates with the tree depth

      interaction_constraints

      interaction_constraints

    • default = "",

    • type = string
  14. controls which features can appear in the same branch
  15. by default interaction constraints are disabled, to enable them you can specify
  16. for CLI, lists separated by commas, e.g. [0,1,2],[2,3]
  17. for Python-package, list of lists, e.g. [[0, 1, 2], [2, 3]]
  18. for R-package, list of character or numeric vectors, e.g. list(c("var1", "var2", "var3"), c("var3", "var4")) or list(c(1L, 2L, 3L), c(3L, 4L)). Numeric vectors should use 1-based indexing, where 1L is the first feature, 2L is the second feature, etc
  19. any two features can only appear in the same branch only if there exists a constraint containing both features

    verbosity

    verbosity

    • default = 1,
    • type = int, aliases: verbose
  20. controls the level of LightGBM’s verbosity
  21. < 0: Fatal, = 0: Error (Warning), = 1: Info, > 1: Debug

    input_model

    input_model

    • default = "",
    • type = string, aliases: model_input, model_in
  22. filename of input model
  23. for prediction task, this model will be applied to prediction data
  24. for train task, training will be continued from this model
  • Note: can be used only in CLI version

    output_model

    output_model

    • default = LightGBM_model.txt,
    • type = string, aliases: model_output, model_out
  1. filename of output model in training
  • Note: can be used only in CLI version

    saved_feature_importance_type

    saved_feature_importance_type

    • default = 0,
    • type = int
  1. the feature importance type in the saved model file
  2. 0: count-based feature importance (numbers of splits are counted); 1: gain-based feature importance (values of gain are counted)
  • Note: can be used only in CLI version

    snapshot_freq

    snapshot_freq

    • default = -1,
    • type = int, aliases: save_period
  1. frequency of saving model file snapshot
  2. set this to positive value to enable this function. For example, the model file will be snapshotted at each iteration if snapshot_freq=1
  • Note: can be used only in CLI version

    I/O参数

    Dataset参数

    max_bin

    max_bin

    • default = 255,
    • type = int, constraints: max_bin > 1
  1. max number of bins that feature values will be bucketed in
  2. small number of bins may reduce training accuracy but may increase general power (deal with over-fitting)
  3. LightGBM will auto compress memory according to max_bin. For example, LightGBM will use uint8_t for feature value if max_bin=255

    max_bin_by_feature

    max_bin_by_feature

    • default = None,
    • type = multi-int
  4. max number of bins for each feature
  5. if not specified, will use max_bin for all features

    min_data_in_bin

    min_data_in_bin

    • default = 3,
    • type = int, constraints: min_data_in_bin > 0
  6. minimal number of data inside one bin
  7. use this to avoid one-data-one-bin (potential over-fitting)

    bin_construct_sample_cnt

    bin_construct_sample_cnt

    • default = 200000,
    • type = int, aliases: subsample_for_bin, constraints: bin_construct_sample_cnt > 0
  8. number of data that sampled to construct feature discrete bins
  9. setting this to larger value will give better training result, but may increase data loading time
  10. set this to larger value if data is very sparse
  • Note: don’t set this to small values, otherwise, you may encounter unexpected errors and poor accuracy

    data_random_seed

    data_random_seed

    • default = 1,
    • type = int, aliases: data_seed
  • random seed for sampling data to construct histogram bins

    is_enable_sparse

    is_enable_sparse

    • default = true,
    • type = bool, aliases: is_sparse, enable_sparse, sparse
  1. used to enable/disable sparse optimization

    enable_bundle

    enable_bundle

    • default = true,
    • type = bool, aliases: is_enable_bundle, bundle
  2. set this to false to disable Exclusive Feature Bundling (EFB), which is described in LightGBM: A Highly Efficient Gradient Boosting Decision Tree
  • Note: disabling this may cause the slow training speed for sparse datasets

    use_missing

    use_missing

    • default = true,
    • type = bool
  1. set this to false to disable the special handle of missing value

    zero_as_missing

    zero_as_missing

    • default = false,
    • type = bool
  2. set this to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices)
  3. set this to false to use na for representing missing values

    feature_pre_filter

    feature_pre_filter

    • default = true,
    • type = bool
  4. set this to true to pre-filter the unsplittable features by min_data_in_leaf
  5. as dataset object is initialized only once and cannot be changed after that, you may need to set this to false when searching parameters with min_data_in_leaf, otherwise features are filtered by min_data_in_leaf firstly if you don’t reconstruct dataset object
  • Note: setting this to false may slow down the training

    pre_partition

    pre_partition

    • default = false,
    • type = bool, aliases: is_pre_partition
  1. used for parallel learning (excluding the feature_parallel mode)
  2. true if training data are pre-partitioned, and different machines use different partitions

    two_round

    two_round

    • default = false,
    • type = bool, aliases: two_round_loading, use_two_round_loading
  3. set this to true if data file is too big to fit in memory
  4. by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed, but may cause run out of memory error when the data file is very big
  • Note: works only in case of loading data directly from file

    header

    header

    • default = false,
    • type = bool, aliases: has_header
  1. set this to true if input data has header
  • Note: works only in case of loading data directly from file

    label_column

    label_column

    • default = "",
    • type = int or string, aliases: label
  1. used to specify the label column
  2. use number for index, e.g. label=0 means column_0 is the label
  3. add a prefix name: for column name, e.g. label=name:is_click
  • Note: works only in case of loading data directly from file

    weight_column

    weight_column

    • default = "",
    • type = int or string, aliases: weight
  1. used to specify the weight column
  2. use number for index, e.g. weight=0 means column_0 is the weight
  3. add a prefix name: for column name, e.g. weight=name:weight
  • Note: works only in case of loading data directly from file
  • Note: index starts from 0 and it doesn’t count the label column when passing type is int, e.g. when label is column_0, and weight is column_1, the correct parameter is weight=0

    group_column

    group_column

    • default = "",
    • type = int or string, aliases: group, group_id, query_column, query, query_id
  1. used to specify the query/group id column
  2. use number for index, e.g. query=0 means column_0 is the query id
  3. add a prefix name: for column name, e.g. query=name:query_id
  • Note: works only in case of loading data directly from file
  • Note: data should be grouped by query_id
  • Note: index starts from 0 and it doesn’t count the label column when passing type is int, e.g. when label is column_0 and query_id is column_1, the correct parameter is query=0

    ignore_column

    ignore_column

    • default = "",
    • type = multi-int or string, aliases: ignore_feature, blacklist
  1. used to specify some ignoring columns in training
  2. use number for index, e.g. ignore_column=0,1,2 means column_0, column_1 and column_2 will be ignored
  3. add a prefix name: for column name, e.g. ignore_column=name:c1,c2,c3 means c1, c2 and c3 will be ignored
  • Note: works only in case of loading data directly from file
  • Note: index starts from 0 and it doesn’t count the label column when passing type is int
  • Note: despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully

    categorical_feature

    categorical_feature

    • default = "",
    • type = multi-int or string, aliases: cat_feature, categorical_column, cat_column
  1. used to specify categorical features
  2. use number for index, e.g. categorical_feature=0,1,2 means column_0, column_1 and column_2 are categorical features
  3. add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3 means c1, c2 and c3 are categorical features
  • Note: only supports categorical with int type (not applicable for data represented as pandas DataFrame in Python-package)
  • Note: index starts from 0 and it doesn’t count the label column when passing type is int
  • Note: all values should be less than Int32.MaxValue (2147483647)
  • Note: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
  • Note: all negative values will be treated as missing values
  • Note: the output cannot be monotonically constrained with respect to a categorical feature

    forcedbins_filename

    forcedbins_filename

    • default = "",
    • type = string
  1. path to a .json file that specifies bin upper bounds for some or all features
  2. .json file should contain an array of objects, each containing the word feature (integer feature index) and bin_upper_bound (array of thresholds for binning)
  3. see this file as an example

    save_binary

    save_binary

    • default = false,
    • type = bool, aliases: is_save_binary, is_save_binary_file
  4. if true, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time
  • Note: init_score is not saved in binary file
  • Note: can be used only in CLI version; for language-specific packages you can use the correspondent function

    Predict参数

    start_iteration_predict

    start_iteration_predict

    • default = 0,
    • type = int
  1. used only in prediction task
  2. used to specify from which iteration to start the prediction
  3. <= 0 means from the first iteration

    num_iteration_predict

    num_iteration_predict

    • default = -1,
    • type = int
  4. used only in prediction task
  5. used to specify how many trained iterations will be used in prediction
  6. <= 0 means no limit

    predict_raw_score

    predict_raw_score

    • default = false,
    • type = bool, aliases: is_predict_raw_score, predict_rawscore, raw_score
  7. used only in prediction task
  8. set this to true to predict only the raw scores
  9. set this to false to predict transformed scores

    preditct_leaf_index

    predict_leaf_index

    • default = false,
    • type = bool, aliases: is_predict_leaf_index, leaf_index
  10. used only in prediction task
  11. set this to true to predict with leaf index of all trees

    predict_contrib

    predict_contrib

    • default = false,
    • type = bool, aliases: is_predict_contrib, contrib
  12. used only in prediction task
  13. set this to true to estimate SHAP values, which represent how each feature contributes to each prediction
  14. produces #features + 1 values where the last value is the expected value of the model output over the training data
  • Note: if you want to get more explanation for your model’s predictions using SHAP values like SHAP interaction values, you can install shap package
  • Note: unlike the shap package, with predict_contrib we return a matrix with an extra column, where the last column is the expected value

    predict_disable_shape_check

    predict_disable_shape_check

    • default = false,
    • type = bool
  1. used only in prediction task
  2. control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
  3. if false (the default), a fatal error will be raised if the number of features in the dataset you predict on differs from the number seen during training
  4. if true, LightGBM will attempt to predict on whatever data you provide. This is dangerous because you might get incorrect predictions, but you could use it in situations where it is difficult or expensive to generate some features and you are very confident that they were never chosen for splits in the model
  • Note: be very careful setting this parameter to true

    pred_early_stop

    pred_early_stop

    • default = false,
    • type = bool
  1. used only in prediction task
  2. if true, will use early-stopping to speed up the prediction. May affect the accuracy

    pred_early_stop_freq

    pred_early_stop_freq

    • default = 10,
    • type = int
  3. used only in prediction task
  4. the frequency of checking early-stopping prediction

    pred_early_stop_margin

    pred_early_stop_margin

    • default = 10.0,
    • type = double
  5. used only in prediction task
  6. the threshold of margin in early-stopping prediction

    output_result

    output_result

    • default = LightGBM_predict_result.txt,
    • type = string, aliases: predict_result, prediction_result, predict_name, prediction_name, pred_name, name_pred
  7. used only in prediction task
  8. filename of prediction result
  • Note: can be used only in CLI version

    Convert参数

    convert_model_language

    convert_model_language

    • default = "",
    • type = string
  1. used only in convert_model task
  2. only cpp is supported yet; for conversion model to other languages consider using m2cgen utility
  3. if convert_model_language is set and task=train, the model will be also converted
  • Note: can be used only in CLI version

    convert_model

    convert_model

    • default = gbdt_prediction.cpp,
    • type = string, aliases: convert_model_file
  1. used only in convert_model task
  2. output filename of converted model
  • Note: can be used only in CLI version

    Objective参数

    objective_seed

    objective_seed

    • default = 5,
    • type = int
  1. used only in rank_xendcg objective
  2. random seed for objectives, if random process is needed

    num_class

    num_class

    • default = 1,
    • type = int, aliases: num_classes, constraints: num_class > 0
  3. used only in multi-class classification application

    is_unbalance

    is_unbalance

    • default = false,
    • type = bool, aliases: unbalance, unbalanced_sets
  4. used only in binary and multiclassova applications
  5. set this to true if training data are unbalanced
  • Note: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities
  • Note: this parameter cannot be used at the same time with scale_pos_weight, choose only one of them

    scale_pos_weight

    scale_pos_weight

    • default = 1.0,
    • type = double, constraints: scale_pos_weight > 0.0
  1. used only in binary and multiclassova applications
  2. weight of labels with positive class
  • Note: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities
  • Note: this parameter cannot be used at the same time with is_unbalance, choose only one of them

    sigmoid

    sigmoid

    • default = 1.0,
    • type = double, constraints: sigmoid > 0.0
  1. used only in binary and multiclassova classification and in lambdarank applications
  2. parameter for the sigmoid function

    boost_from_average

    boost_from_average

    • default = true,
    • type = bool
  3. used only in regression, binary, multiclassova and cross-entropy applications
  4. adjusts initial score to the mean of labels for faster convergence

    reg_sqrt

    reg_sqrt

    • default = false,
    • type = bool
  5. used only in regression application
  6. used to fit sqrt(label) instead of original values and prediction result will be also automatically converted to prediction^2
  7. might be useful in case of large-range labels

    alpha

  • alpha
    • default =0.9,
    • type = double, constraints:alpha > 0.0
  1. used only in huber and quantile regression applications
  2. parameter for Huber loss and Quantile regression

    fair_c

    fair_c

    • default = 1.0,
    • type = double, constraints: fair_c > 0.0
  3. used only in fair regression application
  4. parameter for Fair loss

    poisson_max_delta_step

    poisson_max_delta_step

    • default = 0.7,
    • type = double, constraints: poisson_max_delta_step > 0.0
  5. used only in poisson regression application
  6. parameter for Poisson regression to safeguard optimization

    tweedie_variance_power

    tweedie_variance_power

    • default = 1.5,
    • type = double, constraints: 1.0 <= tweedie_variance_power < 2.0
  7. used only in tweedie regression application
  8. used to control the variance of the tweedie distribution
  9. set this closer to 2 to shift towards a Gamma distribution
  10. set this closer to 1 to shift towards a Poisson distribution

    lambdarank_truncation_level

    lambdarank_truncation_level

    • default = 30,
    • type = int, constraints: lambdarank_truncation_level > 0
  11. used only in lambdarank application
  12. controls the number of top-results to focus on during training, refer to “truncation level” in the Sec. 3 of LambdaMART paper
  13. this parameter is closely related to the desirable cutoff k in the metric NDCG@k that we aim at optimizing the ranker for. The optimal setting for this parameter is likely to be slightly higher than k (e.g., k + 3) to include more pairs of documents to train on, but perhaps not too high to avoid deviating too much from the desired target metric NDCG@k

    lambdarank_norm

    lambdarank_norm

    • default = true,
    • type = bool
  14. used only in lambdarank application
  15. set this to true to normalize the lambdas for different queries, and improve the performance for unbalanced data
  16. set this to false to enforce the original lambdarank algorithm

    label_gain

    label_gain

    • default = 0,1,3,7,15,31,63,...,2^30-1,
    • type = multi-double
  17. used only in lambdarank application
  18. relevant gain for labels. For example, the gain of label 2 is 3 in case of default label gains
  19. separate by ,

    Metric参数

    metric

    metric

    • default = "",
    • type = multi-enum, aliases: metrics, metric_types
  • metric(s) to be evaluated on the evaluation set(s)
    • "" (empty string or not specified) means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added)
    • "None" (string, not a None value) means that no metric will be registered, aliases: na, null, custom
    • l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1
    • l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression
    • rmse, root square loss, aliases: root_mean_squared_error, l2_root
    • quantile, Quantile regression
    • mape, MAPE loss, aliases: mean_absolute_percentage_error
    • huber, Huber loss
    • fair, Fair loss
    • poisson, negative log-likelihood for Poisson regression
    • gamma, negative log-likelihood for Gamma regression
    • gamma_deviance, residual deviance for Gamma regression
    • tweedie, negative log-likelihood for Tweedie regression
    • ndcg, NDCG, aliases: lambdarank, rank_xendcg, xendcg, xe_ndcg, xe_ndcg_mart, xendcg_mart
    • map, MAP, aliases: mean_average_precision
    • auc, AUC
    • average_precision, average precision score
    • binary_logloss, log loss, aliases: binary
    • binary_error, for one sample: 0 for correct classification, 1 for error classification
    • auc_mu, AUC-mu
    • multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr
    • multi_error, error rate for multi-class classification
    • cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy
    • cross_entropy_lambda, “intensity-weighted” cross-entropy, aliases: xentlambda
    • kullback_leibler, Kullback-Leibler divergence, aliases: kldiv
  • support multiple metrics, separated by ,

    metric_freq

    metric_freq

    • default = 1,
    • type = int, aliases: output_freq, constraints: metric_freq > 0
  1. frequency for metric output
  • Note: can be used only in CLI version

    is_provide_training_metric

    is_provide_training_metric

    • default = false,
    • type = bool, aliases: training_metric, is_training_metric, train_metric
  1. set this to true to output metric result over training dataset
  • Note: can be used only in CLI version

    eval_at

    eval_at

    • default = 1,2,3,4,5,
    • type = multi-int, aliases: ndcg_eval_at, ndcg_at, map_eval_at, map_at
  1. used only with ndcg and map metrics
  2. NDCG and MAP evaluation positions, separated by ,

    multi_error_top_k

    multi_error_top_k

    • default = 1,
    • type = int, constraints: multi_error_top_k > 0
  3. used only with multi_error metric
  4. threshold for top-k multi-error metric
  5. the error on each sample is0if the true class is among the topmulti_error_top_kpredictions, and1otherwise
    • more precisely, the error on a sample is 0 if there are at least num_classes - multi_error_top_k predictions strictly less than the prediction on the true class
  6. when multi_error_top_k=1 this is equivalent to the usual multi-error metric

    auc_mu_weights

    auc_mu_weights

    • default = None,
    • type = multi-double
  7. used only with auc_mu metric
  8. list representing flattened matrix (in row-major order) giving loss weights for classification errors
  9. list should have n * n elements, where n is the number of classes
  10. the matrix co-ordinate [i, j] should correspond to the i * n + j-th element of the list
  11. if not specified, will use equal weights for all classes

    Network参数

    num_machinces

    num_machines

    • default = 1,
    • type = int, aliases: num_machine, constraints: num_machines > 0
  12. the number of machines for parallel learning application
  13. this parameter is needed to be set in both socket and mpi versions

    local_listen_port

    local_listen_port

    • default = 12400,
    • type = int, aliases: local_port, port, constraints: local_listen_port > 0
  14. TCP listen port for local machines
  • Note: don’t forget to allow this port in firewall settings before training

    time_out

    time_out

    • default = 120,
    • type = int, constraints: time_out > 0
  • socket time-out in minutes

    machine_list_filename

    machine_list_filename

    • default = "",
    • type = string, aliases: machine_list_file, machine_list, mlist
  1. path of file that lists machines for this parallel learning application
  2. each line contains one IP and one port for one machine. The format is ip port (space as a separator)

    machines

    machines

    • default = "",
    • type = string, aliases: workers, nodes
  3. list of machines in the following format: ip1:port1,ip2:port2

    GPU参数

    gpu_platform_id

    gpu_platform_id

    • default = -1,
    • type = int
  4. OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform
  5. -1 means the system-wide default platform
  • Note: refer to GPU Targets for more details

    gpu_device_id

    gpu_device_id

    • default = -1,
    • type = int
  1. OpenCL device ID in the specified platform. Each GPU in the selected platform has a unique device ID
  2. -1 means the default device in the selected platform
  • Note: refer to GPU Targets for more details

    gpu_use_dp

    gpu_use_dp

    • default = false,
    • type = bool
  1. set this to true to use double precision math on GPU (by default single precision is used in OpenCL implementation and double precision is used in CUDA implementation)

    num_gpu

    num_gpu

    • default = 1,
    • type = int, constraints: num_gpu > 0
  2. number of GPUs
  • Note: can be used only in CUDA implementation

    其他参数

    将Score用于训练

    如果我们的数据文件名字为train.txt,那么一开始的score文件应该命名为train.txt.init,然后把它和数据文件放到同一个文件夹下,这样LightGB会自动加载它。Score形式如下:
    1. 0.5
    2. -0.1
    3. 0.9
    4. ...

    权重数据

    LightGBM支持权重训练,使用一个文件来存储权重数据,数据大概如下所示:
    1. 1.0
    2. 0.5
    3. 0.8
    4. ...
    如果数据文件的名字是train.txt,权重文件应该命名为train.txt.weight,然后应该置于同一文件夹下,如果存在这个文件的话,LightGBM会自动加载。
    同样,我们可以在数据文件中添加上权重这一列。

    查询数据

    学习并且排序,需要查询训练数据的信息。LightGBM使用额外的文件来存储查询的数据,如下所示。
    1. 27
    2. 18
    3. 67
    4. ...
    上面的数据:27表示第27行的样本属于1个query(查询),18代表18行的样本属于另一个query(查询)。数据应该由查询排序。
    如果数据文件的名字是train.txt,查询文件的名字应该为train.txt.query,然后置于同一个文件夹下。
    同样也可以把query/group这一列放到数据文件中。