🚀 原文链接:https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

Dask is a parallel computing library built on Python. Dask allows easy management of distributed workers and excels(擅长) at handling large distributed data science workflows. The implementation in XGBoost originates from(起源于) dask-xgboost with some extended functionalities(功能) and a different interface. Right now it is still under construction and may change (with proper(适当的) warnings) in the future. The tutorial here focuses on basic usage of dask with CPU tree algorithms. For an overview of GPU based training and internal workings, see A New, Official Dask API for XGBoost.

1、Requirements

Dask can be installed using either pip or conda (see the dask installation documentation for more information). For accelerating XGBoost with GPUs, dask-cuda is recommended for creating GPU clusters.

2、Overview

A dask cluster consists of three different components: a centralized scheduler, one or more workers, and one or more clients which act as the user-facing entry point(入口点) for submitting tasks to the cluster. When using XGBoost with dask, one needs to call the XGBoost dask interface from the client side. Below is a small example which illustrates basic usage of running XGBoost on a dask cluster:

  1. import xgboost as xgb
  2. import dask.array as da
  3. import dask.distributed
  4. cluster = dask.distributed.LocalCluster(n_workers=4, threads_per_worker=1)
  5. client = dask.distributed.Client(cluster)
  6. # X and y must be Dask dataframes or arrays
  7. num_obs = 1e5
  8. num_features = 20
  9. X = da.random.random(
  10. size=(num_obs, num_features),
  11. chunks=(1000, num_features)
  12. )
  13. y = da.random.random(
  14. size=(num_obs, 1),
  15. chunks=(1000, 1)
  16. )
  17. dtrain = xgb.dask.DaskDMatrix(client, X, y)
  18. output = xgb.dask.train(client,
  19. {'verbosity': 2,
  20. 'tree_method': 'hist',
  21. 'objective': 'reg:squarederror'
  22. },
  23. dtrain,
  24. num_boost_round=4, evals=[(dtrain, 'train')])

Here we first create a cluster in single-node mode with dask.distributed.LocalCluster, then connect a dask.distributed.Client to this cluster, setting up an environment for later computation.