🚀 原文链接:https://xgboost.readthedocs.io/en/latest/tutorials/dask.html
Dask is a parallel computing library built on Python. Dask allows easy management of distributed workers and excels(擅长) at handling large distributed data science workflows. The implementation in XGBoost originates from(起源于) dask-xgboost with some extended functionalities(功能) and a different interface. Right now it is still under construction and may change (with proper(适当的) warnings) in the future. The tutorial here focuses on basic usage of dask with CPU tree algorithms. For an overview of GPU based training and internal workings, see A New, Official Dask API for XGBoost.
1、Requirements
Dask can be installed using either pip or conda (see the dask installation documentation for more information). For accelerating XGBoost with GPUs, dask-cuda is recommended for creating GPU clusters.
2、Overview
A dask cluster consists of three different components: a centralized scheduler, one or more workers, and one or more clients which act as the user-facing entry point(入口点) for submitting tasks to the cluster. When using XGBoost with dask, one needs to call the XGBoost dask interface from the client side. Below is a small example which illustrates basic usage of running XGBoost on a dask cluster:
import xgboost as xgb
import dask.array as da
import dask.distributed
cluster = dask.distributed.LocalCluster(n_workers=4, threads_per_worker=1)
client = dask.distributed.Client(cluster)
# X and y must be Dask dataframes or arrays
num_obs = 1e5
num_features = 20
X = da.random.random(
size=(num_obs, num_features),
chunks=(1000, num_features)
)
y = da.random.random(
size=(num_obs, 1),
chunks=(1000, 1)
)
dtrain = xgb.dask.DaskDMatrix(client, X, y)
output = xgb.dask.train(client,
{'verbosity': 2,
'tree_method': 'hist',
'objective': 'reg:squarederror'
},
dtrain,
num_boost_round=4, evals=[(dtrain, 'train')])
Here we first create a cluster in single-node mode with dask.distributed.LocalCluster
, then connect a dask.distributed.Client
to this cluster, setting up an environment for later computation.