The purpose of this guide is to illustrate some of the main features that scikit-learn provides. It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.). Please refer to our installation instructions for installing scikit-learn.
purpose:目的 | illustrate:阐述 | assumes:
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
1、Fitting and predicting: estimator basics
Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.
estimator:评估器
Here is a simple example where we fit a RandomForestClassifier to some very basic data:
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(random_state=0)# 2 samples, 3 featuresX = [[ 1, 2, 3], [11, 12, 13]]# classes of each sampley = [0, 1]clf.fit(X, y)
The fit method generally accepts 2 inputs:
- The samples matrix (or design matrix) X. The size of 
Xis typically(n_samples, n_features), which means that samples are represented as rows and features are represented as columns. - The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, 
ydoes not need to be specified.yis usually 1d array where theith entry corresponds to the target of theith sample (row) ofX.discrete:离散的
 
Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.
equivalent:等价的 | sparse:稀疏的
Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:
>>> clf.predict(X) # predict classes of the training dataarray([0, 1])>>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new dataarray([0, 1])
2、Transformers and pre-processors
Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.
In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same BaseEstimator class). The transformer objects don’t have a predict method but rather a transform method that outputs a newly transformed sample matrix X:
from sklearn.preprocessing import StandardScalerX = [[0, 15], [1, -10]]# scale data according to computed scaling valuesStandardScaler().fit(X).transform(X)
Sometimes, you want to apply different transformations to different features: the ColumnTransformer is designed for these use-cases.
3、Pipelines: chaining pre-processors and estimators
Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict. As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.
unifying:统一 | data leakage:数据泄露 | disclosing:泄露
In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:
from sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinefrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# create a pipeline objectpipe = make_pipeline(StandardScaler(),LogisticRegression())# load the iris dataset and split it into train and test setsX, y = load_iris(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)# fit the whole pipelinepipe.fit(X_train, y_train)# we can now use it like any other estimatoraccuracy_score(pipe.predict(X_test), y_test)
4、Model evaluation
Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for cross-validation.
entail:需要
We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions. Please refer to our User Guide for more details:
from sklearn.datasets import make_regressionfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import cross_validateX, y = make_regression(n_samples=1000, random_state=0)lr = LinearRegression()result = cross_validate(lr, X, y) # defaults to 5-fold CVresult['test_score'] # r_squared score is high because dataset is easy
5、Automatic parameter searches
All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters. For example a RandomForestRegressor has a n_estimators parameter that determines the number of trees in the forest, and a max_depth parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand.
Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a RandomizedSearchCV object. When the search is over, the RandomizedSearchCV behaves as a RandomForestRegressor that has been fitted with the best set of parameters. Read more in the User Guide:
from sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import RandomizedSearchCVfrom sklearn.model_selection import train_test_splitfrom scipy.stats import randintX, y = fetch_california_housing(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)# define the parameter space that will be searched overparam_distributions = {'n_estimators': randint(1, 5),'max_depth': randint(5, 10)}# now create a searchCV object and fit it to the datasearch = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),n_iter=5, random_state=0param_distributions=param_distributions)search.fit(X_train, y_train)search.best_params_# the search object now acts like a normal random forest estimator# with max_depth=9 and n_estimators=4search.score(X_test, y_test)
:::info Note
In practice, you almost always want to search over a pipeline, instead of a single estimator. One of the main reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the generalization power of the estimator (you can read more in this Kaggle post).
Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.:::
pitfall:危险
