🚀 原文链接:https://scikit-learn.org/stable/tutorial/statistical_inference/index.html

1、Statistical learning: the setting and the estimator object in scikit-learn

📄 原文链接:https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html

1.1 Datasets

Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.
**

A simple example shipped with scikit-learn: iris dataset
It is made of 150 observations of irises(iris复数), each described by 4 features: their sepal(萼片) and petal(花瓣) length and width, as detailed in iris.DESCR.

  1. >>> from sklearn import datasets
  2. >>> iris = datasets.load_iris()
  3. >>> data = iris.data
  4. >>> data.shape
  5. (150, 4)

When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to be used by scikit-learn.

An example of reshaping data would be the digits dataset The digits dataset is made of 1797 8x8 images of hand-written digits

  1. >>> digits = datasets.load_digits()
  2. >>> digits.images.shape
  3. (1797, 8, 8)
  4. >>> import matplotlib.pyplot as plt
  5. >>> plt.imshow(digits.images[-1], cmap=plt.cm.gray_r)
  6. <matplotlib.image.AxesImage object at ...>

A tutorial on statistical-learning for scientific data processing - 图1

To use this dataset with scikit-learn, we transform each 8x8 image into a feature vector of length 64

  1. >>> data = digits.images.reshape((digits.images.shape[0], -1))

1.2 Estimators objects

Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.

All estimator objects expose(暴露) a fit method that takes a dataset (usually a 2-d array):

  1. >>> estimator.fit(data)

Estimator parameters: All the parameters of an estimator can be set when it is instantiated(实例化) or by modifying the corresponding attribute:

  1. >>> estimator = Estimator(param1=1, param2=2)
  2. >>> estimator.param1
  3. 1

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore(下划线):

  1. >>> estimator.estimated_param_

2、Supervised learning: predicting an output variable from high-dimensional observations

📄 原文链接:https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html :::info The problem solved in supervised learning
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.:::

:::info Vocabulary: classification and regression
If the prediction task is to classify the observations in a set of finite(限定的) labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task.

When doing classification in scikit-learn, y is a vector of integers or strings. Note: See the Introduction to machine learning with scikit-learn Tutorial for a quick run-through on the basic machine learning vocabulary used within scikit-learn.:::

2.1 Nearest neighbor and the curse of dimensionality

Classifying irises:
The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width:

  1. >>> import numpy as np
  2. >>> from sklearn import datasets
  3. >>> iris_X, iris_y = datasets.load_iris(return_X_y=True)
  4. >>> np.unique(iris_y)
  5. array([0, 1, 2])

image.png

1) k-Nearest neighbors classifier

The simplest possible classifier is the nearest neighbor: given a new observation X_test, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector. (Please see the Nearest Neighbors section of the online Scikit-learn documentation for more information about this type of classifier.)

Training set and testing set While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data.

KNN (k nearest neighbors) classification example:
A tutorial on statistical-learning for scientific data processing - 图3

  1. # Split iris data in train and test data
  2. # A random permutation, to split the data randomly
  3. import numpy as np
  4. from sklearn.neighbors import KNeighborsClassifier
  5. np.random.seed(0)
  6. indices = np.random.permutation(len(iris_X))
  7. iris_X_train = iris_X[indices[:-10]]
  8. iris_y_train = iris_y[indices[:-10]]
  9. iris_X_test = iris_X[indices[-10:]]
  10. iris_y_test = iris_y[indices[-10:]]
  11. # Create and fit a nearest-neighbor classifier
  12. knn = KNeighborsClassifier()
  13. knn.fit(iris_X_train, iris_y_train)
  14. knn.predict(iris_X_test) # array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
  15. iris_y_test # array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

2) The curse(灾难) of dimensionality

For an estimator to be effective, you need the distance between neighboring points to be less than some value A tutorial on statistical-learning for scientific data processing - 图4, which depends on the problem. In one dimension, this requires on average A tutorial on statistical-learning for scientific data processing - 图5 points. In the context of the above A tutorial on statistical-learning for scientific data processing - 图6-NN example, if the data is described by just one feature with values ranging from 0 to 1 and with A tutorial on statistical-learning for scientific data processing - 图7 training observations, then new data will be no further away than A tutorial on statistical-learning for scientific data processing - 图8. Therefore, the nearest neighbor decision rule will be efficient as soon as A tutorial on statistical-learning for scientific data processing - 图9 is small compared to the scale of between-class feature variations(变化).

If the number of features is A tutorial on statistical-learning for scientific data processing - 图10, you now require A tutorial on statistical-learning for scientific data processing - 图11 points. Let’s say that we require 10 points in one dimension: now A tutorial on statistical-learning for scientific data processing - 图12 points are required in A tutorial on statistical-learning for scientific data processing - 图13 dimensions to pave(铺垫) the [0,1] space. As A tutorial on statistical-learning for scientific data processing - 图14 becomes large, the number of training points required for a good estimator grows exponentially(以指数方式).

For example, if each point is just a single number (8 bytes), then an effective A tutorial on statistical-learning for scientific data processing - 图15-NN estimator in a paltry(微不足道的) A tutorial on statistical-learning for scientific data processing - 图16 dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or so).

This is called the curse of dimensionality and is a core problem that machine learning addresses.

2.2 Linear model: from regression to sparsity

:::info Diabetes dataset
The diabetes(糖尿病) dataset consists of 10 physiological(生理学的) variables (age, sex, weight, blood pressure) measure on 442 patients, and an indication of disease(病) progression after one year: :::

  1. >>> diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
  2. >>> diabetes_X_train = diabetes_X[:-20]
  3. >>> diabetes_X_test = diabetes_X[-20:]
  4. >>> diabetes_y_train = diabetes_y[:-20]
  5. >>> diabetes_y_test = diabetes_y[-20:]

The task at hand is to predict disease progression from physiological variables.

1) Linear regression

LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals(残差) of the model as small as possible.

Linear models: A tutorial on statistical-learning for scientific data processing - 图17

  • A tutorial on statistical-learning for scientific data processing - 图18: data
  • A tutorial on statistical-learning for scientific data processing - 图19: target variable
  • A tutorial on statistical-learning for scientific data processing - 图20: Coefficients
  • A tutorial on statistical-learning for scientific data processing - 图21: Observation noise

image.png

  1. >>> from sklearn import linear_model
  2. >>> regr = linear_model.LinearRegression()
  3. >>> regr.fit(diabetes_X_train, diabetes_y_train)
  4. LinearRegression()
  5. >>> print(regr.coef_)
  6. [ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937
  7. 492.81458798 102.84845219 184.60648906 743.51961675 76.09517222]
  8. # The mean square error
  9. >>> np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)
  10. 2004.56760268...
  11. # Explained variance score: 1 is perfect prediction
  12. # and 0 means that there is no linear relationship
  13. # between X and y.
  14. >>> regr.score(diabetes_X_test, diabetes_y_test)
  15. 0.5850753022690...

2) Shrinkage

If there are few data points per dimension, noise in the observations induces high variance:

  1. import matplotlib.pyplot as plt
  2. X = np.c_[ .5, 1].T
  3. y = [.5, 1]
  4. test = np.c_[ 0, 2].T
  5. regr = linear_model.LinearRegression()
  6. plt.figure()
  7. np.random.seed(0)
  8. for _ in range(6):
  9. this_X = .1 * np.random.normal(size=(2, 1)) + X
  10. regr.fit(this_X, y)
  11. plt.plot(test, regr.predict(test))
  12. plt.scatter(this_X, y, s=3)

image.png
A solution in high-dimensional statistical learning is to shrink(缩小) the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:

  1. regr = linear_model.Ridge(alpha=.1)
  2. plt.figure()
  3. np.random.seed(0)
  4. for _ in range(6):
  5. this_X = .1 * np.random.normal(size=(2, 1)) + X
  6. regr.fit(this_X, y)
  7. plt.plot(test, regr.predict(test))
  8. plt.scatter(this_X, y, s=3)

image.png
This is an example of bias/variance **tradeoff(权衡)**: the larger the ridge alpha parameter, the higher the bias and the lower the variance.

We can choose alpha to minimize left out error, this time using the diabetes dataset rather than our synthetic data:

  1. alphas = np.logspace(-4, -1, 6)
  2. print([regr.set_params(alpha=alpha)
  3. .fit(diabetes_X_train, diabetes_y_train)
  4. .score(diabetes_X_test, diabetes_y_test)
  5. for alpha in alphas])
  6. # [0.5851110683883..., 0.5852073015444..., 0.5854677540698...,
  7. # 0.5855512036503..., 0.5830717085554..., 0.57058999437...]

Note: Capturing in the fitted parameters noise that prevents the model to generalize to new data is called > overfitting. The bias introduced by the ridge regression is called a > regularization.

3) Sparsity

image.png Note:
A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of the target variable). It is hard to develop an intuition(直觉) on such representation, but it may be useful to keep in mind that it would be a fairly(公平地) empty space.

We can see that, although feature 2 has a strong coefficient on the full model, it conveys little information on y when considered with feature 1.

To improve the conditioning of the problem (i.e. mitigating(缓和) the The curse of dimensionality), it would be interesting to select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease their contribution, but not set them to zero. Another penalization(惩罚) approach, called Lasso (least absolute shrinkage(最小绝对收缩率) and selection operator), can set some coefficients to zero. Such methods are called sparse method and sparsity(稀疏) can be seen as an application of Occam’s razor: prefer simpler models.

  1. regr = linear_model.Lasso()
  2. scores = [regr.set_params(alpha=alpha)
  3. .fit(diabetes_X_train, diabetes_y_train)
  4. .score(diabetes_X_test, diabetes_y_test)
  5. for alpha in alphas]
  6. >>> best_alpha = alphas[scores.index(max(scores))]
  7. >>> regr.alpha = best_alpha
  8. >>> regr.fit(diabetes_X_train, diabetes_y_train)
  9. Lasso(alpha=0.025118864315095794)
  10. >>> print(regr.coef_)
  11. [ 0. -212.437... 517.194... 313.779... -160.830...
  12. -0. -187.195... 69.382... 508.660... 71.842...]

:::info Different algorithms for the same problem
Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in scikit-learn solves the lasso regression problem using a coordinate descent method, that is efficient on large datasets. However, scikit-learn also provides the LassoLars object using the LARS algorithm, which is very efficient for problems in which the weight vector estimated is very sparse (i.e. problems with very few observations). :::

4) Classification

For classification, as in the labeling iris task, linear regression is not the right approach as it will give too much weight to data far from the decision frontier(绝对边界). A linear approach is to fit a sigmoid function or logistic function:
image.png
A tutorial on statistical-learning for scientific data processing - 图27

  1. >>> log = linear_model.LogisticRegression(C=1e5)
  2. >>> log.fit(iris_X_train, iris_y_train)
  3. LogisticRegression(C=100000.0)

This is known as LogisticRegression.
image.png :::info Multiclass classification
If you have several classes to predict, an option often used is to fit one-versus-all classifiers and then use a voting heuristic(启发) for the final decision. :::

:::info Shrinkage and sparsity with logistic regression
The C parameter controls the amount of regularization in the LogisticRegression object: a large value for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while penalty="l1" gives Sparsity. :::

:::info Exercise
Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test prediction performance on these observations. A solution can be downloaded here. :::

  1. from sklearn import datasets, neighbors, linear_model
  2. X_digits, y_digits = datasets.load_digits(return_X_y=True)
  3. X_digits = X_digits / X_digits.max()

2.3 Support vector machines (SVMs)

1) Linear SVMs

Support Vector Machines belong to the discriminant(判别式的) model family: they try to find a combination of samples to build a plane(平面) maximizing the margin between the two classes. Regularization is set by the C parameter: a small value for C means the margin is calculated using many or all of the observations around the separating line (more regularization); a large value for C means the margin is calculated on observations close to the separating line (less regularization).
image.png

:::info Example:

SVMs can be used in regression –SVR (Support Vector Regression)–, or in classification –SVC (Support Vector Classification).

  1. >>> from sklearn import svm
  2. >>> svc = svm.SVC(kernel='linear')
  3. >>> svc.fit(iris_X_train, iris_y_train)
  4. SVC(kernel='linear')

:::info Warning:Normalizing data For many estimators, including the SVMs, having datasets with unit standard deviation(偏差) for each feature is important to get good prediction.:::

2) Using kernels

Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by positioning kernels on observations:

i) Linear kernel

  1. >>> svc = svm.SVC(kernel='linear')

image.png

ii) Polynomial kernel

  1. svc = svm.SVC(kernel='poly', degree=3)
  2. # degree: polynomial degree

image.png

iii) RBF kernel (Radial Basis Function)

  1. svc = svm.SVC(kernel='rbf')
  2. # gamma: inverse of size of
  3. # radial kernel

image.png :::info Interactive example
See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, fit the model and change parameters and data. :::

:::info Exercise
Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave out 10% of each class and test prediction performance on these observations. A solution can be downloaded here

Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class.Hint: You can use the decision_function method on a grid to get intuitions.:::

  1. iris = datasets.load_iris()
  2. X = iris.data
  3. y = iris.target
  4. X = X[y != 0, :2]
  5. y = y[y != 0]

image.png

3、Model selection: choosing estimators and their parameters

📄 原文链接:https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

3.1 Score, and cross-validated scores

As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.

  1. >>> from sklearn import datasets, svm
  2. >>> X_digits, y_digits = datasets.load_digits(return_X_y=True)
  3. >>> svc = svm.SVC(C=1, kernel='linear')
  4. >>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
  5. 0.98

To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:

  1. import numpy as np
  2. X_folds = np.array_split(X_digits, 3)
  3. y_folds = np.array_split(y_digits, 3)
  4. scores = list()
  5. for k in range(3):
  6. # We use 'list' to copy, in order to 'pop' later on
  7. X_train = list(X_folds)
  8. X_test = X_train.pop(k)
  9. X_train = np.concatenate(X_train)
  10. y_train = list(y_folds)
  11. y_test = y_train.pop(k)
  12. y_train = np.concatenate(y_train)
  13. scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
  14. print(scores) # [0.934..., 0.956..., 0.939...]

3.2 Cross-validation generators

Scikit-learn has a collection of classes which can be used to generate lists of train/test indices for popular cross-validation strategies.

They expose a split method which accepts the input dataset to be split and yields the train/test set indices for each iteration of the chosen cross-validation strategy.

This example shows an example usage of the split method.

  1. from sklearn.model_selection import KFold, cross_val_score
  2. X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
  3. k_fold = KFold(n_splits=5)
  4. for train_indices, test_indices in k_fold.split(X):
  5. print('Train: %s | test: %s' % (train_indices, test_indices))
  6. # Train: [2 3 4 5 6 7 8 9] | test: [0 1]
  7. # Train: [0 1 4 5 6 7 8 9] | test: [2 3]
  8. # Train: [0 1 2 3 6 7 8 9] | test: [4 5]
  9. # Train: [0 1 2 3 4 5 8 9] | test: [6 7]
  10. # Train: [0 1 2 3 4 5 6 7] | test: [8 9]

The cross-validation can then be performed easily:

  1. [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
  2. for train, test in k_fold.split(X_digits)]
  3. # [0.963..., 0.922..., 0.963..., 0.963..., 0.930...]

The cross-validation score can be directly calculated using the cross_val_score helper. Given an estimator, the cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation.

By default the estimator’s score method is used to compute the individual scores.

Refer the metrics module to learn more on the available scoring methods.

  1. >>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
  2. array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])

n_jobs=-1 means that the computation will be dispatched(迅速完成) on all the CPUs of the computer.

Alternatively, the scoring argument can be provided to specify an alternative scoring method.

  1. >>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro')
  2. # array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])

1) Cross-validation generators

Methods Description
KFold(n_splits, shuffle, random_state) Splits it into K folds, trains on K-1 and then tests on the left-out.
StratifiedKFold(n_splits, shuffle, random_state) Same as K-Fold but preserves(保留) the class distribution within each fold.
GroupKFold(n_splits) Ensures that the same group is not in both testing and training sets.
ShuffleSplit(n_splits, test_size, train_size, random_state) Generates train/test indices based on random permutation(组合).
StratifiedShuffleSplit Same as shuffle split but preserves the class distribution within each iteration.
GroupShuffleSplit Ensures that the same group is not in both testing and training sets.
LeaveOneGroupOut() Takes a group array to group observations.
LeavePGroupsOut(n_groups) Leave P groups out.
LeaveOneOut() Leave one observation out.
LeavePOut(p) Leave P observations out.
PredefinedSplit Generates train/test indices based on predefined splits.

:::info Exercise
On the digits dataset, plot the cross-validation score of a SVC estimator with an linear kernel as a function of parameter C (use a logarithmic(对数的) grid of points, from 1 to 10).
Solution: Cross-validation on Digits Dataset Exercise :::

  1. import numpy as np
  2. from sklearn.model_selection import cross_val_score
  3. from sklearn import datasets, svm
  4. import matplotlib.pyplot as plt
  5. X, y = datasets.load_digits(return_X_y=True)
  6. svc = svm.SVC(kernel='linear')
  7. C_s = np.logspace(-10, 0, 10)
  8. scores = list()
  9. scores_std = list()
  10. for C in C_s:
  11. svc.C = C
  12. this_scores = cross_val_score(svc, X, y, n_jobs=1)
  13. scores.append(np.mean(this_scores))
  14. scores_std.append(np.std(this_scores))
  15. # Do the plotting
  16. plt.figure()
  17. plt.semilogx(C_s, scores)
  18. plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
  19. plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
  20. locs, labels = plt.yticks()
  21. plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
  22. plt.ylabel('CV score')
  23. plt.xlabel('Parameter C')
  24. plt.ylim(0, 1.1)
  25. plt.show()

image.png

3.3 Grid-search and cross-validated estimators

1) Grid-search

scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:

  1. from sklearn.model_selection import GridSearchCV, cross_val_score
  2. Cs = np.logspace(-6, -1, 10)
  3. clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
  4. clf.fit(X_digits[:1000], y_digits[:1000])
  5. clf.best_score_ # 0.925...
  6. clf.best_estimator_.C # 0.0077...
  7. # Prediction performance on test set is not as good as on train set
  8. clf.score(X_digits[1000:], y_digits[1000:]) # 0.943...

By default, the GridSearchCV uses a 5-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 5-fold.

:::info Nested cross-validation
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data.

Warning:You cannot nest objects with parallel computing (n_jobs different than 1). :::

  1. >>> cross_val_score(clf, X_digits, y_digits)
  2. array([0.938..., 0.963..., 0.944...])

2) Cross-validated estimators

Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why, for certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation:

  1. from sklearn import linear_model, datasets
  2. lasso = linear_model.LassoCV()
  3. X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True)
  4. lasso.fit(X_diabetes, y_diabetes)
  5. # The estimator chose automatically its lambda:
  6. lasso.alpha_ # 0.00375...

These estimators are called similarly to their counterparts(对应的事物), with CV appended to their name.

:::info Exercise
On the diabetes dataset, find the optimal regularization parameter alpha.
Bonus: How much can you trust the selection of alpha?
Solution: Cross-validation on diabetes Dataset Exercise :::

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn import datasets
  4. from sklearn.linear_model import LassoCV
  5. from sklearn.linear_model import Lasso
  6. from sklearn.model_selection import KFold
  7. from sklearn.model_selection import GridSearchCV
  8. X, y = datasets.load_diabetes(return_X_y=True)
  9. X = X[:150]
  10. y = y[:150]
  11. lasso = Lasso(random_state=0, max_iter=10000)
  12. alphas = np.logspace(-4, -0.5, 30)
  13. tuned_parameters = [{'alpha': alphas}]
  14. n_folds = 5
  15. clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
  16. clf.fit(X, y)
  17. scores = clf.cv_results_['mean_test_score']
  18. scores_std = clf.cv_results_['std_test_score']
  19. plt.figure().set_size_inches(8, 6)
  20. plt.semilogx(alphas, scores)
  21. # plot error lines showing +/- std. errors of the scores
  22. std_error = scores_std / np.sqrt(n_folds)
  23. plt.semilogx(alphas, scores + std_error, 'b--')
  24. plt.semilogx(alphas, scores - std_error, 'b--')
  25. # alpha=0.2 controls the translucency of the fill color
  26. plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)
  27. plt.ylabel('CV score +/- std error')
  28. plt.xlabel('alpha')
  29. plt.axhline(np.max(scores), linestyle='--', color='.5')
  30. plt.xlim([alphas[0], alphas[-1]])
  31. # #############################################################################
  32. # Bonus: how much can you trust the selection of alpha?
  33. # To answer this question we use the LassoCV object that sets its alpha
  34. # parameter automatically from the data by internal cross-validation (i.e. it
  35. # performs cross-validation on the training data it receives).
  36. # We use external cross-validation to see how much the automatically obtained
  37. # alphas differ across different cross-validation folds.
  38. lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
  39. k_fold = KFold(3)
  40. print("Answer to the bonus question:", "how much can you trust the selection of alpha?")
  41. print("Alpha parameters maximising the generalization score on different")
  42. print("subsets of the data:")
  43. for k, (train, test) in enumerate(k_fold.split(X, y)):
  44. lasso_cv.fit(X[train], y[train])
  45. print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
  46. format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
  47. print("Answer: Not very much since we obtained different alphas for different")
  48. print("subsets of the data and moreover, the scores for these alphas differ")
  49. print("quite substantially.")
  50. plt.show()

image.png

4、Unsupervised learning: seeking representations of the data

📄 原文链接:https://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html

4.1 Clustering: grouping observations together

:::info The problem solved in clustering
Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist(分类学家) to label them: we could try a clustering task: split the observations into well-separated group called clusters. :::

1) K-means clustering

Note that there exist a lot of different clustering criteria(标准) and associated algorithms. The simplest clustering algorithm is K-means.
image.png

  1. from sklearn import cluster, datasets
  2. X_iris, y_iris = datasets.load_iris(return_X_y=True)
  3. k_means = cluster.KMeans(n_clusters=3)
  4. k_means.fit(X_iris)
  5. print(k_means.labels_[::10])
  6. # [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
  7. print(y_iris[::10])
  8. # [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

:::info Warning There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although scikit-learn employs several tricks to mitigate this issue.:::

image.png

:::info Application example: vector quantization
Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars(范例) to compress the information. The problem is sometimes known as vector quantization. For instance, this can be used to posterize(色彩分离) an image: :::

  1. import numpy as np
  2. import scipy as sp
  3. import matplotlib.pyplot as plt
  4. from sklearn import cluster
  5. try: # SciPy >= 0.16 have face in misc
  6. from scipy.misc import face
  7. face = face(gray=True)
  8. except ImportError:
  9. face = sp.face(gray=True)
  10. n_clusters = 5
  11. np.random.seed(0)
  12. X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array
  13. k_means = cluster.KMeans(n_clusters=n_clusters, n_init=4)
  14. k_means.fit(X)
  15. values = k_means.cluster_centers_.squeeze()
  16. labels = k_means.labels_
  17. # create an array from labels and values
  18. face_compressed = np.choose(labels, values)
  19. face_compressed.shape = face.shape
  20. vmin = face.min()
  21. vmax = face.max()
  22. # original face
  23. plt.figure(1, figsize=(3, 2.2))
  24. plt.imshow(face, cmap=plt.cm.gray, vmin=vmin, vmax=256)
  25. # compressed face
  26. plt.figure(2, figsize=(3, 2.2))
  27. plt.imshow(face_compressed, cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
  28. # equal bins face
  29. regular_values = np.linspace(0, 256, n_clusters + 1)
  30. regular_labels = np.searchsorted(regular_values, face) - 1
  31. regular_values = .5 * (regular_values[1:] + regular_values[:-1]) # mean
  32. regular_face = np.choose(regular_labels.ravel(), regular_values, mode="clip")
  33. regular_face.shape = face.shape
  34. plt.figure(3, figsize=(3, 2.2))
  35. plt.imshow(regular_face, cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
  36. # histogram
  37. plt.figure(4, figsize=(3, 2.2))
  38. plt.clf()
  39. plt.axes([.01, .01, .98, .98])
  40. plt.hist(X, bins=256, color='.5', edgecolor='.5')
  41. plt.yticks(())
  42. plt.xticks(regular_values)
  43. values = np.sort(values)
  44. for center_1, center_2 in zip(values[:-1], values[1:]):
  45. plt.axvline(.5 * (center_1 + center_2), color='b')
  46. for center_1, center_2 in zip(regular_values[:-1], regular_values[1:]):
  47. plt.axvline(.5 * (center_1 + center_2), color='b', linestyle='--')
  48. plt.show()

image.png

2) Hierarchical agglomerative clustering: Ward

A Hierarchical clustering method is a type of cluster analysis that aims to build a hierarchy(层次) of clusters. In general, the various approaches of this technique are either:

  • Agglomerative(凝聚) - bottom-up approaches: each observation starts in its own cluster, and clusters are iteratively merged in such a way to minimize a linkage(连接) criterion. This approach is particularly interesting when the clusters of interest are made of only a few observations. When the number of clusters is large, it is much more computationally efficient than k-means.
  • Divisive(分裂) - top-down approaches: all observations start in one cluster, which is iteratively split as one moves down the hierarchy. For estimating large numbers of clusters, this approach is both slow (due to all observations starting as one cluster, which it splits recursively) and statistically ill-posed(不适应的).

    3) Connectivity-constrained clustering

    With agglomerative clustering, it is possible to specify which samples can be clustered together by giving a connectivity graph. Graphs in scikit-learn are represented by their adjacency matrix. Often, a sparse matrix is used. This can be useful, for instance, to retrieve(检索数据) connected regions (sometimes also referred to(提到) as connected components) when clustering an image.
    image.png ```python import time as time import numpy as np import matplotlib.pyplot as plt import skimage from skimage.data import coins from skimage.transform import rescale from scipy.ndimage.filters import gaussian_filter from sklearn.feature_extraction.image import grid_to_graph from sklearn.cluster import AgglomerativeClustering from sklearn.utils.fixes import parse_version

these were introduced in skimage-0.14

if parseversion(skimage._version) >= parse_version(‘0.14’): rescale_params = {‘anti_aliasing’: False, ‘multichannel’: False} else: rescale_params = {}

#

Generate data

orig_coins = coins()

Resize it to 20% of the original size to speed up the processing

Applying a Gaussian filter for smoothing prior to down-scaling

reduces aliasing artifacts.

smoothened_coins = gaussian_filter(orig_coins, sigma=2) rescaled_coins = rescale(smoothened_coins, 0.2, mode=”reflect”, **rescale_params)

X = np.reshape(rescaled_coins, (-1, 1))

#

Define the structure A of the data. Pixels connected to their neighbors.

connectivity = grid_to_graph(*rescaled_coins.shape)

#

Compute clustering

print(“Compute structured hierarchical clustering…”) st = time.time() nclusters = 27 # number of regions ward = AgglomerativeClustering(n_clusters=n_clusters, linkage=’ward’, connectivity=connectivity) ward.fit(X) label = np.reshape(ward.labels, rescaled_coins.shape) print(“Elapsed time: “, time.time() - st) print(“Number of pixels: “, label.size) print(“Number of clusters: “, np.unique(label).size)

#

Plot the results on an image

plt.figure(figsize=(5, 5)) plt.imshow(rescaled_coins, cmap=plt.cm.gray)

for l in range(n_clusters): plt.contour(label == l, colors=[plt.cm.nipy_spectral(l / float(n_clusters)), ])

plt.xticks(()) plt.yticks(()) plt.show()

  1. We need a vectorized version of the image. `'rescaled_coins'` is a down-scaled version of the coins image to speed up the process:
  2. ```python
  3. from sklearn.feature_extraction import grid_to_graph
  4. connectivity = grid_to_graph(*rescaled_coins.shape)

Define the graph structure of the data. Pixels connected to their neighbors:

  1. n_clusters = 27 # number of regions
  2. from sklearn.cluster import AgglomerativeClustering
  3. ward = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward', connectivity=connectivity)
  4. ward.fit(X)
  5. label = np.reshape(ward.labels_, rescaled_coins.shape)

4) Feature agglomeration

We have seen that sparsity could be used to mitigate(缓和) the curse of dimensionality, i.e an insufficient(不足的) amount of observations compared to the number of features. Another approach is to merge together similar features: feature agglomeration. This approach can be implemented by clustering in the feature direction, in other words clustering the transposed(变形的) data.
image.png

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn import datasets, cluster
  4. from sklearn.feature_extraction.image import grid_to_graph
  5. digits = datasets.load_digits()
  6. images = digits.images
  7. X = np.reshape(images, (len(images), -1))
  8. connectivity = grid_to_graph(*images[0].shape)
  9. agglo = cluster.FeatureAgglomeration(connectivity=connectivity, n_clusters=32)
  10. agglo.fit(X)
  11. X_reduced = agglo.transform(X)
  12. X_restored = agglo.inverse_transform(X_reduced)
  13. images_restored = np.reshape(X_restored, images.shape)
  14. plt.figure(1, figsize=(4, 3.5))
  15. plt.clf()
  16. plt.subplots_adjust(left=.01, right=.99, bottom=.01, top=.91)
  17. for i in range(4):
  18. plt.subplot(3, 4, i + 1)
  19. plt.imshow(images[i], cmap=plt.cm.gray, vmax=16, interpolation='nearest')
  20. plt.xticks(())
  21. plt.yticks(())
  22. if i == 1:
  23. plt.title('Original data')
  24. plt.subplot(3, 4, 4 + i + 1)
  25. plt.imshow(images_restored[i], cmap=plt.cm.gray, vmax=16,
  26. interpolation='nearest')
  27. if i == 1:
  28. plt.title('Agglomerated data')
  29. plt.xticks(())
  30. plt.yticks(())
  31. plt.subplot(3, 4, 10)
  32. plt.imshow(np.reshape(agglo.labels_, images[0].shape),
  33. interpolation='nearest', cmap=plt.cm.nipy_spectral)
  34. plt.xticks(())
  35. plt.yticks(())
  36. plt.title('Labels')
  37. plt.show()

:::info transform** and inverse_transform methods**
Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset. :::

5、Decompositions: from a signal to components and loadings

:::info Components and loadings
If X is our multivariate data, then the problem that we are trying to solve is to rewrite it on a different observational basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose the components :::

5.1 Principal component analysis: PCA

Principal component analysis (PCA) selects the successive components that explain the maximum variance in the signal.
image.pngimage.png

The point cloud spanned(贯穿) by the observations above is very flat in one direction: one of the three univariate(单变量的) features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat

When used to _transform
data, PCA can reduce the dimensionality of the data by projecting on(投射) a principal subspace.

  1. # Create a signal with only 2 useful dimensions
  2. x1 = np.random.normal(size=100)
  3. x2 = np.random.normal(size=100)
  4. x3 = x1 + x2
  5. X = np.c_[x1, x2, x3]
  6. from sklearn import decomposition
  7. pca = decomposition.PCA()
  8. pca.fit(X)
  9. print(pca.explained_variance_)
  10. # [ 2.18565811e+00 1.19346747e+00 8.43026679e-32]
  11. # As we can see, only the 2 first components are useful
  12. pca.n_components = 2
  13. X_reduced = pca.fit_transform(X)
  14. X_reduced.shape # (100, 2)

5.2 Independent Component Analysis: ICA

Independent component analysis (ICA) selects components so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals:
image.png

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from scipy import signal
  4. from sklearn.decomposition import FastICA, PCA
  5. # #############################################################################
  6. # Generate sample data
  7. np.random.seed(0)
  8. n_samples = 2000
  9. time = np.linspace(0, 8, n_samples)
  10. s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal
  11. s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal
  12. s3 = signal.sawtooth(2 * np.pi * time) # Signal 3: saw tooth signal
  13. S = np.c_[s1, s2, s3]
  14. S += 0.2 * np.random.normal(size=S.shape) # Add noise
  15. S /= S.std(axis=0) # Standardize data
  16. # Mix data
  17. A = np.array([[1, 1, 1], [0.5, 2, 1.0], [1.5, 1.0, 2.0]]) # Mixing matrix
  18. X = np.dot(S, A.T) # Generate observations
  19. # Compute ICA
  20. ica = FastICA(n_components=3)
  21. S_ = ica.fit_transform(X) # Reconstruct signals
  22. A_ = ica.mixing_ # Get estimated mixing matrix
  23. # We can `prove` that the ICA model applies by reverting the unmixing.
  24. assert np.allclose(X, np.dot(S_, A_.T) + ica.mean_)
  25. # For comparison, compute PCA
  26. pca = PCA(n_components=3)
  27. H = pca.fit_transform(X) # Reconstruct signals based on orthogonal components
  28. # #############################################################################
  29. # Plot results
  30. plt.figure()
  31. models = [X, S, S_, H]
  32. names = ['Observations (mixed signal)',
  33. 'True Sources',
  34. 'ICA recovered signals',
  35. 'PCA recovered signals']
  36. colors = ['red', 'steelblue', 'orange']
  37. for ii, (model, name) in enumerate(zip(models, names), 1):
  38. plt.subplot(4, 1, ii)
  39. plt.title(name)
  40. for sig, color in zip(model.T, colors):
  41. plt.plot(sig, color=color)
  42. plt.tight_layout()
  43. plt.show()

6、Putting it all together

6.1 Pipelining

We have seen that some estimators can transform data and that some estimators can predict variables. We can also create combined estimators:

  1. import matplotlib.pyplot as plt
  2. import pandas as pd
  3. from sklearn import datasets
  4. from sklearn.decomposition import PCA
  5. from sklearn.linear_model import LogisticRegression
  6. from sklearn.pipeline import Pipeline
  7. from sklearn.model_selection import GridSearchCV
  8. # Define a pipeline to search for the best combination of PCA truncation
  9. # and classifier regularization.
  10. pca = PCA()
  11. # set the tolerance to a large value to make the example faster
  12. logistic = LogisticRegression(max_iter=10000, tol=0.1)
  13. pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
  14. X_digits, y_digits = datasets.load_digits(return_X_y=True)
  15. # Parameters of pipelines can be set using ‘__’ separated parameter names:
  16. param_grid = {
  17. 'pca__n_components': [5, 15, 30, 45, 64],
  18. 'logistic__C': np.logspace(-4, 4, 4),
  19. }
  20. search = GridSearchCV(pipe, param_grid, n_jobs=-1)
  21. search.fit(X_digits, y_digits)
  22. print("Best parameter (CV score=%0.3f):" % search.best_score_)
  23. print(search.best_params_)
  24. # Plot the PCA spectrum
  25. pca.fit(X_digits)
  26. fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
  27. ax0.plot(np.arange(1, pca.n_components_ + 1),
  28. pca.explained_variance_ratio_, '+', linewidth=2)
  29. ax0.set_ylabel('PCA explained variance ratio')
  30. ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
  31. linestyle=':', label='n_components chosen')
  32. ax0.legend(prop=dict(size=12))

image.png

6.2 Face recognition with eigenfaces

The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, also known as LFW: http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

  1. """
  2. ===================================================
  3. Faces recognition example using eigenfaces and SVMs
  4. ===================================================
  5. The dataset used in this example is a preprocessed excerpt of the
  6. "Labeled Faces in the Wild", aka LFW_:
  7. http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
  8. .. _LFW: http://vis-www.cs.umass.edu/lfw/
  9. Expected results for the top 5 most represented people in the dataset:
  10. ================== ============ ======= ========== =======
  11. precision recall f1-score support
  12. ================== ============ ======= ========== =======
  13. Ariel Sharon 0.67 0.92 0.77 13
  14. Colin Powell 0.75 0.78 0.76 60
  15. Donald Rumsfeld 0.78 0.67 0.72 27
  16. George W Bush 0.86 0.86 0.86 146
  17. Gerhard Schroeder 0.76 0.76 0.76 25
  18. Hugo Chavez 0.67 0.67 0.67 15
  19. Tony Blair 0.81 0.69 0.75 36
  20. avg / total 0.80 0.80 0.80 322
  21. ================== ============ ======= ========== =======
  22. """
  23. import logging
  24. import matplotlib.pyplot as plt
  25. from time import time
  26. from sklearn.model_selection import train_test_split
  27. from sklearn.model_selection import GridSearchCV
  28. from sklearn.datasets import fetch_lfw_people
  29. from sklearn.metrics import classification_report
  30. from sklearn.metrics import confusion_matrix
  31. from sklearn.decomposition import PCA
  32. from sklearn.svm import SVC
  33. # Display progress logs on stdout
  34. logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
  35. # #############################################################################
  36. # Download the data, if not already on disk and load it as numpy arrays
  37. lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
  38. # introspect the images arrays to find the shapes (for plotting)
  39. n_samples, h, w = lfw_people.images.shape
  40. # for machine learning we use the 2 data directly (as relative pixel
  41. # positions info is ignored by this model)
  42. X = lfw_people.data
  43. n_features = X.shape[1]
  44. # the label to predict is the id of the person
  45. y = lfw_people.target
  46. target_names = lfw_people.target_names
  47. n_classes = target_names.shape[0]
  48. print("Total dataset size:")
  49. print("n_samples: %d" % n_samples)
  50. print("n_features: %d" % n_features)
  51. print("n_classes: %d" % n_classes)
  52. # #############################################################################
  53. # Split into a training set and a test set using a stratified k fold
  54. # split into a training and testing set
  55. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
  56. # #############################################################################
  57. # Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
  58. # dataset): unsupervised feature extraction / dimensionality reduction
  59. n_components = 150
  60. print("Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0]))
  61. t0 = time()
  62. pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True).fit(X_train)
  63. print("done in %0.3fs" % (time() - t0))
  64. eigenfaces = pca.components_.reshape((n_components, h, w))
  65. print("Projecting the input data on the eigenfaces orthonormal basis")
  66. t0 = time()
  67. X_train_pca = pca.transform(X_train)
  68. X_test_pca = pca.transform(X_test)
  69. print("done in %0.3fs" % (time() - t0))
  70. # #############################################################################
  71. # Train a SVM classification model
  72. print("Fitting the classifier to the training set")
  73. t0 = time()
  74. param_grid = {
  75. 'C': [1e3, 5e3, 1e4, 5e4, 1e5],
  76. 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
  77. }
  78. clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
  79. clf = clf.fit(X_train_pca, y_train)
  80. print("done in %0.3fs" % (time() - t0))
  81. print("Best estimator found by grid search:")
  82. print(clf.best_estimator_)
  83. # #############################################################################
  84. # Quantitative evaluation of the model quality on the test set
  85. print("Predicting people's names on the test set")
  86. t0 = time()
  87. y_pred = clf.predict(X_test_pca)
  88. print("done in %0.3fs" % (time() - t0))
  89. print(classification_report(y_test, y_pred, target_names=target_names))
  90. print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
  91. # #############################################################################
  92. # Qualitative evaluation of the predictions using matplotlib
  93. def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
  94. """Helper function to plot a gallery of portraits"""
  95. plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
  96. plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
  97. for i in range(n_row * n_col):
  98. plt.subplot(n_row, n_col, i + 1)
  99. plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
  100. plt.title(titles[i], size=12)
  101. plt.xticks(())
  102. plt.yticks(())
  103. # plot the result of the prediction on a portion of the test set
  104. def title(y_pred, y_test, target_names, i):
  105. pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
  106. true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
  107. return 'predicted: %s\ntrue: %s' % (pred_name, true_name)
  108. prediction_titles = [title(y_pred, y_test, target_names, i)
  109. for i in range(y_pred.shape[0])]
  110. plot_gallery(X_test, prediction_titles, h, w)
  111. # plot the gallery of the most significative eigenfaces
  112. eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
  113. plot_gallery(eigenfaces, eigenface_titles, h, w)
  114. plt.show()