1、Statistical learning: the setting and the estimator object in scikit-learn
- 1.1 Datasets
- 1.2 Estimators objects
2、Supervised learning: predicting an output variable from high-dimensional observations
3、Model selection: choosing estimators and their parameters
4、Unsupervised learning: seeking representations of the data
- 4.1 Clustering: grouping observations together
these were introduced in skimage-0.14
#
Generate data
Resize it to 20% of the original size to speed up the processing
Applying a Gaussian filter for smoothing prior to down-scaling
reduces aliasing artifacts.
#
Define the structure A of the data. Pixels connected to their neighbors.
#
Compute clustering
#
Plot the results on an image
- 4) Feature agglomeration
5、Decompositions: from a signal to components and loadings
- 5.1 Principal component analysis: PCA
- 5.2 Independent Component Analysis: ICA
6、Putting it all together
- 6.1 Pipelining
- 6.2 Face recognition with eigenfaces

🚀 原文链接：https://scikit-learn.org/stable/tutorial/statistical_inference/index.html

1、Statistical learning: the setting and the estimator object in scikit-learn

📄 原文链接：https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html

1.1 Datasets

Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.
**

A simple example shipped with scikit-learn: iris dataset
It is made of 150 observations of irises(iris复数), each described by 4 features: their sepal(萼片) and petal(花瓣) length and width, as detailed in iris.DESCR.

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to be used by scikit-learn.

An example of reshaping data would be the digits dataset The digits dataset is made of 1797 8x8 images of hand-written digits

>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import matplotlib.pyplot as plt 
>>> plt.imshow(digits.images[-1], cmap=plt.cm.gray_r) 
<matplotlib.image.AxesImage object at ...>

To use this dataset with scikit-learn, we transform each 8x8 image into a feature vector of length 64

>>> data = digits.images.reshape((digits.images.shape[0], -1))

1.2 Estimators objects

Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.

All estimator objects expose(暴露) a fit method that takes a dataset (usually a 2-d array):

>>> estimator.fit(data)

Estimator parameters: All the parameters of an estimator can be set when it is instantiated(实例化) or by modifying the corresponding attribute:

>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore(下划线):

>>> estimator.estimated_param_

2、Supervised learning: predicting an output variable from high-dimensional observations

📄 原文链接：https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html :::info The problem solved in supervised learning
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.:::

:::info Vocabulary: classification and regression
If the prediction task is to classify the observations in a set of finite(限定的) labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task.

When doing classification in scikit-learn, y is a vector of integers or strings. Note: See the Introduction to machine learning with scikit-learn Tutorial for a quick run-through on the basic machine learning vocabulary used within scikit-learn.:::

2.1 Nearest neighbor and the curse of dimensionality

Classifying irises:
The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width:

>>> import numpy as np
>>> from sklearn import datasets
>>> iris_X, iris_y = datasets.load_iris(return_X_y=True)
>>> np.unique(iris_y)
array([0, 1, 2])

1) k-Nearest neighbors classifier

The simplest possible classifier is the nearest neighbor: given a new observation X_test, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector. (Please see the Nearest Neighbors section of the online Scikit-learn documentation for more information about this type of classifier.)

Training set and testing set While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data.

KNN (k nearest neighbors) classification example:
A tutorial on statistical-learning for scientific data processing - 图3

# Split iris data in train and test data
# A random permutation, to split the data randomly
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)
knn.predict(iris_X_test)  # array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
iris_y_test  # array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

2) The curse(灾难) of dimensionality

For an estimator to be effective, you need the distance between neighboring points to be less than some value A tutorial on statistical-learning for scientific data processing - 图4 , which depends on the problem. In one dimension, this requires on average points. In the context of the above -NN example, if the data is described by just one feature with values ranging from 0 to 1 and with training observations, then new data will be no further away than A tutorial on statistical-learning for scientific data processing - 图8 . Therefore, the nearest neighbor decision rule will be efficient as soon as is small compared to the scale of between-class feature variations(变化).

If the number of features is A tutorial on statistical-learning for scientific data processing - 图10 , you now require points. Let’s say that we require 10 points in one dimension: now points are required in dimensions to pave(铺垫) the [0,1] space. As becomes large, the number of training points required for a good estimator grows exponentially(以指数方式).

For example, if each point is just a single number (8 bytes), then an effective A tutorial on statistical-learning for scientific data processing - 图15 -NN estimator in a paltry(微不足道的) dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or so).

This is called the curse of dimensionality and is a core problem that machine learning addresses.

2.2 Linear model: from regression to sparsity

:::info Diabetes dataset
The diabetes(糖尿病) dataset consists of 10 physiological(生理学的) variables (age, sex, weight, blood pressure) measure on 442 patients, and an indication of disease(病) progression after one year: :::

>>> diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
>>> diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test  = diabetes_X[-20:]
>>> diabetes_y_train = diabetes_y[:-20]
>>> diabetes_y_test  = diabetes_y[-20:]

The task at hand is to predict disease progression from physiological variables.

1) Linear regression

LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals(残差) of the model as small as possible.

Linear models: A tutorial on statistical-learning for scientific data processing - 图17

: data
: target variable
: Coefficients
: Observation noise

>>> from sklearn import linear_model
>>> regr = linear_model.LinearRegression()
>>> regr.fit(diabetes_X_train, diabetes_y_train)
LinearRegression()
>>> print(regr.coef_)
[   0.30349955 -237.63931533  510.53060544  327.73698041 -814.13170937
  492.81458798  102.84845219  184.60648906  743.51961675   76.09517222]
# The mean square error
>>> np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)
2004.56760268...
# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and y.
>>> regr.score(diabetes_X_test, diabetes_y_test)
0.5850753022690...

2) Shrinkage

If there are few data points per dimension, noise in the observations induces high variance:

import matplotlib.pyplot as plt
X = np.c_[ .5, 1].T
y = [.5, 1]
test = np.c_[ 0, 2].T
regr = linear_model.LinearRegression()
plt.figure() 
np.random.seed(0)
for _ in range(6): 
    this_X = .1 * np.random.normal(size=(2, 1)) + X
    regr.fit(this_X, y)
    plt.plot(test, regr.predict(test)) 
    plt.scatter(this_X, y, s=3)

A solution in high-dimensional statistical learning is to shrink(缩小) the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:

regr = linear_model.Ridge(alpha=.1)
plt.figure() 
np.random.seed(0)
for _ in range(6): 
    this_X = .1 * np.random.normal(size=(2, 1)) + X
    regr.fit(this_X, y)
    plt.plot(test, regr.predict(test)) 
    plt.scatter(this_X, y, s=3)

This is an example of bias/variance **tradeoff(权衡)**: the larger the ridge alpha parameter, the higher the bias and the lower the variance.

We can choose alpha to minimize left out error, this time using the diabetes dataset rather than our synthetic data:

alphas = np.logspace(-4, -1, 6)
print([regr.set_params(alpha=alpha)
           .fit(diabetes_X_train, diabetes_y_train)
           .score(diabetes_X_test, diabetes_y_test)
       for alpha in alphas])
# [0.5851110683883..., 0.5852073015444..., 0.5854677540698...,
#  0.5855512036503..., 0.5830717085554..., 0.57058999437...]

Note： Capturing in the fitted parameters noise that prevents the model to generalize to new data is called > overfitting. The bias introduced by the ridge regression is called a > regularization.

3) Sparsity
Note：
A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of the target variable). It is hard to develop an intuition(直觉) on such representation, but it may be useful to keep in mind that it would be a fairly(公平地) empty space.

We can see that, although feature 2 has a strong coefficient on the full model, it conveys little information on y when considered with feature 1.

To improve the conditioning of the problem (i.e. mitigating(缓和) the The curse of dimensionality), it would be interesting to select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease their contribution, but not set them to zero. Another penalization(惩罚) approach, called Lasso (least absolute shrinkage(最小绝对收缩率) and selection operator), can set some coefficients to zero. Such methods are called sparse method and sparsity(稀疏) can be seen as an application of Occam’s razor: prefer simpler models.

regr = linear_model.Lasso()
scores = [regr.set_params(alpha=alpha)
              .fit(diabetes_X_train, diabetes_y_train)
              .score(diabetes_X_test, diabetes_y_test)
          for alpha in alphas]
>>> best_alpha = alphas[scores.index(max(scores))]
>>> regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794)
>>> print(regr.coef_)
[   0.         -212.437...  517.194...  313.779... -160.830...
   -0.         -187.195...   69.382...  508.660...   71.842...]

:::info Different algorithms for the same problem
Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in scikit-learn solves the lasso regression problem using a coordinate descent method, that is efficient on large datasets. However, scikit-learn also provides the LassoLars object using the LARS algorithm, which is very efficient for problems in which the weight vector estimated is very sparse (i.e. problems with very few observations). :::

4) Classification

For classification, as in the labeling iris task, linear regression is not the right approach as it will give too much weight to data far from the decision frontier(绝对边界). A linear approach is to fit a sigmoid function or logistic function:

A tutorial on statistical-learning for scientific data processing - 图27

>>> log = linear_model.LogisticRegression(C=1e5)
>>> log.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0)

This is known as LogisticRegression.
:::info Multiclass classification
If you have several classes to predict, an option often used is to fit one-versus-all classifiers and then use a voting heuristic(启发) for the final decision. :::

:::info Shrinkage and sparsity with logistic regression
The C parameter controls the amount of regularization in the LogisticRegression object: a large value for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while penalty="l1" gives Sparsity. :::

:::info Exercise
Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test prediction performance on these observations. A solution can be downloaded here. :::

from sklearn import datasets, neighbors, linear_model
X_digits, y_digits = datasets.load_digits(return_X_y=True)
X_digits = X_digits / X_digits.max()

2.3 Support vector machines (SVMs)

1) Linear SVMs

Support Vector Machines belong to the discriminant(判别式的) model family: they try to find a combination of samples to build a plane(平面) maximizing the margin between the two classes. Regularization is set by the C parameter: a small value for C means the margin is calculated using many or all of the observations around the separating line (more regularization); a large value for C means the margin is calculated on observations close to the separating line (less regularization).

:::info Example:

Plot different SVM classifiers in the iris dataset
:::

SVMs can be used in regression –SVR (Support Vector Regression)–, or in classification –SVC (Support Vector Classification).

>>> from sklearn import svm
>>> svc = svm.SVC(kernel='linear')
>>> svc.fit(iris_X_train, iris_y_train)
SVC(kernel='linear')

:::info Warning：Normalizing data For many estimators, including the SVMs, having datasets with unit standard deviation(偏差) for each feature is important to get good prediction.:::

2) Using kernels

Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by positioning kernels on observations:

i) Linear kernel

>>> svc = svm.SVC(kernel='linear')

ii) Polynomial kernel

svc = svm.SVC(kernel='poly', degree=3)
# degree: polynomial degree

iii) RBF kernel (Radial Basis Function)

svc = svm.SVC(kernel='rbf')
# gamma: inverse of size of
# radial kernel

:::info Interactive example
See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, fit the model and change parameters and data. :::

:::info Exercise
Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave out 10% of each class and test prediction performance on these observations. A solution can be downloaded here

Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class.Hint: You can use the decision_function method on a grid to get intuitions.:::

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]

3、Model selection: choosing estimators and their parameters

📄 原文链接：https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

3.1 Score, and cross-validated scores

As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.

>>> from sklearn import datasets, svm
>>> X_digits, y_digits = datasets.load_digits(return_X_y=True)
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.98

To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:

import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
    # We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    X_test = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)  # [0.934..., 0.956..., 0.939...]

3.2 Cross-validation generators

Scikit-learn has a collection of classes which can be used to generate lists of train/test indices for popular cross-validation strategies.

They expose a split method which accepts the input dataset to be split and yields the train/test set indices for each iteration of the chosen cross-validation strategy.

This example shows an example usage of the split method.

from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
k_fold = KFold(n_splits=5)
for train_indices, test_indices in k_fold.split(X):
     print('Train: %s | test: %s' % (train_indices, test_indices))
# Train: [2 3 4 5 6 7 8 9] | test: [0 1]
# Train: [0 1 4 5 6 7 8 9] | test: [2 3]
# Train: [0 1 2 3 6 7 8 9] | test: [4 5]
# Train: [0 1 2 3 4 5 8 9] | test: [6 7]
# Train: [0 1 2 3 4 5 6 7] | test: [8 9]

The cross-validation can then be performed easily:

[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
 for train, test in k_fold.split(X_digits)]
# [0.963..., 0.922..., 0.963..., 0.963..., 0.930...]

The cross-validation score can be directly calculated using the cross_val_score helper. Given an estimator, the cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation.

By default the estimator’s score method is used to compute the individual scores.

Refer the metrics module to learn more on the available scoring methods.

>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])

n_jobs=-1 means that the computation will be dispatched(迅速完成) on all the CPUs of the computer.

Alternatively, the scoring argument can be provided to specify an alternative scoring method.

>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro')
# array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])

1) Cross-validation generators

Methods	Description
KFold(n_splits, shuffle, random_state)	Splits it into K folds, trains on K-1 and then tests on the left-out.
StratifiedKFold(n_splits, shuffle, random_state)	Same as K-Fold but preserves(保留) the class distribution within each fold.
GroupKFold(n_splits)	Ensures that the same group is not in both testing and training sets.
ShuffleSplit(n_splits, test_size, train_size, random_state)	Generates train/test indices based on random permutation(组合).
StratifiedShuffleSplit	Same as shuffle split but preserves the class distribution within each iteration.
GroupShuffleSplit	Ensures that the same group is not in both testing and training sets.
LeaveOneGroupOut()	Takes a group array to group observations.
LeavePGroupsOut(n_groups)	Leave P groups out.
LeaveOneOut()	Leave one observation out.
LeavePOut(p)	Leave P observations out.
PredefinedSplit	Generates train/test indices based on predefined splits.

:::info Exercise
On the digits dataset, plot the cross-validation score of a SVC estimator with an linear kernel as a function of parameter C (use a logarithmic(对数的) grid of points, from 1 to 10).
Solution: Cross-validation on Digits Dataset Exercise :::

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
import matplotlib.pyplot as plt
X, y = datasets.load_digits(return_X_y=True)
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)
scores = list()
scores_std = list()
for C in C_s:
    svc.C = C
    this_scores = cross_val_score(svc, X, y, n_jobs=1)
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))
# Do the plotting
plt.figure()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()

3.3 Grid-search and cross-validated estimators

1) Grid-search

scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:

from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])        
clf.best_score_  # 0.925...
clf.best_estimator_.C  # 0.0077...
# Prediction performance on test set is not as good as on train set
clf.score(X_digits[1000:], y_digits[1000:])  # 0.943...

By default, the GridSearchCV uses a 5-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 5-fold.

:::info Nested cross-validation
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data.

Warning：You cannot nest objects with parallel computing (n_jobs different than 1). :::

>>> cross_val_score(clf, X_digits, y_digits) 
array([0.938..., 0.963..., 0.944...])

2) Cross-validated estimators

Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why, for certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation:

from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True)
lasso.fit(X_diabetes, y_diabetes)
# The estimator chose automatically its lambda:
lasso.alpha_  # 0.00375...

These estimators are called similarly to their counterparts(对应的事物), with CV appended to their name.

:::info Exercise
On the diabetes dataset, find the optimal regularization parameter alpha.
Bonus: How much can you trust the selection of alpha?
Solution: Cross-validation on diabetes Dataset Exercise :::

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
X, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
y = y[:150]
lasso = Lasso(random_state=0, max_iter=10000)
alphas = np.logspace(-4, -0.5, 30)
tuned_parameters = [{'alpha': alphas}]
n_folds = 5
clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']
plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)
# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)
plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')
# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)
plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])
# #############################################################################
# Bonus: how much can you trust the selection of alpha?
# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.
lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
k_fold = KFold(3)
print("Answer to the bonus question:", "how much can you trust the selection of alpha?")
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X, y)):
    lasso_cv.fit(X[train], y[train])
    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
          format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")
plt.show()

4、Unsupervised learning: seeking representations of the data

📄 原文链接：https://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html

4.1 Clustering: grouping observations together

:::info The problem solved in clustering
Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist(分类学家) to label them: we could try a clustering task: split the observations into well-separated group called clusters. :::

1) K-means clustering

Note that there exist a lot of different clustering criteria(标准) and associated algorithms. The simplest clustering algorithm is K-means.

from sklearn import cluster, datasets
X_iris, y_iris = datasets.load_iris(return_X_y=True)
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris)
print(k_means.labels_[::10])
# [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
print(y_iris[::10])
# [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

:::info Warning There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although scikit-learn employs several tricks to mitigate this issue.:::

:::info Application example: vector quantization
Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars(范例) to compress the information. The problem is sometimes known as vector quantization. For instance, this can be used to posterize(色彩分离) an image: :::

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn import cluster
try:  # SciPy >= 0.16 have face in misc
    from scipy.misc import face
    face = face(gray=True)
except ImportError:
    face = sp.face(gray=True)
n_clusters = 5
np.random.seed(0)
X = face.reshape((-1, 1))  # We need an (n_sample, n_feature) array
k_means = cluster.KMeans(n_clusters=n_clusters, n_init=4)
k_means.fit(X)
values = k_means.cluster_centers_.squeeze()
labels = k_means.labels_
# create an array from labels and values
face_compressed = np.choose(labels, values)
face_compressed.shape = face.shape
vmin = face.min()
vmax = face.max()
# original face
plt.figure(1, figsize=(3, 2.2))
plt.imshow(face, cmap=plt.cm.gray, vmin=vmin, vmax=256)
# compressed face
plt.figure(2, figsize=(3, 2.2))
plt.imshow(face_compressed, cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
# equal bins face
regular_values = np.linspace(0, 256, n_clusters + 1)
regular_labels = np.searchsorted(regular_values, face) - 1
regular_values = .5 * (regular_values[1:] + regular_values[:-1])  # mean
regular_face = np.choose(regular_labels.ravel(), regular_values, mode="clip")
regular_face.shape = face.shape
plt.figure(3, figsize=(3, 2.2))
plt.imshow(regular_face, cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
# histogram
plt.figure(4, figsize=(3, 2.2))
plt.clf()
plt.axes([.01, .01, .98, .98])
plt.hist(X, bins=256, color='.5', edgecolor='.5')
plt.yticks(())
plt.xticks(regular_values)
values = np.sort(values)
for center_1, center_2 in zip(values[:-1], values[1:]):
    plt.axvline(.5 * (center_1 + center_2), color='b')
for center_1, center_2 in zip(regular_values[:-1], regular_values[1:]):
    plt.axvline(.5 * (center_1 + center_2), color='b', linestyle='--')
plt.show()

2) Hierarchical agglomerative clustering: Ward

A Hierarchical clustering method is a type of cluster analysis that aims to build a hierarchy(层次) of clusters. In general, the various approaches of this technique are either:

Agglomerative(凝聚) - bottom-up approaches: each observation starts in its own cluster, and clusters are iteratively merged in such a way to minimize a linkage(连接) criterion. This approach is particularly interesting when the clusters of interest are made of only a few observations. When the number of clusters is large, it is much more computationally efficient than k-means.
Divisive(分裂) - top-down approaches: all observations start in one cluster, which is iteratively split as one moves down the hierarchy. For estimating large numbers of clusters, this approach is both slow (due to all observations starting as one cluster, which it splits recursively) and statistically ill-posed(不适应的).

3) Connectivity-constrained clustering
With agglomerative clustering, it is possible to specify which samples can be clustered together by giving a connectivity graph. Graphs in scikit-learn are represented by their adjacency matrix. Often, a sparse matrix is used. This can be useful, for instance, to retrieve(检索数据) connected regions (sometimes also referred to(提到) as connected components) when clustering an image.
```python import time as time import numpy as np import matplotlib.pyplot as plt import skimage from skimage.data import coins from skimage.transform import rescale from scipy.ndimage.filters import gaussian_filter from sklearn.feature_extraction.image import grid_to_graph from sklearn.cluster import AgglomerativeClustering from sklearn.utils.fixes import parse_version

these were introduced in skimage-0.14

if parseversion(skimage._version) >= parse_version(‘0.14’): rescale_params = {‘anti_aliasing’: False, ‘multichannel’: False} else: rescale_params = {}

#

Generate data

orig_coins = coins()

Resize it to 20% of the original size to speed up the processing

Applying a Gaussian filter for smoothing prior to down-scaling

reduces aliasing artifacts.

smoothened_coins = gaussian_filter(orig_coins, sigma=2) rescaled_coins = rescale(smoothened_coins, 0.2, mode=”reflect”, **rescale_params)

X = np.reshape(rescaled_coins, (-1, 1))

#

Define the structure A of the data. Pixels connected to their neighbors.

connectivity = grid_to_graph(*rescaled_coins.shape)

#

Compute clustering

print(“Compute structured hierarchical clustering…”) st = time.time() nclusters = 27 # number of regions ward = AgglomerativeClustering(n_clusters=n_clusters, linkage=’ward’, connectivity=connectivity) ward.fit(X) label = np.reshape(ward.labels, rescaled_coins.shape) print(“Elapsed time: “, time.time() - st) print(“Number of pixels: “, label.size) print(“Number of clusters: “, np.unique(label).size)

#

Plot the results on an image

plt.figure(figsize=(5, 5)) plt.imshow(rescaled_coins, cmap=plt.cm.gray)

for l in range(n_clusters): plt.contour(label == l, colors=[plt.cm.nipy_spectral(l / float(n_clusters)), ])

plt.xticks(()) plt.yticks(()) plt.show()


We need a vectorized version of the image. `'rescaled_coins'` is a down-scaled version of the coins image to speed up the process:
```python
from sklearn.feature_extraction import grid_to_graph
connectivity = grid_to_graph(*rescaled_coins.shape)

Define the graph structure of the data. Pixels connected to their neighbors:

n_clusters = 27  # number of regions
from sklearn.cluster import AgglomerativeClustering
ward = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward', connectivity=connectivity)
ward.fit(X)
label = np.reshape(ward.labels_, rescaled_coins.shape)

4) Feature agglomeration

We have seen that sparsity could be used to mitigate(缓和) the curse of dimensionality, i.e an insufficient(不足的) amount of observations compared to the number of features. Another approach is to merge together similar features: feature agglomeration. This approach can be implemented by clustering in the feature direction, in other words clustering the transposed(变形的) data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, cluster
from sklearn.feature_extraction.image import grid_to_graph
digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity, n_clusters=32)
agglo.fit(X)
X_reduced = agglo.transform(X)
X_restored = agglo.inverse_transform(X_reduced)
images_restored = np.reshape(X_restored, images.shape)
plt.figure(1, figsize=(4, 3.5))
plt.clf()
plt.subplots_adjust(left=.01, right=.99, bottom=.01, top=.91)
for i in range(4):
    plt.subplot(3, 4, i + 1)
    plt.imshow(images[i], cmap=plt.cm.gray, vmax=16, interpolation='nearest')
    plt.xticks(())
    plt.yticks(())
    if i == 1:
        plt.title('Original data')
    plt.subplot(3, 4, 4 + i + 1)
    plt.imshow(images_restored[i], cmap=plt.cm.gray, vmax=16,
               interpolation='nearest')
    if i == 1:
        plt.title('Agglomerated data')
    plt.xticks(())
    plt.yticks(())
plt.subplot(3, 4, 10)
plt.imshow(np.reshape(agglo.labels_, images[0].shape),
           interpolation='nearest', cmap=plt.cm.nipy_spectral)
plt.xticks(())
plt.yticks(())
plt.title('Labels')
plt.show()

:::info transform** and inverse_transform methods**
Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset. :::

5、Decompositions: from a signal to components and loadings

:::info Components and loadings
If X is our multivariate data, then the problem that we are trying to solve is to rewrite it on a different observational basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose the components :::

5.1 Principal component analysis: PCA

Principal component analysis (PCA) selects the successive components that explain the maximum variance in the signal.

The point cloud spanned(贯穿) by the observations above is very flat in one direction: one of the three univariate(单变量的) features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat

When used to _transform data, PCA can reduce the dimensionality of the data by projecting on(投射) a principal subspace.

# Create a signal with only 2 useful dimensions
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1, x2, x3]
from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X)
print(pca.explained_variance_)  
# [  2.18565811e+00   1.19346747e+00   8.43026679e-32]
# As we can see, only the 2 first components are useful
pca.n_components = 2
X_reduced = pca.fit_transform(X)
X_reduced.shape  # (100, 2)

5.2 Independent Component Analysis: ICA

Independent component analysis (ICA) selects components so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals:

import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from sklearn.decomposition import FastICA, PCA
# #############################################################################
# Generate sample data
np.random.seed(0)
n_samples = 2000
time = np.linspace(0, 8, n_samples)
s1 = np.sin(2 * time)  # Signal 1 : sinusoidal signal
s2 = np.sign(np.sin(3 * time))  # Signal 2 : square signal
s3 = signal.sawtooth(2 * np.pi * time)  # Signal 3: saw tooth signal
S = np.c_[s1, s2, s3]
S += 0.2 * np.random.normal(size=S.shape)  # Add noise
S /= S.std(axis=0)  # Standardize data
# Mix data
A = np.array([[1, 1, 1], [0.5, 2, 1.0], [1.5, 1.0, 2.0]])  # Mixing matrix
X = np.dot(S, A.T)  # Generate observations
# Compute ICA
ica = FastICA(n_components=3)
S_ = ica.fit_transform(X)  # Reconstruct signals
A_ = ica.mixing_  # Get estimated mixing matrix
# We can `prove` that the ICA model applies by reverting the unmixing.
assert np.allclose(X, np.dot(S_, A_.T) + ica.mean_)
# For comparison, compute PCA
pca = PCA(n_components=3)
H = pca.fit_transform(X)  # Reconstruct signals based on orthogonal components
# #############################################################################
# Plot results
plt.figure()
models = [X, S, S_, H]
names = ['Observations (mixed signal)',
         'True Sources',
         'ICA recovered signals',
         'PCA recovered signals']
colors = ['red', 'steelblue', 'orange']
for ii, (model, name) in enumerate(zip(models, names), 1):
    plt.subplot(4, 1, ii)
    plt.title(name)
    for sig, color in zip(model.T, colors):
        plt.plot(sig, color=color)
plt.tight_layout()
plt.show()

6、Putting it all together

6.1 Pipelining

We have seen that some estimators can transform data and that some estimators can predict variables. We can also create combined estimators:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA()
# set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
X_digits, y_digits = datasets.load_digits(return_X_y=True)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'pca__n_components': [5, 15, 30, 45, 64],
    'logistic__C': np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
# Plot the PCA spectrum
pca.fit(X_digits)
fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(np.arange(1, pca.n_components_ + 1),
         pca.explained_variance_ratio_, '+', linewidth=2)
ax0.set_ylabel('PCA explained variance ratio')
ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

6.2 Face recognition with eigenfaces

The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, also known as LFW: http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
Expected results for the top 5 most represented people in the dataset:
================== ============ ======= ========== =======
                   precision    recall  f1-score   support
================== ============ ======= ========== =======
     Ariel Sharon       0.67      0.92      0.77        13
     Colin Powell       0.75      0.78      0.76        60
  Donald Rumsfeld       0.78      0.67      0.72        27
    George W Bush       0.86      0.86      0.86       146
Gerhard Schroeder       0.76      0.76      0.76        25
      Hugo Chavez       0.67      0.67      0.67        15
       Tony Blair       0.81      0.69      0.75        36
      avg / total       0.80      0.80      0.80       322
================== ============ ======= ========== =======
"""
import logging
import matplotlib.pyplot as plt
from time import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.svm import SVC
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
# #############################################################################
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)
# #############################################################################
# Split into a training set and a test set using a stratified k fold
# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# #############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
print("Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0]))
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))
eigenfaces = pca.components_.reshape((n_components, h, w))
print("Projecting the input data on the eigenfaces orthonormal basis")
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))
# #############################################################################
# Train a SVM classification model
print("Fitting the classifier to the training set")
t0 = time()
param_grid = {
    'C': [1e3, 5e3, 1e4, 5e4, 1e5],
    'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], 
    }
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_)
# #############################################################################
# Quantitative evaluation of the model quality on the test set
print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test_pca)
print("done in %0.3fs" % (time() - t0))
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
# #############################################################################
# Qualitative evaluation of the predictions using matplotlib
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())
# plot the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
                     for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
plt.show()

A tutorial on statistical-learning for scientific data processing