Sklearn视频教程 - train_test_split数据集分割 - 《机器学习》

函数名：train_test_split
所在包：sklearn.model_selection
功能：划分数据的训练集与测试集
参数解读：train_test_split (*arrays，test_size, train_size, rondom_state=None, shuffle=True, stratify=None)

arrays：特征数据和标签数据（array，list，dataframe等类型），要求所有数据长度相同。
test_size / train_size: 测试集/训练集的大小，若输入小数表示比例，若输入整数表示数据个数。
rondom_state：随机种子（一个整数），其实就是一个划分标记，对于同一个数据集，如果rondom_state相同，则划分结果也相同。
shuffle：是否打乱数据的顺序，再划分，默认True。
stratify：none或者array/series类型的数据，表示按这列进行分层采样。

举个栗子：

特征数据：data
   a  b  c
0  1  2  3
1  1  3  6
2  2  3  8
3  1  5  7
4  2  4  8
5  2  3  6
6  1  4  8
7  2  3  6
标签数据：label
[2,3,5,6,8,0,2,3]
#划分
xtrain,xtest,ytrain,ytest=train_test_split(data,label,test_size=0.2,stratify=data['a'],random_state=1)
训练特征集：
   a  b  c
0  1  2  3
2  2  3  8
3  1  5  7
5  2  3  6
6  1  4  8
4  2  4  8
测试特征集：
   a  b  c
1  1  3  6
7  2  3  6
训练集与测试集按照a列来分层采样，且无论重复多少次上述语句，划分结果都相同。

from __future__ import print_function
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
print(X)
#  [[ 0 1]
#  [ 2 3]
#  [ 4 5]
#  [ 6 7]
#  [ 8 9]]
print(y)
#  range(0, 5), [0, 1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size = 0.4, random_state = 22)
print(X_train)
#  [[6 7]
#   [0 1]
#  [8 9]]
print(y_train)
#  [3, 0, 4]
print(X_test)
#  [[2 3]
#  [4 5]]
print(y_test)
#  [1, 2]