Chapter 1 绪论

基本概念术语

  • 数据集、样本、属性、属性值、样本空间、特征向量
  • 训练、假设、真相(ground-truth)
  • 分类、回归、聚类
  • 监督学习(分类和回归)、无监督学习(聚类)
  • 泛化能力、独立同分布(iid)
  • 归纳(泛化)、演绎(特化)

    假设空间

    空间规模计算

    西瓜书读书笔记:Chapter 1&2 - 图1
    image.png

    版本空间

  • 一个与训练集一致的“假设集合”

image.png

  • 例1

    图片.png
    图片.png
    西瓜书读书笔记:Chapter 1&2 - 图6

    Chapter 2模型评估与选择

    经验误差与过拟合

  • 错误率西瓜书读书笔记:Chapter 1&2 - 图7, 其中西瓜书读书笔记:Chapter 1&2 - 图8为样本总数,西瓜书读书笔记:Chapter 1&2 - 图9为分类错误的样本数;

  • 精度西瓜书读书笔记:Chapter 1&2 - 图10
  • 经验误差(训练误差)、泛化误差
  • 过拟合、欠拟合

image.png

评估方法

留出法(hold-out)

  • 单次使用留出法简洁实现

    1. from sklearn.datasets import load_iris
    2. from sklearn.model_selection import train_test_split
    3. # 加载数据集
    4. iris=load_iris()
    5. # 得到特征和label
    6. X, y = iris.data, iris.target
    7. # 随机将训练集和测试集按照4:1划分
    8. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
    9. random_state = 42)
  • 单次使用留出法实现

    1. import numpy as np
    2. def leave_out(X, y, test_size):
    3. # 获得所有数据的索引
    4. data_index = [idx for idx in range(len(X))]
    5. # 打乱索引顺序
    6. np.random.shuffle(data_index)
    7. # 确定训练和验证的分界线
    8. split_index = int((1 - test_size) * len(X))
    9. # 训练数据集和验证数据集索引
    10. train_data_index = data_index[: split_index]
    11. test_data_index = data_index[split_index :]
    12. X_train1, X_test1, y_train1, y_test1 = X[train_data_index], X[test_data_index], y[train_data_index], y[test_data_index]
    13. return X_train1, X_test1, y_train1, y_test1

    交叉验证法(cross validation)

  • 10折交叉验证示意图

image.png

  • 留一法(西瓜书读书笔记:Chapter 1&2 - 图13)

优点:不受随机样本的影响,实际评估的模型和期望评估的用完整数据集训练出的模型相似,评估结比较准确
缺点:数据集太大时,计算开销过大

  • 10折交叉验证简洁实现

    1. from sklearn.model_selection import KFold
    2. kf = KFold(n_splits = 10, shuffle = True, random_state = 42)
    3. for train_index, test_index, in kf.split(X, y):
    4. print('The shape of train_index %s, the shape of test_index %s'
    5. %(train_index.shape, test_index.shape))
  • 10折交叉验证实现 ```python import numpy as np def k_fold_split(X, y, n_splits, shuffle=True): n_splits = int(n_splits)

    生成索引数组

    data_index = np.arange(len(X))

    打乱顺序

    if shuffle == True:

    1. np.random.shuffle(data_index)

    计算每个fold的大小

    fold_sizes = np.full(n_splits, len(X) // n_splits) fold_sizes[: len(X) % n_splits] += 1 current = 0 for fold_size in fold_sizes:

    1. # test fold对应的索引位置
    2. test_start, test_end = current, current + fold_size
    3. train_index = data_index[test_end :]
    4. if test_start != 0:
    5. train_index = np.concatenate((data_index[: test_start], train_index))
    6. test_index = data_index[test_start : test_end]
    7. yield (train_index, test_index)
    8. current = test_end

for train_index1, test_index1 in k_fold_split(X, y, 14): print(‘The shape of train_index1 %s, the shape of test_index1 %s’ %(train_index1.shape, test_index1.shape))

  1. <a name="EhDcF"></a>
  2. ### 自助法(bootstrapping)
  3. - 示意图
  4. ![](https://cdn.nlark.com/yuque/0/2021/jpeg/758052/1626361563150-4b7f5fc4-3042-49e9-a138-14e81d54a33b.jpeg)
  5. - 当样本数目足够多时,![](https://cdn.nlark.com/yuque/__latex/6f8f57715090da2632453988d9a1501b.svg#card=math&code=m&id=rPXpb)次抽样始终不被抽到的概率趋近于
  6. ![](https://cdn.nlark.com/yuque/__latex/51492ce55bda8beded337a5a6b14c6cf.svg#card=math&code=p%3D%5Clim_%7Bm%5Cto%2B%5Cinfty%7D%281-%5Cfrac%7B1%7D%7Bm%7D%29%5E%7Bm%7D%3D%5Clim_%7Bm%5Cto%2B%5Cinfty%7D%281-%5Cfrac%7B1%7D%7Bm%7D%29%5E%7B-m%5Ctimes%7B%28-1%29%7D%7D%3De%5E%7B-1%7D%3D%5Cfrac%7B1%7D%7Be%7D&height=38&id=KcwUm)
  7. - 自助法使用pandas实现
  8. ```python
  9. import pandas as pd
  10. df_X = pd.DataFrame(X, columns=['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'])
  11. df_y = pd.DataFrame(y, columns=['Species'])
  12. # 将X和y的Dataframe在列上相连
  13. df = pd.concat([df_X, df_y], axis = 1)
  14. # 采样
  15. train = df.sample(frac = 1.0, replace = True)
  16. # 没有采样到的作为test
  17. test = df.loc[df.index.difference(train.index)].copy()
  18. # 转成ndarray
  19. X_train2, y_train2 = train.iloc[:,[0, 1, 2, 3]].values, train.iloc[:, 4].values
  20. X_test2, y_test2 = test.iloc[:, [0, 1, 2, 3]].values, test.iloc[:, 4].values
  21. X_train2.shape, X_test2.shape
  • 自助法实现

    1. import numpy as np
    2. train_index = np.random.randint(len(X), size=len(X))
    3. bool_index = np.zeros(len(X), dtype = bool)
    4. bool_index[train_index] = True
    5. test_index = np.argwhere(bool_index == False).reshape(-1)
    6. X_train3, y_train3, X_test3, y_test3 = X[train_index], y[train_index], X[test_index], y[test_index]
    7. X_train3.shape, X_test3.shape

    性能度量

    错误率和精度

  • 错误率公式

image.png

  • 精度公式

image.png

  • 精度简洁实现

    1. import numpy as np
    2. y_pred = np.random.randint(0, 3, size = y.shape)
    3. from sklearn.metrics import accuracy_score
    4. print(accuracy_score(y, y_pred))
  • 精度实现

    1. acc_cnt = 0
    2. for i in range(len(y)):
    3. if y[i] == y_pred[i]:
    4. acc_cnt += 1
    5. print(acc_cnt / len(y))

    查准率、查全率和F1

  • 二分类问题的混淆矩阵

image.png
tips: TP的P表示预测为positive,T表示预测对了:true。 同理,TN表示预测为Negative,并且预测对了:true

  • 查准率 西瓜书读书笔记:Chapter 1&2 - 图17 和查全率西瓜书读书笔记:Chapter 1&2 - 图18

image.png

  • 西瓜书读书笔记:Chapter 1&2 - 图20

image.png

  • 多分类问题的 西瓜书读书笔记:Chapter 1&2 - 图22西瓜书读书笔记:Chapter 1&2 - 图23

“宏查全率”、“宏查准率”、“宏F1”
image.png
“微查全率”、“微查准率”、“微F1”
image.png
image.png
宏和微指标的区别是宏是先算指标再平均,微是先把TP,FP,TN以及FN平均,再算指标

  • 查准率和查全率简洁实现

    1. from sklearn.metrics import precision_score, recall_score, f1_score
    2. print('宏查准率:', precision_score(y, y_pred, average = 'macro'))
    3. print('宏查全率:', recall_score(y, y_pred, average = 'macro'))
    4. print('宏F1 Score:', f1_score(y, y_pred, average = 'macro'))
  • 查准率和查全率实现(三分类) ```python def get_score(y, y_pred, label): TP, TN, FP, FN = 0, 0, 0, 0 for i in range(len(y)):

    1. if y[i] == label and y_pred[i] == label:
    2. TP += 1
    3. elif y[i] == label and y_pred[i] != label:
    4. FN += 1
    5. elif y[i] != label and y_pred[i] == label:
    6. FP += 1
    7. elif y[i] != label and y_pred[i] != label:
    8. TN += 1

    return [TP, TN, FP, FN]

score_0 = get_score(y, y_pred, 0) p_0 = score_0[0] / (score_0[0] + score_0[2]) r_0 = score_0[0] / (score_0[0] + score_0[3]) f1_0 = 2 p_0 r_0 / (p_0 + r_0) score_1 = get_score(y, y_pred, 1) p_1 = score_1[0] / (score_1[0] + score_1[2]) r_1 = score_1[0] / (score_1[0] + score_1[3]) f1_1 = 2 p_1 r_1 / (p_1 + r_1) score_2 = get_score(y, y_pred, 2) p_2 = score_2[0] / (score_2[0] + score_2[2]) r_2 = score_2[0] / (score_2[0] + score_2[3]) f1_2 = 2 p_2 r_2 / (p_2 + r_2) macro_p = (p_0 + p_1 + p_2) / 3 macro_r = (r_0 + r_1 + r_2) / 3 macro_f1 = 2 macro_p macro_r / (macro_p + macro_r) macro_f1_1 = (f1_0 + f1_1 + f1_2) / 3 print(‘macro precision score:’, macro_p) print(‘macro recall score:’, macro_r) print(‘macro f1 score:’, macro_f1) print(‘macro f1 score:’, macro_f1_1) ``` notes:计算macro-F1,micro-F1有两种方式,一种是西瓜书提到的公式,另一种是把对所有F1作平均。具体两种方法的差别可以看 https://arxiv.org/pdf/1911.03347.pdf,比较推荐求出所有F1,再作平均的方法(sklearn使用的这种)