Chapter 1 绪论
- 基本概念术语
- 假设空间
  - 空间规模计算
  - 版本空间
Chapter 2模型评估与选择
- 经验误差与过拟合
- 评估方法
  - 留出法(hold-out)
  - 交叉验证法(cross validation)
生成索引数组
打乱顺序
计算每个fold的大小
- 性能度量
  - 错误率和精度
  - 查准率、查全率和F1

Chapter 1 绪论

基本概念术语

数据集、样本、属性、属性值、样本空间、特征向量
训练、假设、真相(ground-truth)
分类、回归、聚类
监督学习(分类和回归)、无监督学习(聚类)
泛化能力、独立同分布(iid)
归纳(泛化)、演绎(特化)

假设空间

空间规模计算

版本空间
一个与训练集一致的“假设集合”

例1

Chapter 2模型评估与选择

经验误差与过拟合
错误率, 其中为样本总数，为分类错误的样本数;
精度
经验误差(训练误差)、泛化误差
过拟合、欠拟合

评估方法

留出法(hold-out)

单次使用留出法简洁实现

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 加载数据集
iris=load_iris()
# 得到特征和label
X, y = iris.data, iris.target
# 随机将训练集和测试集按照4:1划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                  random_state = 42)

单次使用留出法实现

import numpy as np
def leave_out(X, y, test_size):
  # 获得所有数据的索引
  data_index = [idx for idx in range(len(X))]
  # 打乱索引顺序
  np.random.shuffle(data_index)
  # 确定训练和验证的分界线
  split_index = int((1 - test_size) * len(X))
  # 训练数据集和验证数据集索引
  train_data_index = data_index[: split_index]
  test_data_index = data_index[split_index :]
  X_train1, X_test1, y_train1, y_test1 = X[train_data_index], X[test_data_index], y[train_data_index], y[test_data_index]
  return X_train1, X_test1, y_train1, y_test1

交叉验证法(cross validation)

10折交叉验证示意图

留一法()

优点：不受随机样本的影响，实际评估的模型和期望评估的用完整数据集训练出的模型相似,评估结比较准确
缺点：数据集太大时，计算开销过大

10折交叉验证简洁实现

from sklearn.model_selection import KFold
kf = KFold(n_splits = 10, shuffle = True, random_state = 42)
for train_index, test_index, in kf.split(X, y):
  print('The shape of train_index %s, the shape of test_index %s'
        %(train_index.shape, test_index.shape))

10折交叉验证实现 ```python import numpy as np def k_fold_split(X, y, n_splits, shuffle=True): n_splits = int(n_splits)

生成索引数组

data_index = np.arange(len(X))

打乱顺序

if shuffle == True:

  np.random.shuffle(data_index)

计算每个fold的大小

fold_sizes = np.full(n_splits, len(X) // n_splits) fold_sizes[: len(X) % n_splits] += 1 current = 0 for fold_size in fold_sizes:

  # test fold对应的索引位置  
  test_start, test_end = current, current + fold_size
  train_index = data_index[test_end :]
  if test_start != 0:
      train_index = np.concatenate((data_index[: test_start], train_index))
  test_index = data_index[test_start : test_end]
  yield (train_index, test_index)
  current = test_end

for train_index1, test_index1 in k_fold_split(X, y, 14): print(‘The shape of train_index1 %s, the shape of test_index1 %s’ %(train_index1.shape, test_index1.shape))

<a name="EhDcF"></a>
### 自助法(bootstrapping)
- 示意图
![](https://cdn.nlark.com/yuque/0/2021/jpeg/758052/1626361563150-4b7f5fc4-3042-49e9-a138-14e81d54a33b.jpeg)
- 当样本数目足够多时，![](https://cdn.nlark.com/yuque/__latex/6f8f57715090da2632453988d9a1501b.svg#card=math&code=m&id=rPXpb)次抽样始终不被抽到的概率趋近于
![](https://cdn.nlark.com/yuque/__latex/51492ce55bda8beded337a5a6b14c6cf.svg#card=math&code=p%3D%5Clim_%7Bm%5Cto%2B%5Cinfty%7D%281-%5Cfrac%7B1%7D%7Bm%7D%29%5E%7Bm%7D%3D%5Clim_%7Bm%5Cto%2B%5Cinfty%7D%281-%5Cfrac%7B1%7D%7Bm%7D%29%5E%7B-m%5Ctimes%7B%28-1%29%7D%7D%3De%5E%7B-1%7D%3D%5Cfrac%7B1%7D%7Be%7D&height=38&id=KcwUm)
- 自助法使用pandas实现
```python
import pandas as pd
df_X = pd.DataFrame(X, columns=['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'])
df_y = pd.DataFrame(y, columns=['Species'])
# 将X和y的Dataframe在列上相连
df = pd.concat([df_X, df_y], axis = 1)
# 采样
train = df.sample(frac = 1.0, replace = True)
# 没有采样到的作为test
test = df.loc[df.index.difference(train.index)].copy()
# 转成ndarray
X_train2, y_train2 = train.iloc[:,[0, 1, 2, 3]].values, train.iloc[:, 4].values
X_test2, y_test2 = test.iloc[:, [0, 1, 2, 3]].values, test.iloc[:, 4].values
X_train2.shape, X_test2.shape

自助法实现

import numpy as np
train_index = np.random.randint(len(X), size=len(X))
bool_index = np.zeros(len(X), dtype = bool)
bool_index[train_index] = True
test_index = np.argwhere(bool_index == False).reshape(-1)
X_train3, y_train3, X_test3, y_test3 = X[train_index], y[train_index], X[test_index], y[test_index]
X_train3.shape, X_test3.shape

性能度量

错误率和精度

错误率公式

精度公式

精度简洁实现

import numpy as np
y_pred = np.random.randint(0, 3, size = y.shape)
from sklearn.metrics import accuracy_score
print(accuracy_score(y, y_pred))

精度实现

acc_cnt = 0
for i in range(len(y)):
  if y[i] == y_pred[i]: 
      acc_cnt += 1
print(acc_cnt / len(y))

查准率、查全率和F1

二分类问题的混淆矩阵

tips: TP的P表示预测为positive，T表示预测对了：true。同理，TN表示预测为Negative，并且预测对了：true

查准率和查全率

多分类问题的和

“宏查全率”、“宏查准率”、“宏F1”

“微查全率”、“微查准率”、“微F1”

宏和微指标的区别是宏是先算指标再平均，微是先把TP，FP，TN以及FN平均，再算指标

查准率和查全率简洁实现

from sklearn.metrics import precision_score, recall_score, f1_score
print('宏查准率:', precision_score(y, y_pred, average = 'macro'))
print('宏查全率:', recall_score(y, y_pred, average = 'macro'))
print('宏F1 Score:', f1_score(y, y_pred, average = 'macro'))

查准率和查全率实现(三分类) ```python def get_score(y, y_pred, label): TP, TN, FP, FN = 0, 0, 0, 0 for i in range(len(y)):

  if y[i] == label and y_pred[i] == label:
      TP += 1
  elif y[i] == label and y_pred[i] != label:
      FN += 1
  elif y[i] != label and y_pred[i] == label:
      FP += 1
  elif y[i] != label and y_pred[i] != label:
      TN += 1

return [TP, TN, FP, FN]

score_0 = get_score(y, y_pred, 0) p_0 = score_0[0] / (score_0[0] + score_0[2]) r_0 = score_0[0] / (score_0[0] + score_0[3]) f1_0 = 2 p_0 r_0 / (p_0 + r_0) score_1 = get_score(y, y_pred, 1) p_1 = score_1[0] / (score_1[0] + score_1[2]) r_1 = score_1[0] / (score_1[0] + score_1[3]) f1_1 = 2 p_1 r_1 / (p_1 + r_1) score_2 = get_score(y, y_pred, 2) p_2 = score_2[0] / (score_2[0] + score_2[2]) r_2 = score_2[0] / (score_2[0] + score_2[3]) f1_2 = 2 p_2 r_2 / (p_2 + r_2) macro_p = (p_0 + p_1 + p_2) / 3 macro_r = (r_0 + r_1 + r_2) / 3 macro_f1 = 2 macro_p macro_r / (macro_p + macro_r) macro_f1_1 = (f1_0 + f1_1 + f1_2) / 3 print(‘macro precision score:’, macro_p) print(‘macro recall score:’, macro_r) print(‘macro f1 score:’, macro_f1) print(‘macro f1 score:’, macro_f1_1) ``` notes：计算macro-F1，micro-F1有两种方式，一种是西瓜书提到的公式，另一种是把对所有F1作平均。具体两种方法的差别可以看 https://arxiv.org/pdf/1911.03347.pdf，比较推荐求出所有F1，再作平均的方法(sklearn使用的这种)

机器学习

西瓜书读书笔记:Chapter 1&2

Chapter 1 绪论

基本概念术语

假设空间

空间规模计算

版本空间

Chapter 2模型评估与选择

经验误差与过拟合