Introduction
Framework
- Assumptions
- Notation
CL Methods
Theory
- Noiseless Predicted Probabilities
Using
- 交叉验证
  - sklearn—KFold and StratifiedKFold
StratifiedKFold: 抽样后的训练集和验证集的样本分类比例和原有的数据集尽量是一样的
对(X, Y1)进行抽样
Y1中有5个类别，比例为1:1:1:1:1
所以，每个KFold的样本数必须为 1x+1x+1x+1x+1*x=5x个样本
复现Cifar10和imagenet效果

Northcutt, Curtis G., Lu Jiang, and Isaac L. Chuang. “Confident learning: Estimating uncertainty in dataset labels.” arXiv preprint arXiv:1911.00068 (2019).

论文：https://arxiv.org/abs/1911.00068
代码：https://github.com/cgnorthcutt/cleanlab
关键词： “Confident learning” ，“estimate noise” ， “train with confidenc” , “cleanlab”

标注数据存在错误怎么办？MIT&Google提出用置信学习找出错误标注（附开源实现）
Cleanlab中文文档教程

什么是Confident learning (CL)?
pruning noisy data, counting to estimate noise, and ranking examples to train with confidence
评估noise，并进行统计过滤，使用confidence score进行训练。

model-centric or data-centric?

classification noise process (CNP)

Introduction

Confident learning主要分为以下分支：

Prune, to search for label errors,如[1] [2] 通过loss - reweighting避免noise label在训练迭代中负面影响
Count, to train on clean data, avoiding error-propagation in learned model weights from reweighting the loss 训练在clean data上面
Rank which examples to use during training, to allow learning with unnormalized probabilities or decision boundary distances

上图是一个传统的Confident learning。

Three key contributions

证明了CL 可以精确的评估noisy和true lable的联合分布，通过label errors的精确识别
提供了CL在三个tasks上的表现： (a) label noise estimation, (b) label error finding (c) learning with noisy labels，提升resnet在imagenet上的精度，和总结介绍了最近7中sota的方法
我们将cleanlab1作为标准Python包进行开源，以重现所有的结果

Framework

表示the set of Confident Learning - 图7 examples对应noise label

Assumptions

基于a class conditional classification noise process (CNP)[3],
表示对于 Confident Learning - 图12 类被错误的labled为 Confident Learning - 图13 类，对于每一个 Confident Learning - 图14 ,都有对于的 Confident Learning - 图15 ，一共有 Confident Learning - 图16 个概率。

Notation

先贴一个Notation
**
表示错误label为cat的sample,

Sparsity表示在混淆矩阵非对角线概率中是稀疏的，比如在imagenet中 Confident Learning - 图19 有更高的概率被mislabele为 Confident Learning - 图20 ，但是不太会被标为 Confident Learning - 图21
Self-Confidence指的是example Confident Learning - 图22 belongs to its given label Confident Learning - 图23 的概率，用公式表示为
如果low self-confidence过于低可能就是一个label error.

CL Methods

Confident learning 评估在noisy和(true) latent labels之间的联合分布
The main procedure consists of three steps:
（1）、评估m×m joint distribution matrix 去描述class-conditional label noise
（2）、filter out noisy examples
（3）、使用re-weighting examples by class weights去训练网络。

下面基于这三个step对一些方式进行分析，主要用到两个input：

predicted probabilities:
array of noisy labels:

Count: Label Noise Characterization

通过进行Counts。举个例子表示有十个 examples are labeled 3 but should be labeled 1。

Confusion matrix

是一个confusion matrix对于given labels 和预测。在每个class概率分布不同的情况下会失效。即是下面这种情况：

The confident joint

计算计数矩阵 Confident Learning - 图37 （类似于混淆矩阵），如图1中的 Confident Learning - 图38 意味着，人工标记为dog但实际为fox的样本为40个。具体的操作流程如图2所示：

Estimate the joint

给定一个，我们可以计算如下：

Count：估计噪声标签和真实标签的联合分布

我们定义噪声标签为 Confident Learning - 图44 ，即经过初始标注（也许是人工标注）、但可能存在错误的样本；定义真实标签为 Confident Learning - 图45 ，但事实上我们并不会获得真实标签，通常可通过交叉验证对真实标签进行估计。此外，定义样本总数为 Confident Learning - 图46 ，类别总数为 Confident Learning - 图47 。
为了估计联合分布，共需要4步：

step 1 : 交叉验证：
- 首先需要通过对数据集集进行交叉验证计算第样本在第个类别下的概率；
- 然后计算每个人工标定类别下的平均概率作为置信度阈值；
- 最后对于样本，其真实标签为个类别中的最大概率，并且 ;
step 2: 计算计数矩阵（类似于混淆矩阵），如图1中的意味着，人工标记为dog但实际为fox的样本为40个。具体的操作流程如图2所示：
step 3 : 标定计数矩阵：目的就是为了让计数总和与人工标记的样本总数相同。计算公式如下面所示，其中为人工标记标签的样本总个数：

Confident Learning - 图62

step 4 : 估计噪声标签和真实标签的联合分布，可通过下式求得：

Confident Learning - 图66

Rank and Prune: Data Cleaning

Two approaches are:
(1)、use the off-diagonals of,混淆矩阵

Method: .用混淆矩阵的非对角线去过滤noise label

(2)、评估label error的数量使用predicted probability去rank然后remove error。

在得到噪声标签和真实标签的联合分布 Confident Learning - 图70 ，论文共提出了5种方法过滤错误样本。

Method 1: ，选取的样本进行过滤，即选取最大概率对应的下标与人工标签不一致的样本。
Method 2: ，选取构造计数矩阵过程中、进入非对角单元的样本进行过滤。
Method 3: Prune by Class (PBC) ，即对于人工标记的每一个类别 ,选取个样本过滤，并按照最低概率排序。
Method 4: Prune by Noise Rate (PBNR) ，对于计数矩阵的非对角单元，选取个样本进行过滤，并按照最大间隔排序。
Method 5: C+NR，同时采用Method 3和Method 4.

上述这些过滤样本的方法在cleanlab也有提供，我们只要提供2个输入、1行code即可clean错误样本：

import cleanlab
# 输入
# s:噪声标签
# psx: n x m 的预测概率概率，通过交叉验证获得
# Method 3：Prune by Class (PBC)
baseline_cl_pbc = cleanlab.pruning.get_noise_indices(s, psx, prune_method='prune_by_class',n_jobs=1)
# Method 4：Prune by Noise Rate (PBNR)
baseline_cl_pbnr = cleanlab.pruning.get_noise_indices(s, psx, prune_method='prune_by_noise_rate',n_jobs=1)
# Method 5：C+NR
baseline_cl_both = cleanlab.pruning.get_noise_indices(s, psx, prune_method='both',n_jobs=1)

Theory

Noiseless Predicted Probabilities

Lemma 1 （Ideal Thresholds）

对于i类的阈值取p(i)的期望，

Using

Github Paper Blog

New to cleanlab? Start with:

The main procedure is simple:

Compute cross-validated predicted probabilities.
Use cleanlab to find the label errors in CIFAR-10.
Remove errors and train on cleaned data via Co-Teaching.

import matplotlib.pyplot as plt
#折线图
x = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]#点的横坐标
pre = [0,1,0.99,0.98,0.96,0.94,0.91,0.86,0.81,0.75,0.71]#线1的纵坐标
recal = [0,0.06,0.14,0.24,0.35,0.46,0.58,0.69,0.81,0.90,0.96]#线2的纵坐标
plt.figure(figsize=(10,8))
plt.plot(x,pre,'s-',color = 'r',label="ATT-RLSTM")#s-:方形
plt.plot(x,recal,'o-',color = 'g',label="CNN-RLSTM")#o-:圆形
plt.xlabel("fraction noise(%)")#横坐标名字
# plt.ylabel("noise rate")#纵坐标名字
plt.legend(loc = "best")#图例
plt.show()
import matplotlib.pyplot as plt
#折线图
x = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]#点的横坐标
noiserate = [0.361,0.346,0.326,0.299,0.27,0.2352,0.1963,0.1552,0.1076,0.0613,0.0306]#线1的纵坐标
plt.figure(figsize=(10,8))
plt.plot(x,noiserate,'s-',color = 'r')#s-:方形
plt.xlabel("fraction noise(%)")#横坐标名字
plt.ylabel("noise rate(%)")#纵坐标名字
plt.show()
import matplotlib.pyplot as plt
#折线图
x = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]#点的横坐标
noiserate = [50000,48940,47441,45601,43530,41180,38508,35509,32031,28389,25741]#线1的纵坐标
plt.figure(figsize=(10,8))
plt.plot(x,noiserate,'s-',color = 'r')#s-:方形
plt.xlabel("fraction noise(%)")#横坐标名字
plt.ylabel("training sample")#纵坐标名字
plt.show()

交叉验证

对训练集通过交叉验证来找出一些可能存在错误标注的样本
为什么要进行交叉验证？
将原始数据分为3个数据集合，我们就大大减少了可用于模型训练的样本数量，并且得到的结果依赖于集合对（训练，验证）的随机选择。
在所有CV方法中，最基本的方法被称之为，k-折交叉验证：

将 k−1 份训练集子集作为 training data （训练集）训练模型;
将剩余的 1 份训练集子集作为验证集用于模型验证（也就是利用该数据子集计算模型的性能指标，例如准确率）。
sklearn—KFold and StratifiedKFold
KFold划分数据集的原理：根据n_split直接进行划分
StratifiedKFold划分数据集的原理：划分后的训练集和验证集中类别分布尽量和原数据集一样 ```
StratifiedKFold: 抽样后的训练集和验证集的样本分类比例和原有的数据集尽量是一样的
　对(X, Y1)进行抽样
Y1中有5个类别，比例为1:1:1:1:1
所以，每个KFold的样本数必须为 1x+1x+1x+1x+1*x=5x个样本
stratifiedKFolds = StratifiedKFold(n_splits=2, shuffle=False) for (trn_idx, val_idx) in stratifiedKFolds.split(X, Y2): print((trn_idx, val_idx)) print((len(trn_idx), len(val_idx)))

Out: (array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])) (10, 10) (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])) (10, 10)


<a name="IfrL3"></a>
## Get started with easy, quick examples
<a name="m9Tz9"></a>
### Use `cleanlab` with any model (Tensorflow, caffe2, PyTorch, etc.)
`cleanlab` package可以服务任何model，包括PyTorch, Tensorflow, caffe2, scikit-learn, mxnet等等。<br />如果使用scikit-learn classifier， cleanlab就能直接使用out-of-the-box，如果使用其他的框架例如pytorch、tensorflow，也只需要编写我们的classifer去继承`sklearn.base.BaseEstimator`这个类。
```python
from sklearn.base import BaseEstimator
class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier
    def __init__(self, ):
        pass
    def fit(self, X, y, sample_weight=None):
        pass
    def predict(self, X):
        pass
    def predict_proba(self, X):
        pass
    def score(self, X, y, sample_weight=None):
        pass

# Now you can use your model with `cleanlab`. Here's one example:
from cleanlab.classification import LearningWithNoisyLabels
lnl = LearningWithNoisyLabels(clf=YourFavoriteModel())
lnl.fit(train_data, train_labels_with_errors)

这里有一个pytorch MINIST cnn class例子

复现Cifar10和imagenet效果

[1] Chen, P., Liao, B. B., Chen, G., and Zhang, S. Understand- ing and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning (ICML), 2019.
[2] Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[3] Angluin, D. and Laird, P. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
[4] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Conference on Neural Information Processing Systems (NeurIPS), 2018.

Confident Learning