Northcutt, Curtis G., Lu Jiang, and Isaac L. Chuang. “Confident learning: Estimating uncertainty in dataset labels.” arXiv preprint arXiv:1911.00068 (2019).

论文:https://arxiv.org/abs/1911.00068
代码:https://github.com/cgnorthcutt/cleanlab
关键词: “Confident learning” ,“estimate noise” , “train with confidenc” , “cleanlab”

image.png
标注数据存在错误怎么办?MIT&Google提出用置信学习找出错误标注(附开源实现)
Cleanlab中文文档教程

什么是Confident learning (CL)?
pruning noisy data, counting to estimate noise, and ranking examples to train with confidence
评估noise,并进行统计过滤,使用confidence score进行训练。

model-centric or data-centric?


classification noise process (CNP)

Introduction

Confident learning主要分为以下分支:

  • Prune, to search for label errors,如[1] [2] 通过loss - reweighting避免noise label在训练迭代中负面影响
  • Count, to train on clean data, avoiding error-propagation in learned model weights from reweighting the loss 训练在clean data上面
  • Rank which examples to use during training, to allow learning with unnormalized probabilities or decision boundary distances

image.png
上图是一个传统的Confident learning。

Three key contributions

  • 证明了CL 可以精确的评估noisy和true lable的联合分布,通过label errors的精确识别
  • 提供了CL在三个tasks上的表现: (a) label noise estimation, (b) label error finding (c) learning with noisy labels,提升resnet在imagenet上的精度,和总结介绍了最近7中sota的方法
  • 我们将cleanlab1作为标准Python包进行开源,以重现所有的结果

Framework

image.png表示the set of Confident Learning - 图7 examplesimage.png对应noise label image.png

Assumptions

基于a class conditional classification noise process (CNP)[3],image.png
image.png表示对于Confident Learning - 图12类被错误的labled为Confident Learning - 图13类,对于每一个Confident Learning - 图14,都有对于的Confident Learning - 图15,一共有Confident Learning - 图16个概率。

Notation

先贴一个Notation
**image.png
image.png表示错误label为cat的sample,

Sparsity表示在混淆矩阵非对角线概率中是稀疏的,比如在imagenet中Confident Learning - 图19有更高的概率被mislabele为Confident Learning - 图20,但是不太会被标为Confident Learning - 图21
Self-Confidence指的是example Confident Learning - 图22 belongs to its given label Confident Learning - 图23的概率,用公式表示为image.png
如果low self-confidence过于低可能就是一个label error.

CL Methods

Confident learning 评估在noisy和(true) latent labels之间的联合分布
The main procedure consists of three steps:
(1)、评估m×m joint distribution matrix image.png去描述class-conditional label noise
(2)、filter out noisy examples
(3)、使用re-weighting examples by class weightsimage.png去训练网络。

下面基于这三个step对一些方式进行分析,主要用到两个input:

  • predicted probabilities: Confident Learning - 图27
  • array of noisy labels: Confident Learning - 图28

Count: Label Noise Characterization

通过image.png进行Counts。举个例子image.png表示有十个 examples are labeled 3 but should be labeled 1。

Confusion matrix Confident Learning - 图31

image.png是一个confusion matrix对于given labels image.png和预测image.png。在每个class概率分布不同的情况下会失效。即是下面这种情况:
image.png

The confident joint image.png

计算计数矩阵 Confident Learning - 图37 (类似于混淆矩阵),如图1中的Confident Learning - 图38 意味着,人工标记为dog但实际为fox的样本为40个。具体的操作流程如图2所示:
image.png

Estimate the joint image.png

给定一个image.png,我们可以计算image.png如下:
image.png

Count:估计噪声标签和真实标签的联合分布

我们定义噪声标签为 Confident Learning - 图44 ,即经过初始标注(也许是人工标注)、但可能存在错误的样本;定义真实标签为 Confident Learning - 图45,但事实上我们并不会获得真实标签,通常可通过交叉验证对真实标签进行估计。此外,定义样本总数为 Confident Learning - 图46 ,类别总数为 Confident Learning - 图47
为了估计联合分布,共需要4步:

  • step 1 : 交叉验证:
    • 首先需要通过对数据集集进行交叉验证计算第 Confident Learning - 图48 样本在第 Confident Learning - 图49 个类别下的概率 Confident Learning - 图50
    • 然后计算每个人工标定类别 Confident Learning - 图51 下的平均概率 Confident Learning - 图52 作为置信度阈值;
    • 最后对于样本 Confident Learning - 图53 ,其真实标签 Confident Learning - 图54Confident Learning - 图55 个类别中的最大概率 Confident Learning - 图56 ,并且 Confident Learning - 图57 ;
  • step 2: 计算计数矩阵 Confident Learning - 图58 (类似于混淆矩阵),如图1中的Confident Learning - 图59 意味着,人工标记为dog但实际为fox的样本为40个。具体的操作流程如图2所示:
  • step 3 : 标定计数矩阵:目的就是为了让计数总和与人工标记的样本总数相同。计算公式如下面所示,其中 Confident Learning - 图60 为人工标记标签 Confident Learning - 图61 的样本总个数:

Confident Learning - 图62

  • step 4 : 估计噪声标签Confident Learning - 图63 和真实标签Confident Learning - 图64的联合分布Confident Learning - 图65,可通过下式求得:

Confident Learning - 图66

Rank and Prune: Data Cleaning

Two approaches are:
(1)、use the off-diagonals ofimage.png,混淆矩阵

  • Method: Confident Learning - 图68.用混淆矩阵的非对角线去过滤noise label

(2)、image.png评估label error的数量使用predicted probability去rank然后remove error。

在得到噪声标签和真实标签的联合分布 Confident Learning - 图70 ,论文共提出了5种方法过滤错误样本。

  • Method 1: Confident Learning - 图71 ,选取 Confident Learning - 图72 的样本进行过滤,即选取 Confident Learning - 图73 最大概率对应的下标 Confident Learning - 图74 与人工标签不一致的样本。
  • Method 2: Confident Learning - 图75 ,选取构造计数矩阵 Confident Learning - 图76 过程中、进入非对角单元的样本进行过滤。
  • Method 3: Prune by Class (PBC) ,即对于人工标记的每一个类别 Confident Learning - 图77 ,选取 Confident Learning - 图78 个样本过滤,并按照最低概率 Confident Learning - 图79 排序。
  • Method 4: Prune by Noise Rate (PBNR) ,对于计数矩阵 Confident Learning - 图80的非对角单元,选取 Confident Learning - 图81 个样本进行过滤,并按照最大间隔 Confident Learning - 图82 排序。
  • Method 5: C+NR,同时采用Method 3和Method 4.

上述这些过滤样本的方法在cleanlab也有提供,我们只要提供2个输入、1行code即可clean错误样本:

  1. import cleanlab
  2. # 输入
  3. # s:噪声标签
  4. # psx: n x m 的预测概率概率,通过交叉验证获得
  5. # Method 3:Prune by Class (PBC)
  6. baseline_cl_pbc = cleanlab.pruning.get_noise_indices(s, psx, prune_method='prune_by_class',n_jobs=1)
  7. # Method 4:Prune by Noise Rate (PBNR)
  8. baseline_cl_pbnr = cleanlab.pruning.get_noise_indices(s, psx, prune_method='prune_by_noise_rate',n_jobs=1)
  9. # Method 5:C+NR
  10. baseline_cl_both = cleanlab.pruning.get_noise_indices(s, psx, prune_method='both',n_jobs=1)

Theory

Noiseless Predicted Probabilities

Lemma 1 (Ideal Thresholds)
image.png
image.png

对于i类的阈值取p(i)的期望,

Using

Github Paper Blog

New to cleanlab? Start with:

  1. Visualizing confident learning
  2. A simple example of learning with noisy labels on the multiclass Iris dataset.

The main procedure is simple:

  1. Compute cross-validated predicted probabilities.
  2. Use cleanlab to find the label errors in CIFAR-10.
  3. Remove errors and train on cleaned data via Co-Teaching.
  1. import matplotlib.pyplot as plt
  2. #折线图
  3. x = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]#点的横坐标
  4. pre = [0,1,0.99,0.98,0.96,0.94,0.91,0.86,0.81,0.75,0.71]#线1的纵坐标
  5. recal = [0,0.06,0.14,0.24,0.35,0.46,0.58,0.69,0.81,0.90,0.96]#线2的纵坐标
  6. plt.figure(figsize=(10,8))
  7. plt.plot(x,pre,'s-',color = 'r',label="ATT-RLSTM")#s-:方形
  8. plt.plot(x,recal,'o-',color = 'g',label="CNN-RLSTM")#o-:圆形
  9. plt.xlabel("fraction noise(%)")#横坐标名字
  10. # plt.ylabel("noise rate")#纵坐标名字
  11. plt.legend(loc = "best")#图例
  12. plt.show()
  13. import matplotlib.pyplot as plt
  14. #折线图
  15. x = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]#点的横坐标
  16. noiserate = [0.361,0.346,0.326,0.299,0.27,0.2352,0.1963,0.1552,0.1076,0.0613,0.0306]#线1的纵坐标
  17. plt.figure(figsize=(10,8))
  18. plt.plot(x,noiserate,'s-',color = 'r')#s-:方形
  19. plt.xlabel("fraction noise(%)")#横坐标名字
  20. plt.ylabel("noise rate(%)")#纵坐标名字
  21. plt.show()
  22. import matplotlib.pyplot as plt
  23. #折线图
  24. x = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]#点的横坐标
  25. noiserate = [50000,48940,47441,45601,43530,41180,38508,35509,32031,28389,25741]#线1的纵坐标
  26. plt.figure(figsize=(10,8))
  27. plt.plot(x,noiserate,'s-',color = 'r')#s-:方形
  28. plt.xlabel("fraction noise(%)")#横坐标名字
  29. plt.ylabel("training sample")#纵坐标名字
  30. plt.show()

交叉验证

对训练集通过交叉验证来找出一些可能存在错误标注的样本
为什么要进行交叉验证?
将原始数据分为3个数据集合,我们就大大减少了可用于模型训练的样本数量, 并且得到的结果依赖于集合对(训练,验证)的随机选择。
在所有CV方法中, 最基本的方法被称之为,k-折交叉验证 :

  • 将 k−1 份训练集子集作为 training data (训练集)训练模型;
  • 将剩余的 1 份训练集子集作为验证集用于模型验证(也就是利用该数据子集计算模型的性能指标,例如准确率)。

    sklearn—KFold and StratifiedKFold

    KFold划分数据集的原理:根据n_split直接进行划分
    StratifiedKFold划分数据集的原理:划分后的训练集和验证集中类别分布尽量和原数据集一样 ```

    StratifiedKFold: 抽样后的训练集和验证集的样本分类比例和原有的数据集尽量是一样的

     对(X, Y1)进行抽样

    Y1中有5个类别,比例为1:1:1:1:1

    所以,每个KFold的样本数必须为 1x+1x+1x+1x+1*x=5x个样本

    stratifiedKFolds = StratifiedKFold(n_splits=2, shuffle=False) for (trn_idx, val_idx) in stratifiedKFolds.split(X, Y2): print((trn_idx, val_idx)) print((len(trn_idx), len(val_idx)))

Out: (array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])) (10, 10) (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])) (10, 10)


<a name="IfrL3"></a>
## Get started with easy, quick examples
<a name="m9Tz9"></a>
### Use `cleanlab` with any model (Tensorflow, caffe2, PyTorch, etc.)
`cleanlab` package可以服务任何model,包括PyTorch, Tensorflow, caffe2, scikit-learn, mxnet等等。<br />如果使用scikit-learn classifier, cleanlab就能直接使用out-of-the-box,如果使用其他的框架例如pytorch、tensorflow,也只需要编写我们的classifer去继承`sklearn.base.BaseEstimator`这个类。
```python
from sklearn.base import BaseEstimator
class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier
    def __init__(self, ):
        pass
    def fit(self, X, y, sample_weight=None):
        pass
    def predict(self, X):
        pass
    def predict_proba(self, X):
        pass
    def score(self, X, y, sample_weight=None):
        pass

# Now you can use your model with `cleanlab`. Here's one example:
from cleanlab.classification import LearningWithNoisyLabels
lnl = LearningWithNoisyLabels(clf=YourFavoriteModel())
lnl.fit(train_data, train_labels_with_errors)

这里有一个pytorch MINIST cnn class例子

复现Cifar10和imagenet效果


[1] Chen, P., Liao, B. B., Chen, G., and Zhang, S. Understand- ing and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning (ICML), 2019.
[2] Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[3] Angluin, D. and Laird, P. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
[4] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Conference on Neural Information Processing Systems (NeurIPS), 2018.