Problem Definition & Formulation

Design pretext tasks that could improve dense downstream prediction or classification tasks, perticularly semantic segmentation.

How to Design pretext tasks

  1. Augmentation
  2. Loss function

    Mathematical Formulation

    (可能不太对)
  • Encoder SSL4DP (Draft Report) - 图1, parameter SSL4DP (Draft Report) - 图2
  • Loss for semantic segmentation SSL4DP (Draft Report) - 图3
  • Loss for dense prediction pretext task SSL4DP (Draft Report) - 图4
  1. Design a augmentation set, SSL4DP (Draft Report) - 图5, such that for two augmentation SSL4DP (Draft Report) - 图6, we have SSL4DP (Draft Report) - 图7, where SSL4DP (Draft Report) - 图8.
  2. Design a loss function, SSL4DP (Draft Report) - 图9, such that SSL4DP (Draft Report) - 图10, where SSL4DP (Draft Report) - 图11.

    Goal

  3. Based on current pre-trained models for classification, further pre-training on the pretext task designed for dense prediction could improve downstream semantic segmentation task, with small computation cost.
    (基于现有self-sup训练出的针对classification的模型,再针对dense prediction做继续预训练)

  4. A pretext task for dense prediction that could improve model performance on semantic segmentation, relative to self-supervised pre-training for classification, when pre-training from scratch.
    (从random init.开始,针对dense prediction做pre-training,提升sem seg性能)

    Related Work

    Classification

  1. Explicitly adding context information during self-supervised pre-training could improve dense downstream tasks, in our case, semantic segmentation.
  2. Removing segmentation-irrelevent information.

    How to Encode Context Information

  • Adjacent Pixel Aggregation
    • Weighted sum, like attention
  • Co-occurance

    Methods

    Patch-level Rotation

    Image -> Divide into patches -> Randomly (i.i.d) rotate each patch -> Feed into Encoder-Decoder network -> Predict rotate angles of each patch (pixel level supervision)

    Encoder-Decoder MoCo

    Image -> Encoder-Decoder as encoder, average pooling decoder output to (1, 1), output one vector representing the image -> MoCo

    Patch-level MoCo

  1. Feature map (e.g. 1/16 size of original image with high channel dim) contains patches of image.
  2. Perform patch level matching on two views of the same image.
  3. Momentum contrastive learning on patches (e.g. pull matched patches closer and push others away)

    Patch-level Encoder-Decoder MoCo

    Experiments

    Performance in mIoU on VOCaug dataset

    | | Pre-training Epoch | Decoder Head | mIoU | Note | | :—- | :—-: | :—-: | —-: | :—- | | RandomInit | - | MoCoFCN | 68.37% | 160k iters, lr from 1e-2 to 1e-4, poly lr decay, p=0.9 | | | | | 36.43% | 32k iters (first 30k in 160k) | | Supervised-IN | - | MoCoFCN | 74.40% | 30k iters, lr from 3e-3 to 3e-5, step lr decay, x0.1 at 70 and 90 percentile, MoCo report | | | | PSPNet | 76.61% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | DeepLabv3+ | 76.38% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | 76.20% | 20k iters, lr from 1e-2 to 1e-4, poly lr decay, p=0.9, mmseg report | | MoCo | 200 | MoCoFCN | 72.65% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 75.45% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | DeepLabv3+ | 74.61% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | Non-Local | 73.53% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | SimCLR | 200 | MoCoFCN | 69.92% | 30k iters, lr from 3e-3 to 3e-5, step lr decay, x0.1 at 70 and 90 percentile | | MoCo v2 | 200 | MoCoFCN | 74.04% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 75.06% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | DeepLabv3+ | 76.11% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | 74.99% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | BYOL | 200 | MoCoFCN | 73.35% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | PSPNet* | 74.69% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, out channel = 256 | | | | PSPNet | 73.32% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, out channel = 512 (default) | | MoCo + PLR | 200 + 20(VOC) | MoCoFCN | 70.82% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, average of two runs. | | | | PSPNet | 72.59% | 30k iters, lr from 3e-3 to 3e-5, cos lr decay | | MoCo + EDMoCo | 200 + 20(VOC) | MoCoFCN | 71.42% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay. Pre-trained with MoCoFCN, same lr for all params. | | | | PSPNet | 73.59% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay. Pre-trained with PSPNet, 1/100 base lr for encoder. | | | | Non-Local | | | | MoCo + PLMoCo | 200 + 2(IN) | MoCoFCN | 73.27% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 73.41% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | | | | | 200 + 50(VOC) | MoCoFCN | 71.67% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | 200 + 200(VOC) | MoCoFCN | 70.66% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | MoCo + PLEDMoCo | 200 + 20(VOC) | MoCoFCN | | 30k iters, lr from 3e-3 to 3e-5, linear lr decay. Pre-trained with MoCoFCN, 1/100 base lr for encoder. | | SimSiam | 100 | MoCoFCN | 69.82% | 30k iters, lr from 2e-3 to 2e-4, cos lr decay |

MoCoFCN are 2 layers of 3x3 convolution with dilation 6. DeepLabv3+ are of the same atrous convolution family. PSPNet chooses to encode context information using different pooling size. Non-local uses attention to encode context information, dynamic routing.