Problem Definition & Formulation
- How to Design pretext tasks
- Mathematical Formulation
Goal
Related Work
Hypothesis
- How to Encode Context Information
Methods
Experiments
- Performance in mIoU on VOCaug dataset

Problem Definition & Formulation

Design pretext tasks that could improve dense downstream prediction or classification tasks, perticularly semantic segmentation.

How to Design pretext tasks

Augmentation
Loss function
Mathematical Formulation
（可能不太对）

Encoder , parameter
Loss for semantic segmentation
Loss for dense prediction pretext task

Design a augmentation set, , such that for two augmentation , we have , where .
Design a loss function, , such that , where .

Goal
Based on current pre-trained models for classification, further pre-training on the pretext task designed for dense prediction could improve downstream semantic segmentation task, with small computation cost.
（基于现有self-sup训练出的针对classification的模型，再针对dense prediction做继续预训练）
A pretext task for dense prediction that could improve model performance on semantic segmentation, relative to self-supervised pre-training for classification, when pre-training from scratch.
（从random init.开始，针对dense prediction做pre-training，提升sem seg性能）

Related Work
Classification

SimCLR
MoCo
BYOL
SwAV
SimSiam

Augmentation
InfoMin
SelfAug

Dense Prediction
PixPro

Hypothesis

Explicitly adding context information during self-supervised pre-training could improve dense downstream tasks, in our case, semantic segmentation.
Removing segmentation-irrelevent information.
How to Encode Context Information

Adjacent Pixel Aggregation
- Weighted sum, like attention
Co-occurance
Methods
Patch-level Rotation
Image -> Divide into patches -> Randomly (i.i.d) rotate each patch -> Feed into Encoder-Decoder network -> Predict rotate angles of each patch (pixel level supervision)
Encoder-Decoder MoCo
Image -> Encoder-Decoder as encoder, average pooling decoder output to (1, 1), output one vector representing the image -> MoCo
Patch-level MoCo

Feature map (e.g. 1/16 size of original image with high channel dim) contains patches of image.
Perform patch level matching on two views of the same image.
Momentum contrastive learning on patches (e.g. pull matched patches closer and push others away)
Patch-level Encoder-Decoder MoCo
Experiments
Performance in mIoU on VOCaug dataset
| | Pre-training Epoch | Decoder Head | mIoU | Note | | :—- | :—-: | :—-: | —-: | :—- | | RandomInit | - | MoCoFCN | 68.37% | 160k iters, lr from 1e-2 to 1e-4, poly lr decay, p=0.9 | | | | | 36.43% | 32k iters (first 30k in 160k) | | Supervised-IN | - | MoCoFCN | 74.40% | 30k iters, lr from 3e-3 to 3e-5, step lr decay, x0.1 at 70 and 90 percentile, MoCo report | | | | PSPNet | 76.61% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | DeepLabv3+ | 76.38% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | 76.20% | 20k iters, lr from 1e-2 to 1e-4, poly lr decay, p=0.9, mmseg report | | MoCo | 200 | MoCoFCN | 72.65% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 75.45% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | DeepLabv3+ | 74.61% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | Non-Local | 73.53% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | SimCLR | 200 | MoCoFCN | 69.92% | 30k iters, lr from 3e-3 to 3e-5, step lr decay, x0.1 at 70 and 90 percentile | | MoCo v2 | 200 | MoCoFCN | 74.04% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 75.06% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | DeepLabv3+ | 76.11% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | 74.99% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | BYOL | 200 | MoCoFCN | 73.35% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | PSPNet* | 74.69% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, out channel = 256 | | | | PSPNet | 73.32% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, out channel = 512 (default) | | MoCo + PLR | 200 + 20(VOC) | MoCoFCN | 70.82% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, average of two runs. | | | | PSPNet | 72.59% | 30k iters, lr from 3e-3 to 3e-5, cos lr decay | | MoCo + EDMoCo | 200 + 20(VOC) | MoCoFCN | 71.42% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay. Pre-trained with MoCoFCN, same lr for all params. | | | | PSPNet | 73.59% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay. Pre-trained with PSPNet, 1/100 base lr for encoder. | | | | Non-Local | | | | MoCo + PLMoCo | 200 + 2(IN) | MoCoFCN | 73.27% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 73.41% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | | | | | 200 + 50(VOC) | MoCoFCN | 71.67% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | 200 + 200(VOC) | MoCoFCN | 70.66% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | MoCo + PLEDMoCo | 200 + 20(VOC) | MoCoFCN | | 30k iters, lr from 3e-3 to 3e-5, linear lr decay. Pre-trained with MoCoFCN, 1/100 base lr for encoder. | | SimSiam | 100 | MoCoFCN | 69.82% | 30k iters, lr from 2e-3 to 2e-4, cos lr decay |

MoCoFCN are 2 layers of 3x3 convolution with dilation 6. DeepLabv3+ are of the same atrous convolution family. PSPNet chooses to encode context information using different pooling size. Non-local uses attention to encode context information, dynamic routing.

Self-supervised Learning for Dense Prediction

SSL4DP (Draft Report)

Problem Definition & Formulation

How to Design pretext tasks

Mathematical Formulation

Goal

Related Work

Classification

Augmentation

Dense Prediction

Hypothesis

How to Encode Context Information

Methods

Patch-level Rotation

Encoder-Decoder MoCo

Patch-level MoCo

Patch-level Encoder-Decoder MoCo

Experiments

Performance in mIoU on VOCaug dataset