Problem Definition & Formulation
Design pretext tasks that could improve dense downstream prediction or classification tasks, perticularly semantic segmentation.
How to Design pretext tasks
- Encoder
, parameter
- Loss for semantic segmentation
- Loss for dense prediction pretext task
- Design a augmentation set,
, such that for two augmentation
, we have
, where
.
Design a loss function,
, such that
, where
.
Goal
Based on current pre-trained models for classification, further pre-training on the pretext task designed for dense prediction could improve downstream semantic segmentation task, with small computation cost.
(基于现有self-sup训练出的针对classification的模型,再针对dense prediction做继续预训练)- A pretext task for dense prediction that could improve model performance on semantic segmentation, relative to self-supervised pre-training for classification, when pre-training from scratch.
(从random init.开始,针对dense prediction做pre-training,提升sem seg性能)
Related Work
Classification
- Explicitly adding context information during self-supervised pre-training could improve dense downstream tasks, in our case, semantic segmentation.
- Removing segmentation-irrelevent information.
How to Encode Context Information
- Adjacent Pixel Aggregation
- Weighted sum, like attention
- Co-occurance
Methods
Patch-level Rotation
Image -> Divide into patches -> Randomly (i.i.d) rotate each patch -> Feed into Encoder-Decoder network -> Predict rotate angles of each patch (pixel level supervision)Encoder-Decoder MoCo
Image -> Encoder-Decoder as encoder, average pooling decoder output to (1, 1), output one vector representing the image -> MoCoPatch-level MoCo
- Feature map (e.g. 1/16 size of original image with high channel dim) contains patches of image.
- Perform patch level matching on two views of the same image.
- Momentum contrastive learning on patches (e.g. pull matched patches closer and push others away)
Patch-level Encoder-Decoder MoCo
Experiments
Performance in mIoU on VOCaug dataset
| | Pre-training Epoch | Decoder Head | mIoU | Note | | :—- | :—-: | :—-: | —-: | :—- | | RandomInit | - | MoCoFCN | 68.37% | 160k iters, lr from 1e-2 to 1e-4, poly lr decay, p=0.9 | | | | | 36.43% | 32k iters (first 30k in 160k) | | Supervised-IN | - | MoCoFCN | 74.40% | 30k iters, lr from 3e-3 to 3e-5, step lr decay, x0.1 at 70 and 90 percentile, MoCo report | | | | PSPNet | 76.61% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | DeepLabv3+ | 76.38% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | 76.20% | 20k iters, lr from 1e-2 to 1e-4, poly lr decay, p=0.9, mmseg report | | MoCo | 200 | MoCoFCN | 72.65% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 75.45% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | DeepLabv3+ | 74.61% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | Non-Local | 73.53% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | SimCLR | 200 | MoCoFCN | 69.92% | 30k iters, lr from 3e-3 to 3e-5, step lr decay, x0.1 at 70 and 90 percentile | | MoCo v2 | 200 | MoCoFCN | 74.04% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 75.06% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | DeepLabv3+ | 76.11% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | 74.99% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | BYOL | 200 | MoCoFCN | 73.35% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | PSPNet* | 74.69% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, out channel = 256 | | | | PSPNet | 73.32% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, out channel = 512 (default) | | MoCo + PLR | 200 + 20(VOC) | MoCoFCN | 70.82% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay, average of two runs. | | | | PSPNet | 72.59% | 30k iters, lr from 3e-3 to 3e-5, cos lr decay | | MoCo + EDMoCo | 200 + 20(VOC) | MoCoFCN | 71.42% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay. Pre-trained with MoCoFCN, same lr for all params. | | | | PSPNet | 73.59% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay. Pre-trained with PSPNet, 1/100 base lr for encoder. | | | | Non-Local | | | | MoCo + PLMoCo | 200 + 2(IN) | MoCoFCN | 73.27% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | | PSPNet | 73.41% | 30k iters, lr from 5e-3 to 5e-5, cos lr decay | | | | Non-Local | | | | | 200 + 50(VOC) | MoCoFCN | 71.67% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | | 200 + 200(VOC) | MoCoFCN | 70.66% | 30k iters, lr from 3e-3 to 3e-5, linear lr decay | | MoCo + PLEDMoCo | 200 + 20(VOC) | MoCoFCN | | 30k iters, lr from 3e-3 to 3e-5, linear lr decay. Pre-trained with MoCoFCN, 1/100 base lr for encoder. | | SimSiam | 100 | MoCoFCN | 69.82% | 30k iters, lr from 2e-3 to 2e-4, cos lr decay |
MoCoFCN are 2 layers of 3x3 convolution with dilation 6. DeepLabv3+ are of the same atrous convolution family. PSPNet chooses to encode context information using different pooling size. Non-local uses attention to encode context information, dynamic routing.
