Workflow Summary
- Supervised ImageNet as pretraining for backbone (usually Resnet50 as encoder) reaches higher performance than traditional end-to-end method. That is
.
- Self-supervised approaches, like MoCo and SimCLR, provide weaker backbones yet still reach higher performance than end-to-end training. That is
.
- Current model are based on encoder-decoder style, where pretrained backbone serves as encoder for the complete model and participates in finetuning with the decoder head. We believe that when backbone is fixed, the more complexity a decoder can learn, the better the performance. In inequalities,
. If shown a considerable room for improvement (performance gap is large), then this is a good start.
- The goal of this research is to improve self-supervised method on subjects related to dense predictions. Therefore a desirable result would be a better self-supervised pretrained model, and possibly a model that surpasses the supervised pretraining baseline. Good:
. Optimal:
.
Encoder-Decoder Pattern
MoCo
Pure encoder training
- Structure: Enc-Dec + (Avg Pooling)
- Loss: MoCo contrastive loss
-
Dense Prediction — Segmentation
Structure: Enc-Dec (perhaps pretraining together)
- Loss: pretext task (to be designed)
- Dataset: Segmentation (e.g. VOC)
Pretext Task Proposals
Patch-Level Rotation (+Color Distortions)
Dividing a image to multiple regions of the same size to form patches, then apply independent random rotation (and possibly color distortions) on each individual patch.Contrastive Classification
Location-Based Instance Discrimination
- Location
- patch-level instances
- use convolutional layers and poolings to form a low resolution image such that each pixel in the low-res image represents a learnable transformation of a larger regional informations in the original image (not exactly patch-level), and then we treat a single pixel as an instance.
- Negative sample
- other locations of current image
- exclude itself in current batch
