Workflow Summary

Supervised ImageNet as pretraining for backbone (usually Resnet50 as encoder) reaches higher performance than traditional end-to-end method. That is .
Self-supervised approaches, like MoCo and SimCLR, provide weaker backbones yet still reach higher performance than end-to-end training. That is .
Current model are based on encoder-decoder style, where pretrained backbone serves as encoder for the complete model and participates in finetuning with the decoder head. We believe that when backbone is fixed, the more complexity a decoder can learn, the better the performance. In inequalities, . If shown a considerable room for improvement (performance gap is large), then this is a good start.
The goal of this research is to improve self-supervised method on subjects related to dense predictions. Therefore a desirable result would be a better self-supervised pretrained model, and possibly a model that surpasses the supervised pretraining baseline. Good: . Optimal: .
Encoder-Decoder Pattern
MoCo
Pure encoder training

Location
1. patch-level instances
2. use convolutional layers and poolings to form a low resolution image such that each pixel in the low-res image represents a learnable transformation of a larger regional informations in the original image (not exactly patch-level), and then we treat a single pixel as an instance.
Negative sample
1. other locations of current image
2. exclude itself in current batch

Self-supervised Learning for Dense Prediction