Key Points

MoCo baseline verified. The question is that two extra 3 by 3 convolutional layers may lack the complexity to encode useful information that helps performance. So we should try some more complicated decoder head like U-net or PSPnet.
Finetuning the backbone parameters should use smaller learning rate than the randomly initialized decoder head, usually 1:10 or maybe 1:100 for backbone lr to arbitrary decoder head lr.
Ideas of designing pretext task: treat several pixel in a region as a super-pixel and do district or patch level instance discriminative learning, and more.
Naive auto encoder is still worth exploring, or at least set as a baseline model.
To-do’s

Implement U-net head in mmseg, then use U-net head as decoder and ResNet-50 as backbone encoder, run standard VOC training and validation.
Try to finetune the backbone with small learning rate, 1/10 the base learning rate for example.

Self-supervised Learning for Dense Prediction