Weekly Report

Confirmed that fail of patch level rotation with 16 patches and 4 directions is caused by insufficient image resolution. Especially, the ResNet50 encoder scales the original image down 32 strides, causing trouble for the decoder to properly classify rotate directions.
Patch level rotation with 4 patches and 2 directions (0, 180) reaches the highest performance of all pretext task tried on VOC dataset.
The better the ResNet50 encoder performs on patch level rotation task, the worse it performs on segmentation task.
Encoder-Decoder as base encoder for MoCo, performs badly on segmentation.
Test Report
Patch Level Rotation Pipeline
Good News

Two test are done under different settings, by accident. PLR-1p-2d stands for Patch level rotation with 1 patch (the whole image) and 2 directions (0, 180). PLR-4p-2d stands for PLR with 4 patches and 2 directions. Performance seems ok.
Bad News
High Resolution + Longer training for pretext task => Lower Performance
Encoder-Decoder MoCo Pipeline

Segmentation on VOC dataset
Encoder-Decoder model: ResNet50 + FCN8s
Optimizer: SGD, Lr = 0.01, Min_lr = 0.0001, Poly decay with power = 0.9, 20k iterations.

Self-supervised Learning for Dense Prediction