Workflow Summary

  1. Supervised ImageNet as pretraining for backbone (usually Resnet50 as encoder) reaches higher performance than traditional end-to-end method. That is 20200912 - 图1.
  2. Self-supervised approaches, like MoCo and SimCLR, provide weaker backbones yet still reach higher performance than end-to-end training. That is 20200912 - 图2.
  3. Current model are based on encoder-decoder style, where pretrained backbone serves as encoder for the complete model and participates in finetuning with the decoder head. We believe that when backbone is fixed, the more complexity a decoder can learn, the better the performance. In inequalities, 20200912 - 图3. If shown a considerable room for improvement (performance gap is large), then this is a good start.
  4. The goal of this research is to improve self-supervised method on subjects related to dense predictions. Therefore a desirable result would be a better self-supervised pretrained model, and possibly a model that surpasses the supervised pretraining baseline. Good: 20200912 - 图4. Optimal: 20200912 - 图5.

    Encoder-Decoder Pattern

    MoCo

    Pure encoder training
  • Structure: Enc-Dec + (Avg Pooling)
  • Loss: MoCo contrastive loss
  • Dataset: ImageNet-1M

    Dense Prediction — Segmentation

  • Structure: Enc-Dec (perhaps pretraining together)

  • Loss: pretext task (to be designed)
  • Dataset: Segmentation (e.g. VOC)

    Pretext Task Proposals

    Patch-Level Rotation (+Color Distortions)

    Dividing a image to multiple regions of the same size to form patches, then apply independent random rotation (and possibly color distortions) on each individual patch.

    Contrastive Classification

    Location-Based Instance Discrimination

  1. Location
    1. patch-level instances
    2. use convolutional layers and poolings to form a low resolution image such that each pixel in the low-res image represents a learnable transformation of a larger regional informations in the original image (not exactly patch-level), and then we treat a single pixel as an instance.
  2. Negative sample
    1. other locations of current image
    2. exclude itself in current batch