Background and Significance
Why self-supervised representation learning
The ability to learn with little-to-none supervision is a key challenge towards utilizing the vast amounts of unlabeled data in the real world. Firstly, high quality labeled datasets are expensive to get, while unlabeled data can be collected easily. Secondly, labeled datasets are usually small in size and may not reflect the accurate distribution of real-world data due to limited sample space. A main purpose of unsupervised learning is to pre-train representations, namely, features, that can be transferred to downstream tasks by fine-tuning. A good image representation is desirable in computer vision as it allows for efficient training and performance gains on downstream tasks. Recent researches show that, the gap between fully unsupervised and supervised learning are closing. In addition, with proper fine-tuning on downstream tasks, model pre-trained on a larger set of uncurated unlabeled data usually outperforms its supervised counterpart in vision tasks like classification, detection, and segmentation. Now it is generally believed that forming good representations of visual signals helps extract useful informations when building classifiers or other predictors later on.
Related Work
Contrastive Method
MoCo
MoCo (He et al., 2019) follows a simple instance discrimination task: a query matches a key if they are encoded views (e.g. different augmentations) of the same images. From a different perspective, this is a dictionary look-up problem: an encoded query should be similar to its matching key and dissimilar to others (i.e. negative samples). Learning is formulated as minimizing a contrastive loss. The contrastive loss they use is . The sum is over one positive and K negative samples. The loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k+. MoCo argues that it is desirable to build dictionaries that are 1) large and 2) consistent as they evolve during training. So MoCo propose 1) a FIFO queue that enables the dictionary to be large, and ensures consistent keys by dequeuing the oldest key (the least consistent key), and 2) momentum update method (
) for the key encoder to maintain consistency of keys.
https://www.notion.so/ztang/MoCo-b9817c1bcb0f4904bd93521cf6a1580b
SimCLR
SimCLR (Chen et al., 2020) learns representation by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. They randomly sample N examples and apply two different augmentations on each image resulting in 2N data points, each one image has one positive pair and 2(N-1) negative pairs. They define their contrastive loss for a positive pair of examples as
(similar to MoCo) with
. SimCLR mainly shows that 1) composition of data augmentations plays a critical role in defining effective predictive tasks, 2) introducing a learnable nonlinear transformation (MLP with one hidden layer as projection head) between the representation and the contrastive loss substantially improve the quality of the learned representations, and 3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
https://www.notion.so/ztang/SimCLR-e0dbe9c4e64f4a60b306bf6663e1cc98
BYOL
BYOL (Grill et al., 2020) introduces a RL-like contrastive method that achieves high performance without negative pairs. It relies on an online network and a target network. BYOL define as online representation encoder,
as online projection head,
as online prediction layer, and
as target representation encoder (update using
),
as target projection head,
as stop gradient.
serves as prediction function, that tries to match the output of online network to the output of target network, using MSE loss. One interest finding is that even with a fixed randomly initialized target network, the online network still is able to learn a representation from that. The performance gain compared to a randomly initialized network under linear evaluation protocol is non-trivial. The ablation study shows that BYOL is more resilient to changes in hyper-parameters and choices of image augmentations.
https://www.notion.so/ztang/BYOL-0f31c1f997a147d19c37365656a1952b
SwAV
SwAV (Caron et al., 2020) proposes a “swapped” prediction mechanism to predict the cluster assignment of a view from the representation of another view. They also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. They argue that their method is more memory efficient and can scale to unlimited amounts of data. The ablation study shows that multi-crop is beneficial as an image augmentation technique to all self-supervised contrastive learning.
https://www.notion.so/ztang/SwAV-15d34581df764dc8b1b526fed72b54cf
Preliminary Suppositions and Implications
What makes representations good
Learning latent representations, is to find a function , that maps original data points
to a hidden vector space
with certain properties.
- Smoothness:
Main idea behind contrastive learning. Generalization is mostly achieved by a form of local interpolation between neighboring training samples
- Natural Clustering / Linear Separable (?)
Without focus on specific features or part of the image, it generally means . Recent results suggest it improves classification performance.
- Compactness
The representation vector should be low dimensional. Not too low to be able to encode intra-class differences, and not too high to require less resources.
- Independent Columns and High Rank
Orthogonal (or close to orthogonal) features and high rank maximizes information stored in limited dimensions. Representations with this property could be memory efficient. Projecting representations to this kind of vector space is like distillation.
- Shared Factors Across Tasks (main motivation)
Recent success in pre-training then fine-tuning empirically shows this, but this is hard to explain.
How to learn good representations
It could be argued that, in theory, labels are not needed to learn most vision tasks. Objects are different from each other by nature. This one argument alone could support how humans distinguish different objects. For example, humans have developed a systematic way of classifying biological species without supervision. People learn that different animals are different, so different that reproductive isolation occurred, then we call them two different species. Contrastive learning is based on a similar idea of learning the difference between objects.
BYOL introduces a fascinating finding that a fixed randomly initialized network could still serve as a target for the online network and provide considerable improvement. It is still unclear why this happens, and the paper lacks an explanation of this phenomenon. My explanation is that since the random initializations are different by their random nature, it implies that the network could converge to a different representation function in each different initialization. Put it in human context, the different representations could be various understandings of the same object, perhaps like diverse languages evolved in different regions, but its functions are similar.
It is hard to define when two images are roughly the same, and when they are different. In contrastive learning, two augmented views of the same image are considered to be mutually approximation. Therefore, the choice of image augmentations mainly helps define the meaning of “similar” images. It is not surprising that the quality of representations is heavily affected by choice of image augmentations.
For natural clustering, recent works like SimCLR and BYOL empirically show representations that in favor to be linear separable, generally perform well in semi-supervised learning and transfer learning.
For compactness, independent columns and high rank, we want the learned representation to encode as many actual features as possible and not to take up too much space. Also, if these vectors could span a vector/representation space as large as possible, it means high information density. And the polarization of eigenvalues/singularvalues may not be very good.
Generative pre-training or BERT style pre-training have proven to be helpful in many language tasks. Image GPT shows that transformer could also do well in visual representation learning. But it would require computing power that are an order of magnitude higher to reach the same (or worse) level performance of contrastive method.
Method
- Could a good representation already exists in the middle layer of a ResNet? Take the example of ResNet-50. Currently, we use all 50 convolutional layers to get a representation, we could experiment on pulling the first half convolutional layers and put FCs ahead, then use BYOL style training, set it as target network.
- Generative method. We can view segmentation as a generative task, where we get an input image and we are asked to generate an alpha map of different instances.
- Use energy-based model in an auto-encoder way. We input x, an image, a, an attention map, and let it output h, a representation. Then we input x and h to find the original attention map. (more info in energy-based method section)
*Apply the RL-like training style introduced in BYOL, with the target net initialized to output a specific representation (e.g. GloVe embedding of class meaning), instead of a random vector, the network should be able to converge to some representation that encodes semantic meanings of class meanings. (doesn’t seem to be unsupervised)
Energy-based Method
Contrastive
In short: Pick points to push up
push down of the energy of data points, push up everywhere else: max likelihood (needs tractable partition function or variational approximation)
- push down of the energy of data points, push up on chosen locations: max likelihood with MC/MCMC/HMC, Contrastive divergence, Metric learning (e.g. MoCo, SimCLR, BYOL, SwAV), Ratio matching, Noise contrastive estimation, Min probability flow, adversarial generator/GANs
train a function that maps points off the data manifold to points on the data manifold: Denoising auto-encoder, Masked auto-encoder (e.g. BERT)
Architectural
In Short: Limit the information capacity of the representation
build the machine so that the volume of low energy stuff is bounded: PCA, K-means, Gaussian mixture model, Square ICA, etc.
- use a regularization term that measures the volume of space that has low energy: Sparse coding, Sparse auto-encoder, LISTA, Variational auto-encoders
- F(x, y) = C(y, G(x, y)), make G(x, y) as “constant” as possible with respect to y”: Contracting auto-encoder, saturating auto-encoder
- minimize the gradient and maximize the curvature around data points: Score matching
