Perceptual GAN

Momentum Contrast (MoCo) for Unsupervised Representation Learning

Nov 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for Momentum Contrast for Unsupervised Visual Representation Learning [1].

While unsupervised representation learning is highly successful in NLP, as shown by GPT and BERT, supervised pre-training is still dominant in computer vision. A conjecture is that language tasks have discrete signal spaces for building tokenized dictionaries, while computer vision concerns dictionary building in a continuous, high-dimensional space. In this work the authors presented Momentum Contrast (MoCo) for unsupervised visual representation learning. They built a dynamic dictionary with a queue and a moving-averaged encoder. Experimental results showed that MoCo could largely close the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to Imagenet pre-training in several applications.

MoCo

Contrastive learning as dictionary look-up. Consider an encoded query $q$ and a set of encoded samples $\{k_0, \dots, \}$ that are keys of a dictionary. Assume there is a single key $k_+$ that $q$ matches. A form of a contrastive loss function [2], called InfoNCE [3], is considered in this paper \[ \mathcal{L}_q = - \log \frac{\exp (q \cdot k_+ / \tau)}{\sum_{i=0}^K \exp(q \cdot k_i / \tau)} \] The contrastive loss serves as an unsupervised objective function for training the encoder networks that represent the queries and keys.

Momentum Contrast. The hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution.

Dictionary as a queue. By maintaining the dictionary as a queue of data samples, we reuse the encoded keys from the immediate preceding mini-batches. The dictionary size can be much larger than a typical mini-batch size and the extra computation of maintaining this dictionary is manageable.
Momentum update. Using a queue makes it intractable to update the key encoder by back-propagation. A naive solution is to copy the key encoder $f_k$ from the query encoder $f_q$, ignoring this gradient. However, this solution reduces the key representations' consistency by rapidly changing the encoder and yields poor results in experiments. The authors then proposed a momentum update to address this issue. Denote the parameters of $f_k$ as $\theta_k$ and those of $f_q$ as $\theta_q$. We udpate $\theta_k$ by \[ \theta_k \leftarrow m\theta_k + (1-m)\theta_q \] In experiments, a large momentum with $m=0.999$ works much better than $m=0.9$, suggesting that a slowly evolving key encoder is a core to making use of a queue.

Experimental Results

Comparison of three contrastive loss mechanisms.

Ablation study on momentum.

References

[1] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR, 2020.

[2] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR, 2006.

[3] A. Oord, Y. Li, O. Vinyals. Representation Learning with Contrastive Predictive Coding. In arXiv, 2018.