Masked Autoencoders

Masked Autoencoders Are Scalable Vision Learners

Nov 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for Masked Autoencoders Are Scalable Vision Learners [1].

In this work, the authors presented a simple, effective, and scalable form of a masked autoencoders (MAE) for visual representation learning. The MAE masks random patches from the input image and reconstructs the missing patches in the pixel space. By masking a high proportion of the input image, e.g. 75%, the authors accelerated training by 3x or more and improved accuracy.

Introduction

What makes maksed autoencoding different between vision and language?

Until recently, architectures were different. CNNs used to be dominant over the last decade. However, this architectural gap has been addressed with the introduction of ViT [2].
Information density is different. Images are natural signals with heavy spatial redundancy and a missing patch can be recovered from neighboring patches with little high-level understanding. To overcome this difference and encourage learning useful features, the authors proposed to mask a very high portion of random patches.
The autoencoder's decoder plays a different role. While in BERT the decoder can be trivial (an MLP), the decoder design for vision tasks plays a key role in determining the semantic level of the learned latent representations.

Example results on COCO validation images, using an MAE trained on ImageNet.

Approach

The proposed masked autoencoder (MAE) is a simple autoencoder approach that reconstructs the original signal given its partial observation. The encoder maps the observed signal to a latent representation, and a decoder reconstructs the original signal from the latent representation.

Masking. Following ViT, images are divided into non-overlapping patches. Randomly sampled (without replacement) patches are masked.

MAE encoder. The encoder is a ViT but only on visible patches. Training is efficient since the encoder only operates of 30% of all patches.

MAE decoder. The decoder is designed to be much narrower and shallower than the encoder, which significantly reduces pre-training time.

Reconstruction target. The loss function computes the MSE between the masked patches in the reconstructed and original images. Also, using normalized pixels as the recontruction target improves representation quality.

Experimental Results

Validation accuracy on ImageNet-1K using different masking ratios.

Comparison with supervised pre-training.

References

[1] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked Autoencoders Are Scalable Vision Learners. In arXiv, 2021.

[2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.