MaskFeat

Dec 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for Masked Feature Prediction for Self-Supervised Visual Pre-Training [1].

In this work, the authors presented Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. The approach first randomly masks out a portion of the input sequence and then predicts the features of the masked regions. MaskFeat achieves competitive results on ImageNet and state-of-the-art results on Kinetics datasets, AVA, and SSv2.

Introduction

One essential difference between vision and language is that vision has no pre-existing vocabulary to shape the prediction task into a well-defined classification problem. One immediate solution is to build a visual vocabulary that discretizes frame patches into tokens, as explored by BEiT [2]. However, this requires an external tokenizer which can be limited in compute-intensive video understanding scenario.

MaskFeat ingests the masked space-time input with a vision Transformer backbone and predicts a certain feature representation of the masked content. The authors studied a broad spectrum of feature types and reveals:

Simple histogram of oriented gradients is a particularly effective target for MaskFeat in terms of both performance and efficiency.
The discretization of visual signals is not necessary for masked visual prediction.
Semantic knowledge from human annotations is not always helpful for MaskFeat, but characterizing local patterns seem important.

MaskFeat

Masked tokens are replaced with a [MASK] token, which is a learnable embedding indicating masked patches. The Transformer is trained to predict features of the masked contents and the loss is only operated on the masked cubes.

Choice of features:

Pixel colors. The RGB values are normalized by the mean and the standard deviation of the dataset, and the authors minimized the L2 distance between the prediction and the ground-truth values.
HOG. Histogram of Oriented Gradients (HOG) is a feature descriptor that describes the distribution of gradient orientations or edge directions within a local subregion. Properties of HOG, such as invariance to geometric and photometric changes are vital for good results. In addition, local-contrast normalization in HOG is also essential for MaskFeat pre-training. The HOG features are collected in each RGB channel and the loss minimizes the L2 distance.
Discrete variational autoencoder (dVAE). DALL-E proposed to compress an image with a dVAE codebook, and the task is to predict categorical distribution of the maksed token by optimizing a cross-entropy loss.
Deep features. Deep features are produced by pre-trained model, either a CNN or ViT, and the loss minimizes the cosine distance.
Pseudo-label. To explore an even more high-level semantic prediction target, the authors considered predicting class labels of masked patches from Token Labeling [3].

Comparing target features for MaskFeat on video.

Comparing target features for MaskFeat on image.

Results

Comparison with previous work on K600 & K700.

Comparison with previous work on IN-1K.

Ablation study on HOG implementation.

As we could see from the figure below, on IN-1K validation images, pixel targets can have large errors for ambiguous problems, but HOG is more robust to ambiguity.

References

[1] C. Wei, H. Fan, S. Xie, C. Wu, A. Yuille, C. Feichtenhofer. Masked Feature Prediction for Self-Supervised Visual Pre-Training. In arXiv, 2021.

[2] H. Bao, L. Dong, F. Wei. BEIT: BERT Pre-Training of Image Transformers. In arXiv, 2021.

[3] Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, J. Feng. All Tokens Matter: Token Labeling for Training Better Vision Transformers. In NeurIPS, 2021.