MaskFeat

Dec 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for Masked Feature Prediction for Self-Supervised Visual Pre-Training [1].

In this work, the authors presented Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. The approach first randomly masks out a portion of the input sequence and then predicts the features of the masked regions. MaskFeat achieves competitive results on ImageNet and state-of-the-art results on Kinetics datasets, AVA, and SSv2.

overview
Introduction

One essential difference between vision and language is that vision has no pre-existing vocabulary to shape the prediction task into a well-defined classification problem. One immediate solution is to build a visual vocabulary that discretizes frame patches into tokens, as explored by BEiT [2]. However, this requires an external tokenizer which can be limited in compute-intensive video understanding scenario.

MaskFeat ingests the masked space-time input with a vision Transformer backbone and predicts a certain feature representation of the masked content. The authors studied a broad spectrum of feature types and reveals:

  • Simple histogram of oriented gradients is a particularly effective target for MaskFeat in terms of both performance and efficiency.
  • The discretization of visual signals is not necessary for masked visual prediction.
  • Semantic knowledge from human annotations is not always helpful for MaskFeat, but characterizing local patterns seem important.

MaskFeat

Masked tokens are replaced with a [MASK] token, which is a learnable embedding indicating masked patches. The Transformer is trained to predict features of the masked contents and the loss is only operated on the masked cubes.

overview

Choice of features:

  • Pixel colors. The RGB values are normalized by the mean and the standard deviation of the dataset, and the authors minimized the L2 distance between the prediction and the ground-truth values.
  • HOG. Histogram of Oriented Gradients (HOG) is a feature descriptor that describes the distribution of gradient orientations or edge directions within a local subregion. Properties of HOG, such as invariance to geometric and photometric changes are vital for good results. In addition, local-contrast normalization in HOG is also essential for MaskFeat pre-training. The HOG features are collected in each RGB channel and the loss minimizes the L2 distance.
  • Discrete variational autoencoder (dVAE). DALL-E proposed to compress an image with a dVAE codebook, and the task is to predict categorical distribution of the maksed token by optimizing a cross-entropy loss.
  • Deep features. Deep features are produced by pre-trained model, either a CNN or ViT, and the loss minimizes the cosine distance.
  • Pseudo-label. To explore an even more high-level semantic prediction target, the authors considered predicting class labels of masked patches from Token Labeling [3].

Comparing target features for MaskFeat on video.

overview

Comparing target features for MaskFeat on image.

overview
Results

Comparison with previous work on K600 & K700.

overview

Comparison with previous work on IN-1K.

overview

Ablation study on HOG implementation.

overview

As we could see from the figure below, on IN-1K validation images, pixel targets can have large errors for ambiguous problems, but HOG is more robust to ambiguity.

overview
References

[1] C. Wei, H. Fan, S. Xie, C. Wu, A. Yuille, C. Feichtenhofer. Masked Feature Prediction for Self-Supervised Visual Pre-Training. In arXiv, 2021.

[2] H. Bao, L. Dong, F. Wei. BEIT: BERT Pre-Training of Image Transformers. In arXiv, 2021.

[3] Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, J. Feng. All Tokens Matter: Token Labeling for Training Better Vision Transformers. In NeurIPS, 2021.

Copyright © 2017-21 Wufei Ma