VOLO

Dec, 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for VOLO: Vision Outlooker for Visual Recognition [1].

In this work, the authors tried to close the performance gap between CNNs and vision transformers. They found a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features and contexts into token representations. They introduced a novel outlook attention and present Vision Outlooker (VOLO). Experiments show that VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification without using any extra training data, and transfers well to downstream tasks.

Introduction

The authors found one major factor limiting ViTs from outperforming CNNs is their low efficacy in encoding fine-level features and contexts into token representations. Fine-level information can be encoded into tokens by finer-grained image tokenization, which however would lead to a token sequence of greater length that increases quadratically the complexity.

The authors proposed a new simple and light-weight attention mechanism, terms Outlooker. Outlooker innovates the way of generating attention for token aggregation, and enables the model to efficiently encode fine-level information. It extrapolates the mechanism of aggregating surrounding tokens from the anchor token feature directly via efficient linear projections, thus getting rid of the expensive dot-product attention computation.

overview
VOLO

VOLO can be regarded as an architecture with two stages. The first stage consists of a stack of Outlookers that generates fine-level token representations, and the second stage deploys a sequence of transformer blocks to aggregate global information.

Outlooker. Outlooker consists of an outlook attention layer and a multi-layer perceptron. Given a sequence of input $C$-dim token representations $\mathbf{X} \in \mathbb{R}^{H \times W \times C}$, Outlooker can be written as follows: \[ \begin{align*} \tilde{\mathbf{X}} & = \text{OutlookAtt}(\text{LN}(\mathbf{X})) + \mathbf{X} \\ \mathbf{Z} & = \text{MLP}(\text{LN}(\tilde{\mathbf{X}}) + \tilde{\mathbf{X}} \end{align*} \]

Outlook attention. For each spatial location $(i, j)$, outlook attention computes its similarity to all the neighbors within a local window of size $K \times K$ centered at $(i, j)$. Given input $\mathbf{X}$, each $C$-dim token is first projected using two linear layers of weights $\mathbf{W}_A \in \mathbb{R}^{C \times K^4}$ and $\mathbf{W}_V \in \mathbb{R}^{C \times C}$, into outlook weights $\mathbf{A} \in \mathbb{R}^{H \times W \times K^4}$ and value representation $\mathbf{V} \in \mathbb{R}^{H \times W \times C}$. Let $\mathbf{V}_{\Delta i, j} \in \mathbb{R}^{C \times K^2}$ denote all the values within the local window centered at $(i, j)$. The outlook attention can be written as \[ \mathbf{Y}_{\Delta i, j} = \text{MatMul}(\text{Softmax}(\hat{\mathbf{A}}_{i, j}), \mathbf{V}_{\Delta i, j}) \] The weighted values are then aggregated as \[ \tilde{\mathbf{Y}}_{i, j} = \sum_{0 \leq m, n < K} \mathbf{Y}_{\Delta i+m-\lfloor K/2 \rfloor , j+n-\lfloor K/2 \rfloor}^{i, j} \]

overview
Results

Top-1 accuracy comparison of VOLO with previous state-of-the-art methods on ImageNet, ImageNet Real, and ImageNet-V2.

overview
References

[1] L. Yuan, Q. Hou, Z. Jiang, J. Feng, S. Yan. VOLO: Vision Outlooker for Visual Recognition. In arXiv, 2021.

Copyright © 2017-21 Wufei Ma