VOLO
VOLO can be regarded as an architecture with two stages. The first stage consists of a stack of Outlookers that generates fine-level token representations, and the second stage deploys a sequence of transformer blocks to aggregate global information.
Outlooker. Outlooker consists of an outlook attention layer and a multi-layer perceptron. Given a sequence of input $C$-dim token representations $\mathbf{X} \in \mathbb{R}^{H \times W \times C}$, Outlooker can be written as follows:
\[ \begin{align*} \tilde{\mathbf{X}} & = \text{OutlookAtt}(\text{LN}(\mathbf{X})) + \mathbf{X} \\
\mathbf{Z} & = \text{MLP}(\text{LN}(\tilde{\mathbf{X}}) + \tilde{\mathbf{X}} \end{align*} \]
Outlook attention. For each spatial location $(i, j)$, outlook attention computes its similarity to all the neighbors within a local window of size $K \times K$ centered at $(i, j)$. Given input $\mathbf{X}$, each $C$-dim token is first projected using two linear layers of weights $\mathbf{W}_A \in \mathbb{R}^{C \times K^4}$ and $\mathbf{W}_V \in \mathbb{R}^{C \times C}$, into outlook weights $\mathbf{A} \in \mathbb{R}^{H \times W \times K^4}$ and value representation $\mathbf{V} \in \mathbb{R}^{H \times W \times C}$. Let $\mathbf{V}_{\Delta i, j} \in \mathbb{R}^{C \times K^2}$ denote all the values within the local window centered at $(i, j)$. The outlook attention can be written as
\[ \mathbf{Y}_{\Delta i, j} = \text{MatMul}(\text{Softmax}(\hat{\mathbf{A}}_{i, j}), \mathbf{V}_{\Delta i, j}) \]
The weighted values are then aggregated as
\[ \tilde{\mathbf{Y}}_{i, j} = \sum_{0 \leq m, n < K} \mathbf{Y}_{\Delta i+m-\lfloor K/2 \rfloor , j+n-\lfloor K/2 \rfloor}^{i, j} \]
|