FiLM: Visual Reasoning with a General Conditioning Layer

Mar 2022

Wufei Ma
Purdue University

Abstract

Paper reading notes for FiLM: Visual Reasoning with a General Conditioning Layer [1].

In this work, the authors introduece a general-purpose conditioning method for neural networks called Feature-wise Linear Modulation (FiLM). A FiLM layer carries out a simple, feature-wise affine transformation on a neural network's intermediate features, conditioned on an arbitrary input. In the case of visual reasoning, FiLM layers enable a RNN over an input question to influence CNN computation over an image. This process adaptively and radically alters the CNN's behaviour as a function of the input questions, allowing the overall model to carry our a variety of reasoning tasks, ranging from counting to comparing.

Feature-Wise Linear Modulation (FiLM)

FiLM learns function $f$ and $h$ which output $\gamma_{i, c}$ and $\beta_{i, c}$ as a function of input $x_i$: \[ \gamma_{i, c} = f_c(x_i), \;\;\; \beta_{i, c} = h_c(x_i) \] where $\gamma_{i, c}$ and $\beta_{i, c}$ modulate a neural network's activations via a feature-wise affine transformation: \[ \text{FiLM}(\mathbf{F}_{i, c} \mid \gamma_{i,c}, \beta_{i,c}) = \gamma_{i,c}\mathbf{F}_{i,c} + \beta_{i,c} \] FiLM only requires two parameters per modulated feature map and has a computational cost that does not scale with the image resolution.

Model

The FiLM model consists of a FiLM-generating linguistic pipeline and a FiLM-ed visual pipeline.

FiLM generator. The FiLM generate processes a question $x_i$ using a GRU network with 4096 hidden units that takes in learned, 200-dimensional word embeddings. The final GRU hidden state is a question embedding, from which the model predicts $(\gamma_i^n, \beta_i^n)$ for each $n$-th residual block via affine transformation.

FiLM-ed network. The FiLM-ed ResBlock starts with a 1x1 convolution followed by one 3x3 convolution. The parameters of batch normalization that immediately precede FiLM layers are turned off. Following prior works on CLEVR, two coordinate feature maps are concatenated to the image features, each ResBlock's input, and the classifier's input to facilitate spatial reasoning.

Model

Quantitative performance on CLEVR.

Quantitative performance on CLEVR-Humans.

Few-shot and zero-shot generalization using the CLEVR Compositional Generalization Test (CLEVR-CoGenT).

References

[1] E. Perez, F. Strub, H. de Vries, V. Dumoulin, A. Courville. FiLM: Visual Reasoning with a General Conditioning Layer. In AAAI, 2018.