pixelNeRF

Dec, 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for pixelNeRF: Neural Radiance Fields From One or Few Images [1].

While NeRF can render photorealistic novel views, it is often impractical as it requires a large number of posed images and a lengthy per-scene optimization. In this work, the authors proposed pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. Unlike NeRF that optimizes a representation to every scene independently, they introduced an architecture that conditions a NeRF on image inputs, which allowed the network to be trained across multiple scenes to learn a scene prior. Experiments on the DTU dataset show that pixelNeRF can outperform current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction.

Image-Conditioned NeRF

PixelNeRF is comprised of two components: a fully-convolutional image encoder $E$, which encodes the input image into a pixel-aligned feature grid, and a NeRF network $f$ which outputs color and density, given a spatial location and corresponding encoded feature.

Single-image pixelNeRF. Given an image $\mathbf{I}$ of the scene, we first extract a feature volume $\mathbf{W} = E(\mathbf{I})$. Then for a point on a camera ray $\mathbf{x}$, we retrieve the corresopnding image feature by projecting $\mathbf{x}$ onto the image plane to the image coordinates $\pi(\mathbf{x})$ using known intrinsics, then bilinearly interpolating between the pixelwise features to extract the feature vector $\mathbf{W}(\pi(\mathbf{x}))$. The image features are then passed into the NeRF network as \[ f(\gamma(\mathbf{x}), \mathbf{d}; \mathbf{W}(\pi(\mathbf{x}))) = (\sigma, \mathbf{c}) \] where $\gamma(\cdot)$ is the positional encoding.

Incorporating multiple views. The model is extended to allow for an arbitrary number of views at test time, assuming only the relative camera poses are known. Let the $i$th input image be $\mathbf{I}^{(i)}$ the associated camera transform from the world space to the view space be $\mathbf{P}^{(i)} = \left[ \mathbf{R}^{(i)} \; \mathbf{t}^{(i)} \right]$. Given a query point $\mathbf{x}$ and view direction $\mathbf{d}$, we transform them into the input view $i$ using \[ \mathbf{x}^{(i)} = \mathbf{P}^{(i)}\mathbf{x}, \; \mathbf{d}^{(i)} = \mathbf{R}^{(i)}\mathbf{d} \] The initial layers $f_1$ in NeRF process the each input view separately and the final layers $f_2$ process the aggregated views. The intermediate representations are aggregated with the average pooling operator $\psi$. \[\begin{align*} \mathbf{V}^{(i)} & = f_1\left( \gamma(\mathbf{x}^{(i)}), \mathbf{d}^{(i)}; \mathbf{W}^{(i)}(\pi (\mathbf{x}^{(i)})) \right) \\ (\sigma, \mathbf{c}) & = f_2\left( \psi(\mathbf{v}^{(1)}, \dots, \mathbf{V}^{(n)}) \right) \end{align*}\]

Implementation. The encoder $E$ uses ResNet34 as backbone and features prior to the first 4 pooling layers are upsampled using bilinear interpolation and concatenated to form latent vectors of size 512. The NeRF network $f$ is a fully-connected ResNet architecture with 5 ResNet blocks and width 512. Hierarchical volume sampling with a coarse and fine NeRF and positional encoding is adopted in practice.

Image-Conditioned NeRF

Category-specific 1- and 2-view reconstruction on ShapeNet.

Category-agnostic single-view reconstruction on ShapeNet.

Generalization to unseen categories.

360 view prediction with multiple objects.

Novel-view synthesis in the DTU MVS dataset.

References

[1] A. Yu, V. Ye, M. Tancik, and A. Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Images. In CVPR, 2021.