Neural Radiance Fields (NeRF)

Jul, 2021

Wufei Ma
Purdue University

Abstract

Paper reading notes for NeRF: Representing scenes as neural radiance fields for view synthesis [1].

In this work, the authors presented a method that achieves SOTA results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. The algorithm represents a scene using a fully-connected deep network, whose input is a single continous 5D coordinate (spatial location $(x, y, x)$ and viewing direction $(\theta, \phi)$) and whose output is the volume density and view-dependent emitted radiance at that spatial location. They synthesized views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image.

Neural Radian Field Scene Representation

We represent a continous scene as a 5D vector-valued function whose input is a 3D location $\mathbf{x} = (x, y, z)$ and 2D viewing direction $(\theta, \phi)$, and whose output is an emitted color $\mathbf{c} = (r, g, b)$ and volume density $\sigma$. We use a 3D Cartesian unit vector $\mathbf{d}$ to express direction and approximate this continuous 5D scene representation with an MLP network $F_\Theta: (\mathbf{x}, \mathbf{d}) \to (\mathbf{c}, \sigma)$.

In order for the representation to be multiview consistent, the network predict the volume density $\sigma$ as a function of the location $\mathbf{x}$ only, while the RGB color $\mathbf{c}$ to be a function of both location and viewing direction.

Volume Rendering with Radiance Fields

The volume density $\sigma(x)$ can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location $\mathbf{x}$. The expected color $C(\mathbf{r})$ of camera ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ with near and far bounds $t_n$ and $t_f$ is: \[ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t), \mathbf{d}) dt, \; \text{where} \; T(t) = \exp\left( -\int_{t_n}^{t_f} \sigma(\mathbf{r}(s))ds \right) \] where $T(t)$ denotes the accumulated transmittence along the ray from $t_n$ to $t$. This continuous integral is numerically estimated using quadrature. A stratified sampling approach is adopted by partitioning $[t_n, t_f]$ into $N$ evenly-spaced bins and then drawing one sample uniformly from within each bin. We can then estimate $C(\mathbf{r})$ with the quadrature rule \[ \hat{C}(\mathbf{r}) = \sum_{i=1}^N T_i(1 - \exp(-\sigma_i \delta_i))\mathbf{c}_i, \; \text{where} \; T_i = \exp\left( -\sum_{j=1}^{i-1} \sigma_j\delta_j \right) \] where $\delta_i = t_{i+1} - t_i$.

Optimizing a Neural Radiance Field

The authors introduced two improvements to enable representing high-resolution complex scenes.

Positional encoding. The authors introduced positional encoding of the input coordinates to assist the MLP in representing high-frequency functions. The authors showed that reformulating $F_\Theta$ as a composition of two function $F_\Theta = F_\Theta' \circ \gamma$, one learned and one not, significantly improved performance. The encoding function $\gamma: \mathbb{R} \to \mathbb{R}^{2L}$ is given by: \[ \gamma(p) = \left( \sin(2^0 \pi p), \cos(2^0 \pi p), \cdots, \sin(2^{L-1}\pi p) \cos(2^{L-1}\pi p) \right) \] In the experiments, they set $L=10$ for $\gamma(\mathbf{x})$ and $L=4$ of $\gamma(\mathbf{d})$.

Hierarchical volume sampling. Densely evaluating the neural radiance field network at $N$ query points along each camera ray is inefficient: free space and occluded regions that do not contribute to the rendered image are still sampled repeatedly. Instead of just using a single network to represent the scene, we simultaneously optimize two networks: one "coarse" and one "fine". We first sample $N_c$ locations using stratified sampling, and evaluate the "coarse" network at these locations. Then we produce a more informed sampling of $N_f$ points along each ray where samples are biased towards the relevant part of the volume.

Implementation. A separate neural continuous volume representation network is optimized for each scene. At each iteration, we randomly sample a batch of camera rays from the set of all pixels in the dataset. With $N_c$ samples from the corase network and $N_c + N_f$ samples from the fine network, we use the volume rendering to render the color of each ray. The loss is simply \[ \mathcal{L} = \sum_{\mathbf{r} \in \mathcal{R}} \left[ \lVert \hat{C}_c(\mathbf{r}) - C(\mathbf{r}) \rVert_2^2 + \lVert \hat{C}_f(\mathbf{r}) - C(\mathbf{r}) \rVert_2^2 \right] \] where $\mathcal{R}$ is the set of rays in each batch, and $C(\mathbf{r})$, $\hat{C}_c(\mathbf{r})$, and $\hat{C}_f(\mathbf{r})$ are the groundtruth.

Results

Quantiative results.

Qualitative comparisons. Also see matthewtancik.com.

Ablation study.

References

[1] B. Mildenhall, P.P. Srinivasan, M. Tancik, J.T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.