3D View Synthesis

Jul, 2021

Wufei Ma
Purdue University

Abstract

In this post, we will review the following papers:

NeRF: Representing scenes as neural radiance fields for view synthesis [1].

NeRF

Mildenhall et al. [1] proposed NeRF for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Images are then synthesized by querying 5D coordinates (spatial location and viewing direction) and use classic volume rendering techniques to project the output colors and densities into an image. In order to generate high-resolution representations, the 5D coordinates are transformed with a positional encoding and a hierarchical sampling procedure is adopted to reduce the number of queries.

Neural radiance field scene representation. The scene is represented continuously as a 5D vector-valued function whose input is a 3D location $\mathbf{x} = (x, y, z)$ and 2D viewing direction $(\theta, \phi)$, and whose ouptut is an emitted color $\mathbf{c} = (r, g, b)$ and volume density $\sigma$. The 5D scene representation is approximated with an MLP $F_\Theta: (\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma)$. The representation is multiview consistent by restricting the network to the predict the volume density $\sigma$ as a function of the location $\mathbf{x}$ only, while the color $\mathbf{c}$ depends on both location and viewing direction.

Volume rendering with radiance fields. We render the color of any ray passing through the scene using principles from classical volume rendering. The volume density $\sigma(\mathbf{x})$ can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location $\mathbf{x}$. The expected color $C(\mathbf{r})$ of camera ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ with near and far bounds $t_n$ and $t_f$ is: \[C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t), \mathbf{d})dt, \; T(t) = \exp \left( - \int_{t_n}^t \sigma(\mathbf{r}(s))ds \right)\] The function $T(t)$ denotes the accumulated transmittance along the ray from $t_n$ to $t$, i.e., the probability that the ray travels from $t_n$ to $t$ without hitting any other particle. We numerically estimate the continuous integral using quadrature by sampling from $N$ evenly-spaced bins: \[t_i \sim \mathcal{U}\left[ t_n + \frac{i-1}{N}(t_f - t_n), t_n + \frac{i}{N}(t_f - t_n) \right]\] $C(\mathbf{r})$ is then estimated with the quadrature rule given by \[\hat{C}(\mathbf{r}) = \sum_{i=1}^N T_i(1 - \exp(-\sigma_i\delta_i))\mathbf{c}_i, \; T_i = \exp\left( - \sum_{j=1}^{i-1} \sigma_i\delta_j \right)\] where $\delta_i = t_{i+1}-t_i$.

Positional encoding. The authors found that vanilla NeRF performed poorly at representing high-frequency variations in color and geometry. Following [2], the authors adopt an encoding function $\gamma$ that maps from $\mathbb{R}$ into $\mathbb{R}^{2L}$: \[\gamma (p) = \left( \sin(2^0 \pi p), \cos(2^0 \pi p), \cdots, \sin(2^{L-1}\pi p), \cos(2^{L-1}\pi p) \right)\] The function $\gamma(\cdot)$ is applied separately to each of the three values in $\mathbf{x}$ and the Cartesian viewing direction unit vector $\mathbf{d}$.

Hierarchical volume rendering. It is inefficient to query free space and occluded regions repeatedly. Inspired from early work [3], the authors proposed a hierarchical representation that allocates samples proportionally to their expected effect on the final rendering. We first sample $N_c$ locations and evaluate with the ``coarse'' network. Given \[\hat{C}_c(\mathbf{r}) = \sum_{i=1}^{N_c} w_ic_i, \; w_i = T_i(1- \exp(-\sigma_i\delta_i)\] Normalizing the weights $\hat{w}_i = w_i / \sum w_i$ produces a piecewise-constant PDF along the ray. A second set of locations are then sampled from this distribution using inverse transform sampling.

Implementation details and results can be found here.

GRAF

In this work, Schwarz et al. [4] proposed a generative model for radiance fields. By introducing a multis-scale patch-based discriminator, they demonstrated synthesis of high-resolution images while training from unposed 2D images.

References

[1] B. Mildenhall, P.P. Srinivasan, M. Tancik, J.T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.

[2] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In ICML, 2018.

[3] M. Levoy. Efficient ray tracing of volume data. In ACM Transactions on Graphics, 1990.

[4] K. Schwarz, Y. Liao, M. Niemeyer, A. Geiger. GRAF: Generative radiance fields for 3D-aware image synthesis. In NeurIPS, 2020.