NeRF
Mildenhall et al. [1] proposed NeRF for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Images are then synthesized by querying 5D coordinates (spatial location and viewing direction) and use classic volume rendering techniques to project the output colors and densities into an image. In order to generate high-resolution representations, the 5D coordinates are transformed with a positional encoding and a hierarchical sampling procedure is adopted to reduce the number of queries.
Neural radiance field scene representation. The scene is represented continuously as a 5D vector-valued function whose input is a 3D location $\mathbf{x} = (x, y, z)$ and 2D viewing direction $(\theta, \phi)$, and whose ouptut is an emitted color $\mathbf{c} = (r, g, b)$ and volume density $\sigma$. The 5D scene representation is approximated with an MLP $F_\Theta: (\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma)$. The representation is multiview consistent by restricting the network to the predict the volume density $\sigma$ as a function of the location $\mathbf{x}$ only, while the color $\mathbf{c}$ depends on both location and viewing direction.
Volume rendering with radiance fields. We render the color of any ray passing through the scene using principles from classical volume rendering. The volume density $\sigma(\mathbf{x})$ can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location $\mathbf{x}$. The expected color $C(\mathbf{r})$ of camera ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ with near and far bounds $t_n$ and $t_f$ is:
\[C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t), \mathbf{d})dt, \; T(t) = \exp \left( - \int_{t_n}^t \sigma(\mathbf{r}(s))ds \right)\]
The function $T(t)$ denotes the accumulated transmittance along the ray from $t_n$ to $t$, i.e., the probability that the ray travels from $t_n$ to $t$ without hitting any other particle. We numerically estimate the continuous integral using quadrature by sampling from $N$ evenly-spaced bins:
\[t_i \sim \mathcal{U}\left[ t_n + \frac{i-1}{N}(t_f - t_n), t_n + \frac{i}{N}(t_f - t_n) \right]\]
$C(\mathbf{r})$ is then estimated with the quadrature rule given by
\[\hat{C}(\mathbf{r}) = \sum_{i=1}^N T_i(1 - \exp(-\sigma_i\delta_i))\mathbf{c}_i, \; T_i = \exp\left( - \sum_{j=1}^{i-1} \sigma_i\delta_j \right)\]
where $\delta_i = t_{i+1}-t_i$.
Positional encoding. The authors found that vanilla NeRF performed poorly at representing high-frequency variations in color and geometry. Following [2], the authors adopt an encoding function $\gamma$ that maps from $\mathbb{R}$ into $\mathbb{R}^{2L}$:
\[\gamma (p) = \left( \sin(2^0 \pi p), \cos(2^0 \pi p), \cdots, \sin(2^{L-1}\pi p), \cos(2^{L-1}\pi p) \right)\]
The function $\gamma(\cdot)$ is applied separately to each of the three values in $\mathbf{x}$ and the Cartesian viewing direction unit vector $\mathbf{d}$.
Hierarchical volume rendering. It is inefficient to query free space and occluded regions repeatedly. Inspired from early work [3], the authors proposed a hierarchical representation that allocates samples proportionally to their expected effect on the final rendering. We first sample $N_c$ locations and evaluate with the ``coarse'' network. Given
\[\hat{C}_c(\mathbf{r}) = \sum_{i=1}^{N_c} w_ic_i, \; w_i = T_i(1- \exp(-\sigma_i\delta_i)\]
Normalizing the weights $\hat{w}_i = w_i / \sum w_i$ produces a piecewise-constant PDF along the ray. A second set of locations are then sampled from this distribution using inverse transform sampling.
Implementation details and results can be found here.
|