RePOSE
Given an input image $\mathbf{I}$, RePOSE extracts a feature $\mathbf{F}_\text{inp}$ from image $\mathbf{I}$ using a CNN $\Phi$. Then we refine the initial pose estimate $\mathbf{P}_\text{ini} = \Omega(I)$ where $\Omega$ is any pose estimation method like PVNet and PoseCNN in real time using differentiable LM optimization. RePOSE renders the template 3D model with learnable deep textures in pose $\mathbf{P}$ to extract features $\mathbf{F}_\text{rend}$. The pose refinement is performed by minimizing the distance between $\mathbf{F}_\text{inp}$ and $\mathbf{F}_\text{rend}$.
Feature extraction. A U-Net architecture is used for the CNN $\Phi$ and the decoder outputs a deep feature map $\mathbf{F}_\text{inp} \in \mathbb{R}^{w \times h \times d}$ for every pixel in $\mathbf{I}$. The authors also found $d=3$ to be optimal.
Template 3D Model Rendering. The template 3D model $\mathcal{M}$ with pose $\mathbf{P} = \{ \mathbf{R}, \mathbf{t} \}$ is projected to 2D to render the feature $\mathbf{F}_\text{rend}$. Let the template 3D model $\mathcal{M} = \{ \mathcal{V}, \mathcal{C}, \mathcal{F}\}$ be represented by a triangular watertight mesh consisting of $N$ vertices $\mathcal{V} = \{V_n\}_{n=1}^N$ where $V_n \in \mathbb{R}^3$, faces $\mathcal{F}$ and deep features $\mathcal{C} = \{C_n\}$ for $C_n \in \mathbb{R}^d$. The vertex $V_n$ are mapped to $v_n \in \mathbb{R}^2$ using
\[ v_n = \pi(V_n \mathbf{R}^\top + \mathbf{t}^\top) \]
and the rendered feature at pixel location $(x, y)$ is a linear combination of the deep features in the triangular face
\[ \mathbf{F}_\text{rend}(x, y) = \sum_{i=1}^3 w_n^i C_n^i \]
The authors' custom implementation of the renderer takes less than 1ms to render $\mathbf{F}_\text{rend}$.
LM optimization. The optimal pose $\hat{\mathbf{P}}$ is caluclated by minimizing the L2 distance between $\mathbf{F}_\text{inp}$ and $\mathbf{F}_\text{rend}$.
\[ \begin{cases}
\Delta \mathbf{P} & = (\mathbf{J}^\top(\mathbf{e})\mathbf{J} + \lambda \mathbf{I})^{-1}\mathbf{J}^\top(\mathbf{e})\mathbf{e} \\
\mathbf{P}_{i+1} & = \mathbf{P}_i + \Delta\mathbf{P}
\end{cases} \]
The loss function includes the ADD(-S) score
\[ \mathcal{L}_\text{ADD(-S)} = S_\text{ADD(-S)}(\mathbf{P}, \mathbf{P}_\text{gt}) \]
and a loss function to ensure the objective is minimized when the pose $\mathbf{P}$ is equal to $\mathbf{P}_\text{gt}$
\[ \mathcal{L}_\text{diff} = \sum_k \lVert \mathbf{F}_\text{inp} - \Psi(\mathbf{P}_\text{gt}, \mathcal{M}) \rVert^2 \]
|