RePOSE

Feb 2022

Wufei Ma
Purdue University

Abstract

Paper reading notes for RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering [1].

In this work, the authors presented RePOSE, a fast iterative refinement method for 6D object pose estimation. RePOSE leverages image rendering for fast feature extraction using a 3D model with a learnable texture. The deep texture rendering uses a shallow multi-layer perceptron to directly regress a view-invariant image representation of an object. They also utilized differentiable LM optimization to refine a pose by minimizing the distance between the input and rendered image representations and showed that the LM algorithm converges within few iterations. RePOSE runs at 92 FPS and achieves state-of-the-art performance on the Occlusion LineMOD dataset.

RePOSE

Given an input image $\mathbf{I}$, RePOSE extracts a feature $\mathbf{F}_\text{inp}$ from image $\mathbf{I}$ using a CNN $\Phi$. Then we refine the initial pose estimate $\mathbf{P}_\text{ini} = \Omega(I)$ where $\Omega$ is any pose estimation method like PVNet and PoseCNN in real time using differentiable LM optimization. RePOSE renders the template 3D model with learnable deep textures in pose $\mathbf{P}$ to extract features $\mathbf{F}_\text{rend}$. The pose refinement is performed by minimizing the distance between $\mathbf{F}_\text{inp}$ and $\mathbf{F}_\text{rend}$.

Feature extraction. A U-Net architecture is used for the CNN $\Phi$ and the decoder outputs a deep feature map $\mathbf{F}_\text{inp} \in \mathbb{R}^{w \times h \times d}$ for every pixel in $\mathbf{I}$. The authors also found $d=3$ to be optimal.

Template 3D Model Rendering. The template 3D model $\mathcal{M}$ with pose $\mathbf{P} = \{ \mathbf{R}, \mathbf{t} \}$ is projected to 2D to render the feature $\mathbf{F}_\text{rend}$. Let the template 3D model $\mathcal{M} = \{ \mathcal{V}, \mathcal{C}, \mathcal{F}\}$ be represented by a triangular watertight mesh consisting of $N$ vertices $\mathcal{V} = \{V_n\}_{n=1}^N$ where $V_n \in \mathbb{R}^3$, faces $\mathcal{F}$ and deep features $\mathcal{C} = \{C_n\}$ for $C_n \in \mathbb{R}^d$. The vertex $V_n$ are mapped to $v_n \in \mathbb{R}^2$ using \[ v_n = \pi(V_n \mathbf{R}^\top + \mathbf{t}^\top) \] and the rendered feature at pixel location $(x, y)$ is a linear combination of the deep features in the triangular face \[ \mathbf{F}_\text{rend}(x, y) = \sum_{i=1}^3 w_n^i C_n^i \] The authors' custom implementation of the renderer takes less than 1ms to render $\mathbf{F}_\text{rend}$.

LM optimization. The optimal pose $\hat{\mathbf{P}}$ is caluclated by minimizing the L2 distance between $\mathbf{F}_\text{inp}$ and $\mathbf{F}_\text{rend}$. \[ \begin{cases} \Delta \mathbf{P} & = (\mathbf{J}^\top(\mathbf{e})\mathbf{J} + \lambda \mathbf{I})^{-1}\mathbf{J}^\top(\mathbf{e})\mathbf{e} \\ \mathbf{P}_{i+1} & = \mathbf{P}_i + \Delta\mathbf{P} \end{cases} \] The loss function includes the ADD(-S) score \[ \mathcal{L}_\text{ADD(-S)} = S_\text{ADD(-S)}(\mathbf{P}, \mathbf{P}_\text{gt}) \] and a loss function to ensure the objective is minimized when the pose $\mathbf{P}$ is equal to $\mathbf{P}_\text{gt}$ \[ \mathcal{L}_\text{diff} = \sum_k \lVert \mathbf{F}_\text{inp} - \Psi(\mathbf{P}_\text{gt}, \mathcal{M}) \rVert^2 \]

Results

Qualitative results of RePOSE on the Occlusion LineMOD dataset.

Quantitative reuslts of RePOSE on the LineMOD dataset.

References

[1] S. Iwase, X. Liu, R. Khirodkar, R. Yokota, K. Kitani. RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering. In ICCV, 2021.