Normalized Object Coordinate Space (NOCS)
NOCS. Category-level 6D object pose and size estimation predicts a tight oriented bounding box around an object. Since no exact CAD models are available for category-level tasks, the first challenge is to find a representation that allows definition of 6D pose and size for different objects. The NOCS is defined as a 3D space contained within a unit cube $\{x, y, z\} \in [0, 1]$. Known object CAD models for each category are normalized such that the diagonal of its tight bounding box has a length of 1 and is centered. The object center and orientation is also aligned across the same category. The CNN then predicts the 2D perspective projection of the color-coded NOCS coordinates, i.e., a NOCS map.
Method. The proposed method consists of a Mask R-CNN that estimates the class label, instance mask, and the NOCS mask from the RGB image and a pose fitting algorithm. Three heads are added to the Mask R-CNN architecture to predict the $x, y, z$ components of the NOCS maps. For the NOCS head, a standard softmax loss function is used for classification and a soft L1 loss is added to make the learning more robust.
\[ \mathcal{L}(\mathbf{y}, \mathbf{y}') = \begin{cases} \frac{1}{n}5(\mathbf{y} - \mathbf{y}^*)^2 & \lvert \mathbf{y} - \mathbf{y}^*\rvert \leq 0.1 \\ \frac{1}{n}(\lvert \mathbf{y} - \mathbf{y}^*\rvert - 0.05) & \lvert \mathbf{y} - \mathbf{y}^*\rvert > 0.1 \end{cases} \]
For object categories that are symmetric about an axis, e.g., bottle, we can define an axis of symmetry and extend the loss function by setting $\mathcal{L}_\text{s} = \min_{i=1, \dots, \lvert \theta \rvert} \mathcal{L}$. Often $\lvert \theta \rvert \leq 6$ is enough.
Pose fitting. The goal is to estimate the full metric 6D pose and dimensions of detected objects. We first obtain a 3D point cloud $P_m$ of the object from the predicted object mask. Also, we can use the NOCS map to obtain a 3D representation $P_n$. We then estimate the scales, rotations, and translation that transforms the $P_n$ to $P_m$. Umeyama algorithm and RANSAC is used for this 7 dimensional rigid transformation estimation.
|