ImageNet3D

Towards General-Purpose Object-Level 3D Understanding

Authors

Wufei Ma, Guanning Zeng, Qihao Liu, Letian Zhang, Adam Kortylewski, Yaoyao Liu, Alan Yuille

Affiliations

Johns Hopkins University, Tsinghua University, Tongji University, University of Freiburg, Max Planck Institute for Informatics


We present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box annotations and 6D pose annotations, as well as human assessment of each object's quality. Moreover, for each category in ImageNet3D, we provide detailed text descriptions and template CAD models.

With ImageNet3D, we aim to develop and evaluate visual foundation models for object-level 3D understanding. This involves fully-supervised pose estimators and self-supervised models with 3D awareness. These models can be further integrated into multi-modal large language models (MLLMs) to address complicated VQA problems that require 3D reasoning.

In ImageNet3D, we focus on the following tasks:

  1. Category-level pose estimation. We train and evaluate category-level models on ImageNet3d for 3D and 6D pose estimation.
  2. Open-vocabulary pose estimation. We train an open-vocabulary pose estimation model on common categories in ImageNet3D, and evaluate the model on unseen categories.
  3. Linear probing of object-level 3D awareness. We run linear probing of the 3D awareness of visual foundation models (e.g., CLIP, DINO, and Stable Diffusion), and evaluate how these pretrained visual encoders embed object-level 3D information.

Data Collection

We recruited 40 annotators to annotate 2D bounding boxes, 6D poses, and object quality labels for each object in the image. We further employed an onboarding stage where annotators are required to attend tutorials and pass sample tests before annotating the data.

Annotation tool. We developed a web-based app for annotators to easily access the annotation tool without installing softwares on local machines.

Ethics. We obtained IRB approvals before starting our data collection process, and we informed each annotator about the nature of our study, the purpose of the collected data, and potential risks.

Tasks

Standard pose estimation. In this setting we provide annotated training and validation data for each of the 200 categories. Depending on the amount the training data used, we consider pose estimation in fully-supervised setting, few-shot setting, and zero-shot setting.

Open-vocabulary pose estimation. We study the open-vocabulary generalization ability of pose estimation models. Specifically we split the 200 categories into "seen" categories and "unseen" categories. "Seen" categories are often more common in daily life but share similar topological structures as "unseen" categories. Moreover, open-vocabulary pose estimation models may utilize the provided text descriptions of novel categories, as well as large pretrained vision models, such as Stable Diffusion and DINO.

Linear probing of object-level 3D awareness. Following the ImageNet linear probing setting, we add a linear classifier on top of frozen feature encoders pretrained with self-supervised objectives. We expect visual encoders with better object-level 3D probing performance can be further integrated into large vision language models (LVLMs) and address challenging VQA problems that require 3D reasoning.

Baseline Results

Standard pose estimation. For 3D pose estimation, we evaluate a common baseline model, ResNet50-General, which regards the pose estimation problem as a classification problem. Following previous works, we use 40 bins for each of 3D viewpoint parameter.

Open-vocabulary pose estimation. We train ResNet50-General on training data from the seen categories, and evaluate the model on validation data from seen and unseen categories.

Linear probing of object-level 3D awareness. Given features obtained from frozen encoders of DINO, MAE, Stable Diffusion, etc, we train a linear classifier that classifies the 3D viewpoint of the object in the image. This allows us to investigate the object-level 3D awareness of these large pretrained vision encoders.

Accessing ImageNet3D

We release our ImageNet3D dataset on huggingface. For detailed documentation and sample data preprocessing code, please refer to our GitHub repo.

BibTeX

TBD