DINeMo: Learning Neural Mesh Models with no 3D Annotations

Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille

2025-03-27

DINeMo: Learning Neural Mesh Models with no 3D Annotations

Summary

This paper is about teaching computers to understand 3D shapes without needing a lot of labeled examples.

What's the problem?

It's hard to train computers to recognize 3D objects because you usually need a lot of examples with detailed labels, which takes a lot of time and effort.

What's the solution?

The researchers developed a new method called DINeMo that uses unlabeled images and a smart way to guess how different parts of the image correspond to each other to learn about 3D shapes.

Why it matters?

This work matters because it can make it easier to train AI models to understand the 3D world, which is important for applications like robotics and self-driving cars.

Abstract

Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at https://analysis-by-synthesis.github.io/DINeMo/.

View Paper