NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Muhammad Zubair Irshad, Sergey Zakharov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus

2024-08-01

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Summary

This paper presents NeRF-MAE, a new method for teaching computers to understand 3D scenes better by using a technique called masked autoencoders. It focuses on improving how neural networks can learn from images to create detailed 3D representations.

What's the problem?

Many current methods for understanding 3D scenes rely heavily on labeled data and struggle to learn effectively from just 2D images. This can lead to poor performance in tasks like recognizing objects in 3D space because the models often forget important details when switching between different types of data, like images and 3D models.

What's the solution?

To address these issues, the authors developed NeRF-MAE, which uses masked autoencoders to train models on 3D representations derived from 2D images. By masking parts of the scene and having the model predict what’s missing, it learns to understand the structure of the entire scene. This method allows the model to generate high-quality 3D representations without needing extensive labeled data. The authors trained their model on over 1.8 million images, allowing it to perform well on various tasks related to 3D object detection and scene understanding.

Why it matters?

This research is important because it significantly enhances how computers can learn about and interpret 3D environments from simpler 2D images. By improving the ability of neural networks to understand complex scenes, NeRF-MAE has the potential to advance applications in areas like robotics, virtual reality, and computer graphics, making interactions with digital environments more realistic and effective.

Abstract

Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.

View Paper