Utonia: Toward One Encoder for All Point Clouds
Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao
2026-03-04
Summary
This paper introduces Utonia, a new system designed to create a single, powerful AI model that can understand and work with 3D point cloud data from many different sources, like self-driving car sensors, indoor scans from phones, and even 3D models created by designers.
What's the problem?
Currently, AI models for understanding 3D data are usually trained on data from *one* specific source. This means a model trained on data from a self-driving car won't necessarily understand a scan of your living room, even though both are 3D point clouds. Different sensors and data types create different kinds of point clouds with varying qualities and characteristics, making it hard to build a universal system.
What's the solution?
The researchers created Utonia, which is a 'point transformer' – a type of neural network – that's trained on a huge collection of diverse 3D data all at once. By training it to understand all these different types of point clouds together, Utonia learns a common way to represent 3D information that works across all the domains. This allows it to transfer knowledge between them, improving performance in each area and even unlocking new abilities when the data is combined.
Why it matters?
This work is important because it's a step towards creating 'foundation models' for 3D data, similar to how large language models like ChatGPT work with text. A universal 3D understanding model could greatly improve technologies like augmented and virtual reality, robotics, and self-driving cars by allowing them to share and leverage information more effectively.
Abstract
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.