Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction
Chi Yan, Dan Xu
2025-10-13
Summary
This paper introduces a new method, PG-Occ, for creating 3D models of scenes from images, specifically for use in self-driving cars.
What's the problem?
Current 3D scene modeling techniques struggle with a balancing act. If they use a simple, fast method with fewer details (like representing things as blobs), they miss small objects. If they try to be very detailed, it takes a lot of computing power. Existing methods also have trouble understanding scenes based on open-ended text descriptions – they’re usually limited to recognizing specific object types.
What's the solution?
PG-Occ solves this by progressively adding detail to the 3D model. It starts with a basic representation and then gradually refines it, focusing on adding finer details where needed. It also cleverly figures out how much 'attention' each part of the scene needs based on its size and how the scene changes over time, allowing it to capture more information efficiently. Essentially, it's a smart way to build up a detailed 3D understanding of a scene without overwhelming the computer.
Why it matters?
This work is important because it improves the ability of self-driving cars to understand the world around them. By creating more accurate and detailed 3D models, and by allowing the car to understand scenes based on natural language instructions, it makes autonomous driving safer and more versatile. The new method significantly outperforms previous approaches in terms of accuracy.
Abstract
The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ