A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

Zhan Shi, Song Wang, Junbo Chen, Jianke Zhu

2025-08-07

A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

Summary

This paper talks about a new way to make self-driving cars understand their surroundings better by combining 3D information from sensors with natural language descriptions. It introduces a system and dataset that help cars recognize objects more precisely, not just by drawing boxes around them but by understanding the exact space they occupy.

What's the problem?

The problem is that current methods for object detection in autonomous driving usually simplify objects by using bounding boxes, which can miss important details because not all parts inside the box belong to the object. This limits how well the car understands the environment and makes safer decisions.

What's the solution?

The solution was to create a benchmark and a model called GroundingOcc that uses images, 3D point clouds, and text descriptions together to predict exactly which parts of space are occupied by objects. It does this by processing information from different sources in one step, improving the accuracy of localization and object perception.

Why it matters?

This matters because better understanding and more precise recognition of objects in 3D space help autonomous vehicles make safer and smarter decisions on the road. It improves how self-driving cars see and interact with complex outdoor environments.

Abstract

A benchmark and model for 3D occupancy grounding using natural language and voxel-level annotations improve object perception in autonomous driving.

View Paper