3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

Hao Tang, Ting Huang, Zeyu Zhang

2026-01-13

3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

Summary

This paper introduces a new system called 3D CoCa v2, which is designed to automatically create descriptions for 3D scenes, like rooms or outdoor environments, using natural language.

What's the problem?

Describing 3D scenes with words is difficult for computers because the data representing these scenes, called point clouds, is messy and incomplete. Existing systems struggle to accurately connect the words they generate to specific objects in the scene and don't work well when moved to completely new and different environments, like going from describing a kitchen to describing a forest.

What's the solution?

The researchers created 3D CoCa v2, which combines a pre-trained understanding of images and text with a special 3D scene analyzer. It learns to link visual features to language and generates captions. Importantly, it improves its performance *without* needing to be retrained by searching for the best caption at the time of use, using a quick summary of the scene to guide its choices. It doesn't rely on pre-defined object detectors, making it more flexible.

Why it matters?

This work is important because it makes 3D scene understanding more accessible. Better 3D captioning can help robots navigate and interact with the world, improve virtual reality experiences, and allow for more effective searching and organization of 3D data. The system’s ability to work well in new environments is a significant step towards more generally intelligent AI systems.

Abstract

Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at https://github.com/AIGeeksGroup/3DCoCav2.

View Paper