ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting
Yen-Jen Chiou, Wei-Tse Cheng, Yuan-Fu Yang
2026-01-09
Summary
This paper introduces ProFuse, a new system for understanding what's in a 3D scene using 3D Gaussian Splatting, and labeling those objects with words from a vocabulary the system hasn't specifically been trained on – meaning it can recognize and name things it hasn't 'seen' before.
What's the problem?
Existing methods for understanding 3D scenes and attaching labels to objects within them are often slow and require a lot of computational power, especially when dealing with complex scenes and trying to understand them using a broad range of possible labels. They also sometimes struggle with making sure the labels are consistent from different viewpoints and that parts of an object are all labeled the same way.
What's the solution?
ProFuse speeds things up by first roughly aligning the 3D scene using information from multiple views, then creating 'proposals' for potential objects. These proposals are given a general description based on the objects they contain. This description is then used to help the system accurately place and label the objects during the final 3D reconstruction process. Importantly, it does this without needing to retrain the core 3D scene representation or do a lot of extra optimization steps.
Why it matters?
ProFuse is significant because it makes open-vocabulary 3D scene understanding much faster – about twice as fast as the best existing methods – while still maintaining high accuracy. This means we can more quickly and efficiently create detailed 3D models of environments and automatically label the objects within them, which has applications in robotics, virtual reality, and other fields.
Abstract
We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.