Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park

2026-03-24

Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Summary

This paper introduces a new method, called Group3D, for identifying and classifying 3D objects in images taken from multiple viewpoints, even objects it hasn't specifically been trained to recognize.

What's the problem?

Current methods for open-vocabulary 3D object detection often separate the process of building 3D object shapes from the process of figuring out what those shapes *are*. They first create basic 3D forms based on how things look from different angles, and then try to label them later. This works, but if the images don't clearly show an object's shape, or if parts of it are hidden, the system can make mistakes, either combining separate objects into one or breaking a single object into pieces. It's like trying to assemble a puzzle with missing pieces and only a vague idea of the final picture.

What's the solution?

Group3D solves this by combining shape *and* meaning during the 3D object building process. It uses a powerful AI model that understands language and images to create a 'vocabulary' of possible object categories. This vocabulary isn't fixed; it adapts to the specific scene being viewed. The system then groups together categories that are likely to be the same object from different viewpoints. When building 3D objects, it only connects parts that both look like they fit together geometrically *and* make sense semantically – meaning they belong to compatible categories. This prevents the system from making incorrect connections based on shape alone.

Why it matters?

This research is important because it significantly improves the accuracy of 3D object detection, especially when dealing with objects the system hasn't seen before. It allows robots and other AI systems to better understand the 3D world around them, even in complex scenes, and it opens the door to more flexible and adaptable computer vision applications.

Abstract

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.

View Paper