Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
2026-02-27
Summary
This paper focuses on improving how well computers can identify and outline specific objects in images based on text descriptions, even objects the computer hasn't specifically been trained to recognize.
What's the problem?
Current systems that try to do this, called open-vocabulary segmentation, aren't as accurate as systems trained on lots of labeled images. This is because the initial computer 'understanding' comes from looking at images with general labels, not precise outlines, and because language can be open to interpretation, making it hard for the computer to know exactly what to look for.
What's the solution?
The researchers came up with a way to give the computer a little bit of extra help. They showed it a few example images *with* the outlines of the objects already drawn in, alongside the text description. Then, they created a system that smartly combines information from both the text and those example images to figure out what to highlight in a new image. Importantly, this system learns how to best combine the information for each specific image, rather than using a fixed method.
Why it matters?
This work makes a big step towards making these systems much more useful. It gets them closer to the accuracy of systems that require a lot of training data, but still allows them to identify a wide range of objects based on simple text prompts, and even personalize the segmentation to specific examples.
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.