REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo

2025-03-11

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Summary

This paper talks about REF-VLM, an AI tool that helps computers understand and describe images better by breaking tasks into three parts: what to look for, how to describe it, and where it is in the picture.

What's the problem?

Current AI models struggle to handle detailed image tasks like marking objects or identifying key points, especially when they need to do multiple tasks at once or adjust to different levels of detail.

What's the solution?

REF-VLM uses a three-part system (called TRP) to separate the 'what', 'how', and 'where' of image tasks, and trains on a huge dataset with varied examples to handle many tasks without needing separate models for each.

Why it matters?

This makes AI tools more versatile and accurate for real-world tasks like medical imaging, self-driving cars, or design apps where understanding images in detail is critical.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.

View Paper