Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu
2025-10-09
Summary
This paper introduces a new way for AI models that understand both images and text to perform visual tasks, like identifying objects in a picture or outlining them precisely.
What's the problem?
Current AI models that combine vision and language often struggle with tasks that require detailed understanding of images, like pinpointing exactly where objects are or creating detailed outlines. They usually work by converting visual information into text, which loses important details and limits what the model can do. It's like trying to describe a complex painting with just a few words – you miss a lot of the nuance.
What's the solution?
The researchers developed a method called Patch-as-Decodable Token, or PaDT. This approach lets the AI directly process visual information from images alongside text, without needing to translate everything into words first. They use 'Visual Reference Tokens' which are like little pieces of the image that the AI can use to understand what it's looking at. A simple decoder then takes the AI’s output and turns it into the specific visual prediction, like a box around an object or a detailed outline. Importantly, the AI learns to focus on these visual pieces independently, improving its ability to tell similar objects apart.
Why it matters?
This new method is significant because it allows AI models to perform visual tasks more accurately and efficiently, even when compared to much larger and more complex models. It opens the door to more sophisticated applications in areas like robotics, image editing, and self-driving cars, where precise visual understanding is crucial.
Abstract
Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.