Introducing Visual Perception Token into Multimodal Large Language Model

Runpeng Yu, Xinyin Ma, Xinchao Wang

2025-02-26

Introducing Visual Perception Token into Multimodal Large Language Model

Summary

This paper talks about adding special tokens called Visual Perception Tokens to AI models that can understand both text and images, helping these models see and understand images better

What's the problem?

Current AI models that work with both text and images (called MLLMs) can't control how they look at and understand images very well. They can't focus on specific parts of an image or pay attention to certain types of objects on their own, which makes it hard for them to understand images as well as humans do

What's the solution?

The researchers created two new types of tokens: Region Selection Tokens and Vision Re-Encoding Tokens. These tokens help the AI model choose which parts of an image to focus on and how to look at them again for better understanding. The AI can create and use these tokens on its own, just like it creates text

Why it matters?

This matters because it makes AI models much better at understanding images. The researchers found that adding these tokens made a smaller AI model perform better than a much larger one, improving its score by over 23%. This could lead to smarter AI assistants that can understand and talk about images more like humans do, which could be useful in many areas like education, accessibility for visually impaired people, or even in helping robots understand their surroundings better

Abstract

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

View Paper