Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna
2024-12-11

Summary
This paper talks about Perception Tokens, a new concept that enhances the ability of multimodal language models (MLMs) to understand and reason about visual information, such as images and videos.
What's the problem?
Multimodal language models have difficulty performing tasks that require understanding visual details, like depth or object locations, because they typically rely on text-based reasoning. Current methods often struggle to integrate necessary visual information, leading to poor performance in tasks that involve reasoning about 3D structures or detecting objects in images.
What's the solution?
The authors introduce Perception Tokens, which are special representations that help MLMs process visual information more effectively. These tokens allow the model to generate intermediate visual data, like depth maps or bounding boxes, that can be used during reasoning. They also propose a training method called AURORA that helps the model learn to use these tokens for better performance across various tasks. This approach leads to significant improvements in accuracy when counting objects and understanding depth in images.
Why it matters?
This research is important because it expands the capabilities of AI models to handle complex visual reasoning tasks. By integrating Perception Tokens into MLMs, the study paves the way for more advanced applications in areas like computer vision, robotics, and augmented reality, where understanding both language and visual information is crucial.
Abstract
Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.