Token Warping Helps MLLMs Look from Nearby Viewpoints
Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung
2026-04-06
Summary
This paper investigates how to make multimodal large language models, which process both images and text, better at understanding how a scene looks when viewed from a slightly different angle.
What's the problem?
Current multimodal models struggle when the viewpoint changes, even a little. They usually try to adjust the image by warping the pixels, but this is very sensitive to errors in estimating depth and can create distorted images. Essentially, they don't 'understand' the scene in a way that allows for a smooth change in perspective like humans do.
What's the solution?
The researchers propose warping the image 'tokens' instead of the individual pixels. Think of tokens as small, meaningful parts of the image that the model already uses to understand what it's seeing. They specifically used a method called 'backward token warping' which creates a grid in the new viewpoint and finds the corresponding tokens from the original viewpoint to fill it. This approach proved to be more stable and kept the scene's meaning consistent when changing viewpoints.
Why it matters?
This work is important because it offers a more reliable way for these models to reason about visual scenes from different perspectives. By using token warping, the models perform significantly better than existing methods, bringing them closer to how humans naturally understand and adapt to changes in viewpoint, which is crucial for tasks like robotics and virtual reality.
Abstract
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.