Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng

2025-11-24

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Summary

This paper introduces Mantis, a new system for helping robots understand and follow instructions involving both vision, language, and actions. It's designed to improve how robots predict what will happen next when they perform tasks, and ultimately make them better at completing those tasks successfully.

What's the problem?

Current robots using vision, language, and action models struggle because predicting future visual states (what things will look like) is really hard and computationally expensive. Trying to simplify those visuals loses important information. Also, these models often don't fully utilize language instructions, leading to poor understanding and problem-solving skills. They essentially get bogged down in *seeing* and *doing* and don't pay enough attention to *what they're told*.

What's the solution?

Mantis solves this by separating the task of predicting future visuals from the main processing part of the robot's brain. It uses a special technique called 'Disentangled Visual Foresight' which uses 'meta queries' and a 'diffusion Transformer' to figure out the key actions needed to complete a task. By predicting just the actions, instead of the entire future scene, it makes the process much more efficient and allows the robot to focus on understanding the language instructions. A 'residual connection' helps the system learn by building on what it already knows.

Why it matters?

This work is important because it significantly improves a robot's ability to follow instructions and perform tasks, achieving a very high success rate on a standard robotics test (LIBERO). It also shows better performance than existing systems, especially when dealing with new or complex instructions, and demonstrates improved reasoning skills. The researchers are even sharing their code so others can build upon this work, which will help accelerate progress in robotics.

Abstract

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms π_{0.5}, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

View Paper