FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang
2026-02-20
Summary
This paper focuses on improving how robots understand and predict what will happen in their environment, which is crucial for them to make better decisions and adapt to new situations.
What's the problem?
Currently, teaching robots to 'world model' – to predict future events – has two main difficulties. First, the way these models are trained makes them focus too much on perfectly recreating what things *look* like, instead of understanding *what* those things are and how they behave. This limits their ability to generalize to new situations. Second, relying on the robot’s own predictions about the future can cause small errors to build up over time, leading to inaccurate long-term planning.
What's the solution?
The researchers developed a new method called FRAPPE. It works in two steps. First, during training, the robot learns to predict the underlying 'idea' or representation of future events, rather than the exact pixels of what it will see. Second, after initial training, they use a clever technique to refine the model by comparing its predictions to those of other, already well-trained visual models, doing this in parallel to speed things up. This approach requires less labeled data and is more efficient.
Why it matters?
This work is important because it provides a more effective and practical way to give robots a better understanding of the world around them. By improving their ability to predict future events, robots can perform more complex tasks, adapt to changing environments, and operate more reliably in real-world scenarios, even when facing situations they haven't encountered before.
Abstract
Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.