Planning with Reasoning using Vision Language World Model
Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung
2025-09-04
Summary
This paper introduces a new AI model called the Vision Language World Model, or VLWM, which is designed to better understand and plan actions in the real world by watching videos and learning from them.
What's the problem?
Currently, AI struggles with complex planning because it lacks a good 'world model' – a way to understand how things work and what will happen if it takes certain actions. Existing models often can't reason about actions at a high level, meaning they can't understand *why* something is done, just *what* is done, and they don't predict future states well. This makes it hard for AI to create effective, long-term plans.
What's the solution?
The VLWM tackles this by first watching videos and figuring out the overall goal being achieved. Then, it predicts a series of actions and changes in the environment that would lead to that goal. It uses a clever technique involving a large language model to refine these predictions based on glimpses of the future. The model learns two things: how to react immediately to situations (like a quick reflex) and how to think through plans more carefully (like considering all the options). It evaluates plans by comparing the predicted future to the desired goal, using a system that learns to judge how 'good' a plan is without needing explicit instructions.
Why it matters?
This research is important because it significantly improves AI's ability to plan and solve problems in visual environments. The VLWM performs better than previous models on several challenging tasks, including those that require understanding and assisting with real-world activities. This could lead to more helpful robots and AI assistants that can understand our intentions and help us achieve our goals more effectively.
Abstract
Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.