Self-Improving World Modelling with Latent Actions
Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti
2026-02-09
Summary
This paper introduces a new method called SWIRL that helps large language models (LLMs) and vision-language models (VLMs) better understand how the world works and plan for the future.
What's the problem?
LLMs and VLMs are good at processing information, but they struggle with reasoning and planning because they don't naturally understand how actions change the state of the world. Training them to learn these 'world models' usually requires a lot of data where every action is carefully labeled, which is expensive and time-consuming to create.
What's the solution?
SWIRL tackles this problem by learning from sequences of states *without* needing labeled actions. It imagines actions are hidden and tries to both predict what happens next given a state and action (Forward World Modelling) and to guess what action would have caused a change from one state to another (Inverse Dynamics Modelling). It improves these predictions by constantly challenging each model – the forward model tries to make predictions that are easy for the inverse model to explain, and vice versa, using a technique similar to reinforcement learning. Essentially, it's a self-teaching loop where the models get better by trying to understand each other.
Why it matters?
This research is important because it offers a way to build more intelligent LLMs and VLMs that can reason, plan, and interact with the world more effectively, all without needing massive amounts of labeled data. The significant performance improvements on several benchmarks demonstrate that SWIRL is a promising step towards more capable AI systems.
Abstract
Internal modelling of the world -- predicting transitions between previous states X and next states Y under actions Z -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) P_θ(Y|X,Z) and an Inverse Dynamics Modelling (IDM) Q_φ(Z|X,Y). SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.