Computer-Using World Model

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, Pu Zhao, Lukas Wutschitz, Samuel Kessler, Huseyin A Inan, Robert Sim, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

2026-02-20

Summary

This paper introduces a new way for computer programs, called 'agents', to understand what will happen when they interact with software like Microsoft Office. It's about helping these agents plan ahead and avoid mistakes.

What's the problem?

When an agent is trying to complete a task in a program like Word or Excel, even one wrong click can mess everything up. Unlike a video game where you can easily try different things, you can't just 'undo' a lot of actions in real software without potentially losing work. It's hard to teach an agent to learn through trial and error because real-time testing isn't practical, even though the software itself is predictable.

What's the solution?

The researchers created something called the 'Computer-Using World Model' or CUWM. Think of it as a predictive tool. CUWM takes the current screen you're looking at and a possible action (like clicking a button), and then *predicts* what the screen will look like afterward. It does this in two steps: first, it describes in text what changes will happen, and then it visually creates the new screenshot. It learns by watching how people use Office and then gets a little bit better through a process similar to learning from rewards.

Why it matters?

This is important because it allows agents to 'think' before they act. Instead of just clicking and hoping for the best, they can use CUWM to simulate different actions and choose the one that's most likely to succeed. This makes them more reliable and able to complete complex tasks without errors, which is crucial for automating things in office software.

Abstract

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

View Paper