Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin

2026-02-11

Code2World: A GUI World Model via Renderable Code Generation

Summary

This paper introduces Code2World, a new system that allows AI agents to better understand and interact with computer interfaces like those on your phone. It focuses on predicting what a screen will look like after an action is taken, which is crucial for an AI to navigate apps and complete tasks automatically.

What's the problem?

Currently, AI systems trying to understand and interact with graphical user interfaces (GUIs) struggle to balance accurately representing what things *look* like with understanding the underlying *structure* of the interface. Existing methods either produce blurry or unrealistic images, or they can't precisely control how elements change on the screen when an action is performed. Also, there isn't a lot of training data available to teach these AIs how GUIs work.

What's the solution?

The researchers created Code2World, which works by translating actions into code that can then *render* a predicted image of the next screen state. To overcome the lack of data, they built a large dataset called AndroidCode by converting real user interactions into HTML code and then improving that code using visual feedback. They also used a technique called 'Render-Aware Reinforcement Learning' to fine-tune existing vision-language models, rewarding them for generating code that produces visually accurate and consistent results. Essentially, they taught the AI to 'code' the next screen instead of just guessing what it will look like.

Why it matters?

This work is important because it significantly improves the ability of AI agents to interact with GUIs. Code2World performs as well as, or even better than, some of the most advanced AI models like GPT-5 and Gemini, and it makes other AI systems better at completing tasks within apps. This could lead to more helpful virtual assistants, automated testing of software, and other applications where AI needs to reliably interact with computer interfaces.

Abstract

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.

View Paper