R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang
2025-10-15
Summary
This paper investigates whether large language models (LLMs) can be used to predict what will happen in digital environments, helping agents make better decisions without needing to try everything out randomly. They find LLMs struggle with long-term predictions due to inaccuracies and outdated information, and then propose a way to improve them.
What's the problem?
Imagine you're teaching a computer to play a game or complete a task. Normally, it learns by trying things and seeing what happens, which can take a long time. LLMs could speed this up by *predicting* what will happen if the computer takes a certain action. However, LLMs sometimes 'hallucinate' – they make things up – and they only know what they were trained on, so they can't adapt to new situations. This means their predictions get worse and worse the further into the future they try to look, making them unreliable for complex tasks that require planning several steps ahead.
What's the solution?
The researchers developed a system called Retrieval-augmented World Model, or R-WoM. This system doesn't rely solely on the LLM's internal knowledge. Instead, when the LLM needs to predict something, R-WoM first searches for relevant, up-to-date information from external sources, like online tutorials. This extra information 'grounds' the LLM, making its predictions more accurate and reliable, especially when planning over longer periods. Essentially, it gives the LLM access to a constantly updated rulebook.
Why it matters?
This research is important because it shows the limitations of using LLMs directly as 'world models' for complex tasks. But it also offers a practical solution – using external knowledge to supplement the LLM – that significantly improves their performance. This could lead to smarter, more efficient AI agents that can learn and solve problems more effectively in real-world digital environments, like games, simulations, or even robotic control.
Abstract
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.