dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, Yichen Zhu
2026-04-27
Summary
This paper introduces a new way to test how well robots can perform tasks without needing to physically test them in thousands of different situations. It creates a virtual world, called dWorldEval, that can quickly and efficiently evaluate robotic policies.
What's the problem?
Testing a robot's ability to handle many different environments and tasks is incredibly difficult and time-consuming. Traditional methods require a huge amount of real-world testing or very complex simulations, neither of which are practical when you want to evaluate a robot across a wide range of scenarios. Existing virtual environments weren't scalable enough to handle the complexity needed for robust robotics evaluation.
What's the solution?
The researchers developed dWorldEval, which works by converting everything a robot 'sees' and 'does' – like images, language commands, and movements – into a simple code. This code is then processed by a powerful computer model, a type of neural network called a transformer, that can predict what will happen next. They added a 'memory' system to keep things consistent over time and a way to track how close the robot is to completing its task. The model can then automatically determine if the robot succeeded.
Why it matters?
This research is important because it allows developers to test and improve robots much faster and more cheaply. By creating a scalable and accurate virtual testing ground, dWorldEval opens the door to building more capable and reliable robots that can handle a wider variety of real-world situations. It represents a new approach to building virtual worlds specifically designed for evaluating robots.
Abstract
Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.