WorldMark: A Unified Benchmark Suite for Interactive Video World Models
Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, Yongtao Ge
2026-04-24
Summary
This paper introduces WorldMark, a new benchmark designed to fairly compare different interactive video generation models, like those that let you control a character in a video game with WASD keys.
What's the problem?
Currently, evaluating these models is difficult because each company tests them using its own unique environments and ways of controlling the action, making it impossible to directly compare how well they perform against each other. Existing public tests don't provide the same standardized conditions needed for a true comparison, meaning scores aren't meaningful across different models.
What's the solution?
The researchers created WorldMark, which provides a common set of scenes and actions for all models to use. They built a 'translator' that converts simple WASD controls into the specific commands each model understands. They also created a large suite of tests with varying difficulty and viewpoints, and a toolkit to measure video quality, how well the video follows instructions, and if the world within the video makes sense. Everything – the data, code, and example videos – will be publicly available.
Why it matters?
WorldMark is important because it allows researchers and developers to objectively assess and improve interactive video generation models. By providing a standardized way to test these models, it will accelerate progress in the field and help create more realistic and controllable virtual worlds. They also launched an online arena where people can directly compare models in real-time.
Abstract
Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.