Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang

2026-03-24

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Summary

This paper introduces a new way to test how well computer programs understand and predict what happens in videos, focusing on how things change when someone interacts with the scene.

What's the problem?

Currently, tests for these programs either check if the generated videos *look* realistic or if they match a text description, or they only look at a single moment in time for 3D reconstructions. Neither of these really checks if the program understands how actions *cause* changes over time – like what happens if you push a box, or pick up a ball. There wasn't a good, all-around test to see if these 'world models' could accurately predict the consequences of interactions.

What's the solution?

The researchers created a benchmark called Omni-WorldBench. This benchmark has two parts: a collection of different scenarios and interactions (Omni-WorldSuite) and a way to automatically test how well the program predicts what will happen when an action is taken (Omni-Metrics). They tested 18 different programs to see how they performed on these interactive tasks, looking at both the final result and how things changed along the way.

Why it matters?

This work is important because it highlights that current programs aren't very good at predicting how things will change when someone interacts with a scene. By providing a public benchmark, the researchers hope to encourage the development of better 'world models' that can truly understand and predict how the world works in a dynamic, interactive way, which is crucial for things like robotics and realistic simulations.

Abstract

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

View Paper