ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li

2025-11-28

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Summary

This research explores whether current artificial intelligence models, specifically those that combine vision and language, demonstrate something similar to how humans understand the world through physical interaction. It questions if these models, which are usually trained by just *looking* at data, can actually grasp concepts tied to acting in and experiencing a physical environment.

What's the problem?

The core issue is that most advanced vision-language models learn from massive datasets of images and text, but they don't have bodies or the ability to *do* things in the world. This raises doubts about whether they truly 'understand' concepts like how actions affect objects, or what's possible in a given situation. The researchers wanted a way to test if these models show signs of 'embodied cognition' – intelligence arising from interacting with the world, not just observing it.

What's the solution?

To address this, the researchers created a new testing benchmark called ENACT. It presents the models with visual question answering tasks framed as a simulated interaction. The models are given a series of images showing an action happening in a scene, but the order is scrambled. The task is to either reorder the images to show the correct sequence of events given the actions taken (forward world modeling), or to reorder the actions given the images (inverse world modeling). Successfully completing these tasks requires the model to understand cause and effect, predict outcomes, and essentially build a mental model of how the world works, just like a person would.

Why it matters?

This work is important because it highlights a potential limitation of current AI. If models lack embodied understanding, they might struggle with real-world applications that require physical interaction, like robotics or even just common-sense reasoning. By creating ENACT, the researchers provide a way to measure and improve the ability of AI to understand the world in a more human-like way, moving beyond simply recognizing objects to understanding how things work and how to interact with them.

Abstract

Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

View Paper