GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun

2026-03-26

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Summary

This paper introduces a new way to test how well artificial intelligence, specifically large language models that can understand both text and images, can understand and react to complex situations in 3D games.

What's the problem?

Current tests for AI don't really challenge them to understand what's happening in fast-paced, multi-player games from the perspective of a character *within* the game. They struggle with keeping track of who is doing what, when, and how it relates to the game world, especially when lots of things are happening at once with multiple players. Existing benchmarks just aren't good enough to see if an AI can truly 'think' like an agent in a dynamic environment.

What's the solution?

The researchers created a dataset called GameplayQA. They took videos of people playing multiplayer 3D games and carefully labeled everything happening – what the player themselves is doing, what other players are doing, and what’s happening in the game world. These labels are incredibly detailed, happening over once per second, and are organized around 'self', 'others', and 'the world'. They then used these labels to create over two thousand question-and-answer pairs that test different levels of reasoning ability, and also categorized the types of mistakes the AI might make. They then tested some of the most advanced AI models with this dataset.

Why it matters?

This work is important because it provides a much more realistic and challenging test for AI that's meant to be used in things like robots or virtual reality. By identifying where these AI models fall short, it can help researchers improve them so they can better understand and interact with the world around them, ultimately leading to more capable and intelligent agents.

Abstract

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

View Paper