PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna

2026-04-02

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Summary

This paper introduces a new challenge called PerceptionComp, which is a set of complex questions about videos designed to test how well computers can 'understand' what they're seeing over a long period of time.

What's the problem?

Current computer systems struggle with videos that require them to piece together information from different moments and understand relationships between objects, actions, and locations. Existing benchmarks don't really push these systems to do the kind of detailed, step-by-step visual reasoning that humans do naturally when watching a video. Basically, it's hard for computers to watch a video and answer questions that aren't obvious from just one single frame.

What's the solution?

The researchers created PerceptionComp, a collection of over a thousand challenging questions based on nearly 300 videos covering a variety of real-world scenarios. These questions aren't simple; they require the system to connect information from multiple points in the video and use logic to figure out the answer. They also tested how well humans and existing AI models performed on these questions.

Why it matters?

This work highlights that 'perception-centric' reasoning – truly understanding what's happening in a video – is still a major hurdle for AI. The new benchmark, PerceptionComp, provides a more difficult and realistic test for AI systems, and hopefully will encourage researchers to develop better AI that can truly 'see' and understand the world like humans do.

Abstract

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

View Paper