ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han

2025-02-17

ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models

Summary

This paper talks about ZeroBench, a new test designed to challenge AI systems that can understand both text and images (called Large Multimodal Models or LMMs). It's so difficult that even the most advanced AI can't solve any of its problems right now.

What's the problem?

Current tests for AI image understanding are becoming too easy as AI gets smarter. This makes it hard to know how good AI really is at understanding images. Also, AI still struggles with some basic image tasks that even kids can do, but the current tests don't show this problem clearly.

What's the solution?

The researchers created ZeroBench, a set of 100 super hard questions about images that no AI can answer correctly right now. They also made 334 easier sub-questions to help measure progress. They tested 20 different AI systems with ZeroBench, and all of them got a score of 0% on the main questions.

Why it matters?

This matters because it gives scientists a way to measure how AI image understanding improves over time. As AI gets better, we'll be able to see its progress on ZeroBench. This can help researchers focus on making AI truly understand images, not just get good at specific types of questions. It could lead to AI that can see and understand the world more like humans do.

Abstract

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

View Paper