PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan

2025-10-28

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Summary

This paper introduces PRISM-Bench, a new way to test how well AI models can *reason* with images, not just get the right answer.

What's the problem?

Current tests for AI vision models mostly just check if the final answer is correct. This doesn't tell us if the AI is actually thinking through the problem logically, or if it's just guessing or finding patterns that happen to work. We need a way to see *how* an AI is reasoning, and pinpoint where it makes mistakes.

What's the solution?

The researchers created a set of visual puzzles that require multiple steps of reasoning – things like understanding shapes, symbols, and analogies. Then, they gave the AI a step-by-step solution to a puzzle, but with one deliberate error in it. The AI’s job wasn’t to solve the puzzle, but to *find* the first mistake in the provided reasoning. This forces the AI to actively check its own logic.

Why it matters?

This is important because even AI models that seem to explain their thinking clearly can still make basic logical errors. PRISM-Bench helps us understand the difference between an AI that *sounds* smart and one that *actually* reasons correctly, which is crucial for building AI we can trust.

Abstract

We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

View Paper