SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi

2025-11-27

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Summary

This paper introduces Sphinx, a computer environment designed to test how well AI can 'see' and think about images, similar to how humans do.

What's the problem?

Current AI models, even very advanced ones, struggle with tasks that require understanding visual information and using reasoning skills. It's hard to accurately measure these abilities because creating enough real-world examples for testing is difficult and knowing the 'right' answer for those examples isn't always clear. Essentially, we need a way to reliably test if AI can truly understand what it's looking at and draw logical conclusions.

What's the solution?

The researchers created Sphinx, which automatically generates a huge number of visual puzzles. These puzzles use basic shapes, patterns, charts, and icons, and importantly, the system *knows* the correct solution to each puzzle. They then tested several state-of-the-art AI models, including a very powerful one called GPT-5, on these puzzles. They also showed that training AI using a method called 'reinforcement learning with verifiable rewards' – meaning the AI gets rewarded for correct answers – significantly improved its performance, not just on Sphinx puzzles but also on other visual reasoning tasks.

Why it matters?

This work is important because it provides a standardized and reliable way to evaluate and improve AI's visual reasoning abilities. By identifying where current AI models fall short, and demonstrating a successful training method, it paves the way for building AI systems that can better understand and interact with the visual world, which is crucial for applications like robotics, self-driving cars, and image analysis.

Abstract

We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

View Paper