RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V. Nori, Rahul Sharma, Amit Sharma, Javier Gonzalez

2025-06-20

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Summary

This paper talks about RE-IMAGINE, a benchmark that tests how well large language models can solve reasoning problems by creating new variations of problems that can't be solved just by remembering answers.

What's the problem?

The problem is that many AI models rely too much on memorizing patterns from training data rather than truly reasoning through new or different problems, which makes it hard to know if they really understand the tasks.

What's the solution?

The researchers developed a way to generate different versions of reasoning problems that force the models to actually think and apply logic rather than just recall past information. This helps evaluate the true reasoning skills of the AI.

Why it matters?

This matters because it allows us to measure and improve how well AI models think and reason, making them more reliable and useful in solving complex and novel problems.

Abstract

RE-IMAGINE evaluates the reasoning abilities of Large Language Models by generating variations of problems that cannot be solved by memorization, indicating reliance on statistical recall.

View Paper