FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong
2025-02-28
Summary
This paper talks about FINEREASON, a new way to test and improve how well AI language models can think through complex problems step-by-step, using logic puzzles as a tool.
What's the problem?
Current ways of testing AI's problem-solving skills mostly look at whether it gets the final answer right, but don't check how the AI thinks through each step. This means we can't tell if the AI is actually reasoning well or just guessing.
What's the solution?
The researchers created FINEREASON, which uses logic puzzles that can be broken down into small steps. They test the AI on two main things: checking if the current state of the puzzle is correct, and figuring out what to do next. They also made a set of puzzles to train AI models, which helped improve their math skills.
Why it matters?
This matters because as AI gets smarter, we need better ways to make sure it can think through problems carefully, like humans do. FINEREASON helps researchers understand how AI models reason, which could lead to smarter and more reliable AI systems for solving complex problems in the real world. The improvement in math skills shows that this approach can make AI better at other kinds of reasoning too.
Abstract
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.