Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu
2026-02-09
Summary
This paper tackles the challenge of automatically evaluating solutions to complex math problems generated by artificial intelligence. While AI is getting better at *attempting* these problems, it's still really hard to tell if the answers are actually correct without a human expert checking them.
What's the problem?
Checking AI-generated math solutions is a major roadblock because it requires a lot of time from mathematicians. Current methods for automatically assessing these solutions aren't very reliable. The core issue is that a good solution to a math problem shouldn't just get the right answer, it should also provide a useful approach that can help solve similar problems. Existing automatic evaluators don't really capture this 'usefulness' or 'generalizability' of a solution.
What's the solution?
The researchers developed a new way to evaluate solutions called 'Consequence-Based Utility'. Instead of directly judging if an answer is right or wrong, this method tests how helpful the solution is when trying to solve *other*, related math problems. It essentially uses the proposed solution as an example to see if it improves the AI's performance on nearby problems. If the solution helps with those other problems, it's considered a good solution, even if it's not perfect on the original one. This approach doesn't need a human to provide feedback, making it 'oracle-free'.
Why it matters?
This work is important because it offers a more effective and efficient way to evaluate AI-generated math solutions. By focusing on the usefulness of a solution, rather than just its correctness, the new method significantly improves the ranking of solutions, meaning the best answers are more likely to be identified. This could speed up progress in AI-assisted mathematical research by reducing the burden on human experts and allowing AI to learn more effectively.
Abstract
Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose Consequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.