Diverse Inference and Verification for Advanced Reasoning

Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell

2025-02-17

Diverse Inference and Verification for Advanced Reasoning

Summary

This paper talks about a new way to make AI systems better at solving really hard math and reasoning problems by combining different methods and checking their work, kind of like having a team of smart students work together and double-check each other's answers.

What's the problem?

Even though AI has gotten really good at math and coding, it still struggles with super hard problems like those from math competitions, tricky puzzles, and complex reasoning questions. These problems are too difficult for current AI to solve reliably.

What's the solution?

The researchers created a system that uses multiple AI models working together, instead of just one. They also made the AI check its answers, like having a teacher grade its work. For math problems, they use a special math checker called Lean. For puzzles, they run the AI's solution as code to see if it works. For other types of questions, they have the AI come up with multiple answers and pick the best one. They also used techniques to help the AI learn from its mistakes and get better over time.

Why it matters?

This matters because it shows we can make AI much smarter at solving really hard problems by having it work more like a team of experts rather than just one brain. It could help AI tackle challenges in math, science, and other fields that were too hard before. This could lead to breakthroughs in research, education, and problem-solving in many areas, making AI a more powerful tool for helping humans with complex tasks.

Abstract

Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

View Paper