Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong

2025-10-28

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Summary

This paper focuses on improving how well large language models, like those powering chatbots, can solve complicated problems that require multiple steps of reasoning.

What's the problem?

While large language models are good at making initial guesses, they often struggle when a problem needs a series of logical steps to solve. Existing methods like 'Chain of Thought' and 'Program of Thought' try to help by having the model show its work, but these methods sometimes lead to incorrect or nonsensical solutions, especially when dealing with things like math or algorithms.

What's the solution?

The researchers developed a new technique called 'Per-Instance Program Synthesis' or PIPS. This method essentially has the model write out a 'program' – a set of instructions – to solve each specific problem, and then it refines that program based on its structure. Importantly, it doesn't need specific examples or guidance on *how* to write the program. PIPS also includes a way to decide, for each problem, whether to try writing a program or just give a direct answer, choosing whichever seems more likely to be correct.

Why it matters?

This work is important because it significantly improves the accuracy of large language models on difficult reasoning tasks. It boosts performance across a wide range of challenges, including complex questions, visual puzzles, and mathematical problems, while also reducing the number of incorrect 'programs' the model generates. This means we can potentially rely on these models for more complex tasks and get more trustworthy results.

Abstract

Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.

View Paper