Multi-Turn Code Generation Through Single-Step Rewards

Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

2025-03-03

Multi-Turn Code Generation Through Single-Step Rewards

Summary

This paper talks about muCode, a new way to teach AI how to write and fix computer code step by step, using feedback from its own mistakes to improve.

What's the problem?

Existing methods for generating code either don’t use feedback at all or rely on very complicated systems that are hard to train and not very efficient. These methods struggle to handle tasks where the AI needs to fix or improve its code over multiple steps.

What's the solution?

The researchers created muCode, which simplifies the process by focusing on single-step rewards. This means the AI learns to fix its code one step at a time, treating each step as a chance to recover and improve. The system includes two parts: a generator that writes the code and a verifier that checks how good the code is. By training these two parts together, muCode gets better at using feedback to refine its solutions.

Why it matters?

This matters because it makes AI coding tools more efficient and effective. muCode performs better than older methods while using fewer resources, which could make it easier for developers to use AI for writing and debugging code. This could speed up software development and make it more accessible to people who don’t have advanced programming skills.

Abstract

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, muCode, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. muCode iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of muCode at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

View Paper