Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, Yossi Adi

2026-04-07

Self-Execution Simulation Improves Coding Models

Summary

This paper explores how to make large language models (LLMs) better at writing correct code, specifically by teaching them to 'think through' how their code will actually run.

What's the problem?

LLMs are good at *writing* code that looks right, but they often make mistakes that cause the code to not work correctly. The issue is that they don't really 'understand' what the code will do when it's executed, especially for complex tasks like competitive programming where even small errors matter. They struggle to predict the outcome of their own code.

What's the solution?

The researchers trained the LLMs to simulate running the code step-by-step. They did this in two main ways: first, by showing the model examples of code execution explained in plain language, and second, by using a reward system that gives the model feedback on whether its code passes tests. This allows the model to check its own work by predicting what will happen and then comparing that to the actual results, and then fix errors through repeated testing and refinement.

Why it matters?

This research is important because it moves us closer to LLMs that can reliably generate working code. If LLMs can accurately predict how their code will behave, they'll be much more useful for tasks like software development and automated problem-solving, and could significantly improve performance in areas like competitive programming.

Abstract

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.

View Paper