Embarrassingly Simple Self-Distillation Improves Code Generation
Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang
2026-04-02
Summary
This paper investigates whether a large language model can get better at writing code simply by learning from its *own* attempts, without needing help from other models or complex training methods like reinforcement learning.
What's the problem?
Large language models are pretty good at generating code, but they still struggle with more difficult coding problems and often make mistakes. Improving these models usually requires a lot of extra resources, like having another model check the code, or using a 'teacher' model to guide the learning process, or using reinforcement learning which is complex to set up. The question is, can we improve code generation performance without all that extra stuff?
What's the solution?
The researchers came up with a surprisingly simple method called 'self-distillation'. Basically, they had the model generate multiple possible solutions to a coding problem, using slightly different settings to encourage variety. Then, they took those generated solutions and used them to fine-tune the original model, treating its own outputs as training data. This process helps the model learn from its successes and failures without needing any external feedback.
Why it matters?
This research is important because it shows that significant improvements in code generation can be achieved with a very straightforward and efficient technique. It suggests a new direction for improving these models – a kind of 'self-improvement' – that doesn't rely on expensive or complicated training procedures. This could make it easier and cheaper to build better coding assistants and tools.
Abstract
Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.