The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Muyu He, Muhammad Ali Shafique, Anand Kumar, Tsach Mackey, Nazneen Rajani

2025-10-08

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Summary

This research investigates how much training data is needed to effectively teach a smaller, simpler AI model to solve competitive programming problems by learning from a larger, more capable AI. It focuses on understanding how performance changes as the amount of training data increases.

What's the problem?

While it's known that you can transfer skills from a powerful AI to a smaller one, it wasn't clear *how much* training data from the powerful AI is actually needed for the smaller AI to learn effectively. Specifically, researchers noticed that simply giving the smaller AI more and more examples didn't always lead to better results, and wanted to understand why.

What's the solution?

The researchers trained two smaller AI models on varying amounts of data generated by a larger AI solving coding problems. They found a surprising pattern: initially, adding more data actually *hurt* the smaller AI's performance, but after a certain point, performance improved rapidly and consistently as more data was added. They also experimented with training the smaller AI at different points in this process, and discovered that when starting with less data, the smaller AI learned better from easier coding problems than from harder ones. Surprisingly, whether the larger AI got the answers right or wrong didn't seem to matter for the smaller AI's learning.

Why it matters?

This work helps us understand the best way to train smaller AI models to perform complex tasks like coding. Knowing how much data is needed, and what kind of data is most helpful at different stages of training, can lead to more efficient and effective AI development, especially for applications where large, powerful AI models are too expensive or impractical to use.

Abstract

Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a valley of code reasoning: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

View Paper