Small Models Struggle to Learn from Strong Reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran

2025-02-20

Small Models Struggle to Learn from Strong Reasoners

Summary

This paper talks about an interesting problem when trying to teach small AI models to reason like bigger, more advanced models. It's like trying to teach a middle school student to solve complex college-level math problems - sometimes it's just too much for them to handle.

What's the problem?

Researchers have been trying to make smaller AI models learn from bigger, smarter ones. But they found that these small models often struggle to learn from long, complex reasoning processes. It's as if the small models get overwhelmed by all the information and can't make sense of it.

What's the solution?

The researchers came up with a method called Mix Distillation. Instead of only teaching the small models using complex reasoning from big models, they mix in simpler, shorter reasoning examples too. It's like giving a student a mix of challenging problems and easier ones that build up to the harder stuff.

Why it matters?

This matters because we want to make AI that can reason well but doesn't need huge amounts of computing power. By finding better ways to teach small AI models, we can create smarter, more efficient AI that can be used in more places, like on smartphones or in small devices. This could make advanced AI more accessible and useful in everyday life, without needing supercomputers to run them.

Abstract

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models (leq3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

View Paper