SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf

2025-02-06

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language
Model

Summary

This paper talks about the development of SmolLM2, a small but powerful language model that performs better than other similar-sized models. It uses a special training method with a lot of diverse data to achieve impressive results despite its relatively small size.

What's the problem?

Large language models are really good at many tasks, but they're too big and expensive to use on devices with limited resources, like smartphones or small computers. There's a need for smaller models that can still perform well.

What's the solution?

The researchers created SmolLM2, a 1.7 billion parameter model, which is considered small for AI. They trained it on a huge amount of data (11 trillion tokens) using a multi-stage process that combines general web text with specialized data for math, coding, and following instructions. They also made new datasets to fill in gaps where existing data wasn't good enough. Throughout the training, they carefully adjusted how much of each type of data to use based on how well the model was performing.

Why it matters?

This research matters because it shows that smaller AI models can be made to perform really well with the right training approach. This could lead to more powerful AI tools that can run on everyday devices, making advanced AI more accessible and useful in many situations where big models are impractical.

Abstract

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

View Paper