Ministral 3

Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Clémence Lanfranchi

2026-01-14

Summary

This paper introduces the Ministral 3 series, which are new language models designed to be efficient and work well even on devices with limited computing power and memory. They come in three sizes – small (3 billion parameters), medium (8 billion), and large (14 billion) – and each size has three versions specialized for different tasks.

What's the problem?

Large language models are incredibly powerful, but they often require a lot of computing resources, making them difficult to use on everyday devices like phones or laptops. Existing methods to make models smaller sometimes significantly reduce their performance, meaning they aren't as good at understanding and generating text.

What's the solution?

The researchers created the Ministral 3 models using a technique called Cascade Distillation. This involves starting with a larger, more capable model and then repeatedly pruning away unnecessary parts while simultaneously retraining the model to maintain its accuracy. This process creates smaller models that retain much of the original model’s intelligence, and they also added the ability for these models to understand images. All of this is released under a license that allows for broad use.

Why it matters?

These models are important because they make powerful language model technology more accessible. By creating efficient models, they open the door for using these tools on a wider range of devices and in more applications where computing resources are limited. The image understanding capability also expands the types of problems these models can solve.

Abstract

We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.

View Paper