Voxtral

Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Anmol Agarwal, Antoine Roux

2025-07-18

Summary

This paper talks about Voxtral Mini and Voxtral Small, which are AI models designed to understand both spoken audio and text, capable of handling very long conversations and audio by using a large context window.

What's the problem?

The problem is that many audio chat models struggle to remember and process long conversations or long audio clips, limiting their ability to follow discussions or respond accurately over time.

What's the solution?

The authors developed Voxtral Mini and Small models that can process up to 32,000 tokens of audio and text, allowing them to keep track of extended conversations and understand audio inputs better.

Why it matters?

This matters because it makes voice-based AI assistants more effective for longer and more natural interactions, improving user experience in applications like customer service, virtual assistants, and communication tools.

Abstract

Voxtral Mini and Voxtral Small are multimodal audio chat models that excel in understanding spoken audio and text, with a 32K context window for extended audio and conversation handling.

View Paper