SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz

2025-04-17

SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction
Fine-Tuning

Summary

This paper talks about SIFT-50M, a huge dataset made up of 50 million examples that pairs speech with text, which is used to train and improve large language models so they can better understand and follow spoken instructions in many different languages.

What's the problem?

The problem is that most language models are mainly trained on written text, so they're not as good at handling spoken language, especially when it comes to following instructions or understanding speech in multiple languages. This limits how helpful these models can be for things like voice assistants or translation tools.

What's the solution?

The researchers created the SIFT-50M dataset, which includes a massive amount of speech paired with text in many languages. They used this dataset to fine-tune and pre-train large language models, making them much better at understanding and responding to spoken instructions. They also tested these improved models with a special benchmark called EvalSIFT and found that they performed better than existing models.

Why it matters?

This matters because it helps make AI tools more useful and accurate for people who want to interact using their voice, no matter what language they speak. It opens up new possibilities for better voice assistants, translation services, and accessibility for people around the world.

Abstract

SIFT, a 50M-example speech-text dataset, fine-tunes and pre-trains LLMs for instruction-following and foundational speech tasks, outperforming existing models and using the EvalSIFT benchmark for evaluation.

View Paper