Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

2025-05-15

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Summary

This paper talks about Omni-R1, a new way to improve large language models for audio tasks, like answering questions about sounds, music, and speech, by fine-tuning them in a smart way.

What's the problem?

The problem is that training AI models to understand and answer questions about audio usually requires a lot of actual audio data, which can be hard to collect and process, making it challenging to get good results across different types of sounds.

What's the solution?

The researchers fine-tuned a model called Qwen2.5-Omni using a method called GRPO on a dataset made for audio question-answering. This approach helped the model achieve top performance in understanding and answering questions about various audio topics, including music and speech, even without needing tons of raw audio data.

Why it matters?

This matters because it shows that we can make powerful audio AI systems more easily and efficiently, which could improve things like voice assistants, music recognition, and tools for people with hearing challenges.

Abstract

Omni-R1 fine-tunes Qwen2.5-Omni with GRPO on an audio QA dataset, achieving state-of-the-art performance in sound, music, speech, and average categories.

View Paper