Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeongoon Kim, Jaegul Choo, Cheonbok Park

2026-03-30

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Summary

This paper focuses on the challenges of building AI systems that can have realistic, two-way conversations like humans do, moving beyond AI that just processes text to AI that understands and responds to spoken language.

What's the problem?

Currently, creating these 'speech language models' is difficult because there isn't much readily available data of multiple people talking *at the same time*. Most existing data has only one speaker. This makes it hard for AI to learn how real conversations flow, including things like people interrupting each other or giving little cues like 'uh-huh' to show they're listening. Existing methods for processing speech also struggle with figuring out *who* is speaking when and sometimes even make up things that weren't said.

What's the solution?

The researchers developed a new, publicly available system for cleaning and organizing speech data specifically for these kinds of full-duplex (two-way) conversations. This system is designed to handle the complexities of multiple speakers and aims to create a larger, better dataset for training these AI models.

Why it matters?

This work is important because it addresses a key bottleneck in developing more natural and useful voice-based AI assistants. Better AI conversations will lead to more seamless and intuitive interactions with technology, making things like virtual assistants and voice-controlled devices much more effective and user-friendly.

Abstract

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

View Paper