BUT System for the MLC-SLM Challenge

Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget

2025-06-19

Summary

This paper talks about the BUT system for the MLC-SLM Challenge, which combines two advanced speech recognition models called DiCoW and DiariZen to recognize multiple speakers speaking different languages in the same audio.

What's the problem?

The problem is that recognizing what multiple people are saying, especially in different languages and overlapping speech, is very difficult for automatic speech recognition systems, and previous methods struggled with this in real-world situations.

What's the solution?

The researchers combined DiCoW, which focuses on recognizing speech using information about when each speaker is talking, with DiariZen, which separates speakers effectively even when voices overlap. They fine-tuned these combined models on the challenge data, improving their accuracy and robustness even with inconsistent or noisy training data.

Why it matters?

This matters because better multi-speaker and multilingual speech recognition helps in many real-world applications like meetings, broadcasts, and communication services, making it easier to understand conversations involving many people in different languages.

Abstract

The combined DiCoW and DiariZen ASR system demonstrates strong performance in multilingual scenarios, with DiCoW preserving its multilingual capabilities and DiariZen improving through fine-tuning.

View Paper