Drax: Speech Recognition with Discrete Flow Matching

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya

2025-10-08

Drax: Speech Recognition with Discrete Flow Matching

Summary

This paper introduces Drax, a new approach to automatic speech recognition (ASR) that uses a technique called discrete flow matching. It aims to make ASR faster and more efficient without sacrificing accuracy.

What's the problem?

Current large language models are really good at tasks like generating text, but haven't been fully utilized for speech recognition. Existing methods for applying these models to ASR often struggle with efficiently converting audio into text because they require a step-by-step process. This makes them slow. Also, the way these models are trained doesn't always match how they're used, leading to performance issues when applied to real-world speech.

What's the solution?

The researchers developed Drax, which learns a guided path for converting audio into text. Instead of starting from random noise and trying to reach the correct text, Drax learns from likely errors that happen during speech recognition. It's like practicing common mistakes to get better. This 'path' is designed to mimic how the model would actually make predictions, bridging the gap between training and real-world use. They also showed mathematically why this approach helps the model generalize better to unseen speech.

Why it matters?

This work is important because it demonstrates a new, promising way to build ASR systems that are both accurate and fast. By using discrete flow matching, Drax achieves performance comparable to the best existing speech models, but with potential for better speed and efficiency. This could lead to improvements in voice assistants, transcription services, and other applications that rely on accurate and quick speech recognition.

Abstract

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

View Paper