MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic

2025-10-07

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Summary

This paper introduces a new method, MoME, to improve audio-visual speech recognition (AVSR) by making large language models more efficient and adaptable, especially when computing power is limited.

What's the problem?

Current large language models are really good at understanding speech from both audio and video, but they require a lot of computing power. Existing methods to compress the information these models process often force you to choose a fixed level of compression, which isn't ideal because sometimes you need more detail and sometimes you don't. Also, previous attempts to handle different levels of compression haven't allowed the model to learn effectively across those different levels, making it less accurate when heavily compressed and harder to understand *why* it's making certain decisions.

What's the solution?

The researchers developed MoME, which stands for Mixture of Matryoshka Experts. It's a system that adds a 'smart routing' component to an existing language model. This routing system dynamically chooses which parts of the model to use based on how much detail is needed – essentially allocating more computing power to the important parts. Importantly, the routing is consistent across different levels of compression, meaning information learned at a high level of detail can help when the information is compressed. This allows the model to adjust its compression on the fly and perform well even with limited resources.

Why it matters?

MoME is important because it makes advanced speech recognition technology more practical for real-world applications where computing power is limited, like on phones or in noisy environments. It achieves better accuracy than previous methods while using fewer resources, and it also provides a way to understand *how* the model is making its decisions, which is crucial for building trustworthy AI systems.

Abstract

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

View Paper