Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti
2025-05-22
Summary
This paper talks about Llama-SMoP, a new kind of AI model that can handle both audio and visual information for speech recognition, making it smarter and more efficient without needing extra computer power.
What's the problem?
Most models that try to understand both what people say and what they see at the same time end up being slow or expensive to run because they need a lot of resources, especially as they get bigger and more complex.
What's the solution?
The researchers created Llama-SMoP, which uses a special setup called Sparse Mixture of Projectors, along with routers that send information to the right experts for each type of data, so the model can work better and faster without costing more to use.
Why it matters?
This matters because it means more accurate and efficient speech recognition systems can be used in real-world situations, like video calls or smart devices, making technology work better for everyone.
Abstract
Llama-SMoP, an efficient multimodal LLM incorporating Sparse Mixture of Projectors, enhances AVSR performance without increasing inference costs through modality-specific routers and experts.