Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
2025-03-11
Summary
This paper talks about Zero-AVSR, an AI system that can understand speech in any language by watching lip movements and listening to audio, even if it hasn’t been trained on that specific language, by using a universal phonetic alphabet (Roman letters) as a middle step.
What's the problem?
Current speech recognition systems need lots of training data for each language and struggle with languages they haven’t seen before, making them less useful for rare or low-resource languages.
What's the solution?
Zero-AVSR first converts speech into Roman letters (like phonetic spelling) using audio and video of lip movements, then uses large language models (like ChatGPT) to translate those letters into the target language’s actual writing system, even for languages the AI hasn’t seen.
Why it matters?
This could help break language barriers in real-time translation, accessibility tools for the hearing-impaired, and communication apps, especially for languages with limited digital resources.
Abstract
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.