BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues
2025-12-03
Summary
This paper introduces BOOM, a system designed to make online lectures accessible to students who speak different languages by automatically translating everything – the spoken words, the slides, and even creating new audio in the student’s language.
What's the problem?
As more and more educational content becomes available online and reaches a global audience, it’s a big challenge to adapt this content for students who don’t speak the original language. Lectures aren’t just spoken words; they use visuals like slides to help explain things. Simply translating the speech isn’t enough because students also need the slides translated and synchronized with the new audio to fully understand the material.
What's the solution?
The researchers created BOOM, which works as a complete translation system. It takes a lecture with both audio and slides, translates the spoken words into text, translates the text on the slides while trying to keep the original images and layout, and then generates new audio in the target language. It does all of this at the same time, ensuring everything stays in sync. They also made the code for translating slides publicly available and integrated it into a larger lecture translation tool.
Why it matters?
This work is important because it helps break down language barriers in education, making learning materials accessible to a wider range of students. By translating all parts of a lecture – audio, text, and visuals – BOOM aims to provide a learning experience that’s as close as possible to the original, and it can even improve other tasks like automatically summarizing lectures or answering questions about them.
Abstract
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.