OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

Sai Koneru, Matthias Huck, Jan Niehues

2025-12-02

OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

Summary

This paper introduces a new system called OmniFusion that combines the strengths of models good at understanding different types of data, like images and sound, with models specifically designed for translating languages. It aims to create a more efficient and accurate way to translate speech, even in real-time.

What's the problem?

Current speech translation systems often work in two steps: first, converting speech to text, and then translating that text. This takes time, which is a big issue for simultaneous translation where you need a translation almost instantly as someone is speaking. Also, these systems can't use helpful information like images that might clarify what's being said, leading to potential misunderstandings.

What's the solution?

The researchers developed OmniFusion, which directly connects a model that understands images and sound (Omni 2.5-7B) to a model that's really good at translation (SeedX PPO-7B). They figured out a way to share information between the models' internal workings during training, so they learn to work together seamlessly. This allows the system to translate speech, speech *and* images, or even just text and images all in one go.

Why it matters?

OmniFusion is important because it makes speech translation faster – reducing delay by about a second in real-time translation – and more accurate by allowing the system to consider visual cues. This is a step towards more natural and effective communication across languages, especially in situations where quick and clear understanding is crucial.

Abstract

There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation qualityCode is available at https://github.com/saikoneru/OmniFusion.

View Paper