Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk

2026-03-03

Unified Vision-Language Modeling via Concept Space Alignment

Summary

This paper introduces V-SONAR, a new system that combines vision (images and videos) and language understanding into a single, shared space. It builds upon a previous system called SONAR, which was good with just text and speech, and expands its capabilities to include visual information.

What's the problem?

Existing vision-language models often struggle to understand concepts across many different languages and types of media. They typically require a lot of training data for each language and visual task, and don't easily transfer knowledge between them. The challenge is to create a system that can understand both images/videos and text in a wide variety of languages without needing massive amounts of specific training data for each combination.

What's the solution?

The researchers created V-SONAR by taking an existing vision encoder (something that turns images into numerical representations) and 'aligning' it with the SONAR system. This alignment process maps visual information into the same 'space' as text and speech. They then used this V-SONAR system to build V-LCM, which is an extension of a previous model called LCM, and trained it to understand both visual and language instructions. This training involved predicting the next embedding, essentially learning how visual and textual information relate to each other.

Why it matters?

This work is important because it demonstrates a way to build powerful vision-language models that can work well across many languages, even those with limited resources. V-LCM performs comparably to state-of-the-art models on common tasks like captioning and question answering, but crucially, it significantly outperforms them in many less common languages. This opens up possibilities for more inclusive AI systems that aren't biased towards widely spoken languages and can understand a broader range of visual content.

Abstract

We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.

View Paper