Jina-VLM: Small Multilingual Vision Language Model

Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

2025-12-04

Jina-VLM: Small Multilingual Vision Language Model

Summary

This paper introduces Jina-VLM, a new artificial intelligence model that's really good at understanding both images and language, especially in multiple languages.

What's the problem?

Existing vision-language models, which try to connect what an image shows with a question asked about it, often struggle with accuracy and efficiency, particularly when dealing with questions in languages other than English. Many powerful models are also very large, making them difficult to use without significant computing resources.

What's the solution?

The researchers created Jina-VLM, a 2.4 billion parameter model, by combining a strong image understanding component called SigLIP2 with a powerful language processing component called Qwen3. They used a special connection method, called attention-pooling, that allows the model to process images of any size without needing a huge amount of processing power. This allows it to efficiently analyze images and answer questions about them in various languages.

Why it matters?

Jina-VLM represents a step forward in AI because it achieves top performance in visual question answering across multiple languages while being relatively small in size compared to other models. This means it's more accessible and practical for a wider range of applications, like helping people understand images or creating more intelligent virtual assistants.

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.

View Paper