CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan

2026-04-06

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Summary

This paper explores how to improve vision-language models, which are AI systems that can understand both images and text, by combining different ways of processing images.

What's the problem?

Current vision-language models usually use one method for understanding images, often based on comparing images and text to find similarities. While this works, it might miss out on capturing all the detailed information within an image. Other image processing techniques are better at recognizing objects and understanding complex scenes, but aren't typically used in these models.

What's the solution?

The researchers developed a new framework called CoME-VL that merges the strengths of two different image processing approaches. One approach focuses on aligning images and text, while the other excels at detailed visual understanding. They cleverly combine the information from both, reducing redundancy and ensuring the different parts work well together. This combined information is then fed into a language model to make predictions.

Why it matters?

This research is important because it significantly boosts the performance of vision-language models on tasks like understanding what's happening in an image and accurately identifying objects within it. By combining different image processing techniques, the models become more accurate and robust, leading to better AI systems that can truly 'see' and understand the world around them.

Abstract

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

View Paper