ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
2025-03-06
Summary
This paper talks about ABC, a new AI model that combines images and text in a smarter way, making it easier to solve tasks like finding specific images or answering questions about pictures using natural language instructions.
What's the problem?
Current AI models that work with both images and text often treat them separately and then try to combine the results. This approach leads to weak connections between the two types of data and makes it hard for users to control how the AI understands and represents the images.
What's the solution?
The researchers created ABC, which deeply integrates image features with natural language instructions using a vision-language backbone. They trained it with a special process that improves how well it connects text and images. ABC also uses a new benchmark called CtrlBench to test its ability to follow detailed instructions while retrieving or analyzing images.
Why it matters?
This matters because it allows AI to handle more complex tasks that involve both text and images, like answering tricky questions about pictures or finding specific details in an image based on user input. This could improve tools for education, design, and even medical imaging by making them more accurate and user-friendly.
Abstract
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate a multimodal embedding model, which outputs embeddings that combine visual and natural language input. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves bestfor-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of multimodal embeddings by offering high-quality representations and flexible natural language control. Our model and datasets are available at our project page.