Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park

2025-11-06

Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

Summary

This paper introduces a new method, M-Solomon, for improving how search engines understand what you're looking for, especially when dealing with both text and images. It focuses on intelligently deciding when to add extra information to your search query to get better results.

What's the problem?

Currently, many search systems try to improve queries by automatically adding related terms. While this can help, it slows down the search process because *every* query gets modified. Also, adding extra terms doesn't *always* improve results – sometimes it can actually make them worse. Existing methods haven't been tested with searches that involve images as well as text.

What's the solution?

M-Solomon solves this by learning to predict which queries *need* extra information. It looks at a bunch of example searches and divides them into two groups: those that benefit from added terms and those that don't. When a query needs help, M-Solomon uses a powerful AI model to generate relevant additions. Otherwise, it skips the extra step. It uses special signals, like '/augment' or '/embed', to tell the system what to do.

Why it matters?

This research is important because it makes search systems faster and more accurate. By only adding information when it's truly helpful, M-Solomon avoids unnecessary delays and improves the quality of search results, and it does this effectively even when searching with both images and text.

Abstract

Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.

View Paper