Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang
2025-10-13
Summary
This paper introduces a new challenge in working with advanced AI models called Multimodal Large Language Models (MLLMs), which can understand not just text, but also things like images and videos. It focuses on finding the best way to 'ask' these models questions using a combination of text and visual cues to get the most accurate answers.
What's the problem?
Currently, a lot of effort goes into figuring out the best way to phrase questions, or 'prompts,' for AI models to get good results. This process, called prompt optimization, works well for text-based AI, but hasn't been adapted for MLLMs. Because MLLMs can process multiple types of information, simply optimizing the text part of a prompt isn't enough – we need to optimize the visual prompts too, and figure out how text and visuals work together to get the best response. It's hard to manually find the best combination of text and images to use as prompts.
What's the solution?
The researchers developed a system called the Multimodal Prompt Optimizer (MPO). This system automatically finds the best combination of text and visual prompts. It does this in two main ways: first, it subtly adjusts both the text and visual prompts while making sure they still relate to each other. Second, it uses what it learned from previous attempts to guide its search for even better prompts, kind of like learning from past mistakes to improve future questions. It uses a statistical approach, similar to how scientists use prior knowledge to make predictions.
Why it matters?
This work is important because it unlocks the full potential of MLLMs. By automatically optimizing prompts that include both text and images (or videos, or even molecular structures!), we can get much more accurate and useful results from these powerful AI models. This is a crucial step towards making these models truly versatile and applicable to a wider range of real-world problems.
Abstract
Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.