Multimodal Policy Internalization for Conversational Agents

Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya

2025-10-14

Multimodal Policy Internalization for Conversational Agents

Summary

This paper introduces a new method for teaching AI assistants, like ChatGPT, to consistently follow complex instructions, especially those involving both text and images. It's about getting these assistants to 'learn' the rules instead of constantly being reminded of them.

What's the problem?

Current AI assistants rely on very detailed instructions given to them every time they need to perform a task. These instructions, called 'policies,' can become incredibly long and complicated, especially when dealing with things like images or multiple steps. This makes it hard for the AI to follow the rules accurately and slows down the process because the AI has to constantly reread the instructions. Existing methods for simplifying instructions haven't focused on these complex, multimodal (text and image) scenarios.

What's the solution?

The researchers developed a technique called Multimodal Policy Internalization, or MPI. Essentially, they train the AI in three stages. First, they give it a broad understanding of the rules during initial training. Then, they fine-tune it with specific examples. Finally, they use a special reinforcement learning method called PolicyRollout, which helps the AI explore different options while still keeping the rules in mind. They also created two new datasets to help with this training, one using artificial scenarios and another using real-world tasks.

Why it matters?

This work is important because it allows AI assistants to be more reliable and efficient. By internalizing the rules, they don't need to be constantly given lengthy instructions, which saves computing power and makes them faster. This is especially crucial as AI assistants become more complex and are used in more diverse applications, like helping with visual tasks or managing multiple tools. The datasets and training methods they provide will also help other researchers build even better AI assistants in the future.

Abstract

Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

View Paper