MONKEY: Masking ON KEY-Value Activation Adapter for Personalization

James Baker

2025-10-13

MONKEY: Masking ON KEY-Value Activation Adapter for Personalization

Summary

This paper focuses on improving how we personalize AI image generation, specifically with diffusion models. These models let you create images with a specific person or object in them, but sometimes they focus too much on *just* that person or object and ignore what you actually want the background or overall scene to look like.

What's the problem?

When you try to add a specific subject to an image generated by an AI, a common issue is that the AI gets stuck just copying the subject and doesn't pay attention to the rest of your instructions. A popular technique called IP-Adapter tries to fix this by automatically figuring out which parts of the image are the subject and which are the background, but it can still struggle to balance the subject with the overall scene described in the text prompt.

What's the solution?

The researchers noticed that IP-Adapter creates a 'mask' that separates the subject from the background. They decided to use this mask in a clever way: after the initial image is created, they apply the mask again, but this time to limit how much the AI can change the subject itself. This forces the AI to focus on changing the *rest* of the image to match the text prompt, like adding a specific location or background. Essentially, it tells the AI, 'We're happy with the subject, now make the surroundings fit the description!'

Why it matters?

This is important because it makes personalized image generation much more useful. Now, you can reliably put a specific person into a scene and be confident that the scene will actually look like what you asked for, instead of just getting a picture of the person pasted onto a random background. It improves the alignment between the image, the subject, and the text prompt, giving users more control and better results.

Abstract

Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.

View Paper