M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen

2026-01-06

M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Summary

This paper investigates how well we can actually remove specific concepts from images created by AI, focusing on making sure those concepts *stay* removed even when someone tries to bring them back through different methods. It introduces a new way to test these removal techniques and a new technique to make them more reliable.

What's the problem?

AI image generators are amazing, but they can sometimes create things we don't want them to, like harmful images or things that violate copyright. Researchers are trying to 'erase' specific concepts from these models, but most methods only focus on changing the text prompt you give the AI. The problem is, there are other ways to influence the image, like tweaking the internal 'code' the AI uses to build the image, and these methods can easily bypass the concept erasure, bringing the unwanted concept back. Essentially, current methods aren't robust enough because they don't consider all the ways someone could try to recreate the unwanted concept.

What's the solution?

The researchers created a new testing framework called M-ErasureBench that checks how well concept erasure works across three different ways of interacting with the AI: through text, through the AI's internal 'understanding' of concepts (embeddings), and through the raw data the AI uses to create images (inverted latents). They found existing methods were good with text but failed with the other two. To fix this, they developed a new module called IRECE. IRECE works by finding where the unwanted concept is 'hidden' within the AI's internal data and then subtly changing that data during image creation to prevent the concept from reappearing. It's like a safeguard that works during the image generation process.

Why it matters?

This work is important because it shows that simply erasing concepts from text prompts isn't enough to guarantee safety and control in AI image generation. By identifying these weaknesses and offering a solution like IRECE, the researchers are helping to build more reliable and trustworthy AI systems that are less likely to generate harmful or unwanted content. The new testing framework also provides a standard way to evaluate and improve these safety measures in the future.

Abstract

Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.

View Paper