Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
2025-08-20

Summary
This paper introduces ColorCtrl, a new way to change colors in images and videos using text instructions, making it more precise and consistent than previous methods.
What's the problem?
It's hard to change just the colors in an image or video without messing up other things like shapes or how light behaves, and existing methods aren't very good at controlling the exact color changes or keeping the edited parts looking right compared to the rest of the image.
What's the solution?
ColorCtrl uses a special type of AI called a Diffusion Transformer, which is good at understanding both images and text. By cleverly adjusting how this AI pays attention to different parts of an image and its internal 'thoughts' (value tokens), the method can separate and change colors accurately based on text prompts, even controlling how strong a color change is and only affecting the areas mentioned in the prompt.
Why it matters?
This makes it much easier and more reliable for artists and creators to edit the colors of images and videos exactly how they want them, leading to better quality results that look more natural and consistent, even surpassing some professional tools and working well across different AI models for images and videos.
Abstract
Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.