SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder
Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, Daniel Cohen-Or
2025-10-07
Summary
This paper introduces a new way to edit images created by AI, giving users more precise control over *how* the image changes based on text instructions.
What's the problem?
Current AI image editing tools rely on text prompts, but these prompts aren't very specific, meaning if you try to change one thing about an image, other things might change unexpectedly too, and it's hard to control *how much* of a change you're making. It's difficult to edit images in a way that feels natural and predictable.
What's the solution?
The researchers developed a method that works by directly tweaking the way the AI 'understands' the text instructions, specifically at the level of individual words or 'tokens'. They use a special tool called a Sparse Autoencoder to find the parts of the AI's understanding that control specific image attributes. By carefully adjusting these parts, they can change one aspect of the image without messing up others, and they can smoothly control the strength of the edit. Importantly, this method doesn't require changing the core AI image generator itself, so it can be used with many different AI systems.
Why it matters?
This research is important because it makes AI image editing much more user-friendly and powerful. It allows for more precise and intuitive control, meaning people can easily create exactly the images they envision without a lot of trial and error. This could have a big impact on fields like graphic design, art, and visual effects.
Abstract
Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.