Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, Lionel M. Ni, Jingren Zhou, Junyang Lin, Chenfei Wu

2025-12-18

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Summary

This paper introduces a new way to edit images using a computer, making the process much more like how professional designers work with tools like Photoshop. It focuses on creating a system that can break down an image into separate, editable layers.

What's the problem?

Current AI image editing tools often struggle to make changes consistently. When you try to edit one part of an image, it can unintentionally mess up other parts because the image is treated as one solid piece of information. Think of it like trying to erase something on a drawing without smudging the rest – it's hard! Professional design software avoids this by using layers, where each element is on its own separate sheet, allowing for isolated edits.

What's the solution?

The researchers developed a system called Qwen-Image-Layered. This system uses artificial intelligence to automatically separate an image into multiple layers, similar to how Photoshop works. It does this by learning to recognize different parts of an image and putting them on their own 'sheets'. They built special components to handle images with a varying number of layers and a training process to teach the AI to do this effectively. To help the AI learn, they even created a dataset of layered images extracted from real Photoshop files.

Why it matters?

This research is important because it paves the way for more powerful and user-friendly AI image editing tools. By allowing edits to be made on individual layers, it ensures changes are consistent and predictable, making it easier to achieve the desired results without unwanted side effects. This could significantly improve how people create and modify images using AI.

Abstract

Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling inherent editability, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on https://github.com/QwenLM/Qwen-Image-Layered{https://github.com/QwenLM/Qwen-Image-Layered}

View Paper