Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
2025-12-01
Summary
This paper introduces Z-Image, a new image generation model that aims to compete with the best available options, like Nano Banana Pro and Seedream 4.0, but with a much smaller size and lower cost to run.
What's the problem?
Currently, the really good image generation models are either kept secret by companies or are so huge and complex that most people can't use them effectively. The open-source alternatives exist, but they require massive amounts of computing power and expensive hardware to even run, let alone customize or improve. This makes it difficult for researchers and hobbyists to participate in developing this technology.
What's the solution?
The researchers created Z-Image, a model with only 6 billion parameters, which is much smaller than the 20-80 billion parameters of other leading open-source models. They achieved this by carefully optimizing every step of the process, from collecting the data used to train the model to the way the model itself is designed (using something called a Scalable Single-Stream Diffusion Transformer). They also developed a clever method to quickly improve the model's performance after initial training. This allows Z-Image to generate images quickly, even on regular computers with limited graphics cards.
Why it matters?
Z-Image is important because it shows that you don't necessarily need enormous amounts of computing power and money to create high-quality image generation models. By making a powerful model that's accessible to more people, the researchers hope to encourage further innovation and development in this field, and ultimately democratize access to this exciting technology. They've even released all the code and model details so others can build upon their work.
Abstract
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.