DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang

2026-02-13

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Summary

This paper introduces DeepGen 1.0, a new artificial intelligence model that can both create images from text and edit existing images based on text instructions. It's designed to be much smaller and more efficient than other similar models while still performing just as well, or even better.

What's the problem?

Current AI models that can generate and edit images are incredibly large, requiring a huge amount of computing power to train and use. This makes them expensive and inaccessible to many researchers and developers. Smaller models often struggle to understand complex instructions or make precise edits to images, limiting their usefulness.

What's the solution?

The researchers developed DeepGen 1.0, a 5 billion parameter model, which is significantly smaller than many competitors. They achieved this by using a technique called Stacked Channel Bridging (SCB) which helps the model better understand the relationship between text and images. They also used a three-step training process: first aligning the model with image-text data, then fine-tuning it on various tasks like generation, editing, and reasoning, and finally using reinforcement learning to improve the quality and accuracy of the generated images.

Why it matters?

DeepGen 1.0 is important because it demonstrates that powerful image generation and editing AI doesn't necessarily require massive size. By creating a smaller, more efficient model that performs well, the researchers are making this technology more accessible to a wider range of people and lowering the barrier to entry for multimodal research. They've also released the code and data, allowing others to build upon their work.

Abstract

Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.

View Paper