SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Dongting Hu, Aarush Gupta, Magzhan Gabidolla, Arpit Sahni, Huseyin Coskun, Yanyu Li, Yerlan Idelbayev, Ahsan Mahmood, Aleksei Lebedev, Dishani Lahiri, Anujraaj Goyal, Ju Hu, Mingming Gong, Sergey Tulyakov, Anil Kag

2026-01-14

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Summary

This paper focuses on making advanced image generation technology, specifically diffusion transformers, usable on phones and other devices with limited computing power.

What's the problem?

Current diffusion transformers are really good at creating images, but they require a lot of processing power and memory, making them too big and slow to run directly on mobile phones or small embedded systems. Essentially, they're too resource-intensive for 'on-device' use.

What's the solution?

The researchers developed a new system with three main parts. First, they created a smaller, more efficient version of the diffusion transformer that smartly focuses on both the big picture and fine details. Second, they used a training method where the model learns to adapt to different hardware capabilities. Finally, they used a special distillation technique to transfer knowledge from larger, more complex models to this smaller one, making it generate high-quality images quickly with fewer steps.

Why it matters?

This work is important because it opens the door to having powerful image generation capabilities directly on our phones and other devices without needing to send data to the cloud. This means faster image creation, increased privacy, and the ability to use these tools even without an internet connection.

Abstract

Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.

View Paper