UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye, Song Fei, Lei Zhu

2025-11-25

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Summary

This paper focuses on improving the quality of images generated by AI, specifically when creating high-resolution (4K) images with different shapes and sizes. It introduces a new system called UltraFlux that creates more realistic and detailed 4K images than existing open-source methods.

What's the problem?

Current AI models that generate images, like diffusion transformers, work well at standard resolutions but struggle when you try to make very large, detailed 4K images. The problems aren't just one thing; they involve how the AI understands the position of things in the image, how it compresses the image data, and how it learns during training. Fixing each of these separately doesn't fully solve the issue, meaning there's a complex interplay between these factors that needs to be addressed.

What's the solution?

The researchers took a combined approach, improving both the data used to train the AI and the AI model itself. They created a large dataset of 1 million 4K images with various shapes and sizes, along with detailed descriptions. Then, they built a new model, UltraFlux, which includes several key improvements: a better way to understand image position, a simpler method for compressing and reconstructing the image, a more balanced learning process that focuses on important details, and a training strategy that prioritizes aesthetic quality. These changes work together to create stable and detailed 4K images.

Why it matters?

This work is important because it pushes the boundaries of what's possible with AI image generation. Being able to reliably create high-quality 4K images opens up possibilities for various applications, like creating realistic visuals for movies, games, and design. UltraFlux performs as well as, or even better than, some commercial AI image generators, but is open-source, meaning it's freely available for others to use and build upon.

Abstract

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

View Paper