Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng

2025-12-05

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Summary

This paper introduces a new way to improve how images are generated using Latent Diffusion Models (LDMs), a type of AI that creates images from text or other inputs.

What's the problem?

Current LDMs generate images all at once, trying to create both the overall meaning and the tiny details simultaneously. However, it's more natural for our brains (and likely for AI) to first establish the big picture – the 'what' of the image – before focusing on the specifics like texture. Existing methods that try to add helpful information from other AI models don't respect this natural order, potentially making the image generation less efficient and detailed.

What's the solution?

The researchers developed a method called Semantic-First Diffusion (SFD). This approach separates the image generation into two steps. First, it creates a 'semantic latent' which represents the core meaning of the image using a pre-trained visual encoder. Then, it creates a 'texture latent' which handles the fine details. Crucially, SFD generates the semantic part *before* the texture part, giving the texture generation a clearer idea of what it should be adding detail to. They achieve this by using different 'noise schedules' – essentially controlling how quickly each part is refined – so the semantic part is finished first.

Why it matters?

SFD significantly improves the quality of generated images, achieving state-of-the-art results on standard benchmarks like ImageNet. It also makes the image generation process much faster, up to 100 times faster than previous methods. Furthermore, it can be used to improve other existing image generation techniques, showing that prioritizing semantic understanding is a valuable approach for creating more realistic and efficient AI-generated images.

Abstract

Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.

View Paper