REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, Ioannis Kakogeorgiou

2025-12-19

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Summary

This paper introduces a new way to improve how well AI models create images, specifically focusing on making the images more semantically accurate and improving the speed of the image creation process.

What's the problem?

Current image-generating AI models, called latent diffusion models, are really good at making images, but they learn what things *mean* slowly. They don't fully utilize the detailed understanding of images that other AI models, called Vision Foundation Models, already have. Existing attempts to incorporate this understanding either don't use enough of the information from these models or add it in a way that isn't fully integrated into the image creation process, leading to slower training and lower quality images.

What's the solution?

The researchers developed a framework called REGLUE that combines three things: the basic image information, detailed local understanding of the image from a Vision Foundation Model (think understanding what each small part of the image represents), and a general overall understanding of the entire image. They use a special 'compressor' to take the complex information from the Vision Foundation Model and make it easier for the image generator to use. They also add a step to make sure the AI's internal understanding matches the Vision Foundation Model's understanding.

Why it matters?

This work is important because it allows AI to create higher-quality images faster by more effectively using existing knowledge about what things look like. By better integrating the detailed understanding of images from Vision Foundation Models, REGLUE represents a step forward in making AI-generated images more realistic and accurate, and it does so more efficiently than previous methods.

Abstract

Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

View Paper