GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

2025-12-30

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Summary

This paper focuses on improving the part of AI systems that turns text descriptions into images or videos, specifically the 'text encoder'. This encoder is responsible for understanding what the text says and translating that into something a visual generator can use.

What's the problem?

Developing good text encoders is hard because it's difficult to know if an encoder is actually any good *before* building a whole image or video generator with it, which takes a lot of time and computing power. Also, taking existing language models (like those used for chatbots) and adapting them to create visuals isn't straightforward; they need to learn to represent information in a way that's useful for image creation.

What's the solution?

The researchers created a new way to test text encoders called GRAN-TED, which stands for Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. They built a special dataset, TED-6K, that lets them quickly and accurately evaluate how well an encoder understands text *without* needing to train a full image generator. They also developed a new training method for text encoders that first adapts a large language model for visual understanding and then carefully selects the most important features for image generation.

Why it matters?

This work is important because it provides a much faster and more reliable way to build better text-to-image and text-to-video AI systems. By being able to quickly assess and improve the text encoder, developers can create AI that generates images and videos that are more accurate, detailed, and aligned with the original text descriptions, ultimately leading to more powerful and creative AI tools.

Abstract

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about 750times faster. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

View Paper