Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, Ron Mokady

2025-11-11

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Summary

This paper introduces a new text-to-image model called FIBO that aims to create images with much more precise control based on detailed text descriptions.

What's the problem?

Current text-to-image models are really good at making images from short descriptions, but they struggle when you try to be very specific or give them a lot of detail in the text prompt. They tend to 'fill in the blanks' based on what most people would expect, rather than exactly what you asked for, making it hard for professionals who need precise results. Essentially, there's a disconnect between how much detail you can put *into* the text and how much control you have over the final image.

What's the solution?

The researchers trained FIBO using very long and detailed descriptions of images, where each image was labeled with a consistent set of specific features. To handle these long descriptions efficiently, they developed a new technique called DimFusion, which cleverly combines information from a language model without making the process too slow. They also created a new way to test how well these models actually follow instructions, called TaBR, which checks if an image can be accurately recreated just from its detailed description.

Why it matters?

This work is important because it moves text-to-image models closer to being truly useful tools for professionals like designers and artists. By giving users much more control over the image creation process, it opens up possibilities for creating very specific and customized visuals, and the new evaluation method provides a better way to measure and improve these models.

Abstract

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

View Paper