The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models
Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti
2025-08-07
Summary
This paper talks about how transformer-based text-to-image models create artworks by turning words into pictures, and it studies how well these models keep the content and style separate in their generated images using heatmaps that show where the model focuses.
What's the problem?
The problem is that when these AI models generate images from text, it’s not always clear how they balance showing the right content (like the subject described) and the style (like the artistic look), which can affect the quality and creativity of the art produced.
What's the solution?
The solution was to analyze the models using cross-attention heatmaps, which visualize how different parts of the text influence different parts of the image. This helps reveal how the models interpret artistic prompts and manage the separation between content and style.
Why it matters?
This matters because understanding how AI models handle artistic prompts can help improve their design, making them better tools for artists and creators who want AI-generated images to match their vision more accurately and creatively.
Abstract
Transformer-based text-to-image diffusion models show varying degrees of content-style separation in generated artworks, as revealed by cross-attention heatmaps.