Transformer-based text-to-image diffusion models show varying degrees of content-style separation in generated artworks, as revealed by cross-attention heatmaps.

This paper talks about how transformer-based text-to-image models create artworks by turning words into pictures, and it studies how well these models keep the content and style separate in their generated images using heatmaps that show where the model focuses.

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract