ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng Chau

2025-02-07

ConceptAttention: Diffusion Transformers Learn Highly Interpretable
Features

Summary

This paper talks about ConceptAttention, a new method that helps us understand how AI models process images and text together. It uses a technique to create clear visual maps showing where specific words or concepts appear in an image, without needing extra training.

What's the problem?

AI models, especially those that generate images from text, are often like 'black boxes'—we can see the input and output but don't understand how they make decisions. This lack of transparency makes it hard to trust or control these systems, especially in important applications.

What's the solution?

The researchers developed ConceptAttention, which uses the attention layers in diffusion transformers (a type of AI model) to create detailed maps that show how words relate to parts of an image. By reusing the model's existing parameters and applying a mathematical projection technique, this method generates sharper and more accurate maps than previous approaches. It also performs very well on tasks like zero-shot image segmentation, where the AI identifies parts of an image without prior training on that specific task.

Why it matters?

This research is important because it makes AI systems more understandable and trustworthy by showing how they connect words to images. It also demonstrates that these models can be used for other tasks, like segmenting images, without requiring extra training. This could lead to safer and more transparent AI applications in fields like medicine, robotics, and content analysis.

Abstract

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

View Paper