Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 164 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

What the DAAM: Interpreting Stable Diffusion Using Cross Attention (2210.04885v5)

Published 10 Oct 2022 in cs.CV and cs.CL

Abstract: Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research. Our code is at https://github.com/castorini/daam.

Citations (134)

View on Semantic Scholar

Summary

The paper introduces DAAM, revealing how cross-attention layers map word influences to image regions in Stable Diffusion.
It aggregates attention scores across U-Net layers to achieve competitive mIoU scores on segmentation benchmarks.
The analysis of syntactic dependencies and feature entanglement offers practical insights for enhancing generative AI interpretability.

Interpreting Stable Diffusion with DAAM

The paper "What the DAAM: Interpreting Stable Diffusion Using Cross Attention" offers a structured approach to understanding large-scale diffusion models, particularly focusing on the text-to-image synthesis process. The authors introduce a method called Diffusion Attentive Attribution Maps (DAAM), leveraging cross-attention layers within the U-Net architecture of diffusion networks to generate pixel-level attribution maps.

Understanding the decision-making process of NLP models, particularly in tasks like text-to-image generation, is crucial for advancing AI interpretability. With this research, the authors aim to illuminate how words influence generated images using Stable Diffusion—a publicly available model making this analysis possible.

Methodology and Evaluation

DAAM generates attribution maps by aggregating cross-attention scores across various layers and time steps. This approach essentially traces the impact of each input word on generated imagery to create a visual interpretation framework. The rigor of DAAM is tested against semantic segmentation tasks using datasets COCO-Gen and Unreal-Gen, drawing comparisons with both supervised and unsupervised segmentation methods.

The results show that DAAM achieves a mean intersection over union (mIoU) score competitive with established segmentation models. With thresholds optimized at τ = 0.4, DAAM generally performs robustly, indicating its effectiveness in this novel context. This performance underscores DAAM as a formidable open-vocabulary baseline for such tasks.

Insights and Analyses

The paper extends the evaluation to all parts of speech through human-annotated validity checks, displaying DAAM’s applicability beyond nouns. For interpretable components of speech like verbs and adjectives, DAAM maps were rated "fair" to "good," reinforcing their semantic significance.

A unique feature of this research is its syntactic analysis, mapping textual dependencies to spatial relationships. The paper of head-dependent DAAM map interactions spans ten common syntactic relations, yielding insights into the visuolinguistic patterns that diffusion models encapsulate. The results highlight specific relations where either the head or dependent word’s influence prevails, like in subject–verb constructions, adding depth to understanding pixel-level interactions.

Moreover, the examination of cohyponym entanglement is particularly noteworthy. It reveals that similar semantic words (e.g., "giraffe" and "zebra") result in less distinct and accurate image attributes, a phenomenon attributed to feature entanglement. Additionally, it is observed that adjectives display substantial influence over entire images rather than just the objects they modify—a significant finding indicating DAAM’s ability to tease apart complicated attribute distributions.

Implications and Future Directions

This work presents substantial implications for refining image generation techniques. Understanding how syntax and semantics manifest at a pixel level can lead to better disentanglement in features, enhancing the control and coherence of outputs in diffusion models. Furthermore, the insights drawn from DAAM could inform improvements in unsupervised parsing techniques and compositionality in AI systems.

Potential future directions include further exploration of syntactic–geometric probes in diffusion models, akin to strategies used in LLMs like BERT. Extending DAAM to more nuanced syntactic structures could uncover deeper linguistic capabilities or limitations within generative models, crucial for advancing AI's interpretability and reliability in creative, autonomous applications.

In summary, "What the DAAM" embarks on a detailed exploration of visuolinguistic mappings in diffusion models, offering a crucial lens for understanding AI-generated art and imagery. This research demonstrably pushes the boundaries in the pursuit of interpreting AI models, applying rigorous methods to uncover latent dynamics in text-to-image generation.