Emergence and Evolution of Interpretable Concepts in Diffusion Models (2504.15473v1)

Published 21 Apr 2025 in cs.CV, cs.LG, and eess.IV

Abstract: Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from noise through a process called reverse diffusion. Understanding the dynamics of the reverse diffusion process is crucial in steering the generation and achieving high sample quality. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic Interpretability (MI) techniques, such as Sparse Autoencoders (SAEs), aim at uncovering the operating principles of models through granular analysis of their internal representations. These MI techniques have been successful in understanding and steering the behavior of LLMs at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that even before the first reverse diffusion step is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we show that the discovered concepts have a causal effect on the model output and can be leveraged to steer the generative process. We design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages of diffusion image composition is finalized, however stylistic interventions are effective, and (3) in the final stages of diffusion only minor textural details are subject to change.

Summary

The paper demonstrates that a coherent image layout emerges at early diffusion stages, with initial activations predicting final object positions (IoU ≈ 0.26).
It employs Sparse Autoencoders to decompose SDv1.4 activations at different timesteps, revealing distinct roles for coarse layout, refined composition, and style.
Causal interventions confirm that targeted modifications yield time-adaptive editing: spatial changes effectively steer composition early while global tweaks influence style mid-process.

This paper explores the internal workings of text-to-image diffusion models, specifically focusing on how interpretable visual concepts emerge and evolve throughout the reverse diffusion process. The authors utilize Sparse Autoencoders (SAEs) to analyze the activations of Stable Diffusion v1.4 (SDv1.4).

Key Goals and Questions:

Determine the level of image representation present in the early "chaotic" stages of generation.
Understand how visual representations evolve over different stages (early, middle, final) of the diffusion process.
Investigate whether concepts discovered via SAEs can be used to interpretably steer the image generation process.
Analyze how the effectiveness of such steering interventions changes depending on the diffusion timestep.

Methodology:

Sparse Autoencoders (SAEs): The paper employs k-sparse autoencoders (using TopK activation) to decompose the internal activations (specifically, the residual updates from cross-attention transformer blocks) of SDv1.4 into a sparse, overcomplete basis of interpretable features or "concepts". Separate SAEs are trained for different timesteps ( $t=1.0$ for early, $t=0.5$ for middle, $t=0.0$ for final stages), different U-Net blocks (down, mid, up), and for both text-conditioned ('cond') and null-text ('uncond') activations.
Activation Collection: Activations from SDv1.4's cross-attention blocks (down_blocks.2.attentions.1, mid_block.attentions.0, up_blocks.1.attentions.0) were collected at the specified timesteps using 1.5 million prompts from the LAION-COCO dataset.
Concept Interpretation: A novel, scalable, vision-only pipeline is proposed to assign semantic labels to the learned SAE concepts, avoiding potential LLM biases.
- Images are generated, and SAE activations are cached.
- A pre-trained vision pipeline (RAM, Grounding DINO, SAM) generates object segmentation masks and labels for the images.
- A "concept dictionary" is built by matching SAE activation maps (based on Intersection over Union - IoU) with the object masks. Each concept ID (CID) is associated with a list of object labels it frequently overlaps with.
- Concepts are embedded using the mean Word2Vec embedding of their associated object labels.
Predicting Image Composition: The concept dictionary is used to predict the final image layout from SAE features at any given timestep. This involves mapping spatial locations to conceptual embeddings based on activated concepts and comparing these to target object embeddings from the prompt via cosine similarity.
Causal Interventions: To test the causal role of concepts, two types of interventions are performed by modifying the activations based on the SAE's concept vectors ( $\mathbf{f}_c$ $f_{c}$ ):
- Spatially Targeted: To control layout, activations corresponding to an object concept are amplified in a target image region and suppressed elsewhere.
- Global: To control style, activations corresponding to a style concept are added globally across the image.
- Intervention strengths ( $\beta$ ) are normalized based on the activation norms at each location.

Key Findings:

Early Emergence of Composition: Surprisingly, the coarse layout and composition of the final image emerge extremely early, even before the first full reverse diffusion step is completed ( $t=1.0$ ). The spatial distribution of activated concepts at this stage can predict the final locations of objects (IoU ≈ 0.26) even when the model's predicted image output ( $\tilde{\mathbf{x}}_0$ ) shows only noise.
Evolution of Representations:
- Early Stage (t=1.0): Primarily defines coarse composition. Concepts provide rough segmentation.
- Middle Stage (t=0.5): Composition is largely finalized. Segmentation predictions saturate in accuracy. More abstract concepts like style emerge. Concepts become more granular (e.g., differentiating 'church' from 'door' on the church).
- Final Stage (t=0.0): Composition is highly refined. Only minor textural details change.
Concept Types: Activations correspond to local semantics (object parts), global semantics (style, ambiance), and context-free patterns (e.g., corners, potentially artifacts). up_block activations generally provide better spatial accuracy. cond features are better for predicting composition than uncond features.
Time-Dependent Intervention Effectiveness:
- Early Stage (t ≈ 0.6-1.0): Effective for controlling image composition via spatially targeted interventions. Global (style) interventions unexpectedly alter composition instead of style.
- Middle Stage (t ≈ 0.2-0.6): Effective for controlling image style via global interventions without changing composition. Spatially targeted interventions fail to alter layout (it's already fixed) and cause distortions.
- Final Stage (t ≈ 0.0-0.2): Both spatial and global interventions are largely ineffective, resulting only in minor textural changes.

Conclusions:

The paper successfully demonstrates that SAEs can uncover interpretable concepts within diffusion models and track their evolution. It reveals a distinct timeline for the emergence of visual properties: coarse composition solidifies very early, followed by stylistic elements in the middle stages, while the final stages focus on refinement. This understanding enables time-adaptive editing: manipulating layout early and style mid-process. The work suggests future directions in developing editing techniques tailored to these evolving representations and potentially applying similar analyses to diffusion transformers to mitigate intervention leakage via skip connections.

PDF Markdown

Tweets

https://twitter.com/kwangmoo_yi/status/1915111390454616439

Emergence and Evolution of Interpretable Concepts in Diffusion Models (2504.15473v1)

Summary

Related Papers

Tweets