Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension

Published 16 Feb 2023 in cs.CL, cs.CV, and cs.LG | (2302.09301v1)

Abstract: Prompting has become an important mechanism by which users can more effectively interact with many flavors of foundation model. Indeed, the last several years have shown that well-honed prompts can sometimes unlock emergent capabilities within such models. While there has been a substantial amount of empirical exploration of prompting within the community, relatively few works have studied prompting at a mathematical level. In this work we aim to take a first step towards understanding basic geometric properties induced by prompts in Stable Diffusion, focusing on the intrinsic dimension of internal representations within the model. We find that choice of prompt has a substantial impact on the intrinsic dimension of representations at both layers of the model which we explored, but that the nature of this impact depends on the layer being considered. For example, in certain bottleneck layers of the model, intrinsic dimension of representations is correlated with prompt perplexity (measured using a surrogate model), while this correlation is not apparent in the latent layers. Our evidence suggests that intrinsic dimension could be a useful tool for future studies of the impact of different prompts on text-to-image models.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (5)

View on Semantic Scholar

Summary

The paper explores how text prompts influence the representation manifolds within Stable Diffusion by analyzing their intrinsic dimension at different layers.
Key findings show that prompt choice and denoising steps significantly affect intrinsic dimension, with bottleneck layers exhibiting a U-shaped dimension curve during denoising.
A correlation is found between prompt perplexity and the intrinsic dimension of bottleneck representations, suggesting higher-perplexity prompts lead to higher-dimensional internal representations.

This paper, "Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension," investigates how text prompts influence the internal representations within the Stable Diffusion text-to-image model. The authors aim to quantify the impact of prompting on the model's behavior by analyzing the intrinsic dimension of hidden activations at different layers.

The core idea is to treat the distributions of hidden activations as manifolds in high-dimensional space. By estimating the intrinsic dimension of these manifolds, the authors seek to understand how prompts shape the geometry of the model's internal representations. They focus on two specific locations within Stable Diffusion: the latent space (before the VAE decoder) and a bottleneck layer within the UNet component.

The paper outlines a mathematical framework where Stable Diffusion is viewed as a composition of layers, each transforming an input distribution. The "representation manifold" at a given layer is the result of pushing forward the product of noise and prompt distributions through the preceding layers. For a fixed prompt, this yields a submanifold. Since Stable Diffusion employs an iterative denoising process, each layer produces a sequence of manifolds indexed by the denoising step. The authors then study the intrinsic dimension of these manifolds for various prompts, layers, and denoising steps.

The experiments reveal several key findings:

Prompt Impact: Different prompts significantly affect the intrinsic dimension of the representation manifolds. Even semantically similar prompts can lead to noticeable variations in dimension.
Denoising Dynamics: The intrinsic dimension of the latent representation manifolds decreases monotonically during denoising. However, the intrinsic dimension of the bottleneck representation manifolds exhibits a U-shaped curve, initially decreasing and then increasing. The authors provide evidence that the later increase can be partially attributed to the time step embedding vector within the UNet, which injects higher-frequency signals. Even when the input to the UNet is frozen, the intrinsic dimension still climbs at later denoising steps.
Perplexity Correlation: A correlation is observed between the perplexity of a prompt (measured using a surrogate LLM) and the intrinsic dimension of the bottleneck representation manifold. Prompts with higher perplexity (i.e., less likely or more unusual prompts) tend to result in higher-dimensional manifolds. This is interpreted as out-of-distribution prompts creating noisier, higher-dimensional internal representations. No such correlation was found for the latent representations.

The authors use two intrinsic dimension estimators: Maximum Likelihood Estimation (MLE) and Two-Nearest Neighbors (TwoNN). They note that the two methods yield more consistent results for latent representations than for bottleneck representations, possibly indicating a more complex geometry in the latter.

The conclusion emphasizes that the choice of prompt has a substantial impact on the intrinsic dimension of internal representations, especially at bottleneck layers, and that this intrinsic dimension is correlated to the perplexity of the prompt. The authors suggest that this work is a step toward a better geometric understanding of learning in text-to-image models.

The paper also includes an appendix with details about the experimental setup, the intrinsic dimensionality estimators, and a brief discussion about the relationship between their findings, neural scaling laws and the intrinsic dimension of the data.