Probability Density Geodesics in Image Diffusion Latent Space (2504.06675v2)

Published 9 Apr 2025 in cs.CV

Abstract: Diffusion models indirectly estimate the probability density over a data space, which can be used to study its structure. In this work, we show that geodesics can be computed in diffusion latent space, where the norm induced by the spatially-varying inner product is inversely proportional to the probability density. In this formulation, a path that traverses a high density (that is, probable) region of image latent space is shorter than the equivalent path through a low density region. We present algorithms for solving the associated initial and boundary value problems and show how to compute the probability density along the path and the geodesic distance between two points. Using these techniques, we analyze how closely video clips approximate geodesics in a pre-trained image diffusion space. Finally, we demonstrate how these techniques can be applied to training-free image sequence interpolation and extrapolation, given a pre-trained image diffusion model.

Summary

The paper presents a novel framework that computes latent space geodesics by minimizing a probability-density weighted path length.
It leverages a Riemannian manifold approach with score distillation and gradient descent to solve the derived Euler-Lagrange geodesic equation.
The method enables smooth image interpolation and extrapolation with competitive performance, all without task-specific training.

This paper, "Probability Density Geodesics in Image Diffusion Latent Space" (2504.06675), introduces a method for computing geodesics (shortest paths) within the latent space of pre-trained image diffusion models like Stable Diffusion. The core idea is to define a distance metric where paths traversing high-probability density regions (corresponding to realistic images) are considered "shorter" than equivalent paths through low-probability regions. This allows for the generation of smooth and plausible image sequences between points in the latent space.

Key Concepts and Theory:

Riemannian Manifold: The diffusion model's latent space is treated as a Riemannian manifold.
Probability-Weighted Metric: A spatially-varying inner product is defined where the norm is inversely proportional to the probability density $p(\gamma)$ at a point $\gamma$ in the latent space: $\|\dot{\gamma}\|_K = \|\dot{\gamma}\|_2 / p(\gamma)$ .
Path Length: The length of a path $\gamma(t)$ from $t=a$ to $t=b$ is $S[\gamma] = \int_a^b L(t, \gamma, \dot{\gamma}) dt$ , where the Lagrangian is $L = \|\dot{\gamma}(t)\| / p(\gamma(t))$ . Minimizing this length yields the geodesic.
Geodesic Equation (Euler-Lagrange): Minimizing the path length leads to a second-order ordinary differential equation (ODE) that governs the geodesic path: $\ddot{\gamma} + \|\dot{\gamma}\|^2 (I - \hat{\dot{\gamma}}\hat{\dot{\gamma}}^T) \nabla \log p(\gamma) = 0$ . This equation links the path's acceleration $\ddot{\gamma}$ to the score function $\nabla \log p(\gamma)$ , which diffusion models estimate.
Functional Derivative: The paper derives the functional derivative $\frac{\delta S}{\delta \gamma}$ of the path length functional. This gradient is essential for numerically optimizing a path towards a geodesic using gradient descent. $\frac{\delta S}{\delta \gamma} = \frac{-1}{p(\gamma)\|\dot{\gamma}\|} \left( (I - \hat{\dot{\gamma}}\hat{\dot{\gamma}}^T) \nabla \log p(\gamma) + \frac{\ddot{\gamma}}{\|\dot{\gamma}\|^2} \right)$ .
Analysis Tools: Methods are provided to compute:
- Relative Log Probability: $\log \tilde{p}_a(\gamma(b)) = \int_a^b \dot{\gamma}(t)^T \nabla \log p(\gamma(t)) dt$ , approximated numerically (e.g., trapezoidal rule).
- Geodesic Distance: $\tilde{d}_a(b) = \int_a^b \frac{\|\dot{\gamma}(t)\|}{\tilde{p}_a(\gamma(t))} dt$ , also approximated numerically.

Implementation Details:

Model: Uses the latent space of a pre-trained Stable Diffusion v2.1-base model ( $4 \times 64 \times 64$ dimensions). Images are mapped to/from this space using the VAE encoder $\mathcal{E}$ and decoder $\mathcal{D}$ .

Score Estimation: Instead of directly using the diffusion model's noise prediction

\epsilon_\theta

, the method employs noise-free score distillation

\phi(x | z, \tau)

(\cref{eq:score_distillation}) to approximate the score function

\nabla \log p(x | z, \tau)

. This is found to be more robust, especially for out-of-distribution points encountered during path optimization. It leverages classifier-free guidance and potentially negative prompts.

1
2

\nabla\log p(x \mid z, \tau) \approx \beta \phi(x \mid z, \tau)
\phi(x \!\mid\! z, \tau) \!=\! \mathbb{E}_{\tau'\!\in \mathcal{R}_{\tau}} w(\!\tau') \left( \sigma d(x_{\tau'\!} \!\mid\! z) \!- \! d(x_{\tau'\!} \!\mid\! z_\text{neg}) \right)

where

d(x_{\tau}|z) = \epsilon_\theta(x_{\tau}|\varnothing) - \epsilon_\theta(x_{\tau}|z)

.

Operating at Noise Level $\tau$ : All geodesic computations and score estimations are performed at a fixed, non-zero diffusion timestep $\tau$ (e.g., $\tau=600$ ). This smooths the probability landscape and improves score estimates. Initial/final latents are obtained using DDIM forward/backward inversion (\cref{eq:ddim_inversion}).
Conditional Density: Handles text conditioning $z$ (CLIP embeddings). For paths between $z_0$ and $z_1$ , it uses linear interpolation: $\zeta(t) = (1 - t) z_0 + t z_1$ . Text embeddings $z_0, z_1$ are obtained via text inversion, potentially with fine-tuning.
High-Dimensional Optimization:
- Sphere Constraint: Latent vectors are assumed to lie near a sphere. Path points are reprojected onto the sphere after each update, and gradients/updates are projected to the tangent space (\cref{eq:derivative_sphere}, \cref{eq:ivp_acc}).
- Path Parameterization: The path $\gamma$ is represented by discrete control points connected by piecewise spherical linear interpolation (great circle arcs). Velocities $\dot{\gamma}$ and accelerations $\ddot{\gamma}$ needed for the functional derivative are estimated by fitting a natural cubic spline to the control points.
- BVP Solver (Algorithm 1 - Interpolation):
  - Takes start/end images and prompts.
  - Encodes images/prompts to latent space endpoints $x_0, x_1$ and $z_0, z_1$ .
  - Initializes path as a great circle between $x_0, x_1$ .
  - Optimizes control points using gradient descent with the projected functional derivative (\cref{eq:derivative_sphere}), using the score estimate $\phi$ .
  - Employs a coarse-to-fine strategy, increasing the number of control points during optimization (e.g., 1, 3, 7, 15 points).
  - Decodes the final path points back to images.
- IVP Solver (Algorithm 2 - Extrapolation):
  - Takes a starting image/prompt and a target prompt.
  - Encodes the start state ( $x_0, z_0$ ) and target prompt ( $z_1$ ).
  - Determines an initial velocity $\dot{x}_0$ (e.g., pointing towards the target prompt distribution).
  - Integrates the geodesic ODE projected onto the sphere's tangent space using RK4: $\ddot{\gamma} = -\|\dot{\gamma}\|^2 ( I - \hat{\gamma} \hat{\gamma}^T ) ( I - \hat{\dot{\gamma}} \hat{\dot{\gamma}}^T ) \nabla \log p(\gamma)$ (\cref{eq:ivp_acc}).
  - Decodes path points to images.
Hyperparameters: Key parameters include the score scaling $\beta$ , the operating diffusion timestep $\tau$ , the timestep range $\Delta\tau$ for score averaging, and the CFG scale $\sigma$ . Ablations show their impact on path directness vs. probability alignment and image quality. Default values: $\tau=600, \Delta\tau=100, \beta=0.002, \sigma=1$ .

Applications and Results:

Video Analysis: Synthetic video clips were found to approximately follow geodesics in the Stable Diffusion latent space, suggesting the learned space captures natural motion to some extent.
Image Interpolation (BVP): The method provides a training-free approach for image morphing. Quantitative results on datasets like MorphBench show it achieves competitive performance compared to state-of-the-art methods (including those requiring fine-tuning), particularly in terms of path directness (PPL) and smoothness (PDV). Qualitative results show smooth transitions for object animations and metamorphoses.
Image Extrapolation (IVP): The method can generate image sequences continuing from a starting image based on a target prompt. Qualitative results demonstrate generating plausible future frames, though finding a good initial velocity is noted as a challenge.

Practical Considerations:

Training-Free: The core method works with a pre-trained diffusion model without requiring task-specific training or fine-tuning (though text inversion fine-tuning is used).
Computational Cost: Relies on repeated calls to the diffusion model's UNet (for score estimation) during path optimization. The paper uses techniques like coarse-to-fine optimization to manage this.
Limitations:
- Standard metrics (FID, PPL, etc.) don't fully capture interpolation quality.
- Can struggle with large semantic gaps or camera motions between endpoints (potential local optima issue).
- Relies on score distillation as a proxy for the true score.
- Finding reliable initial velocities for extrapolation is difficult.

In summary, the paper provides a theoretically grounded and practically implemented framework for navigating the latent space of diffusion models along paths of maximum probability density. It offers competitive training-free solutions for image interpolation and a novel approach for extrapolation, demonstrated using Stable Diffusion. The provided algorithms and analysis tools can be valuable for developers working on generative image sequence tasks or studying the structure of diffusion latent spaces.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (8)

Tweets

https://twitter.com/kwangmoo_yi/status/1910416956026728521

https://twitter.com/arxivsanitybot/status/1910886182701207633