Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
This presentation explores a fundamental geometric bottleneck in diffusion transformers operating on latent spaces from representation encoders like DINOv2. Rather than requiring expensive model scaling, the authors demonstrate that standard architectures fail due to geometric interference—Euclidean flow matching violates the hyperspherical manifold structure of normalized features. By introducing Riemannian Flow Matching with Jacobi Regularization, they enable efficient convergence and state-of-the-art image generation with standard transformer architectures, fundamentally reframing the problem from capacity constraints to geometric alignment.Script
What if the reason your diffusion model fails to learn isn't about having too few parameters, but about traveling through the wrong geometric space? This paper reveals that standard diffusion transformers collapse on representation encoder latents not due to capacity limits, but because they violate the fundamental geometry where semantic information actually lives.
Building on that geometric insight, let's examine why standard approaches fail.
When researchers deployed diffusion transformers on frozen representation encoder spaces, they encountered systematic convergence failures. The prevailing explanation pointed to model capacity, but this paper identifies a deeper geometric cause: standard Euclidean objectives force models to learn trajectories that violate the intrinsic manifold structure.
This geometric violation has a precise mathematical signature.
This figure provides the smoking gun evidence. When the authors trained models of varying widths using standard Euclidean objectives, narrow models collapsed and semantic learning stalled. But when they masked out the radial component and optimized only the angular loss, even models with fewer dimensions than the data converged perfectly. This decisive experiment proves the bottleneck isn't capacity—it's the geometric conflict between Euclidean objectives and hyperspherical structure.
The geometric mismatch becomes clear when we compare these spaces. Representation encoders produce features that live on a tight hypersphere due to normalization, with all semantic information encoded as angles. Standard diffusion models, however, assume Euclidean structure and create linear paths that slice through the sphere's interior—a region completely devoid of valid semantic vectors.
To resolve this, the authors reformulate diffusion on the true manifold.
The proposed solution, Riemannian Flow Matching with Jacobi Regularization, elegantly addresses both issues. By using spherical linear interpolation, all probability flow trajectories remain on the manifold, eliminating norm collapse entirely. The Jacobi regularization then accounts for how curvature causes errors to propagate differently along geodesics, weighting the optimization to focus where geometric constraints are tightest.
This geometric alignment delivers remarkable empirical gains.
The results are striking across the board. Standard-width Diffusion Transformers, which previously failed to converge, now achieve state-of-the-art image generation quality when trained with Riemannian Flow Matching. The method generalizes across different representation encoders and architectural variants, consistently outperforming both Euclidean baselines and previous approaches that relied on expensive model scaling.
The qualitative results showcase the semantic richness and visual fidelity achieved by models trained with the geometric approach. These generations from a DiT-XL model demonstrate accurate class-conditional synthesis with high diversity, all achieved with a standard architecture that would have collapsed under traditional Euclidean training.
This work fundamentally reframes latent diffusion as a geometric problem, proving that respecting manifold structure—not increasing model capacity—is the key to unlocking efficient, high-fidelity generation. To dive deeper into how geometric alignment transforms generative modeling, visit EmergentMind.com.