Papers
Topics
Authors
Recent
Search
2000 character limit reached

REPA-E: End-to-End Tuning in Diffusion Models

Updated 14 April 2026
  • REPA-E is a training paradigm for latent diffusion transformers that jointly tunes the VAE tokenizer and diffusion model using a representation-alignment loss.
  • It substantially accelerates convergence, achieving competitive generative quality in as few as 80 epochs compared to traditional approaches requiring over 1400 epochs.
  • Innovative components like BatchNorm, projection heads, and stopgrad strategies prevent latent collapse and ensure robust, high-quality generation.

REPA-E (Representation Alignment for End-to-End Tuning) is a training paradigm for latent diffusion transformers that enables joint optimization of the Variational Autoencoder (VAE) tokenizer and the diffusion model by leveraging a representation-alignment loss rather than the standard diffusion loss. The REPA-E strategy circumvents failure modes commonly encountered in end-to-end (E2E) tuning while substantially accelerating convergence and boosting generative quality. It generalizes the REPresentation Alignment (REPA) framework by backpropagating alignment-based objectives into both the VAE and diffusion model, providing a principled means to unlock E2E trainability in large-scale generative pipelines (Leng et al., 14 Apr 2025, Wang et al., 25 Jan 2026).

1. Origins and Motivation

Latent diffusion pipelines traditionally adopt a two-stage approach: first, a VAE is separately trained to convert pixel data into a continuous latent space through reconstruction and KL regularization; then a diffusion model is trained with the VAE's encoder frozen, operating exclusively on these latents. Empirical attempts to propagate standard diffusion losses directly into the VAE encoder during E2E tuning cause latent collapse: the encoder minimizes denoising difficulty by destroying informative content, drastically degrading Fréchet Inception Distance (FID) and other generative benchmarks. Table 1 from (Leng et al., 14 Apr 2025) summarizes this breakdown, showing latent spatial variance reduced by orders of magnitude and FID deteriorating from 7.9 (fixed-VAE, REPA) to 62.3 (naïve diffusion-loss E2E).

REPA-E addresses this by replacing the naïve loss with a representation-alignment objective, aligning intermediate diffusion transformer representations with those of a pretrained visual encoder. This approach enables gradients to flow through both model components without incentivizing latent destruction.

2. REPA-E Objective and Mathematical Formulation

The REPA-E loss is a composite objective, combining traditional diffusion training, representation alignment, and explicit VAE regularization: LREPA ⁣ ⁣E(θ,ϕ,ω)=Ldiff(θ  ;  stopgrad(ϕ))diffusion loss only updates θ  +  λLREPA(θ,ϕ,ω)  +  ηLREG(ϕ)\mathcal L_{\rm REPA\!-\!E}(\theta,\phi,\omega) = \underbrace{\mathcal L_{\rm diff}\bigl(\theta\;;\;\mathrm{stopgrad}(\phi)\bigr)}_{\text{diffusion loss only updates }\theta} \;+\;\lambda\,\mathcal L_{\rm REPA}(\theta,\phi,\omega) \;+\;\eta\,\mathcal L_{\rm REG}(\phi) where:

  • θ\theta: diffusion model parameters
  • ϕ\phi: VAE parameters
  • ω\omega: linear or MLP projection head for representation alignment

Terms:

  • Ldiff\mathcal L_{\rm diff}: velocity or ϵ\epsilon-prediction loss, gradients stopped w.r.t. ϕ\phi
  • LREPA\mathcal L_{\rm REPA}: patchwise cosine similarity between a projected diffusion feature and a fixed pretrained encoder feature on the clean image,

LREPA=Ex,ϵ,t[1Ni=1Nsim(yi,  hω(ht)i)]\mathcal L_{\rm REPA} = -\,\mathbb E_{x,\epsilon,t}\Bigl[\tfrac1N\sum_{i=1}^N \mathrm{sim}\bigl(y_i,\;h_\omega(h_t)_i\bigr)\Bigr]

with y=f(x)y=f(x) (external vision encoder, e.g., DINOv2); θ\theta0 are diffusion transformer hidden states.

  • θ\theta1: VAE regularization, typically θ\theta2
  • θ\theta3: hyperparameters controlling trade-offs (e.g., θ\theta4 for θ\theta5 respectively in (Leng et al., 14 Apr 2025))

The key insight is the “stopgrad” on the VAE encoder in the diffusion loss, directing alignment-based updates to shape VAE latents without risking collapse.

3. Architectural and Implementation Developments

The REPA-E framework incorporates several specialized modifications:

  • BatchNorm Layer: Inserted between the VAE output and diffusion input, normalizing latents without affine parameters and using differentiable running moments. This avoids non-differentiable dataset-wide statistics while preserving gradient flow into θ\theta6.
  • Projection Head: A small, often single-layer (or deeper in VAE-REPA (Wang et al., 25 Jan 2026)) MLP mapping diffusion transformer activations to the target encoder's dimensionality.
  • Alignment Depth: Loss applied at an intermediate transformer block (e.g., block 8 for SiT-XL/2). Ablations show early or mid-layer alignment yields optimal FID; too deep harms texture retention.
  • VAE Regularization: Combination of MSE, LPIPS, GAN, and KL losses attached to the VAE decoder.

Pseudocode for a typical training step (from (Leng et al., 14 Apr 2025)):

ϕ\phi2

4. Empirical Performance and Experimental Results

Comprehensive benchmarks on ImageNet 256×256 reveal that REPA-E achieves:

  • Training Speed: Up to 45× faster convergence over vanilla two-stage; typically, REPA-E reaches gFID=4.07 in 80 epochs, while vanilla requires 1400 epochs and REPA (external encoder but fixed VAE) needs 800 epochs for gFID=5.9 (Leng et al., 14 Apr 2025).
  • Generative Quality:
    • FID (w/ guidance): 1.26 (REPA-E at 800 epochs)
    • FID (w/o guidance): 1.83
    • Consistent reductions in FID and gains in linear-probe accuracy across model scales and encoder choices
  • Model Robustness: REPA-E provides 25–44% FID reduction with various VAE architectures (SD-VAE, IN-VAE, VA-VAE); compatible with multiple external encoders (DINOv2, CLIP-L, I-JEPA-H). The approach is robust to choices of alignment weight θ\theta7 in [0.25, 1.0].
  • Ablation Insights: Ablating VAE regularization or stopgrad induces sharp performance degradation (gFID rises from 16.3 to 444.1). BatchNorm is essential for stable convergence (Leng et al., 14 Apr 2025).

Table: Sample FID and training speed results (condensed from (Leng et al., 14 Apr 2025)).

Training Recipe Epochs to gFID~4 Final FID (no CFG) Notes
Vanilla (LDM) 1400 7.90 Two-stage, fixed VAE
REPA 800 5.90 External encoder, fixed VAE
REPA-E 80 4.07 E2E trainable, external encoder

5. VAE-REPA: Alignment Without External Encoders

VAE-REPA (also termed REPA-E in (Wang et al., 25 Jan 2026)) eliminates the reliance on any external pretrained encoder. Instead, it utilizes pre-trained VAE latents as the target features for alignment. This “built-in” supervision exploits the fact that SD-VAEs inherently encode fine texture, structure, and coarse semantics, as validated by PCA visualizations (see (Wang et al., 25 Jan 2026)).

Key features:

  • Featurization: Pre-computed VAE latents θ\theta8 replace external encoder outputs; no additional parameters.
  • Loss Structure: Alignment uses an elementwise smooth-θ\theta9 loss between projected transformer activations and VAE latents: ϕ\phi0
  • Computational Overhead: Adds only 4% GFLOPs and an 18 M parameter projection head (for SiT-XL/2), compared to 20–70% GFLOPs increase for external-encoder-based REPA.
  • Performance: Up to 4–7× faster convergence, FID gains up to 4.5, and matches or exceeds state-of-the-art acceleration methods such as SRA, with zero external dependencies (Wang et al., 25 Jan 2026).

6. Impact on Latent Geometry and Downstream Generation

End-to-end training with REPA-E induces profound restructuring of VAE latent spaces:

  • Latent Conditioning: PCA analysis pre- and post-training shows raw VAE latents are either over-smoothed or noisy; REPA-E attenuates noise and revitalizes captured detail, producing well-conditioned and semantically organized codes (Leng et al., 14 Apr 2025).
  • Restored Variance: After REPA-E, spatial variance in latents is on par with healthy pretrained VAEs, preventing collapse.
  • Representation Quality: CKNNA alignment scores increase (from ~0.42 to ~0.55), correlating tightly with FID improvement.
  • Qualitative Gains: Early training samples show sharp structure and semantics absent in baseline or collapsed-latent variants.

7. Limitations and Future Directions

Despite its broad improvements, REPA-E’s successes are circumscribed to latent diffusion transformers. It remains underexplored in pixel-space diffusion, audio/language modalities, or domains lacking robust VAE architectures. Future research directions include joint VAE–diffusion finetuning without external encoders, multi-layer (deep) alignment, and extension to video/specialized data domains (Wang et al., 25 Jan 2026).

Further, there is as yet no comprehensive theoretical explanation for the effectiveness of representation alignment as a generative regularizer, nor a fixed recipe for tuning ϕ\phi1 and alignment depth across highly heterogeneous pipelines. A plausible implication is that integrating advanced VAE architectures or richer latent spaces (e.g., normalizing flows) with alignment objectives could further enhance modeling capacity.

References

  • "REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers" (Leng et al., 14 Apr 2025)
  • "VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training" (Wang et al., 25 Jan 2026)
  • "Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think" (Yu et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to REPA-E.