End-to-End Representation Alignment (REPA-E)

Updated 10 December 2025

End-to-End Representation Alignment (REPA-E) is a method that explicitly aligns latent representations to enable joint tuning of tokenizers and downstream models across modalities.
It employs a mix of perceptual, adversarial, and contrastive alignment losses to stabilize gradient flow and achieve faster convergence with improved performance metrics.
The approach effectively mitigates latent collapse while preserving semantic structure, enhancing applications in image generation, speech recognition, and multi-modal transfer.

End-to-End Representation Alignment (REPA-E) refers to a class of methods that enable effective, stable, and high-quality joint optimization of encoder and downstream generative or prediction models via explicit alignment of latent representations. Conceptually, REPA-E approaches maximize expressivity and cross-modal fidelity in deep architectures by replacing naive end-to-end tuning losses—often causing collapse or degraded performance—with principled representation alignment objectives, such as perceptual similarity, adversarial invariance, or shared token-space structure. In contrast to conventional two-stage or hard-alignment paradigms, REPA-E unlocks both end-to-end gradient flow and preservation of semantic structure across modalities and abstraction levels, leading to measurable improvements in generation, recognition, and multi-modal transfer tasks (Leng et al., 14 Apr 2025, Zhang et al., 2023, Wang et al., 2021).

1. Motivations and Historical Context

Traditional deep generative architectures—particularly latent diffusion models (LDMs) and variational autoencoders (VAEs)—adopt a two-stage training paradigm. This involves first training the tokenizer (VAE) with reconstruction or GAN-based losses, then freezing its parameters and training the downstream generative model (e.g., diffusion transformer) on its latent codes via the standard denoising loss. Attempts to perform end-to-end optimization by back-propagating generative losses directly into the tokenizer have consistently resulted in latent space collapse: drastic reduction in spatial variance, trivial bias induction in the denoiser, and poor generation metrics (e.g., FID explosion, loss of semantic fidelity).

Parallel developments in multi-modal learning (speech–text and vision–LLMs), recognition, and planning demonstrated similar pitfalls: hard segmentation alignments and naive cross-modal objectives degrade representation quality and impede transferability (Zhang et al., 2023, Wang et al., 2021). REPA-E emerged from the need to ensure cross-domain structure, semantic separability, and transferable token-space alignment throughout gradient flow, while enabling true end-to-end tuning of all architecture components.

2. Core Principles and Loss Functions

Central to REPA-E is the explicit alignment of learned representations using auxiliary objectives that are agnostic to original modal boundaries, but sensitive to semantic and perceptual structure. The most broadly used instantiations are:

Perceptual Alignment Loss (REPA): Measures similarity (cosine, CKNNA) between hidden-layer features from the generative model (e.g., diffusion transformer patch embeddings) and a frozen, high-fidelity perceptual encoder (e.g., DINOv2). The REPA loss for $N$ embeddings is:

$\mathcal L_{\mathrm{REPA}}(\theta, \phi, \omega) = - \mathbb E_{\mathbf x,\epsilon,t}\Bigl[\frac{1}{N} \sum_{n=1}^N \mathrm{sim}\big(\mathbf y^{[n]}, h_\omega(\mathbf h_t^{[n]})\big)\Bigr]$

where $\mathbf y^{[n]}$ are target perceptual features, and $h_\omega$ is a learned projection (Leng et al., 14 Apr 2025).

Adversarial Modality Alignment: Encoders face off against a discriminator, which predicts modality label (e.g., speech vs. text). The encoder seeks to "fool" the discriminator, producing modality-invariant representations while retaining semantic specificity (Zhang et al., 2023).
Embedding Aligner and Modality Switch: Speech and text encoders are projected via a shared linear mapping, enforced by cross-modal losses and random switch training, ensuring tight clustering and interchangeability in latent space (Wang et al., 2021).
Contrastive Instance–Language Alignment: In multi-agent scenes, instance tokens are aligned—via contrastive loss—with language embeddings of generated scene descriptions, enforcing language-guided focus for planning (Song et al., 17 Mar 2025).

The full REPA-E objective, exemplified for VAE–diffusion training, combines: diffusion loss (stop-grad into VAE), REPA perceptual alignment loss (end-to-end), and VAE regularization (LPIPS, GAN, KL), weighted as:

$\mathcal L(\theta, \phi, \omega) = \mathcal L_\mathrm{DIFF}(\theta) + \lambda\,\mathcal L_\mathrm{REPA}(\theta, \phi, \omega) + \eta\,\mathcal L_\mathrm{REG}(\phi)$

with empirical weights $\lambda_{\mathrm{REPA},\theta}=0.5$ , $\lambda_{\mathrm{REPA},\phi}=1.5$ , $\eta=1.0$ (Leng et al., 14 Apr 2025).

3. Algorithmic Design and Training Procedures

REPA-E implementations are typically characterized by the following algorithmic features:

End-to-End Gradient Flow With Alignment Loss: Unlike naive back-propagation of the main predictive/generative loss, only the alignment loss propagates gradients into both tokenizer (VAE) and downstream model, while the primary generative loss applies stop-gradient.
Token Normalization and Regularization: BatchNorm is inserted post-VAE to stabilize the latent distribution. Reconstruction regularizers (MSE, LPIPS, GAN, KL) ensure the VAE's output does not lose signal fidelity.
Staged Training Schedule: Initial pretraining of tokenizer/perceptual modules, followed by joint optimization under the composite REPA-E objective.
Discriminator and Switch Training (for modality cases): Alternating updates between discriminator and encoders, using continuous "mix-up" labels to avoid collapse and maintain well-defined representation gradients (Zhang et al., 2023).

A representative high-level pseudocode for REPA-E-style end-to-end training is:

for each training step:
    # VAE encode
    z = VAE.encode(x)
    z_norm = BatchNorm(z)

    # Diffusion forward pass
    z_t = alpha_t * z_norm + sigma_t * eps
    h_t = Diffusion(z_t, t)
    eps_pred = head(h_t)
    L_diff = mean_square_error(eps_pred, eps)      # only for diffusion params

    # REPA alignment
    y = perceptual_encoder(x)
    h_proj = align_head(h_t)
    L_repa = -mean_cosine_similarity(y, h_proj)    # for both VAE and diffusion

    # VAE regularization
    x_rec = VAE.decode(z)
    L_reg = mse(x, x_rec) + lpips(x, x_rec) + gan(x, x_rec) + kl(x, x_rec)

    # Total loss
    L = L_diff + lambda * L_repa + eta * L_reg
    Back-propagate L to update all parameters

(Leng et al., 14 Apr 2025)

4. Empirical Performance and Analysis

REPA-E achieves substantial gains in both training speed and downstream performance metrics, notably:

ImageNet 256×256:
- REPA-E achieves gFID=1.83 (no CFG) and gFID=1.26 (CFG scale 4), outperforming vanilla REPA and earlier LDM recipes. Training converges in ~400k steps (17–45× faster than vanilla REPA/two-stage) (Leng et al., 14 Apr 2025).
Latent Space Quality:
- End-to-end REPA-E increases latent spatial variance, reduces total variation, and yields semantically organized color gradients in PCA visualization, compared to collapsed or noisy vanilla VAE latents.
Speech and Multi-Modal Recognition:
- Modality-switch REPA-E reduces ASR WER by 14–19% and increases SLU F1 by 2.5–2.8 absolute, tightly fusing text and speech latent spaces (Wang et al., 2021).
- Soft alignment (adversarial REPA-E) outperforms hard pairwise alignment in multi-task speech translation, improving BLEU while retaining modality fidelity and multitask stability (Zhang et al., 2023).
Generalization:
- REPA-E benefits propagate across scales, architectures, and perceptual encoders (CLIP, DINOv2, I-JEPA), yielding consistent FID reductions, faster training, and improved structure (Leng et al., 14 Apr 2025).

A plausible implication is that alignment-based objectives generalize robustly across modalities and architecture types, countering the collapse or scatter observed under naive joint optimization.

5. Design Decisions, Ablations, and Insights

Key ablation studies reveal the necessity of REPA-E components:

Direct Diffusion Loss Into VAE: Results in collapse, plummeting latent spatial variance (~17→0.02), exploding generative FID (~444), and smooth, non-informative latents.
Removing Stop-Gradient or Alignment Loss: Degrades performance dramatically; alignment to external perceptual models stabilizes structure.
BatchNorm, Mix-Up Regularization: Omission degrades gFID and convergence.
Modality Weighting (λ): Controls alignment strength; excessive weighting causes collapse of modality-specific signals, insufficient weighting yields poor alignment. Empirically, λ≈3.5 balances trade-offs in speech–text tasks (Zhang et al., 2023).

The data indicate that representation alignment losses provide stable, semantically meaningful gradients for both tokenizer and generator/predictor models, unlocking the ability to fully jointly tune all deep components.

6. Applications and Impact Across Domains

REPA-E methodologies have seen adoption in:

Image Generation: Improved VAE and LDM training, state-of-the-art FID and sample quality, fast convergence (Leng et al., 14 Apr 2025).
Speech Recognition and Multimodal Translation: Cross-modal latent alignment yielding robust multitask performance and multitask stability (Zhang et al., 2023, Wang et al., 2021).
Autonomous Driving: Alignment of visual tokens and language representations for scene understanding, planning, and agent–map topology modeling (Song et al., 17 Mar 2025).

For multi-modal tasks, REPA-E enables reusable, transferable latent spaces, avoids catastrophic forgetting, and supports generalization to new modalities (vision, speech, text, scene graphs).

7. General Guidelines, Limitations, and Future Directions

REPA-E prescribes:

Use alignment objectives (perceptual, adversarial, or contrastive) rather than direct predictive/generative gradients for joint tuning.
Insert normalization layers (e.g., BatchNorm) post-tokenizer for latent stability.
Apply regularized reconstruction losses to preserve signal fidelity in tokenizer outputs.
Evaluate alignment strength (λ) empirically; excessive values risk modality collapse, low values risk no alignment.

While REPA-E robustly avoids latent space collapse and supports stable end-to-end optimization, one limitation is the requirement for high-fidelity perceptual or semantic targets (e.g., frozen encoder, stable discriminator). Future exploration may consider dynamic adaptation of alignment targets or self-supervised alternatives.

In summary, End-to-End Representation Alignment (REPA-E) replaces naive model joint tuning with alignment-based objectives, unlocking stable, fast, and high-fidelity optimization for both generative and recognition architectures across diverse modalities and tasks (Leng et al., 14 Apr 2025, Zhang et al., 2023, Wang et al., 2021, Song et al., 17 Mar 2025).