Papers
Topics
Authors
Recent
Search
2000 character limit reached

iREPA: Enhanced Spatial Alignment in Diffusion

Updated 14 April 2026
  • iREPA is an enhanced representation alignment method that replaces MLP projections with a convolutional layer and applies spatial normalization to accentuate local structure.
  • It accelerates convergence and improves image quality, achieving up to a 14-point FID reduction compared to baseline REPA in diffusion models.
  • The method challenges traditional global semantic alignment by demonstrating that prioritizing spatial correlations leads to more faithful and efficient image synthesis.

iREPA (Improved Representation Alignment for Perceptual Alignment) is an enhancement to Representation Alignment (REPA) for diffusion-transformer training in image generation. iREPA accentuates the transfer of spatial structure—rather than global semantic information—by replacing the standard multilayer perceptron (MLP) projection layer with a convolutional layer and introducing spatial normalization for the external (teacher) representation. These modifications yield improved convergence speed and generation quality in diffusion models across diverse vision encoders, model types, and training variants (Singh et al., 11 Dec 2025).

1. Background: Representation Alignment (REPA) in Diffusion Models

Representation Alignment (REPA) was introduced by Yu et al. (2024) to accelerate and stabilize diffusion-transformer training by distilling features from a strong, pretrained vision encoder into intermediate layers of the diffusion model. The method aims to directly inject perceptual-semantic priors by penalizing the ℓ₂ distance between projected diffusion features and the teacher encoder features at each training step. Specifically, for an image xpdatax \sim p_{\text{data}}, the \ell-th layer feature map of the diffusion transformer at noise level tt is ft(x)Rn×H×Wf^\ell_t(x) \in \mathbb{R}^{n \times H \times W}, and the patch-token feature map from a frozen encoder EE is h(x)Rd×H×Wh(x) \in \mathbb{R}^{d \times H \times W}. A projection head gϕ:Rn×H×WRd×H×Wg_\phi: \mathbb{R}^{n \times H \times W} \rightarrow \mathbb{R}^{d \times H \times W} (typically a 3-layer MLP applied per location) maps the student features for comparison. The REPA objective is:

LREPA(θ,ϕ)=Ex,tgϕ(ft(x;θ))h(x)22.L_{\text{REPA}}(\theta, \phi) = \mathbb{E}_{x, t} \left\| g_\phi(f^\ell_t(x; \theta)) - h(x) \right\|_2^2.

REPA dramatically decreases the number of diffusion iterations required and enhances image synthesis quality by aligning intermediate representations with those of high-performing vision encoders.

2. The iREPA Method: Modifications to Emphasize Spatial Structure

iREPA introduces two modifications to standard REPA, specifically targeting spatial structure:

  1. Convolutional Projection: The projection head gϕg_\phi, originally implemented as a 3-layer MLP, is replaced with a single 3×33 \times 3 convolutional layer with padding 1. For \ell0 of shape \ell1: \ell9 This change ensures that local correlations between neighboring patch tokens are preserved, enforcing an inductive bias towards spatial coherence absent in MLPs.
  2. Spatial Normalization: The frozen encoder feature map \ell2 is pre-normalized to suppress its dominant global component, thus amplifying local contrast. For \ell3: tt0 where \ell4 is a learned or fixed scalar, and \ell5. This transformation shifts focus to relative (patch–patch) differences, accentuating spatial structure over absolute global content.

The iREPA training step thereby consists of spatially normalizing teacher features, applying a convolutional projection to student features, and minimizing the ℓ₂ loss between these representations.

3. Experimental Setup and Results

Experiments were conducted on ImageNet 256×256 using SiT-B/2, SiT-L/2, and SiT-XL/2 diffusion transformers with the VAE backbone from stabilityai/sd-vae-ft-mse. Twenty-seven vision encoders, including DINOv2, DINOv3, CLIP, WebSSL, Perception Encoder, SAM2, and MoCoV3, were employed as teachers. Training used AdamW (learning rate 1e-4, batch size 256, up to 400K steps), and the Euler SDE sampler (250 NFEs). Performance was evaluated using FID, sFID, Inception Score (IS), Precision, Recall, and CMMD metrics.

Key results include:

Condition FID (↓) IS (↑) Steps
Baseline REPA (DINOv2-B) 19.06 70.3 100K
iREPA (DINOv2-B) 16.96 77.9 100K
REPA (400K, w/ guidance) 1.98 157.2 400K
iREPA (400K, w/ guidance) 1.93 179.3 400K
  • Across various teacher encoders, iREPA achieved FID reductions of 4–14 points at 100K steps.
  • iREPA reached a “good-quality” FID of approximately 20 in 30–40K steps, compared to approximately 60K steps for baseline REPA—a twofold speed-up.
  • Performance improvements held across REPA variants (REPA-E, MeanFlow+REPA) and pixel-space models (JiT+REPA → JiT+iREPA).

4. Analysis: Spatial Structure versus Global Semantics

A central finding is that spatial structure of the target representation, rather than global semantic information (commonly measured by classification accuracy), is critical for generation quality:

  • Linear probing accuracy on ImageNet-1K correlates weakly with generation FID (Pearson \ell6).
  • Four spatial self-similarity metrics (LDS, CDS, SRSS, RMSC) exhibit strong negative correlation with FID (\ell7).
  • Vision encoders with higher classification accuracies can yield worse FID scores; e.g., PE-Core-G (82.8% LP, FID 32.3) versus SpatialPE-B (53.1% LP, FID 22.0).
  • Adding the global CLS token to patch tokens increases linear probe accuracy from 70.7% to 78.5%, yet degrades FID from 19.2 to 25.4.

This inverses the prevalent assumption that aligning for strong global semantics in teacher features yields better generation performance.

5. Ablation Studies

Ablation experiments isolate the contributions of each iREPA component:

  • MLP with spatial normalization: modest FID reduction (19.06 → 18.28).
  • Convolutional projection only: FID 19.06 → 18.52.
  • Full iREPA (convolution + spatial normalization): FID 19.06 → 16.96.

These results demonstrate that both components are individually beneficial, but their combination delivers the strongest improvement.

6. Mechanistic Insights: Importance of Spatial Structure in Diffusion

Diffusion models generate images through iterative denoising, a process requiring the preservation of fine-grained spatial correlations between pixels. Aligning only global, semantic features fails to convey relationships between local patches. The convolutional projection in iREPA enforces that adjacent tokens maintain related features, introducing an inductive bias absent in pointwise MLP heads. Spatial normalization of the teacher features removes the overwhelming global mean, sharpening local contrasts and yielding outputs with more faithful geometry.

Empirically, spatial self-similarity metrics rise in tandem with decreasing FID (Pearson \ell8), while improvements in classification accuracy do not guarantee lower FID. This suggests that accentuating spatial structure, rather than global semantic precision, is essential for the efficacy of representation alignment in diffusion-based image synthesis.

7. Broader Impact and Implications

The introduction of iREPA challenges the established paradigm that stronger global semantic alignment leads to superior generative modeling. Instead, results indicate that preserving and accentuating spatial structure is the decisive factor, with two simple architectural modifications—the convolutional projection and spatial normalization—yielding substantial advances in convergence speed and image quality. These findings motivate a reevaluation of how external representations are leveraged in generative model training and may influence future architectural choices for perceptual alignment in diffusion models (Singh et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to iREPA.