iREPA: Enhanced Spatial Alignment in Diffusion

Updated 14 April 2026

iREPA is an enhanced representation alignment method that replaces MLP projections with a convolutional layer and applies spatial normalization to accentuate local structure.
It accelerates convergence and improves image quality, achieving up to a 14-point FID reduction compared to baseline REPA in diffusion models.
The method challenges traditional global semantic alignment by demonstrating that prioritizing spatial correlations leads to more faithful and efficient image synthesis.

iREPA (Improved Representation Alignment for Perceptual Alignment) is an enhancement to Representation Alignment (REPA) for diffusion-transformer training in image generation. iREPA accentuates the transfer of spatial structure—rather than global semantic information—by replacing the standard multilayer perceptron (MLP) projection layer with a convolutional layer and introducing spatial normalization for the external (teacher) representation. These modifications yield improved convergence speed and generation quality in diffusion models across diverse vision encoders, model types, and training variants (Singh et al., 11 Dec 2025).

1. Background: Representation Alignment (REPA) in Diffusion Models

Representation Alignment (REPA) was introduced by Yu et al. (2024) to accelerate and stabilize diffusion-transformer training by distilling features from a strong, pretrained vision encoder into intermediate layers of the diffusion model. The method aims to directly inject perceptual-semantic priors by penalizing the ℓ₂ distance between projected diffusion features and the teacher encoder features at each training step. Specifically, for an image $x \sim p_{\text{data}}$ , the $\ell$ -th layer feature map of the diffusion transformer at noise level $t$ is $f^\ell_t(x) \in \mathbb{R}^{n \times H \times W}$ , and the patch-token feature map from a frozen encoder $E$ is $h(x) \in \mathbb{R}^{d \times H \times W}$ . A projection head $g_\phi: \mathbb{R}^{n \times H \times W} \rightarrow \mathbb{R}^{d \times H \times W}$ (typically a 3-layer MLP applied per location) maps the student features for comparison. The REPA objective is:

$L_{\text{REPA}}(\theta, \phi) = \mathbb{E}_{x, t} \left\| g_\phi(f^\ell_t(x; \theta)) - h(x) \right\|_2^2.$

REPA dramatically decreases the number of diffusion iterations required and enhances image synthesis quality by aligning intermediate representations with those of high-performing vision encoders.

2. The iREPA Method: Modifications to Emphasize Spatial Structure

iREPA introduces two modifications to standard REPA, specifically targeting spatial structure:

Convolutional Projection: The projection head $g_\phi$ , originally implemented as a 3-layer MLP, is replaced with a single $3 \times 3$ convolutional layer with padding 1. For $\ell$ 0 of shape $\ell$ 1: $\ell$ 9 This change ensures that local correlations between neighboring patch tokens are preserved, enforcing an inductive bias towards spatial coherence absent in MLPs.
Spatial Normalization: The frozen encoder feature map $\ell$ 2 is pre-normalized to suppress its dominant global component, thus amplifying local contrast. For $\ell$ 3: $t$ 0 where $\ell$ 4 is a learned or fixed scalar, and $\ell$ 5. This transformation shifts focus to relative (patch–patch) differences, accentuating spatial structure over absolute global content.

The iREPA training step thereby consists of spatially normalizing teacher features, applying a convolutional projection to student features, and minimizing the ℓ₂ loss between these representations.

3. Experimental Setup and Results

Experiments were conducted on ImageNet 256×256 using SiT-B/2, SiT-L/2, and SiT-XL/2 diffusion transformers with the VAE backbone from stabilityai/sd-vae-ft-mse. Twenty-seven vision encoders, including DINOv2, DINOv3, CLIP, WebSSL, Perception Encoder, SAM2, and MoCoV3, were employed as teachers. Training used AdamW (learning rate 1e-4, batch size 256, up to 400K steps), and the Euler SDE sampler (250 NFEs). Performance was evaluated using FID, sFID, Inception Score (IS), Precision, Recall, and CMMD metrics.

Key results include:

Condition	FID (↓)	IS (↑)	Steps
Baseline REPA (DINOv2-B)	19.06	70.3	100K
iREPA (DINOv2-B)	16.96	77.9	100K
REPA (400K, w/ guidance)	1.98	157.2	400K
iREPA (400K, w/ guidance)	1.93	179.3	400K

Across various teacher encoders, iREPA achieved FID reductions of 4–14 points at 100K steps.
iREPA reached a “good-quality” FID of approximately 20 in 30–40K steps, compared to approximately 60K steps for baseline REPA—a twofold speed-up.
Performance improvements held across REPA variants (REPA-E, MeanFlow+REPA) and pixel-space models (JiT+REPA → JiT+iREPA).

4. Analysis: Spatial Structure versus Global Semantics

A central finding is that spatial structure of the target representation, rather than global semantic information (commonly measured by classification accuracy), is critical for generation quality:

Linear probing accuracy on ImageNet-1K correlates weakly with generation FID (Pearson $\ell$ 6).
Four spatial self-similarity metrics (LDS, CDS, SRSS, RMSC) exhibit strong negative correlation with FID ( $\ell$ 7).
Vision encoders with higher classification accuracies can yield worse FID scores; e.g., PE-Core-G (82.8% LP, FID 32.3) versus SpatialPE-B (53.1% LP, FID 22.0).
Adding the global CLS token to patch tokens increases linear probe accuracy from 70.7% to 78.5%, yet degrades FID from 19.2 to 25.4.

This inverses the prevalent assumption that aligning for strong global semantics in teacher features yields better generation performance.

5. Ablation Studies

Ablation experiments isolate the contributions of each iREPA component:

MLP with spatial normalization: modest FID reduction (19.06 → 18.28).
Convolutional projection only: FID 19.06 → 18.52.
Full iREPA (convolution + spatial normalization): FID 19.06 → 16.96.

These results demonstrate that both components are individually beneficial, but their combination delivers the strongest improvement.

6. Mechanistic Insights: Importance of Spatial Structure in Diffusion

Diffusion models generate images through iterative denoising, a process requiring the preservation of fine-grained spatial correlations between pixels. Aligning only global, semantic features fails to convey relationships between local patches. The convolutional projection in iREPA enforces that adjacent tokens maintain related features, introducing an inductive bias absent in pointwise MLP heads. Spatial normalization of the teacher features removes the overwhelming global mean, sharpening local contrasts and yielding outputs with more faithful geometry.

Empirically, spatial self-similarity metrics rise in tandem with decreasing FID (Pearson $\ell$ 8), while improvements in classification accuracy do not guarantee lower FID. This suggests that accentuating spatial structure, rather than global semantic precision, is essential for the efficacy of representation alignment in diffusion-based image synthesis.

7. Broader Impact and Implications

The introduction of iREPA challenges the established paradigm that stronger global semantic alignment leads to superior generative modeling. Instead, results indicate that preserving and accentuating spatial structure is the decisive factor, with two simple architectural modifications—the convolutional projection and spatial normalization—yielding substantial advances in convergence speed and image quality. These findings motivate a reevaluation of how external representations are leveraged in generative model training and may influence future architectural choices for perceptual alignment in diffusion models (Singh et al., 11 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

What matters for Representation Alignment: Global Information or Spatial Structure? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to iREPA.