Video REPA/CREPA: Alignment Techniques

Updated 14 April 2026

Video REPA and its extensions are methods that align internal video model representations with pretrained encoder features to accelerate convergence and improve output quality.
CREPA introduces cross-frame alignment by weighting similarities between adjacent frames, thereby enforcing temporal coherence and reducing visual flicker.
VideoREPA integrates relational spatial-temporal alignment and physics infusion through token relation distillation, enhancing both semantic adherence and physical commonsense.

Video REPA, and its modern extensions such as Cross-Frame Representation Alignment (CREPA) and VideoREPA, comprise a family of regularization and knowledge transfer techniques for aligning internal representations of video generation and modeling systems to those of external, pretrained encoder models. These approaches serve two core objectives within video diffusion models and latent action world models: (1) accelerating convergence and enhancing quality during fine-tuning, and (2) imparting additional inductive priors—such as semantic coherence across frames or deep physics understanding—that are difficult to encode using diffusion objectives alone. The methodology has influenced both diffusion-based video synthesis as well as unsupervised action abstractions for world models.

1. Representation Alignment Foundations

REPresentation Alignment (REPA) was originally introduced for image diffusion models based on the Diffusion Transformer (DiT) architecture, conceptualizing these models as DAEs with an encoder (early transformer blocks) and a decoder (later blocks). In REPA, a distillation-style auxiliary loss is added to align noisy intermediate hidden states $h_t$ with “richer” external representations $\bar y$ derived from a frozen, pretrained image encoder $E(\cdot)$ , such as DINOv2. Here, $h_t = g_\theta(x_t, t)$ is projected via a learned MLP $h_\phi$ into $E$ ’s feature space, and a similarity loss (typically cosine similarity) is applied:

$\mathcal{L}_{\rm REPA} = -\mathbb{E}_{x_0, \epsilon, t}\left[ \mathrm{sim}\bigl(\bar y, h_\phi(h_t)\bigr) \right].$

This loss is combined with the standard diffusion denoising loss as $\mathcal{L} = \mathcal{L}_{\rm score} + \lambda \mathcal{L}_{\rm REPA}$ , and yields faster convergence and improved sample fidelity in DiT models. However, when naively extended to per-frame fashion in video, REPA does not guarantee consistency or coherence across adjacent frames (Hwang et al., 10 Jun 2025).

2. CREPA: Cross-Frame Representation Alignment for Video Diffusion

CREPA addresses the temporal coherence deficit in vanilla REPA when applied to video diffusion models (VDMs). The key insight is that generative video coherence depends on the global temporal manifold formed by framewise pretrained features. CREPA regularizes not just per-frame but cross-frame alignment:

Given a video $x_0$ with frames $\{x_0^f\}_{f=1}^F$ , noisy sequence $\bar y$ 0, and encoder-derived hidden states $\bar y$ 1, the CREPA loss encourages each $\bar y$ 2 not only to align with its own clean-frame feature $\bar y$ 3, but also (weighted by a decaying schedule $\bar y$ 4) with neighbor frames’ features $\bar y$ 5. Formally,

$\bar y$ 6

with $\bar y$ 7.

Empirical results using parameter-efficient LoRA fine-tuning on DiT-based video models such as CogVideoX-5B and Hunyuan Video demonstrate CREPA achieves

Improved Fréchet Video Distance (FVD) and Inception Score (IS)
Enhanced VBench scores on temporal and semantic metrics
10–15% increases in cross-frame CKNNA similarity
Reduced flicker and shape inconsistency
Superior user study preference rates (>70% in pairwise comparisons)
Superior 3D reconstruction proxies (e.g., PSNR +0.8 dB, SSIM +0.02, LPIPS –0.007)

Performance is robust across a wide selection of datasets (cartoons, 3D, movie scenes, physical interactions), with typical hyperparameters $\bar y$ 8, neighbor distance $\bar y$ 9, and temperature $E(\cdot)$ 0 (Hwang et al., 10 Jun 2025).

3. VideoREPA: Relational Spatio-Temporal Alignment and Physics Infusion

While CREPA focuses on semantic and perceptual cross-frame alignment, VideoREPA targets the injection of physics understanding into powerful text-to-video (T2V) diffusion models. The technique leverages a stronger teacher: a video foundation model (VFM) such as VideoMAEv2, pretrained via SSL, which encodes physics knowledge not present in baseline T2V models.

Instead of aligning hidden features directly (“hard alignment”), VideoREPA aligns relational (pairwise) spatial and temporal token similarities between the teacher and the student at a chosen transformer depth, using the Token Relation Distillation (TRD) loss:

Spatial:

$E(\cdot)$ 1

Temporal:

$E(\cdot)$ 2

and similarly for $E(\cdot)$ 3.

The combined TRD loss:

$E(\cdot)$ 4

This soft-relational formulation yields stable fine-tuning on pretrained models (CogVideoX-2B/5B), in contrast to the instability (“NaN” gradients) observed with per-feature hard alignment. Interpolation aligns teacher and student feature map dimensions. Quantitative results confirm significant improvements in physical commonsense (PC) and semantic adherence (SA) across VideoPhy and VideoPhy2 benchmarks—e.g., PC up to +24% and closing the intuitive physics gap by up to 50% on OCP task. Qualitative improvements are observed in scenarios such as rolling objects, crane lifting, and fluid pouring (Zhang et al., 29 May 2025).

4. Extensions to Latent Action Models and Sequence Alignment

Seq $E(\cdot)$ 5-REPA, introduced in Olaf-World, generalizes representation alignment to the control-effect domain for latent-action world modeling. Here, latent actions extracted per-step via an inverse-dynamics encoder are aggregated over a video clip and projected to match the net change in a frozen SSL video-encoder feature space. The full alignment loss is

$E(\cdot)$ 6

where $E(\cdot)$ 7 is the mean feature difference across frames. This sequence-level alignment solves the cross-context, non-identifiability problem of traditional LAMs, resulting in context-invariant control representations. Empirical results show improved Macro-F1 on cross-domain action classification and superior pose and temporal consistency under zero-shot and few-shot adaptation (Jiang et al., 10 Feb 2026).

5. Practical Fine-Tuning and Implementation

Video REPA/CREPA methods integrate into standard VDM and T2V pipelines as additive regularizers. For parameter-efficient fine-tuning, LoRA adapters are used to minimize memory and tuning cost: only LoRA weights and the alignment MLPs are updated, with the rest of the model frozen. The CREPA layer is positioned by linear probing (typically encoder block 8 in CogVideoX-5B, block 10 in Hunyuan Video).

In VideoREPA, practical dimension alignment involves upsampling student tokens to match teacher, and all training uses moderate batch sizes (e.g., 32 videos per batch on 8×A100 GPUs) and short schedule (2000–4000 steps). For CREPA, datasets from diverse domains (cartoons, physical 3D, scenes) are routinely used to validate generalization.

6. Evaluation, Ablations, and Limitations

Empirical evaluation employs both automatic (FVD, IS, VBench, PC/SA) and human metrics. Key ablations show

Loss/Setting	Semantic Adherence (SA)	Physical Commonsense (PC)
No TRD (L_diff)	63.6	23.2
TRD spatial+temp	64.2	29.7
Spatial only	61.0	27.3
Temporal only	61.0	27.9

Both spatial and temporal relations are required for maximal physics realism. Vanilla REPA (hard alignment) during T2V fine-tuning destabilizes model training (Zhang et al., 29 May 2025).

Limitations include potential oversmoothing during rapid scene switches, inheritance of teacher model biases, and incomplete coverage of complex object interactions and advanced dynamics (e.g., articulated rigid–soft body transitions). Incorporating pretraining with CREPA/VideoREPA, multi-modal (audio/text) alignment, and higher-order (triplet/graph-based) relational objectives remain open areas.

7. Outlook and Connections

Video REPA and its cross-frame, relational, and control-effect variants have reshaped the fine-tuning toolbox for generative video modeling—enabling efficient specialization, physics knowledge transfer, and robust unsupervised control. These advances position representation alignment as a foundation for scalable, physically plausible, and controllable video generation across varied architectures and domains (Hwang et al., 10 Jun 2025, Zhang et al., 29 May 2025, Jiang et al., 10 Feb 2026). The methodological tradeoffs among direct feature alignment, relational distillation, and sequence-level effect matching suggest a rich research space for future alignment protocols, especially for cross-modal and embodied video generation systems.