Papers
Topics
Authors
Recent
2000 character limit reached

Semantic-Guided RIFE: Real-Time VFI via Semantics

Updated 27 December 2025
  • The paper introduces SG-RIFE, which augments a frozen RIFE backbone with DINOv3 semantics to achieve diffusion-competitive perceptual quality at near real-time speeds.
  • The methodology employs Split-FAPM and DSF modules to compress, align, and inject semantic features, significantly improving robustness against large motions and occlusions.
  • Parameter-efficient fine-tuning updates only 16% of the model parameters, achieving substantial FID improvements over baseline methods on benchmark datasets.

Semantic-Guided RIFE (SG-RIFE) is a video frame interpolation (VFI) methodology that augments a frozen, flow-based RIFE backbone with dense, frozen semantic priors extracted from a DINOv3 Vision Transformer. SG-RIFE is designed to bridge the gap between the high throughput of flow-based methods and the superior perceptual quality achieved by recent diffusion models, enabling diffusion-competitive perceptual fidelity at near real-time speeds. It achieves this via parameter-efficient fine-tuning: semantic features are projected, warped, fused, and injected at key stages in the RIFE pipeline, significantly improving robustness to large motion and occlusions while maintaining computational efficiency (Wong et al., 20 Dec 2025).

1. Architecture and Module Overview

SG-RIFE operates as a two-stage pipeline rooted in the RIFE framework, introducing semantic feature injection at the FusionNet stage. The major components are:

  • IFNet (frozen): Computes coarse bidirectional flows Ft0,Ft1F_{t\rightarrow0}, F_{t\rightarrow1} from input frames I0,I1I_0, I_1, and generates a coarse interpolated image I~coarse\tilde{I}_{\text{coarse}} via warping.
  • ContextNet (frozen): Extracts multi-scale contextual features FctxF_{\text{ctx}} for both input frames.
  • DINOv3 (frozen, Small variant): Provides intermediate semantic features D0,D1D_0, D_1 from input images at layers 8 and 11.
  • Split-Fidelity Aware Projection Module (Split-FAPM): Compresses 384D DINO features to task-adapted 256D, followed by a flow-based warp and post-warp refinement.
  • Deformable Semantic Fusion (DSF): Performs flow-guided alignment of semantic features, correcting misalignments caused by occlusions and large displacements.
  • FusionNet: Receives I~coarse\tilde{I}_{\text{coarse}}, FctxF_{\text{ctx}}, and DSF-aligned semantics to predict the high-frequency residual Δ\Delta, forming the final output It=I~coarse+ΔI_t = \tilde{I}_{\text{coarse}} + \Delta.

The high-level data flow can be summarized as:

Component Input Output/Role
IFNet I0,I1I_0, I_1 Ft0,Ft1F_{t\rightarrow0}, F_{t\rightarrow1}, I~coarse\tilde{I}_{\text{coarse}}
ContextNet I0,I1I_0, I_1 FctxF_{\text{ctx}}
DINOv3 I0,I1I_0, I_1 D0,D1D_0, D_1 (semantic embeddings)
Split-FAPM D0,D1D_0, D_1 Compressed and refined semantic features D^\hat{D}
DSF Fctx,D^F_{\text{ctx}}, \hat{D} Aligned semantic features FalignedF_{\text{aligned}}
FusionNet Multiple Residual Δ\Delta and final output ItI_t

This modular structure facilitates efficient semantic injection, with both IFNet and ContextNet remaining frozen to constrain memory and computational cost (Wong et al., 20 Dec 2025).

2. Semantic Feature Processing: Split-FAPM and DSF

Split-Fidelity Aware Projection Module (Split-FAPM)

Split-FAPM consists of two subcomponents:

  • Pre-Warp Compression: DINO features DrawRH×W×384D_{\text{raw}} \in \mathbb{R}^{H \times W \times 384} are reduced via 1×11\times1 convolutions:
    • Feature pathway: Ffeat=ϕf(Draw)F_{\text{feat}} = \phi_f(D_{\text{raw}}) (to $256$ channels).
    • Affine pathway: [γ,β]=ϕm(Draw)[\gamma, \beta] = \phi_m(D_{\text{raw}}) (both $256$ channels).
    • Combination: Fsem=γFfeat+βF_{\text{sem}'} = \gamma \odot F_{\text{feat}} + \beta.
  • Post-Warp Refinement: After warping FsemF_{\text{sem}'} using IFNet flows, a residual SE–GELU block refines the output:
    • Fsem=Φref(W(Fsem,Ft0))F_{\text{sem}''} = \Phi_{\text{ref}}(\mathcal{W}(F_{\text{sem}'}, F_{t\rightarrow0})).

This sequence adapts semantic channels for the video interpolation task and corrects spatial disruptions from warping.

Deformable Semantic Fusion (DSF)

DSF aligns semantic features to pixel-level motion through learned, soft, flow-guided offsets:

  • Compute warping-and-refinement: D^ti=Φref(W(Di,Fti))\hat{D}_{t \gets i} = \Phi_{\text{ref}}(\mathcal{W}(D_i, F_{t\rightarrow i})), i{0,1}i\in\{0,1\}.
  • Project semantic/context features to Q/K/V: Q=ϕq(Fctx)Q = \phi_q(F_{\text{ctx}}), K=ϕk(D^)K = \phi_k(\hat{D}), V=ϕv(D^)V = \phi_v(\hat{D}).
  • Predict offsets Δp\Delta p and modulations Δm\Delta m with a 1×11\times1 convolution over [Q,K][Q, K].
  • Aggregate via grouped deformable convolution:

Faligned(p)=γk=19WkV(p+pk+Δpk)Δmk,F_{\text{aligned}}(p) = \gamma \cdot \sum_{k=1}^9 W_k \cdot V(p + p_k + \Delta p_k) \cdot \Delta m_k,

where G=4G=4 or $8$ groups are used at two FusionNet scales.

DSF corrects rigid misalignments, particularly enhancing interpolation under occlusion and non-linear deformations (Wong et al., 20 Dec 2025).

3. Training Protocol, Fine-Tuning, and Losses

SG-RIFE employs a parameter-efficient fine-tuning paradigm, freezing all heavy compute backbones and updating only 16%16\% (\sim5.6M of 34.4M) total parameters. The trainable modules are Split-FAPM, DSF, FusionNet injection layers, and FusionNet base (the latter unfrozen only after an initial warm-up stage).

Training Details:

  • Dataset: Vimeo90K train split (51,312 triplets, 448×256448\times256 resolution), with random crop, flipping, and temporal reversal augmentations.
  • Schedule: Two stages on a single NVIDIA L4 GPU:
    • Stage 1 (5 epochs): Only Split-FAPM and DSF are trained.
    • Stage 2 (25 subsequent epochs): FusionNet base is unfrozen; all adapters are trained jointly.
  • Optimization: AdamW optimizer, initial learning rate 2×1042\times10^{-4} (cosine annealing), batch size 64, gradient clipping (norm=1.0).

Loss Function:

Ltotal=λrecLrec+λdisLdis+λteaLtea+λsemLsem+λregLreg.\mathcal{L}_{\text{total}}=\lambda_{\text{rec}} \mathcal{L}_{\text{rec}} + \lambda_{\text{dis}}\mathcal{L}_{\text{dis}} + \lambda_{\text{tea}}\mathcal{L}_{\text{tea}} +\lambda_{\text{sem}}\mathcal{L}_{\text{sem}} + \lambda_{\text{reg}}\mathcal{L}_{\text{reg}}.

  • Lrec\mathcal{L}_{\text{rec}}: Laplacian-pyramid reconstruction (inherits from RIFE)
  • Ltea\mathcal{L}_{\text{tea}}: Student-teacher distillation loss
  • Lsem=D(Ipred)D(Igt)1\mathcal{L}_{\text{sem}} = \|D(I_{\text{pred}})-D(I_{\text{gt}})\|_1: Semantic consistency, penalizing divergence between predicted and ground-truth semantic features
  • Lreg\mathcal{L}_{\text{reg}}: L1L_1 regularization on DSF offsets across scales S2S_2, S3S_3
  • λrec=0.1\lambda_{\text{rec}}=0.1, λdis=0.01\lambda_{\text{dis}}=0.01, λtea=0.1\lambda_{\text{tea}}=0.1, λsem=0.5\lambda_{\text{sem}}=0.5, λreg=1×104\lambda_{\text{reg}}=1\times10^{-4}

This combination encourages perceptually faithful output while maintaining semantic and spatial alignment.

4. Quantitative and Qualitative Performance

Extensive evaluation on SNU-FILM (Easy, Medium, Hard, Extreme splits) reveals that SG-RIFE delivers state-of-the-art FID, LPIPS, and runtime performance among near-real-time methods. Principal results:

Method Easy FID Medium FID Hard FID Extreme FID Time (s)
RIFE 5.456 10.833 23.320 47.458 0.01
LDMVFI 5.752 12.485 26.520 47.042 22.32
Consec. BB 4.791 9.039 18.589 36.631 2.60
SG-RIFE 4.557 8.678 17.896 36.272 0.05

Statistical analysis (paired tt-tests) confirms that FID improvements for SG-RIFE over RIFE and LDMVFI are significant (p<0.01p<0.01 on Medium/Hard splits). Qualitatively, SG-RIFE preserves object identity and fine structures in highly occluded or non-linearly moving regions, where standard RIFE produces ghosting or unnatural artifacts. In such cases, DSF realigns semantic patches and Split-FAPM recovers texture fidelity (Wong et al., 20 Dec 2025).

5. Ablation Analysis and Component Contributions

Ablation experiments on the SNU-Hard split demonstrate the effect of each architectural addition in terms of FID:

  • Base RIFE: FID = 23.32
  • + Split-FAPM only: FID = 19.8 (+3.5 improvement)
  • + DSF only: FID = 20.1 (+3.2 improvement)
  • + both adapters without Lsem\mathcal{L}_{\text{sem}}: FID = 18.4
  • Full SG-RIFE: FID = 17.896

These results indicate that (1) Split-FAPM restores fine-grained texture, (2) DSF corrects flow-induced semantic misalignment, and (3) the explicit semantic consistency loss Lsem\mathcal{L}_{\text{sem}} guides FusionNet toward diffusion-level perceptual performance. This suggests that each component is critical to the observed performance improvements over baseline RIFE.

6. Limitations and Prospects for Future Research

SG-RIFE’s validation is limited to SNU-FILM, with the necessity of further benchmarking on datasets such as UCF101 and Xiph for broader generalization claims. The approach relies on a frozen IFNet; catastrophic failures in optical flow estimation (e.g., highly non-linear deformations) cannot be fully remediated through semantics alone. Potential directions include:

  • Dynamic fine-tuning or adaptation of IFNet during training or inference.
  • Integration of stronger or multi-scale ViT semantic priors.
  • Extension of semantic injection to multi-frame or variable rate interpolation schedules.

A plausible implication is that, while SG-RIFE closes the perceptual gap to diffusion models under typical flow regimes, emerging scenarios (e.g., extreme deformable motion) may necessitate more flow-adaptive or multi-scale semantic architectures.

7. Context and Significance within the Literature

SG-RIFE establishes a new paradigm for perceptual-quality video frame interpolation by combining computationally lightweight flow-based inference with dense, globally-aware vision transformer semantics. Unlike high-latency diffusion models such as Consec. BB and LDMVFI, SG-RIFE achieves comparable or superior FID and LPIPS at orders-of-magnitude faster inference speed. This parameter-efficient, plug-in semantic injection opens a viable path for advancing real-time and streaming VFI in settings previously limited by pure optical flow accuracy or computational resources (Wong et al., 20 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Semantic-Guided RIFE (SG-RIFE).