Semantic-Guided RIFE: Real-Time VFI via Semantics

Updated 27 December 2025

The paper introduces SG-RIFE, which augments a frozen RIFE backbone with DINOv3 semantics to achieve diffusion-competitive perceptual quality at near real-time speeds.
The methodology employs Split-FAPM and DSF modules to compress, align, and inject semantic features, significantly improving robustness against large motions and occlusions.
Parameter-efficient fine-tuning updates only 16% of the model parameters, achieving substantial FID improvements over baseline methods on benchmark datasets.

Semantic-Guided RIFE (SG-RIFE) is a video frame interpolation (VFI) methodology that augments a frozen, flow-based RIFE backbone with dense, frozen semantic priors extracted from a DINOv3 Vision Transformer. SG-RIFE is designed to bridge the gap between the high throughput of flow-based methods and the superior perceptual quality achieved by recent diffusion models, enabling diffusion-competitive perceptual fidelity at near real-time speeds. It achieves this via parameter-efficient fine-tuning: semantic features are projected, warped, fused, and injected at key stages in the RIFE pipeline, significantly improving robustness to large motion and occlusions while maintaining computational efficiency (Wong et al., 20 Dec 2025).

1. Architecture and Module Overview

SG-RIFE operates as a two-stage pipeline rooted in the RIFE framework, introducing semantic feature injection at the FusionNet stage. The major components are:

IFNet (frozen): Computes coarse bidirectional flows $F_{t\rightarrow0}, F_{t\rightarrow1}$ from input frames $I_0, I_1$ , and generates a coarse interpolated image $\tilde{I}_{\text{coarse}}$ via warping.
ContextNet (frozen): Extracts multi-scale contextual features $F_{\text{ctx}}$ for both input frames.
DINOv3 (frozen, Small variant): Provides intermediate semantic features $D_0, D_1$ from input images at layers 8 and 11.
Split-Fidelity Aware Projection Module (Split-FAPM): Compresses 384D DINO features to task-adapted 256D, followed by a flow-based warp and post-warp refinement.
Deformable Semantic Fusion (DSF): Performs flow-guided alignment of semantic features, correcting misalignments caused by occlusions and large displacements.
FusionNet: Receives $\tilde{I}_{\text{coarse}}$ , $F_{\text{ctx}}$ , and DSF-aligned semantics to predict the high-frequency residual $\Delta$ , forming the final output $I_t = \tilde{I}_{\text{coarse}} + \Delta$ .

The high-level data flow can be summarized as:

Component	Input	Output/Role
IFNet	$I_0, I_1$	$F_{t\rightarrow0}, F_{t\rightarrow1}$ , $\tilde{I}_{\text{coarse}}$
ContextNet	$I_0, I_1$	$F_{\text{ctx}}$
DINOv3	$I_0, I_1$	$D_0, D_1$ (semantic embeddings)
Split-FAPM	$D_0, D_1$	Compressed and refined semantic features $\hat{D}$
DSF	$F_{\text{ctx}}, \hat{D}$	Aligned semantic features $F_{\text{aligned}}$
FusionNet	Multiple	Residual $\Delta$ and final output $I_t$

This modular structure facilitates efficient semantic injection, with both IFNet and ContextNet remaining frozen to constrain memory and computational cost (Wong et al., 20 Dec 2025).

2. Semantic Feature Processing: Split-FAPM and DSF

Split-Fidelity Aware Projection Module (Split-FAPM)

Split-FAPM consists of two subcomponents:

Pre-Warp Compression: DINO features $D_{\text{raw}} \in \mathbb{R}^{H \times W \times 384}$ are reduced via $1\times1$ $1 \times 1$ convolutions:
- Feature pathway: $F_{\text{feat}} = \phi_f(D_{\text{raw}})$ (to $256$ channels).
- Affine pathway: $[\gamma, \beta] = \phi_m(D_{\text{raw}})$ (both $256$ channels).
- Combination: $F_{\text{sem}'} = \gamma \odot F_{\text{feat}} + \beta$ .
Post-Warp Refinement: After warping $F_{\text{sem}'}$ $F_{sem^{'}}$ using IFNet flows, a residual SE–GELU block refines the output:
- $F_{\text{sem}''} = \Phi_{\text{ref}}(\mathcal{W}(F_{\text{sem}'}, F_{t\rightarrow0}))$ .

This sequence adapts semantic channels for the video interpolation task and corrects spatial disruptions from warping.

Deformable Semantic Fusion (DSF)

DSF aligns semantic features to pixel-level motion through learned, soft, flow-guided offsets:

Compute warping-and-refinement: $\hat{D}_{t \gets i} = \Phi_{\text{ref}}(\mathcal{W}(D_i, F_{t\rightarrow i}))$ , $i\in\{0,1\}$ .
Project semantic/context features to Q/K/V: $Q = \phi_q(F_{\text{ctx}})$ , $K = \phi_k(\hat{D})$ , $V = \phi_v(\hat{D})$ .
Predict offsets $\Delta p$ and modulations $\Delta m$ with a $1\times1$ convolution over $[Q, K]$ .
Aggregate via grouped deformable convolution:

$F_{\text{aligned}}(p) = \gamma \cdot \sum_{k=1}^9 W_k \cdot V(p + p_k + \Delta p_k) \cdot \Delta m_k,$

where $G=4$ or $8$ groups are used at two FusionNet scales.

DSF corrects rigid misalignments, particularly enhancing interpolation under occlusion and non-linear deformations (Wong et al., 20 Dec 2025).

3. Training Protocol, Fine-Tuning, and Losses

SG-RIFE employs a parameter-efficient fine-tuning paradigm, freezing all heavy compute backbones and updating only $16\%$ ( $\sim$ 5.6M of 34.4M) total parameters. The trainable modules are Split-FAPM, DSF, FusionNet injection layers, and FusionNet base (the latter unfrozen only after an initial warm-up stage).

Training Details:

Dataset: Vimeo90K train split (51,312 triplets, $448\times256$ resolution), with random crop, flipping, and temporal reversal augmentations.
Schedule: Two stages on a single NVIDIA L4 GPU:
- Stage 1 (5 epochs): Only Split-FAPM and DSF are trained.
- Stage 2 (25 subsequent epochs): FusionNet base is unfrozen; all adapters are trained jointly.
Optimization: AdamW optimizer, initial learning rate $2\times10^{-4}$ (cosine annealing), batch size 64, gradient clipping (norm=1.0).

Loss Function:

$\mathcal{L}_{\text{total}}=\lambda_{\text{rec}} \mathcal{L}_{\text{rec}} + \lambda_{\text{dis}}\mathcal{L}_{\text{dis}} + \lambda_{\text{tea}}\mathcal{L}_{\text{tea}} +\lambda_{\text{sem}}\mathcal{L}_{\text{sem}} + \lambda_{\text{reg}}\mathcal{L}_{\text{reg}}.$

$\mathcal{L}_{\text{rec}}$ : Laplacian-pyramid reconstruction (inherits from RIFE)
$\mathcal{L}_{\text{tea}}$ : Student-teacher distillation loss
$\mathcal{L}_{\text{sem}} = \|D(I_{\text{pred}})-D(I_{\text{gt}})\|_1$ : Semantic consistency, penalizing divergence between predicted and ground-truth semantic features
$\mathcal{L}_{\text{reg}}$ : $L_1$ regularization on DSF offsets across scales $S_2$ , $S_3$
$\lambda_{\text{rec}}=0.1$ , $\lambda_{\text{dis}}=0.01$ , $\lambda_{\text{tea}}=0.1$ , $\lambda_{\text{sem}}=0.5$ , $\lambda_{\text{reg}}=1\times10^{-4}$

This combination encourages perceptually faithful output while maintaining semantic and spatial alignment.

4. Quantitative and Qualitative Performance

Extensive evaluation on SNU-FILM (Easy, Medium, Hard, Extreme splits) reveals that SG-RIFE delivers state-of-the-art FID, LPIPS, and runtime performance among near-real-time methods. Principal results:

Method	Easy FID	Medium FID	Hard FID	Extreme FID	Time (s)
RIFE	5.456	10.833	23.320	47.458	0.01
LDMVFI	5.752	12.485	26.520	47.042	22.32
Consec. BB	4.791	9.039	18.589	36.631	2.60
SG-RIFE	4.557	8.678	17.896	36.272	0.05

Statistical analysis (paired $t$ -tests) confirms that FID improvements for SG-RIFE over RIFE and LDMVFI are significant ( $p<0.01$ on Medium/Hard splits). Qualitatively, SG-RIFE preserves object identity and fine structures in highly occluded or non-linearly moving regions, where standard RIFE produces ghosting or unnatural artifacts. In such cases, DSF realigns semantic patches and Split-FAPM recovers texture fidelity (Wong et al., 20 Dec 2025).

5. Ablation Analysis and Component Contributions

Ablation experiments on the SNU-Hard split demonstrate the effect of each architectural addition in terms of FID:

Base RIFE: FID = 23.32
+ Split-FAPM only: FID = 19.8 (+3.5 improvement)
+ DSF only: FID = 20.1 (+3.2 improvement)
+ both adapters without $\mathcal{L}_{\text{sem}}$ : FID = 18.4
Full SG-RIFE: FID = 17.896

These results indicate that (1) Split-FAPM restores fine-grained texture, (2) DSF corrects flow-induced semantic misalignment, and (3) the explicit semantic consistency loss $\mathcal{L}_{\text{sem}}$ guides FusionNet toward diffusion-level perceptual performance. This suggests that each component is critical to the observed performance improvements over baseline RIFE.

6. Limitations and Prospects for Future Research

SG-RIFE’s validation is limited to SNU-FILM, with the necessity of further benchmarking on datasets such as UCF101 and Xiph for broader generalization claims. The approach relies on a frozen IFNet; catastrophic failures in optical flow estimation (e.g., highly non-linear deformations) cannot be fully remediated through semantics alone. Potential directions include:

Dynamic fine-tuning or adaptation of IFNet during training or inference.
Integration of stronger or multi-scale ViT semantic priors.
Extension of semantic injection to multi-frame or variable rate interpolation schedules.

A plausible implication is that, while SG-RIFE closes the perceptual gap to diffusion models under typical flow regimes, emerging scenarios (e.g., extreme deformable motion) may necessitate more flow-adaptive or multi-scale semantic architectures.

7. Context and Significance within the Literature

SG-RIFE establishes a new paradigm for perceptual-quality video frame interpolation by combining computationally lightweight flow-based inference with dense, globally-aware vision transformer semantics. Unlike high-latency diffusion models such as Consec. BB and LDMVFI, SG-RIFE achieves comparable or superior FID and LPIPS at orders-of-magnitude faster inference speed. This parameter-efficient, plug-in semantic injection opens a viable path for advancing real-time and streaming VFI in settings previously limited by pure optical flow accuracy or computational resources (Wong et al., 20 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Semantic-Guided RIFE (SG-RIFE).