Frame Interpolation Techniques
- Frame interpolation is the process of generating plausible intermediate frames between given video frames by addressing challenges like occlusion, nonlinear motion, and illumination changes.
- Key methodologies include optical flow-based warping, kernel-based and deformable sampling, and generative diffusion models, each balancing quality, computation, and adaptability.
- Recent advances incorporate transformer-based attention, event-based sensor fusion, and adaptive sparsity to enhance temporal consistency and computational efficiency in diverse scenarios.
Frame interpolation is the process of synthesizing one or more intermediate video frames given a sequence of input frames sampled in time, typically with the purpose of increasing frame rate, generating slow-motion effects, or producing temporally coherent visual content in animation, editing, or computer vision pipelines. The core challenge of frame interpolation is the synthesis of visually and semantically plausible images at intermediate time steps, especially under large or complex motions, occlusions, illumination changes, and high-resolution constraints.
1. Fundamental Principles and Problem Formulation
Frame interpolation aims to reconstruct plausible content at non-existent temporal locations between observed frames. Formally, given two or more input frames, such as and , the task is to synthesize at . The solution space is inherently ambiguous due to potential occlusions, non-linear motion trajectories, and information loss (e.g., motion blur, disocclusion). Modern approaches distinguish themselves by their model of motion (pixelwise dense flow, occlusion maps, or trajectory priors), their means of synthesizing or reconstructing the intermediate frame (pixel warping, kernel-based local sampling, CNN regression, generative or diffusion-based approaches), and their adaptation to arbitrary times or multi-frame upsampling.
Continuous-time interpolation has emerged as a key paradigm, moving beyond rigid N-way or midpoint-only synthesis, and allowing for user-specified or dynamically calculated intermediate timestamps (Zhang et al., 1 Oct 2025).
2. Major Algorithmic Paradigms
Frame interpolation techniques can be divided into overlapping classes, each with unique strengths and limitations:
a) Optical Flow-Based Warping
A large fraction of methods estimate bi- or multi-directional optical flow between frames, then warp source frames toward the target time , potentially blending them using occlusion masks or learned fusion networks. Notable flow-based approaches include FILM (Reda et al., 2022), which unifies scale-agnostic flow estimation and pyramidal feature warping, and DQBC (Zhou et al., 2023), which addresses the receptive-field dependency in conventional cost-volume methods by densely querying correlations at high resolution, supporting small/fast object motion.
b) Kernel-Based and Deformable Sampling
Kernel-based approaches predict spatially adaptive sampling filters (offsets and weights) per pixel and perform local aggregation over the input frames (Danier et al., 2021). Methods such as AdaCoF and spatially-varying deformable convolution generalize simple warping by learning motion-dependent local receptive fields, providing robustness to motion or texture uncertainty but risking excessive smoothing.
c) Generative and Diffusion Models
Recent generative models, including diffusion transformers and latent variable models, have achieved visually plausible, multimodal, or interactive frame synthesis by sampling from a conditional distribution over sequences or leveraging powerful temporal embeddings. ArbInterp (Zhang et al., 1 Oct 2025) enables arbitrary timestamp interpolation and sequence duration by introducing timestamp-aware rotary position embedding; Framer (Wang et al., 2024) allows user-guided or automated trajectory control, incorporating both photorealism and explicit correspondence.
d) Plug-and-Play and Unsupervised Latent Interpolation
Autoencoding techniques such as DeepLLE (Nguyen et al., 2018) fit a low-dimensional latent representation to a short clip, enforce a linearity constraint, and synthesize frames by interpolating latent codes. This paradigm allows unsupervised, scene-adapted interpolation but struggles in low-texture or extremely large motion settings.
e) Event-Based and Sensor Fusion Methods
Frame-based interpolation is limited in scenarios with severe motion blur or ambiguity. Event-based approaches, such as TimeLens (Tulyakov et al., 2021), leverage the high temporal resolution of event cameras to estimate motion or supplement low-frame-rate video, utilizing hybrid warping and synthesis with per-pixel attention fusion.
f) Wavelet/Sparsity and Efficient Synthesis
To address the high computational burden of full-resolution frame synthesis, WaveletVFI (Kong et al., 2023) reconstructs only the significant coefficients in a wavelet domain using instance-specific sparsity masks learned via a classifier, thus tailoring computational cost adaptively to scene content.
3. Modeling Motion: Challenges and Solutions
Accurate motion modeling is central to high-fidelity interpolation and must address challenges such as:
- Large or Nonlinear Motion: Models like FILM use shared, scale-agnostic convolution across all levels of the pyramid and Gram-matrix-based style losses for large disocclusions (Reda et al., 2022). VFIformer (Lu et al., 2022) augments flow estimation and warped feature fusion with transformer-based, cross-scale attention, explicitly expanding receptive fields for long-range motion.
- Occlusion and Disocclusion Handling: Fusing bi-directional warps and computing occlusion masks are standard, but higher-fidelity approaches predict source contribution masks (Li et al., 2021) or use per-pixel attention (Tulyakov et al., 2021, Li et al., 2021).
- Texture and Content Awareness: Texture-aware frameworks train separate models per texture class, demonstrating empirically that static, dynamic continuous, and dynamic discrete scenes require different motion representations (Danier et al., 2021).
- Uncertain or Unavailable Temporal Priors: Some algorithms, e.g., UTI-VFI (Zhang et al., 2021), learn kinematic trajectory formulas that do not require known exposure or interval time, restoring “key-states” from blurry frames via learned residuals and fitting quadratic motion models even when temporal metadata is unavailable.
4. Architectural and Computational Innovations
Several architectures have been proposed to balance quality, efficiency, and scalability:
- Splatting-Based Synthesis: Methods such as softmax splatting (Niklaus et al., 2022) and many-to-many splatting (Hu et al., 2023) move away from per-frame CNN inference, instead employing differentiable splat/warp/merge primitives whose computational cost is nearly independent of upsampling factor or frame count. This enables high-resolution and real-time multi-frame interpolation with minimal overhead.
- Iterative Fusion: The iterative spatial-temporal refinement as in SMIF (Li et al., 2021) fuses structure-based and motion-based candidates, then refines over several cycles, achieving sharper geometry and more faithful object edges, particularly on challenging foreground regions.
- Attention and Transformer Mechanisms: Cross-scale window-based self-attention in VFIformer (Lu et al., 2022) improves on the locality of pure CNNs, and dense querying in DQBC (Zhou et al., 2023) eliminates the error propagation associated with coarse-to-fine cost volumes.
- Adaptive Sparsity: Dynamic threshold prediction for wavelet coefficient computation in WaveletVFI (Kong et al., 2023) yields a marked computational reduction (up to 40% FLOPs savings on animation and 4K data), without significant accuracy loss.
5. Evaluation Protocols and Results
Performance is typically assessed on standard video datasets (Vimeo-90K, UCF101, Middlebury, DAVIS, Xiph-4K, SNU-FILM, GoPro, Adobe240) and metrics including PSNR, SSIM, LPIPS, FID, and FVD; additional studies report perceptual/user studies and specific error maps (interpolation error, IE) to assess temporal or spatial artifacts.
The following table illustrates representative results for selected methods:
| Method | Vimeo-90K PSNR | UCF101 PSNR | SNU-FILM Easy PSNR | Middlebury IE↓ | Key Efficiency/Specialization |
|---|---|---|---|---|---|
| FILM (Reda et al., 2022) | 35.87 | – | – | – | Scale-agnostic net, large motion, fast |
| VFIformer (Lu et al., 2022) | 36.50 | 35.43 | 40.13 | 1.82 | Transformer, cross-scale attention |
| DQBC (Zhou et al., 2023) | 36.57 | 35.44 | 40.31 | 1.78 | Densely Queried Correlation |
| MA-VFI (Han et al., 2024) | 35.96 | 35.31 | – | 1.91 | Hierarchical flow, real-time |
| SMIF (Li et al., 2021) | 35.58 | 35.24 | – | 1.92 | Structure-motion, iterative fusion |
| TAFI (Danier et al., 2021) | ≈28.5 | – | – | – | Texture-aware specialist models |
| WaveletVFI (Kong et al., 2023) | 35.58 | – | – | – | Sparse wavelet, dynamic sparsity |
High-quality interpolation is characterized by the ability to reconstruct sharp foregrounds, maintain temporal consistency, reduce ghosting/tearing, and balance perceptual quality (LPIPS, FID) with distortion-based metrics (PSNR/SSIM).
6. Specialized Paradigms and User-Guided Control
Recent lines of research broaden the scope of frame interpolation:
- Time-Arbitrary, Length-Arbitrary Interpolation: ArbInterp supports truly arbitrary timestamps by reformulating positional encoding in the diffusion transformer, also introducing decoupled segment-wise appearance/motion tokens to ensure continuity (Zhang et al., 1 Oct 2025).
- Interactive and Keypoint-Guided Synthesis: Framer allows explicit user intervention to guide keypoint trajectories, enabling fine semantic or geometric alignment between nonlinearly related frames (Wang et al., 2024).
- Event-Based Cues for High-Dynamic Scenarios: TimeLens employs hybrid event/RGB fusion, exploiting the fact that event cameras provide microsecond-level temporal fidelity, crucial for resolving rapid or ambiguous motions (Tulyakov et al., 2021).
- Integration with Upsampling and Deblurring: BIN combines deblurring and interpolation, cycling through pyramid modules with inter-pyramid recurrence for temporal smoothness in jointly blurry/low-frame-rate videos (Shen et al., 2020).
7. Limitations, Open Challenges, and Research Directions
Despite substantial advances, the field contends with unresolved questions:
- Motion Extrapolation and Disocclusion: Even advanced flow or sampling methods can struggle to hallucinate plausible content for newly-revealed regions.
- Occlusion and Depth Ambiguity: Accurate handling of transparency, layered motion, and occluded foregrounds remains challenging without explicit 3D or scene reasoning.
- Computational Cost vs. Quality: Transformer and generative models deliver perceptual gains but introduce nontrivial inference overhead. Adaptive sparsity and explicit splatting address this for specific regimes.
- Generalization to Real-World Data: Changes in exposure, frame rates, sensor noise, and varying priors make robust, plug-and-play interpolation difficult (Zhang et al., 2021).
- Texture- and Context-Awareness: Explicit handling of scene semantics, textures, or temporal irregularities (animation, cartoons) is still nascent.
- Frame Consistency in Extended Sequences: Models trained for N=2 often do not scale gracefully to long multi-frame insertions; recursive interpolation can amplify artifacts.
Future research directions include: continuous-time generative models with global coherence; unified architectures for deblurring, upsampling, and interpolation; domain-adaptive specialist networks; principled uncertainty estimation; and efficient, real-time solutions for high-resolution or streaming applications.