Occluded Video Prediction Techniques

Updated 28 November 2025

Occluded video prediction is defined as generating plausible future frames when parts of a scene are hidden, requiring both propagation of visible data and novel inpainting.
Techniques employ explicit occlusion masks, learned gating, motion cues, depth maps, and object-centric dynamics to ensure continuity and physical consistency.
Recent models fuse warped features with generated content via analytic, learned, or attention-based strategies while optimizing adversarial, reconstruction, and latent losses.

Occluded video prediction refers to the generation of plausible future video frames in scenarios where some scene content becomes partially or completely invisible due to occlusion. This task encompasses both the reliable propagation of visible information and the physically and semantically consistent hallucination or inpainting of dis-occluded (newly revealed) regions. Occlusion handling fundamentally differentiates video prediction from standard frame synthesis, as it requires models to reason about object permanence, motion continuity, and the generation of novel content at occlusion boundaries. Techniques for occluded video prediction exploit explicit motion cues (optical/point flow, depth), analytic or learned occlusion masks, object-centric dynamics, and both deterministic and stochastic generative frameworks.

1. Explicit and Implicit Occlusion Handling Paradigms

Approaches to occluded video prediction fall into two primary paradigms: explicit occlusion modeling and implicit gating.

Explicit occlusion modeling includes methods such as "Disentangling Propagation and Generation" (DPG) (Gao et al., 2018), which computes analytic, per-frame occlusion masks by propagating pixel energies under predicted backward flow. These binary masks isolate regions where optical warping from previous frames is reliable (non-occluded) or unreliable (occluded). The model fuses a flow-based warping branch for visible pixels with a context inpainting branch dedicated to filling occluded regions, using the mask to mediate the combination:

$\hat{x}_t(i,j) = \hat{m}_t(i,j)\,\tilde{x}_t(i,j) + (1-\hat{m}_t(i,j))\,\hat{y}_t(i,j).$

This strict separation enables distinct architectures and losses for each branch, controlled by the analytically derived mask.

Implicit occlusion handling is exemplified by architectures such as the Transformation-based Spatial Recurrent Unit (TSRU) (Luc et al., 2020). Here, warping, refinement, and gating are fused into each recurrent cell without an explicit occlusion mask. The learned gating unit $u$ interpolates between the motion-warped hidden state and freshly generated content, down-weighting the unreliable warped features particularly at occlusion or dis-occlusion boundaries:

$h_t = u \odot \tilde{h}_{t-1} + (1-u) \odot c.$

The gate is trained end-to-end under adversarial and reconstruction losses, automatically learning to re-generate new pixels where needed. Visualization of $(1-u)$ reveals it functions as an implicit soft occlusion mask.

More recent models such as SVPHW (Kotoyori et al., 4 Dec 2024) employ multiple parallel pathways (forward/backward warping and appearance-based inpainting) and a learned mask decoder, with no explicit supervision, that discovers spatial regions requiring inpainting due to warping failures—effectively inferring soft occlusion masks from data.

2. Motion, Depth, and Object-Centric Structure for Occlusion Reasoning

Incorporating explicit motion and structural information is essential for robust occluded prediction. The integration of pointwise flow and depth-maps, as in "Flow and Depth Assisted Video Prediction with Latent Transformer" (Suleyman et al., 20 Nov 2025), offers complementary modalities for resolving both the motion of visible objects and the reappearance trajectory of occluded instances. Object-centric latent transformers (e.g., SCAT) encode each instance separately, with point-flow registering object trajectories through occlusion, and depth facilitating correct depth ordering:

Point-flow tensors encode tracked keypoint displacements and visibilities, supporting state continuity across occlusions.
Depth-maps provide cues for occlusion ordering and background reconstruction, particularly beneficial when multiple objects interact or are fully obscured.

Similarly, "Occlusion resistant learning of intuitive physics from videos" (Riochet et al., 2020) leverages a probabilistic object-centric state space, where each object's latent position, velocity, and attributes are updated via a recurrent interaction network (dynamics prior) and projected to the pixel space via an occlusion-aware differentiable renderer. The renderer computes soft-min depth compositing to correctly assign foreground and background pixels during occlusion events, enforcing object permanence via latent state roll-out. This explicit factorization between scene latent variables and image rendering enables realistic prediction of object masks and trajectories through severe occlusions.

In time-conditioned slot-attention models (Gao et al., 2023), slot-based object centricity is combined with Transformer-based inference over view latents, allowing reconstructive prediction of the entire object shape—occluded or not—by leveraging multi-view temporal information and Gaussian process priors for time continuity.

3. Generator Architectures and Fusion Strategies for Predicted Content

A central challenge in occluded video prediction is the fusion of warped (propagated) information with hallucinated (newly generated) content. Methods span a range from analytic gating, to learned soft masks, to attention-based slot mixtures.

Analytic gating: As in DPG (Gao et al., 2018), warping validity is binary and determined by conservation of pixel energy or correspondence; the mask is not learned, enforcing strict confidence separation.
Learned gating: Models such as TSRU (Luc et al., 2020) and SVPHW (Kotoyori et al., 4 Dec 2024) employ a learned per-pixel or per-feature gating mechanism. For SVPHW, a MobileNet-based decoder computes softmax-normalized weights over forward warp, backward warp, and appearance inpainting, directly fusing outputs in a convex combination at every pixel. This soft attention enables the network to route information adaptively:

$R_t = m_p \odot x_p + m_{fw} \odot x_{fw} + m_{bw} \odot x_{bw}.$

These gates are interpreted as attention maps for each content source and are trained indirectly via reconstruction loss.

Object-slot compositing: In slot-attention frameworks (Gao et al., 2023), predicted slot masks, shape codes, and per-view depth orders are fused by composite mixing functions that enforce correct stacking and occlusion ordering, enabling the full shape of an object to be predicted even if it is unobservable in the current frame.
Fourier-based inpainting: FFINet (Li et al., 2023) employs Fast Fourier Convolutions to enlarge the receptive field for spatial inpainting. A dedicated "occlusion inpainter" refills masked regions as indicated by binary input masks, while the global spatiotemporal translator leverages frequency-domain context to ensure dynamic consistency in occlusion recovery.

4. Loss Functions, Training Protocols, and Quantitative Evaluation

Objective functions for occluded video prediction often mix:

Adversarial losses: Pressure generators to produce sharp, realistic dis-occluded content (e.g., TSRU/BigGAN-style loss (Luc et al., 2020)).
Masked reconstruction losses: Weighted MSE/SSIM or perceptual losses inside/outside occlusion masks (e.g., DPG (Gao et al., 2018), FFINet (Li et al., 2023)), sometimes with explicitly elevated weights (e.g., $\beta=10$ for pixels behind the mask).
Latent and KL-divergence terms: For stochastic approaches (e.g., SVPHW (Kotoyori et al., 4 Dec 2024)), ELBO losses on variational pathways, sometimes with branch-specific latent spaces.
Smoothness or edge-aware regularization: Especially for flow/mask branches to encourage plausible motion and mask continuity (e.g., DPG's $\mathcal{L}_{smt}$ term).
Physics prior/dynamics loss: For object-centric models, negative log-likelihood on predicted latent states under learned dynamics priors (e.g., (Riochet et al., 2020)).

Benchmarking occluded video prediction uses both standard per-frame metrics and occlusion-specific scores where available. Common metrics include PSNR, SSIM, LPIPS, and in some cases mask-oriented Earth Mover's Distance (EMD) or Intersection over Union (IoU) between predicted and ground-truth dis-occlusion masks. Studies report superior performance for architectures that explicitly disentangle warping and inpainting, employ object-centric inference, or integrate motion/depth cues—especially in occluded-region PSNR/IoU and in graceful performance degradation as occlusion severity increases (Gao et al., 2018, Suleyman et al., 20 Nov 2025).

5. Stochasticity, Uncertainty, and Score-Based Inference

Temporal prediction under occlusion is fundamentally uncertain, especially in ambiguous object interactions or when multiple plausible dis-occlusion resolutions exist. Stochastic generative models and score-based prediction frameworks address this indeterminacy.

Score-based models (Fiquet et al., 30 Oct 2024) learn the gradient of the conditional log-probability (“score”) for the next-frame distribution given context. A U-Net denoiser is trained to perform MMSE denoising under variable Gaussian noise. At inference, annealed score-ascent sampling generates diverse, sharp samples, avoiding the averaging artifacts of direct regression, especially at occlusion boundaries with bifurcating futures. Adaptive evidence weighting ensures the model relies more on past context or current noisy observation based on local reliability, automatically diminishing erroneous context influence at occluded pixels.

Stochastic variational models (e.g., SVPHW (Kotoyori et al., 4 Dec 2024)) introduce per-branch latent codes and model diverse futures through variational inference, balancing diversity-inducing KL penalties with reconstruction fidelity. In heavily occluded scenarios or long-range predictions, such stochastic sampling is critical for realism and temporal coherence.

6. Limitations, Ablations, and Prospective Extensions

Despite steady progress, limitations persist:

Capacity and fusion trade-offs: When combining multiple modalities (e.g., RGB, depth, flow), autoencoders may reach representational bottlenecks, resulting in degraded appearance or motion tracking if not carefully allocated or modularized (Suleyman et al., 20 Nov 2025).
Implicit mask learning: Many models handle occlusion via end-to-end learned soft masks without explicit ground-truth supervision, leading to potential unreliability in complex occlusions (Kotoyori et al., 4 Dec 2024).
Long-horizon drift: Object-centric dynamics or latent sequence models may accrue error and drift in object state estimates if occlusion persists for many frames without re-anchoring (Shih et al., 2019). In such cases, slot covariances often inflate to encode uncertainty.

Ablation studies reveal the necessity of explicit mask gating, flow-based branch separation, and perceptually weighted losses for occluded-region fidelity (Gao et al., 2018, Li et al., 2023). Direct mask supervision and new metrics targeting occluded pixels or object reappearing trajectories suggest pathways for refined evaluation.

Promising future research includes end-to-end occlusion-aware flow/depth estimation, joint latent-structure learning, model-based uncertainty quantification, and the adaptation of score-based diffusion frameworks to large-scale, long-range video prediction under occlusion.

7. Benchmarks, Datasets, and Practical Impact

Occlusion-oriented video prediction is evaluated on both synthetic and natural datasets constructed to include explicit occlusion events:

Synthetic:
- RoamingImages (with ground-truth disocclusion masks) (Gao et al., 2018).
- Kubric-Occlusion for explicit object occlusion sequences (Suleyman et al., 20 Nov 2025).
- “Moving Leaves” ambiguous collision sequences for stochastic occlusion (Fiquet et al., 30 Oct 2024).
- Bouncing-balls-with-occluders, CLEVR/SHOP synthetic scenes (toy object occlusion) (Gao et al., 2023, Riochet et al., 2020).
Real-world:
- KITTI and Cityscapes (urban driving with frequent occlusion) (Gao et al., 2018, Kotoyori et al., 4 Dec 2024).
- CalTech Pedestrian, BBC-Pose, Human3.6M, BAIR-Push (various real scenarios with self-occlusion, full and partial) (Li et al., 2023, Shih et al., 2019).

Table: Example quantitative results for occluded prediction

Model / Dataset	Masked PSNR↑ / SSIM↑	Occluded-region IoU / EMD↓	Best For
DPG (Gao et al., 2018)	22.3 / 0.696	IoU=24.91%	Analytic mask, inpaint
TSRU (Luc et al., 2020)	FVD=44.2	N/A	Adversarial, unsup gated
SVPHW (Kotoyori et al., 4 Dec 2024)	21.85 / 0.654	N/A	Stoch., hybrid warping
SCAT-PD (Suleyman et al., 20 Nov 2025)	25.69 / 0.649	EMD=0.0066	Flow+depth latent
FFINet (Li et al., 2023)	32.2 / 0.921	N/A	FFT-based inpainting

Accurate occluded video prediction has practical impact in robotics (object permanence tracking), autonomous driving (vehicle/pedestrian occlusion), telemedicine, and critical systems where prediction through dis-occlusion governs operational safety and model reliability.

Overall, occluded video prediction stands as a benchmark domain for the interplay of explicit physical modeling, sophisticated fusion strategies, uncertainty quantification, and large-scale end-to-end learning, with continuing advances targeting the rigorous reproduction of object permanence, realistic dis-occlusion, and temporally coherent synthesis even in the most challenging scenes (Suleyman et al., 20 Nov 2025, Luc et al., 2020, Gao et al., 2018, Gao et al., 2023, Fiquet et al., 30 Oct 2024, Kotoyori et al., 4 Dec 2024, Li et al., 2023, Shih et al., 2019, Riochet et al., 2020).