Bidirectional Video Diffusion Models
- Bidirectional video diffusion models are generative frameworks that use both past and future frame contexts to produce temporally consistent videos.
- They employ techniques like bidirectional attention, state-space models, and bridge priors to address temporal coherence and computational challenges.
- These models enable advanced applications such as interpolation, prediction, and infilling while improving metrics like FVD, SSIM, and PSNR.
Bidirectional video diffusion models are a class of generative models that leverage both forward and backward temporal dependencies to achieve high-fidelity, temporally coherent video synthesis, interpolation, prediction, and infilling. In contrast to autoregressive or strictly unidirectional conditioning, bidirectional models utilize context from both past and future frames (or boundary conditions, including semantic keyframes), and often couple this approach with specialized architectural elements—such as bidirectional attention, state-space models, or bridge priors—to address challenges of temporal coherence, distributional alignment, and computational efficiency.
1. Core Principles and Mathematical Foundations
Bidirectional video diffusion models extend denoising diffusion probabilistic models (DDPMs) to the spatiotemporal domain by explicitly modeling dependencies in both temporal directions. For a video sequence , the forward process stochastically corrupts clean data into noise via a Markov chain: for all frames , with time-varying parameters , . The reverse (generation) process parameterizes via a neural denoiser , trained by denoising score matching objectives.
Bidirectionality refers to the use of both early and late frames as conditioning—in interpolation, infilling, or general sequence modeling—such that at each sampling step, predictions utilize information from both directions. Architecturally, this is implemented with bidirectional attention mechanisms (unmasked transformers), bidirectional state-space models (SSMs as in S4D or Mamba), or algorithms that alternate forward and backward denoising passes while merging intermediate states (Yin et al., 2024, Yang et al., 2024).
In specialized settings, bidirectional diffusion considers bridge samplers, where the forward process bridges between two known endpoints (e.g., start and end frames), and the reverse process reconstructs the intermediate sequence by solving a stochastic bridge using analytical marginals or stochastic differential equation (SDE)-driven samplers with temporal correlation priors (Vasilev et al., 14 Oct 2025).
2. Model Architectures and Implementation Approaches
Bidirectional Attention Architectures
Diffusion Transformers ("DiT") and related models flatten video spatial-temporal patches and use standard self-attention over the entire token sequence, removing any causal (AR) mask. Thus, each frame can condition on both previous and future temporal context (Yin et al., 2024). This mechanism is core to state-of-the-art long-context video generators, as it allows for fully global context aggregation during both training and inference.
Bidirectional State-Space Models (SSMs)
Recent advances replace or augment attention modules with bidirectional SSMs—linear dynamical systems implemented as convolutional kernels or recurrent modules that operate along both forward and reverse time axes. S4D/Mamba SSMs are incorporated as temporal feature extractors inside a U-Net diffusion backbone. At every stage, sequences are processed (i) forward in time, and (ii) backward in time via sequence reversal, followed by aggregation (e.g., summation, concatenation, or gating). This efficiently captures long-range temporal dependencies and enables memory and compute scaling with sequence length , in contrast to for attention layers (Oshima et al., 2024, Mo et al., 2024).
Bidirectional Bridge Matching
Time-Correlated Video Bridge Matching (TCVBM) takes a distinct approach: it specifies a tridiagonal prior coupling (e.g., a discretized Laplacian) in the video SDE, induces analytic transition kernels, and formulates the sampling problem as bridging between arbitrary data endpoints. The model samples intermediate states from the closed-form bridge distribution, and a neural score or denoiser learns to reconstruct the clean sequence from these intermediate samples. The joint temporal dependencies are enforced at the SDE level, rather than exclusively via the neural architecture (Vasilev et al., 14 Oct 2025).
Cross-conditioned and Recursive Denoising
Frameworks such as ViBiDSampler (Yang et al., 2024) perform explicit sequential sampling, alternating forward denoising from start keyframe and backward denoising from end keyframe, with a single re-noising step to bridge their trajectories. This prevents "off-manifold" artifacts associated with naïve linear latent-space fusions. Bidirectional Temporal Diffusion Models (BTDM) further instantiate recursive alternating steps across frames, using cross-conditioning and hierarchical U-Net blocks with bidirectional attention (Adiya et al., 2023).
3. Applications: Interpolation, Prediction, Infilling, and Video Understanding
Keyframe Interpolation and Inbetweening
Bidirectional samplers are well-suited for interpolation tasks: given a start and end frame, the model fills in temporally coherent, realistic intermediate frames by conditioning simultaneously on both boundaries. The pipeline in ViBiDSampler achieves state-of-the-art perceptual fidelity and consistency for high-resolution interpolation, outperforming both traditional flow-based and autoregressive approaches (Yang et al., 2024, Vasilev et al., 14 Oct 2025).
Long-Range Video Generation
By leveraging bidirectional SSMs or attention without causal masking, models can generate extended video sequences (hundreds of frames) with reduced memory requirements. This enables scalable training and inference for tasks such as open-ended video synthesis and multi-step prediction (Mo et al., 2024, Oshima et al., 2024).
Infilling and Data-Driven Bridging
Random-mask diffusion frameworks (e.g., RaMViD) introduce stochastic variable-length conditioning, learning to infill arbitrary frame subsets given any set of context (including both early and late frames) (Höppe et al., 2022). Bridge-matching approaches allow translation between various distributional endpoints, such as low-resolution to high-resolution sequences (video super-resolution), or partial-to-complete video reconstruction (Vasilev et al., 14 Oct 2025).
Temporally Consistent Animation
Bidirectional models address the "motion-appearance ambiguity" in human animation by enforcing global coherence through dual temporal reversals and feature-level cross-conditioning. This suppresses texture drift and stabilizes identity appearance, critical to photorealistic animation and motion transfer (Adiya et al., 2023).
Video Language Understanding
VidLaDA extends bidirectional diffusion to LLMs over video token streams, employing bidirectional attention for question-answering, event localization, and reasoning, achieving strong spatiotemporal grounding. It addresses the limitations of AR models (causal masking bias) and accelerates inference via MARS-Cache (modality- and layer-wise lazy cache updating, frame-chunked attention, global anchor tokens) (He et al., 25 Jan 2026).
4. Bidirectional Sampling Algorithms and Guidance Strategies
Sampling in bidirectional video diffusion models is characterized by algorithms that interchange or merge information from forward and backward passes. For keyframe interpolation (ViBiDSampler), the strategy merges conditioning by executing a forward branch conditioned on the start frame, performing a re-noising step, and then a backward branch (on time-reversed sequence) conditioned on the end frame. A single re-noising step between branches enforces on-manifold trajectories.
Advanced guidance mechanisms further stabilize and direct the sampling trajectory:
- Classifier-Free Guidance++ (CFG++) replaces the denominator in Euler-based updaters with the unconditional score, preserving manifold adherence under strong conditioning (Yang et al., 2024).
- Decomposed Diffusion Solver (DDS) applies small-scale Krylov subspace optimization to exactly match boundary frames during denoising, correcting conditioning deviation at each step.
- Random-masked training (RaMViD) enforces the ability to handle arbitrary context patterns with a single model, allowing one model to handle unconditional, causal, and bidirectional tasks (Höppe et al., 2022).
A summary of dominant bidirectional sampling pipelines:
| Method | Sampling Core | Manifold Handling | Guidance |
|---|---|---|---|
| ViBiDSampler (Yang et al., 2024) | Forward + re-noise + backward | Sequential, on-manifold | CFG++, DDS |
| SSM-Based (Oshima et al., 2024) | Bidirectional S4D (f/b passes) | Implicit in SSM prior | L2 denoising |
| TCVBM (Vasilev et al., 14 Oct 2025) | Analytic bridge SDE | Closed-form marginals | Score matching |
| BTDM (Adiya et al., 2023) | Recursive (fwd/back alt.) | Cross-frame, U-Net | Bidirectional loss |
5. Computational Complexity, Scaling, and Empirical Results
A central challenge historically was the quadratic memory and compute scaling of bidirectional attention. State-space models (Mamba, S4D) bring scaling, enabling models to handle 400-frame sequences or image resolutions within practical GPU memory (40 GB for videos at , as opposed to for attention models) (Oshima et al., 2024, Mo et al., 2024). Empirically, SSM-based and DiM models outperform attention-based architectures in FVD, sFID, and other diversity/quality metrics across datasets such as UCF-101 and MineRL.
Inference speed is significantly improved by MARS-Cache (VidLaDA), which asynchronously updates visual representations, leverages chunked attention with global anchor tokens, and prunes redundant computation. This delivers over 12× higher throughput compared to vanilla bidirectional diffusion decoding, with negligible (<0.5%) accuracy degradation (He et al., 25 Jan 2026).
Bidirectional video generators trained with bridge prior (TCVBM) show improved FVD, LPIPS, PSNR, and SSIM versus classical diffusion and Brownian bridge approaches for interpolation and upsampling tasks (Vasilev et al., 14 Oct 2025). Ablations consistently demonstrate that removing bidirectionality results in substantial degradation of temporal coherence and perceptual quality (Oshima et al., 2024, Adiya et al., 2023).
6. Limitations, Open Problems, and Future Directions
Despite their empirical successes, bidirectional video diffusion models present several limitations:
- Bidirectional models are generally unsuitable for low-latency or streaming applications since each frame's generation requires access to the full sequence (including "future" frames). Recent work mitigates this latency by distilling causal, autoregressive student models from bidirectional teachers using Distribution Matching Distillation (DMD), enabling streaming generation at orders-of-magnitude lower inference times (Yin et al., 2024).
- High memory requirements persist for large-capacity, bidirectionally attending transformers, motivating continued investigation into SSMs and mixed attention/state-space hybrid designs (Mo et al., 2024).
- Current interpolation strategies typically assume linear or uniform temporal correspondences; adaptive temporal priors and spatiotemporal editing remain open for further development (Yang et al., 2024).
- Existing approaches rely predominantly on Gaussian or linear temporal priors; incorporating richer, data-driven temporal models or learned feature-adapters is an active area (Vasilev et al., 14 Oct 2025).
- Bidirectional training may not be optimal for online settings or scenarios where only partial context is accessible.
Promising research directions include predictor-corrector samplers, adaptive step schedules, zero-shot dynamic prompting, video-to-video translation in online settings, and integration with higher-level semantic priors (prompting, CLIP guidance) (Yin et al., 2024, Yang et al., 2024).
7. Comparative Performance and Benchmarks
Empirical benchmarks across datasets and tasks demonstrate that bidirectional video diffusion models match or surpass state-of-the-art methods in perceptual and distributional scores. Selected results (as reported in the data):
| Method/Metric | LPIPS ↓ | FID ↓ | FVD ↓ | PSNR ↑ | SSIM ↑ |
|---|---|---|---|---|---|
| ViBiDSampler (Yang et al., 2024) (DAVIS/Pexels) | 0.2355 | 35.66 | 399.2 | — | — |
| SSM (UCF101 16@32²) (Oshima et al., 2024) | — | — | 226.45 | — | — |
| TCVBM Interpolation (Vasilev et al., 14 Oct 2025) | 0.077 | — | 30.54 | 17.28 | 0.813 |
| BTDM (person anim.) (Adiya et al., 2023) | 0.036 | 11.14 | — | — | 0.958 |
FVD, LPIPS, SSIM, and PSNR are standard metrics for perceptual quality, distributional alignment, and pixel-level fidelity. Bidirectional models generally outperform unidirectional, autoregressive, and flow-based approaches across these metrics and are especially robust to long-sequence and complex motion scenarios.
In summary, bidirectional video diffusion models encompass a family of architectures and algorithms that utilize both forward and backward temporal context via bidirectional attention, state-space models, or bridge priors, yielding high-quality, coherent, and efficient solutions for video generation, interpolation, and understanding. The maturation of efficient bidirectional components and scalable sampling strategies positions these models at the frontier of generative video modeling (Yin et al., 2024, Yang et al., 2024, Oshima et al., 2024, Mo et al., 2024, Vasilev et al., 14 Oct 2025, Höppe et al., 2022, He et al., 25 Jan 2026, Adiya et al., 2023).