Papers
Topics
Authors
Recent
Search
2000 character limit reached

Joint Video-Image Diffusion (JVID)

Updated 2 April 2026
  • JVID is a generative modeling approach that combines image and video diffusion models to achieve high per-frame visual fidelity and temporal consistency.
  • It employs shared latent spaces and various joint denoising strategies, such as mixture of denoisers and staged pipelines, to balance detail and smooth motion.
  • Empirical results on benchmarks show improved metrics like FID, FVD, and IS, highlighting its potential in advancing high-quality video synthesis.

Joint Video-Image Diffusion (JVID) refers to a family of generative modeling approaches that integrate image diffusion models (IDMs) and video diffusion models (VDMs) with the objective of achieving both high per-frame visual fidelity and strict temporal consistency in synthesized video. This methodological paradigm addresses the intrinsic trade-off in video synthesis: purely image-based diffusion models yield sharp, photorealistic results but with pronounced frame-to-frame flicker, while video-based diffusion models enforce temporal smoothness at the expense of spatial detail and texture complexity. By architecturally unifying or systematically combining both model types—either during training, inference, or both—JVID enables joint exploitation of their complementary strengths, thus advancing the quality, coherence, and versatility of video synthesis workflows.

1. Foundations and Rationale

At their core, diffusion models generate samples by simulating the reversal of a multistep Gaussian perturbation process in latent or pixel space. Image diffusion models (e.g., latent diffusion models operating on the representations produced by a Variational Autoencoder, such as Stable Diffusion’s encoder) excel at generating sharp and detailed still images via iterative denoising. However, these models offer no mechanism for inter-frame dependency modeling when applied independently to each video frame, resulting in incoherence and flicker artifacts. In contrast, video diffusion models extend the diffusion process to the spatio-temporal domain, representing a video as a 4D tensor and employing models trained specifically to denoise and synthesize across both spatial and temporal axes (Reynaud et al., 2024).

JVID’s central insight is that, provided both models share the same latent space, noise schedule, and noise-prediction objective, their outputs can be mixed in the reverse diffusion process—calling upon the video model to enforce temporal continuity and the image model to inject per-frame detail. This unification can be realized in several algorithmic forms: joint end-to-end training with masking-based specialization (Ho et al., 2022), staged inference pipelines with inversion and blending (Shi et al., 2023), or plug-and-play alternating denoiser schemes (Shao et al., 2024).

2. Model Architectures and Diffusion Operators

A canonical JVID instantiation employs two parallel models:

  • Latent Image Diffusion Model (LIDM): A standard latent diffusion model trained on still images. It operates in the latent space produced by a pre-trained VAE, with a forward diffusion q(ztzt1)=N(zt;1βtzt1,βtI)q(z_t|z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t}z_{t-1}, \beta_t I) and uses a UNet-based noise predictor ϵθ(zt,t)\epsilon_\theta(z_t, t). The loss minimized is LLIDM=E[ϵϵθ(zt,t)2]L_\text{LIDM} = \mathbb{E}[\|\epsilon - \epsilon_\theta(z_t, t)\|^2].
  • Latent Video Diffusion Model (LVDM): Processes video tensors of shape F×4×H/8×W/8F \times 4 \times H/8 \times W/8 (with FF denoting frames), sharing architecture with LIDM but replacing all 2D convolutions with space-only 3D kernels (1×3×3)(1 \times 3 \times 3) and including 3D temporal kernels (3×1×1)(3 \times 1 \times 1). Each UNet block further integrates temporal self-attention. The loss structure and denoising process mirror LIDM.

Both models utilize the same VAE for encoding and decoding, ensuring strict latent alignment (Reynaud et al., 2024). Advanced variants substitute the convolutional backbone for transformers without altering the core mixture-of-denoisers scheme (Chen et al., 2023).

3. Joint Denoising and Sampling Strategies

JVID frameworks operationalize joint diffusion in a variety of ways, each reflecting a specific scheduling and combination of denoiser application:

  • Mixture of Denoisers (JVID original): At each timestep tt, a probability schedule PV(t)P_V(t) determines whether the LVDM or the LIDM denoiser is applied. PV(t)P_V(t) can be piecewise linear, with early steps weighted towards the video model (enforcing temporal structure) and late steps towards the image model (for restoration of local detail). This approach preserves the computational and implementation boundaries of each model by restricting interaction to the reverse diffusion loop (Reynaud et al., 2024).
  • Sequential Staged Pipelines (BIVDiff): Video synthesis proceeds in ordered stages: (i) frame-wise denoising via an IDM for per-frame fidelity, (ii) mixed inversion where each clean frame latent is mapped to noisy latents via both the IDM and the VDM and then linearly blended, followed by (iii) full-sequence temporal denoising by the VDM to smooth transitions. Mixing ratios and guidance scales control the fidelity–coherence trade-off (Shi et al., 2023).
  • Alternating Denoiser Sampling (IV-Mixed Sampler): Each DDIM step comprises an “IDM go-and-back” (a DDIM inversion + DDIM denoise via the IDM) immediately followed by a “VDM go-and-back” (analogous operations via the VDM), with independent classifier-free guidance scales for each pass. This process allows the system to inject spatial detail and temporal structure recursively, offering enhanced SOTA performance in human and automated FVD/FID-based benchmarks (Shao et al., 2024).
  • Joint Training with Masked Attention: A single UNet is trained simultaneously on videos (with active temporal attention) and independent images (with temporal attention masked out), as in "Video Diffusion Models" (Ho et al., 2022). The model learns a shared representation and can generate both images and videos without explicit model mixing at inference.

The table below summarizes JVID denoising schedules and sampling paradigms:

Approach Denoising Schedule Implementation
Mixture of Denoisers ϵθ(zt,t)\epsilon_\theta(z_t, t)0 at ϵθ(zt,t)\epsilon_\theta(z_t, t)1 Sample-time switching
BIVDiff Pipeline Frame → Mix → Sequence Multi-stage inference
IV-Mixed Sampler IDM+VDM alternating per step Nested DDIM operations
Joint Training (Ho et al) Temporal mask for images Single UNet training

4. Implementation Techniques and Hyperparameterization

Architectural and procedural choices in JVID directly impact temporal consistency and image fidelity:

  • Latent Compatibility: Both denoisers must operate in the identical latent space, typically ensured by using a pre-trained VAE for both training data encoding and output decoding.
  • Attention Mechanisms: Temporal self-attention modules in each block of the video model capture inter-frame dependencies; masking enables efficient mixed (image+video) training without network collisions (Ho et al., 2022, Chen et al., 2023).
  • Training Regimes: LIDM and LVDM are often trained at increasing resolutions (e.g., ϵθ(zt,t)\epsilon_\theta(z_t, t)2), adopting large batch sizes and extensive compute resources (e.g., 8–64 GPUs, AdamW optimizers, PNDM schedulers). Conditional dropout can regularize the model to improve generalization across frames and scenes (Reynaud et al., 2024).
  • Sampling and Entropy Reduction: DDPM or DDIM samplers are standard; guidance scales (e.g., 2.0 for DDPM; 7.5 for classifier-free guidance) are tuned for trade-offs between adherence to conditioning signals and sample diversity. Entropy reduction (scaling ϵθ(zt,t)\epsilon_\theta(z_t, t)3 in the DDPM update) and custom channel-wise latent smoothing further suppress flicker (Reynaud et al., 2024).
  • Inference Schedules: Piecewise linear or dynamic weighting of the two denoiser contributions is typically tuned per resolution/benchmark; plug-and-play wrappers permit rapid experimentation without retraining (Shao et al., 2024).

5. Quantitative Evaluation and Empirical Findings

JVID frameworks have been benchmarked extensively on established datasets, using metrics designed to capture both frame quality and temporal consistency:

On UCF-101 for ϵθ(zt,t)\epsilon_\theta(z_t, t)4 videos, JVID achieved FID=31.44, FVD=1037.81, IS=11.25. This closes the FVD gap relative to pure video models (FVD=1057.34), while delivering spatial fidelity close to the image-only baseline (FID=17.09) (Reynaud et al., 2024). In BIVDiff, frame consistency on DAVIS-text (CLIP-based) is FC=92.67% vs. 91.69% for the best baseline, with substantial gains in user-rated quality and temporal smoothness (Shi et al., 2023).

IV-Mixed Sampler reduced FVD on benchmarks such as Chronomagic-Bench-150 (Animatediff: 219.29→192.72) and achieved measurable improvements in both automated and human evaluation scenarios (Shao et al., 2024). The effect of the joint diffusion paradigm is most visible in tasks requiring both intricate per-frame detail (e.g., water ripples, foliage) and motion continuity, with a marked reduction in flicker and artifacts.

6. Comparative Perspectives and Alternative Paradigms

Alternative approaches for video synthesis and joint modeling include:

  • Transformer-based Diffusion (GenTron): Utilizes large-scale pre-norm Transformers with patch-based tokenization and integrates temporal self-attention layers per block. Training incorporates joint image-video examples, and "motion-free guidance" is introduced to control the influence of spatial vs. temporal information during both training and inference (Chen et al., 2023).
  • End-to-End Joint Training: As in Ho et al. (Ho et al., 2022), a single UNet admits both video block and independent image supervision within one objective. Temporal attention is activated for videos and masked for images, providing gradient variance reduction and facilitating rapid convergence and high-quality joint modeling.
  • Plug-and-Play Schedulers: IV-Mixed Sampler introduces multi-step, two-pass per-sample denoising, combining inversion and denoising for both IDM and VDM at each inference step, governed by distinct guidance schedules (Shao et al., 2024).

A plausible implication is that learned mixture weights and further model modularity (e.g., inclusion of depth, audio, or control signals as additional denoisers) will yield increasingly controllable and extendable video synthesis systems.

7. Limitations and Future Directions

Despite significant advances, current JVID implementations confront limitations:

  • Sample Efficiency and SOTA Gaps: At typical academic compute scale, FVD scores remain above leading proprietary models (e.g., Make-A-Video FVD ~81); pre-training on larger, more diverse corpora is likely required to close this gap (Reynaud et al., 2024).
  • Inference Cost: Cascaded and multi-stage sampling strategies (e.g., IV-Mixed Sampler at 250 DDIM calls per sequence) increase per-clip inference latency and resource consumption.
  • Extensibility: Most reported frameworks combine exactly two model types (IDM + VDM). Extension to a more general modular mixture—including additional modalities (audio, depth) or task-oriented denoisers—is a stated direction (Reynaud et al., 2024).
  • Mixing Schedules and Weights: Hand-tuned or piecewise linear schedules predominate; learning mixture weights end-to-end or in an adaptive, data-driven manner is an area for future research.
  • Robustness and Generalization: JVID variants are typically evaluated on fixed-length, center-cropped, and frame-rate-normalized videos; handling variable-length, arbitrary-resolution, and unaligned real-world sequences remains a challenge (Ho et al., 2022).
  • Bias and Societal Impact: There is an identified need to audit and address the propagation of social bias, especially when leveraging large-scale still-image repositories for joint training (Ho et al., 2022).

A plausible implication is that modular, end-to-end trainable joint diffusion backbones combined with advances in learned mixture scheduling and multi-modal conditioning will further accelerate convergence toward high-fidelity, temporally-consistent, and broadly controllable video generation systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Video-Image Diffusion (JVID).