4D Diffusion Models: Dynamic Spatial-Temporal Synthesis

Updated 4 July 2026

4D diffusion models are generative systems that meld 3D spatial structures with temporal dynamics to create dynamic scenes from images, text, or video inputs.
They leverage latent diffusion on compressed video representations or direct diffusion on geometrical objects to ensure both visual realism and structural coherence.
Hybrid approaches combine diffusion-driven video generation with explicit 4D reconstruction, addressing challenges in resolution, trajectory consistency, and data supervision.

A 4D diffusion model is a diffusion-based generative system whose target variable has an explicit temporal dimension in addition to 3D structure, camera viewpoint, or both. In current literature, the term covers several related formulations: dynamic 3D scene generation from images, text, or video; view–time video synthesis with camera control; direct modeling of full $3D+t$ medical volumes; diffusion over mesh deformations or landmark trajectories; and hybrid pipelines that use diffusion to produce multi-view or panoramic videos before lifting them into explicit 4D representations such as dynamic Gaussian splats or point clouds (Fan et al., 11 Dec 2025, Seyfarth et al., 26 Mar 2026, Shao et al., 2024, Zhou et al., 30 Apr 2025).

1. Conceptual scope and definitions

The literature does not use “4D diffusion model” to denote a single mathematical object. In scene-generation work, 4D commonly means a dynamic 3D scene that can be rendered from varying viewpoints over time. OmniView makes this explicit by defining many tasks as instances of a single “4D consistency” problem: given some combination of spatially observed content, timestamps, and camera parameters, generate a temporally and geometrically consistent video in a target camera trajectory (Fan et al., 11 Dec 2025). Closely related formulations appear in 4DiM, which writes 4D novel view synthesis as generation conditioned on images, relative camera poses, and relative timestamps, and in Human4DiT, which treats the output as a latent tensor indexed by viewpoint $V$ , time $T$ , and spatial dimensions $H\times W$ (Watson et al., 2024, Shao et al., 2024).

In other domains, the fourth dimension is time attached to a native 3D structure rather than camera variation. CardioDiT models cine cardiac MRI as a full $3D+t$ distribution over depth, height, width, and cardiac phase, and explicitly contrasts this with factorized slice-wise or channel-merged baselines (Seyfarth et al., 26 Mar 2026). AnimateMe and the earlier 4D Facial Expression Diffusion Model define 4D as 3D facial geometry evolving over time: in one case as per-vertex mesh deformations relative to a neutral mesh, in the other as temporal sequences of 3D landmarks followed by landmark-guided mesh deformation (Gerogiannis et al., 2024, Zou et al., 2023).

A further distinction concerns whether 4D is represented explicitly or only through generated observations. Zero4D defines 4D video as a spatio-temporal grid $x[i,j]$ indexed by camera viewpoint and time, without reconstructing explicit geometry (Park et al., 28 Mar 2025). By contrast, papers such as Diff4Splat, TriDiff-4D, and R2LDM treat 4D as a time-varying structured representation—a deformable 3D Gaussian field, pose-conditioned triplanes, or latent voxel features decoded into dense point clouds—rather than a video tensor alone (Pan et al., 1 Nov 2025, Sheung et al., 20 Nov 2025, Zheng et al., 21 Mar 2025).

2. Representations and diffusion variables

A central design choice is what variable is actually diffused. Many systems perform diffusion in a compressed latent video space. OmniView uses a Diffusion Transformer operating in the latent space of a 3D VAE, patchifying the video latent tensor into tokens $\mathbf{z}_{xyt}\in\mathbb{R}^d$ and training the model as a rectified-flow/video-diffusion system in latent space (Fan et al., 11 Dec 2025). Diff4Splat similarly builds on CogVideoX latents and conditions them on Plücker-encoded camera trajectories before a latent dynamic reconstruction module predicts a canonical 3D Gaussian field plus per-time deformations (Pan et al., 1 Nov 2025). Diffusion4D and 4DVD also formulate 4D generation as latent video diffusion, but use the generated multi-view or orbital videos as supervision for downstream 4D Gaussian reconstruction rather than as the final representation (Liang et al., 2024, Yang et al., 6 Aug 2025).

Other work diffuses directly over non-image geometric objects. CardioDiT encodes cine CMR into a compact 4D latent grid with a spatiotemporal VQ-GAN and then applies a 4D DiT to patches of size $(1,4,4,2)$ , preserving the depth–height–width–time lattice through 4D sine-cosine positional encodings (Seyfarth et al., 26 Mar 2026). Sora3R adapts a pointmap VAE from a pretrained video VAE and diffuses pointmap latents conditioned on RGB video latents, producing dense 4D pointmaps from which depth and camera pose are recovered (Mai et al., 27 Mar 2025). R2LDM avoids range-image and BEV parameterizations and instead performs conditional diffusion on latent voxel features, later decoded by a Latent Point Cloud Reconstruction module into dense LiDAR-like point clouds (Zheng et al., 21 Mar 2025). AnimateMe performs DDPM directly on mesh-space deformation fields $\mathbf{d}_i=\mathbf{x}_i-\mathbf{x}_0$ , while 4DFM generates sequences of 3D landmarks rather than dense meshes and only reconstructs mesh motion in a second stage (Gerogiannis et al., 2024, Zou et al., 2023).

Hybrid pipelines retain video diffusion as a front end but treat video as an intermediate representation. HoloTime is explicit that it is not a native 4D diffusion model: it first converts a panoramic image into a panoramic video with a two-stage DynamiCrafter-based generator, then estimates space-aligned and space-time depth, converts the result into a 4D point cloud, and optimizes a holistic 4D Gaussian Splatting representation (Zhou et al., 30 Apr 2025). CAT4D, Flex4DHuman, and 4DVD follow the same broad pattern: diffusion first generates dense multi-view videos across time, and a separate optimization stage fits deformable 3D Gaussians or dynamic Gaussian splats (Wu et al., 2024, Cheng et al., 11 Jun 2026, Yang et al., 6 Aug 2025).

3. Architectural patterns for coupling space, time, and view

A major line of work concerns how to couple—or deliberately decouple—space, time, and camera geometry inside attention. OmniView’s main architectural claim is that geometry and time should not be collapsed into a single encoding. It represents camera input as Plücker ray maps $\mathbf{P}\in\mathbb{R}^{6\times H\times W}$ , encodes them into camera tokens aligned with video tokens, fixes the temporal coordinate of camera tokens to $V$ 0, and uses channel-wise concatenation rather than additive fusion so that attention scores decompose into separate video-token and camera-token terms (Fan et al., 11 Dec 2025). The same paper argues that naive addition of camera embeddings before 3D RoPE entangles viewpoint with time and harms extrapolation to unseen trajectories.

A different but related strategy appears in Flex4DHuman. Built on Wan 2.1’s 1.3B text-to-video DiT backbone, it preserves the backbone architecture but replaces spatio-temporal RoPE with a five-axis positional encoding over time, view index, continuous $V$ 1 camera geometry, height, and width. Its PRoPE-style conditioning transforms reserved query and key sub-vectors by relative camera poses, allowing synchronized dense multi-view video generation from monocular or sparse multi-view input without depth, normals, skeletons, or rendered target-view geometry (Cheng et al., 11 Jun 2026). Human4DiT also treats viewpoint and time as first-class axes, but does so through factorized attention: a 2D image transformer models within-frame spatial structure, a view transformer models cross-view correlations, and a temporal transformer handles frame-to-frame dynamics (Shao et al., 2024).

Several systems instead separate temporal and spatial synthesis into distinct diffusion processes. DiST-4D formalizes this explicitly with DiST-T for temporal RGB-D generation and DiST-S for spatial RGB-D novel view synthesis, using dense metric depth as the bridge between forecasting and view synthesis (Guo et al., 19 Mar 2025). 4DiM adopts yet another decomposition: its cascaded pixel-space diffusion model supports mixed 3D, 4D, and video supervision through masked FiLM layers for pose and timestamp conditioning, and uses ray origins and directions as the pose representation (Watson et al., 2024). 4Diffusion extends a frozen 3D-aware latent diffusion backbone, ImageDream, by inserting a zero-initialized temporal motion module into each UViT block, thereby retaining multi-view spatial consistency while learning temporal structure (Zhang et al., 2024).

The design debate is therefore not simply “joint versus factorized.” CardioDiT argues that space–time factorization introduces structural bias and can yield inter-slice discontinuities or physiologically implausible motion, motivating direct modeling of $V$ 2 (Seyfarth et al., 26 Mar 2026). OmniView, by contrast, argues not against all factorization, but against incorrect factorization—specifically the temporal rotation of camera features and additive fusion of view and video embeddings (Fan et al., 11 Dec 2025). Human4DiT, 4DiM, and DiST-4D adopt structured factorization primarily for tractability and controllability (Shao et al., 2024, Watson et al., 2024, Guo et al., 19 Mar 2025).

4. Training regimes, data composition, and supervision

The scarcity of fully synchronized 4D data is a defining constraint of the field, and much of the recent literature is devoted to constructing mixtures of partial supervision. OmniView trains one model across static multi-view image NVS, monocular video NVS, camera-controlled text-to-video, image-to-video, and video-to-video tasks, sampling from RE10K, DL3DV, ReCamMaster, SynCamMaster, and Stereo4D, and warming up on multiview static data to adapt Plücker-ray camera conditioning (Fan et al., 11 Dec 2025). 4DiM similarly mixes 3D datasets with pose, 4D datasets with pose and time, and large-scale video data with time but no pose; its cRE10K calibration pipeline is designed specifically to recover metric scale in SfM-posed data so that camera control becomes physically meaningful (Watson et al., 2024).

Several papers construct new datasets because existing corpora are mismatched to 4D requirements. HoloTime introduces 360World, described as the first comprehensive collection of fixed-camera panoramic videos suitable for downstream 4D reconstruction, with 7,497 clips and 5,380,909 frames (Zhou et al., 30 Apr 2025). 4DVD builds D-Objaverse by filtering dynamic Objaverse assets down to 17k samples rendered as 16-view videos, after initially collecting 41k assets (Yang et al., 6 Aug 2025). Diffusion4D curates 54K animated assets from Objaverse-1.0 and Objaverse-XL, filtering out subtle or excessively dramatic motion before rendering 24-frame orbital videos (Liang et al., 2024). Flex4DHuman relies on calibrated multi-view human and animal captures from DNA-Rendering, ActorsHQ, and DFA, and complements these with multi-view captions for test-time text control (Cheng et al., 11 Jun 2026).

In domains with stronger supervision, training becomes more geometry-centric. CardioDiT uses public ACDC and MM2 cine CMR datasets plus a larger private cohort, standardizes sequence length to 32 frames by cyclic repetition, and reports that full 4D modeling is feasible on 24 GB VRAM through latent compression and FlashAttention (Seyfarth et al., 26 Mar 2026). R2LDM is trained with paired radar/LiDAR supervision in two stages: first VFE and LPCR on LiDAR reconstruction, then a radar-conditioned latent voxel diffusion model that learns LiDAR-like latent features from radar input (Zheng et al., 21 Mar 2025). Phys4D pushes supervised data generation furthest: after pseudo-supervised pretraining on internet and model-generated videos, it fine-tunes on simulation data produced from approximately 250,000 environments and 1,250,000 videos, with multimodal annotations spanning depth, flow, and physical state evolution (Lu et al., 3 Mar 2026).

A recurrent theme is curriculum or staged optimization. Flex4DHuman uses a three-stage curriculum for pose-following adaptation, flexible multi-view generation, and temporal rollout (Cheng et al., 11 Jun 2026). HoloTime uses a two-stage motion-guided panoramic animator followed by 4D lifting (Zhou et al., 30 Apr 2025). Dream-in-4D first learns a static 3D asset under 2D and 3D diffusion guidance, then freezes it and learns motion through a deformable field with video diffusion guidance (Zheng et al., 2023). Such staging reflects a common judgment in the literature: appearance, geometry, and motion are easier to learn when not all optimized simultaneously.

5. System families and application domains

One family of 4D diffusion models is native joint modeling, where the diffusion process directly targets a 4D latent or structured 4D variable. CardioDiT is the clearest example, performing unified 4D latent diffusion for short-axis cine CMR and reporting on the private cohort a FID of 21.2, Precision 0.82, Recall 0.54, $V$ 3-SSIM 0.69, $V$ 4 1.7, and $V$ 5 (Seyfarth et al., 26 Mar 2026). OmniView approaches native 4D modeling on the video side: it unifies static NVS, dynamic NVS, image-to-video, text-to-video, and video-to-video redirection in a single latent-space DiT, and reports improvements of up to 33\% in LLFF SSIM, 60\% on Neural 3D Video, 20\% on static camera control on RE-10K, and roughly a $V$ 6 reduction in camera error for text-conditioned video generation (Fan et al., 11 Dec 2025).

A second family comprises hybrid diffusion-plus-reconstruction systems. HoloTime generates panoramic video and then reconstructs a 4D Gaussian scene for VR/AR (Zhou et al., 30 Apr 2025). Diffusion4D generates orbital videos and reconstructs 4D Gaussian splats in a coarse-to-fine manner, reporting an overall generation time of about 8 minutes versus 23 hours for 4DFY and 9 hours for Animate124 (Liang et al., 2024). CAT4D uses a multi-view video diffusion model to create a large space–time “data cube” from monocular video before fitting deformable 3D Gaussians (Wu et al., 2024). 4DVD turns monocular video into dense 16-view video grids and reports LPIPS 0.133, CLIP-S 0.927, FVD-F 507.12, FVD-V 314.44, and FVD-Diag 456.01 on multi-view videos, with downstream 4D assets reaching LPIPS 0.136, CLIP-S 0.919, and FVD 438.41 (Yang et al., 6 Aug 2025).

A third family is feed-forward explicit 4D reconstruction from diffusion latents. Diff4Splat predicts a deformable 3D Gaussian field from a single image, camera trajectory, and optional text prompt, with reported reconstruction time of 30 seconds and interactive exploration latency of 6.7 ms (Pan et al., 1 Nov 2025). TriDiff-4D performs text-to-4D avatar generation through diffusion-based triplane re-posing, reporting 0.6 minutes for a 14-frame sequence versus 6.5 hours for MAV3D and 23 hours for 4D-fy (Sheung et al., 20 Nov 2025). Sora3R uses video-diffusion priors not for generation from prompts but for feedforward 4D geometry reconstruction from casual video via pointmaps (Mai et al., 27 Mar 2025).

The application range is correspondingly broad. Human-centric systems include Human4DiT for free-view human videos, Flex4DHuman for dense target-view video generation and subsequent dynamic Gaussian reconstruction, AnimateMe and 4DFM for facial animation, and TriDiff-4D for text-driven avatars (Shao et al., 2024, Cheng et al., 11 Jun 2026, Gerogiannis et al., 2024, Zou et al., 2023, Sheung et al., 20 Nov 2025). Panoramic and immersive scene generation is addressed by HoloTime (Zhou et al., 30 Apr 2025). Driving scenes are handled by DiST-4D, which generates future RGB-D and novel views feed-forward from multi-camera observations and control signals (Guo et al., 19 Mar 2025). Radar super-resolution extends the paradigm beyond RGB, with R2LDM reporting 6- to 10-fold densification of radar point clouds and downstream gains of up to 31.7\% in registration recall and 24.9\% in object detection accuracy (Zheng et al., 21 Mar 2025). Phys4D reframes video diffusion as a route to 4D world modeling with explicit physics consistency rather than only visual plausibility (Lu et al., 3 Mar 2026).

6. Misconceptions, limitations, and open technical questions

A common misconception is that all 4D diffusion models directly denoise explicit 4D scene representations. Several papers reject this interpretation explicitly. HoloTime states that it is not a true 4D diffusion model because diffusion generates panoramic video rather than 4D Gaussians or point clouds directly (Zhou et al., 30 Apr 2025). Flex4DHuman emphasizes that its diffusion model generates synchronized multi-view videos, while the actual 4D Gaussian splats are reconstructed by FreeTimeGS (Cheng et al., 11 Jun 2026). CAT4D, Diffusion4D, 4DVD, and Zero4D make the same separation between diffusion-based view synthesis and downstream or implicit 4D reconstruction (Wu et al., 2024, Liang et al., 2024, Yang et al., 6 Aug 2025, Park et al., 28 Mar 2025). The field therefore contains both native 4D diffusion and diffusion-enabled 4D construction.

Another recurrent misconception is that visual realism implies world consistency. Phys4D is explicit that appearance-driven video diffusion models can remain physically implausible, motivating depth heads, motion heads, warp losses, simulator-grounded fine-tuning, and RL with a 4D Chamfer reward (Lu et al., 3 Mar 2026). Sora3R likewise shows that video-diffusion priors can be repurposed for 4D geometry reconstruction, but its results remain sensitive to pointmap quality and downstream PnP recovery (Mai et al., 27 Mar 2025). These papers mark an important shift from “4D as coherent video” toward “4D as coherent evolving geometry.”

The main limitations are strikingly consistent across papers. Resolution and sequence length remain restricted in many systems: OmniView is trained at latent-video resolution and on limited frame/view counts; Diffusion4D reports current video size of $V$ 7; CardioDiT notes that cyclic repetition to 32 frames may impose artificial periodicity; and Zero4D depends on a structured camera–time grid and on the quality of monocular depth used for warping (Fan et al., 11 Dec 2025, Liang et al., 2024, Seyfarth et al., 26 Mar 2026, Park et al., 28 Mar 2025). Generalization is often bounded by trajectory coverage or supervision quality: Flex4DHuman notes limited robustness to dynamic camera motion and possible long-rollout drift; HoloTime avoids moving-camera panoramic datasets by design; DiST-4D attributes some failures to imperfect pseudo metric depth, especially for distant structures (Cheng et al., 11 Jun 2026, Zhou et al., 30 Apr 2025, Guo et al., 19 Mar 2025).

Several open directions are directly suggested by the surveyed work. CardioDiT points to variable-length 4D modeling beyond cyclic repetition (Seyfarth et al., 26 Mar 2026). 4DVD proposes adapting its dense-view, structure-aware design to diffusion transformers such as CogVideoX (Yang et al., 6 Aug 2025). Flex4DHuman points to self-forcing or diffusion forcing for better long-horizon consistency (Cheng et al., 11 Jun 2026). TriDiff-4D mentions flow matching and improved architectures as future work (Sheung et al., 20 Nov 2025). Dream-in-4D identifies stronger 2D and 3D priors as the critical bottleneck when the canonical static stage fails (Zheng et al., 2023). Taken together, these directions indicate that the next phase of 4D diffusion research is likely to focus less on proving that diffusion can address 4D problems, and more on unifying explicit geometry, controllable viewpoint–time generation, long-horizon coherence, and physically grounded dynamics within scalable training and inference regimes.