MoRe4D: Motion & Geometry for 4D Synthesis

Updated 9 December 2025

MoRe4D is a joint framework that integrates 3D geometry and motion using diffusion models to generate dynamic 4D scenes from sparse inputs.
It leverages joint latent encoding, depth-guided normalization, and motion-aware conditioning to ensure spatiotemporal coherence and physical plausibility.
MoRe4D achieves superior performance with multi-view consistency and reduced training time compared to decoupled pipelines, boosting real-world applicability.

Motion Generation and Geometric Reconstruction for 4D Synthesis (MoRe4D) encompasses a class of methods that jointly model the evolution of 3D geometry and scene motion to synthesize dynamic, coherent 4D worlds—typically from sparse data such as a single image or monocular video. Unlike pipelines that decouple geometry and motion, leading to spatiotemporal inconsistencies, MoRe4D approaches structure the problem such that geometric and dynamic information is tightly coupled throughout generation, reconstruction, and rendering. These frameworks rely on diffusion-based generative models, variational encoding, geometric priors, flow-matching, and spatiotemporal regularization, yielding robust multi-view consistency and physically plausible scene dynamics from minimal supervision (Zhang et al., 4 Dec 2025).

1. Joint 4D Representation: Geometry and Motion

MoRe4D methods encode a dynamic scene as a time-indexed set of high-dimensional primitives, typically dense 4D point trajectories or deformable Gaussian splatting fields, where each element tracks both geometry (spatial position, appearance) and motion (temporal displacement, deformation) (Zhang et al., 4 Dec 2025, Pan et al., 1 Nov 2025).

In the dominant paradigm, each frame t is associated with a set of primitives $\mathcal{P}_t^{uv} = (x_t, y_t, z_t, o_t)$ for every pixel (u,v), where the fourth channel indexes occlusion for view synthesis. These primitives are linked through dense temporal tracks, enabling the recovery of 3D motion fields and geometric attributes directly from pixel-level correspondences. Gaussian Splatting parameterizes such a scene by time-varying center $\mu_i(t)$ , scale/rotation, view-dependent color, and opacity, with a deformation field $D$ modulating offsets as a function of canonical positions and time (Zhang et al., 4 Dec 2025, Pan et al., 1 Nov 2025, Liu et al., 10 Nov 2025).

The tight coupling of geometry and motion is achieved via joint encoding strategies in video VAEs and DiT-style transformers. Latent concatenation of appearance (image), geometry (depth), and motion (point trajectory difference maps) forms a composite input to the backbone network, which denoises and reconstructs the full spatiotemporal sequence (Zhang et al., 4 Dec 2025). Depth-guided normalization further stabilizes latent encoding, promoting scale invariance for physical plausibility.

2. Diffusion-Based Scene Trajectory Generation

Core to MoRe4D is the use of diffusion models in latent video, image, or point-cloud space. The forward process introduces Gaussian noise along a schedule $\{\beta_t\}$ : $x_t = \sqrt{\alpha_t}\,x_0 + \sqrt{1-\alpha_t}\,\epsilon$ , with the denoising process parameterized by a transformer network predicting clean latent reconstructions (Zhang et al., 4 Dec 2025, Mi et al., 24 Nov 2025).

Key elements include:

Flow matching loss: $\mathcal L_{\rm fm} = \mathbb E_{t,x_0,x_1}\bigl\|v_\theta(t,x_t)-(\,x_1-x_0\,)\bigr\|^2$ matches predicted flow vectors to actual latent transitions, improving temporal coherence and dynamic fidelity (Zhang et al., 4 Dec 2025).
Modality-specific branches: E.g., One4D (Mi et al., 24 Nov 2025) leverages Decoupled LoRA Control (DLC), splitting the network into RGB and geometry branches with zero-initialized control links for mutual pixel-level consistency, without cross-modal interference.
Motion-aware conditioning: Motion Perception Modules (MPM) inject patch-level priors such that the transformer can allocate denoising focus to regions of likely movement, using adaptive layer normalization and gated feature modulation (Zhang et al., 4 Dec 2025).

Depth-guided motion normalization is essential: it rescales motion vectors according to per-pixel depth, ensuring that the same world-space displacement is magnitude-equivalent in latent space regardless of distance, mitigating geometric distortions (Zhang et al., 4 Dec 2025).

3. 4D View Synthesis and Inpainting

To achieve multi-view-consistent video synthesis under arbitrary camera trajectories, rendered point clouds or Gaussian fields are projected and composited via splatting methods (Zhang et al., 4 Dec 2025).

The rendering equation at each pixel $(u,v)$ and time $t$ is:

$C(u,v) = \frac{\sum_n w_n(u,v)\,c_n}{\sum_n w_n(u,v)}, \quad w_n(u,v) = \exp\left(-\frac{\|\pi(P_n)-[u,v]\|^2}{2\sigma_n^2}\right)$

where $\sigma_n$ is the per-point Gaussian radius, $\pi$ projects 3D points to the view, and $c_n$ is the color.

Novel views may exhibit holes or missing pixels due to occlusion or lack of trajectory coverage. Here, inpainting diffusion modules—finetuned on rendered frames and binary masks—learn to fill missing regions with plausible and geometry-consistent content, leveraging both latent image features and binary occupancy (Zhang et al., 4 Dec 2025).

Post-optimization steps for camera intrinsics, pose refinement, and per-pixel depth estimation minimize photometric and smoothness losses, bringing rendered outputs into alignment with real observations (Mi et al., 24 Nov 2025).

4. Datasets, Training, and Evaluation Protocols

Large-scale, motion-rich datasets are essential for joint motion-geometry modeling. TrajScene-60K compiles 60,000 video samples with dense point trajectories derived from filtered WebVid-10M clips, focusing on self-initiated, countable motions with high geometric quality (Zhang et al., 4 Dec 2025). Synthetic datasets (OmniWorld-Game, BEDLAM) and real videos (SpatialVID) extend coverage for rendered and pseudo-geometry (Mi et al., 24 Nov 2025).

Models are trained using AdamW optimizers, varying learning rates, and short run-times (e.g., 6 minutes/sample at 49 frames and 512×368 resolution, significantly faster than prior art) (Zhang et al., 4 Dec 2025). Regularization includes KL loss on VAE latents, weight decay for transformers, and flow-matching losses.

Evaluation is multi-faceted:

VBench: Subject/Background Consistency, Motion Smoothness, Dynamic Degree, Aesthetic Quality, Imaging Quality.
VLM-based 4D Consistency: 3D Geometric Consistency, Temporal Texture Stability, Motion-Geometry Coupling.
Scene metrics: Abs Rel, $\delta < 1.25$ , PSNR, SSIM, LPIPS, FVD. MoRe4D frameworks consistently outperform baselines such as 4Real, GenXD, DimensionX, Gen3C, Free4D by 10–15% absolute on aesthetic and imaging quality, and +1.5 on VLM consistency metrics (Zhang et al., 4 Dec 2025).

5. Comparative Analysis and Performance

MoRe4D’s tightly coupled joint diffusion of geometry and motion prevents “geometry collapse” or loss of spatial detail common in decoupled or template-driven pipelines (Zhang et al., 4 Dec 2025). Depth-guided normalization promotes metric correctness, while motion-aware conditioning injects physically plausible dynamics.

Comparative results show superior subject consistency, background stability, motion-geometry coupling, and scaling to held-out real scenes. Ablations confirm the necessity of unified latent conditioning: omitting motion-aware features or depth normalization degrades spatiotemporal coherence and physical realism (Zhang et al., 4 Dec 2025).

In full-video and sparse-frame settings, graceful degradation with reduced observation density is achieved, with robust geometry persisting even with <10% frame input (Mi et al., 24 Nov 2025). Motion and geometric fidelity, as measured by VBench, remain consistently high across settings.

6. Methodological Innovations and Future Directions

MoRe4D stands out by:

Integrating depth, motion, and appearance priors into joint latent encodings, avoiding the static-template limitation of earlier pipelines.
Using flow-matching and modality-specific branches for scalable training, reducing step counts by orders of magnitude versus concatenation-based approaches.
Employing learned inpainting for novel-view robustness.

Open directions include extending the point-based formulation to mesh or neural field representations, further improvements in cross-modal interaction modeling, and scaling up datasets for more diverse, articulated, or long-horizon scenes (Zhao et al., 22 Oct 2025).

7. Context and Outlook within 4D Synthesis

MoRe4D approaches typify a shift towards unified, data-driven pipelines capable of generalizing from sparse observations to realistic, dynamic 4D scenes with high geometric and motion fidelity. Their methods contrast with generate-then-reconstruct and reconstruct-then-generate pipelines, avoiding their typical pitfalls of static geometry or motion artifacts. By leveraging large-scale motion datasets, flow-matching diffusion backbones, depth-aware normalization, and joint motion-geometry conditioning, MoRe4D demonstrates state-of-the-art spatiotemporal coherence and generalization (Zhang et al., 4 Dec 2025, Mi et al., 24 Nov 2025).

MoRe4D frameworks are central to the next generation of real-time, interactive 4D scene synthesis, with direct impact on downstream perception, planning, and AR/VR applications. The paradigm points towards future architectures where geometry, motion, and semantic interaction are generated in mutual consistency from minimal input, establishing benchmarks for the domain.