Factored 4D Representation: Dynamic Scene Modeling

Updated 17 December 2025

Factored 4D representation is a model that disentangles dynamic 3D data into separate, interpretable modules—geometry, motion, and interaction—for enhanced learning and simulation.
It employs techniques such as deformation fields and latent factorization to achieve state-of-the-art results in dynamic scene reconstruction, articulated tracking, and physical simulations.
This structured approach supports efficient computation and robust generalization across fields like graphics, robotics, and medical imaging by integrating deep learning with analytical priors.

A factored 4D representation is a structured encoding of dynamic 3D data over time, where geometry, motion, and (in some formulations) interaction components are disentangled into separate, interpretable modules. This modularization provides improved generalization, efficient learning, actionable semantics, and computational acceleration in dynamic perception and simulation problems. Factored 4D representations are used across computational graphics, vision, robotics, medical imaging, and physics, supporting applications such as dynamic scene reconstruction, articulated tracking, multi-agent understanding, and efficient simulation. Modern approaches integrate deep learning with analytical priors, continuous-time modeling, and explicit domain decompositions.

1. Formal Definitions and High-Level Taxonomy

A 4D representation encodes a function

$f(\mathbf{x},t):\mathbb{R}^3\times\mathcal{T}\rightarrow\mathcal{A}$

where $\mathbf{x}$ is a 3D location, $t$ is time (continuous or discrete), and $\mathcal{A}$ is an attribute space (occupancy, color, MRI intensity, etc). A "factored" approach decomposes $f$ into modules reflecting geometry, motion, and (optionally) interaction:

$f(\mathbf{x}, t) \approx G(D(\mathbf{x}, t; \theta_D); \theta_G) + \Delta_\text{inter}(t; \theta_I)$

where $G$ is a canonical geometry, $D$ a deformation or motion field, and $\Delta_\text{inter}$ models inter-agent or contact effects. This decomposition may be additive, compositional, or hierarchical depending on the methodological context (Zhao et al., 22 Oct 2025).

Families of factored 4D representations include:

Canonical geometry + deformation flow: e.g., dynamic NeRFs, FourierFlow, CPT-4DMR (Wu et al., 22 Sep 2025, Lee et al., 2023).
Latent factorization (shape, pose, motion, auxiliary): e.g., H4D, LoRD (Jiang et al., 2022, Jiang et al., 2022).
Block-local physical factorization (domain decomposition): e.g., 4D factorization in lattice QCD (Giusti et al., 2022, Saccardi et al., 2022).
Hybrid discrete/continuous or spatial/temporal representations: e.g., 3D-4DGS (Oh et al., 19 May 2025).

This architecture allows each component to leverage bespoke priors, supervision signals, and computational routines.

2. Geometric and Motion Decomposition: Representative Methods

Canonical Geometry in Implicit Fields or Primitives

The geometry term $G$ encodes either:

A neural implicit field: $G_\theta(\mathbf{x})$ (e.g., SDF, occupancy, radiance).
A set of parametric primitives: e.g., points, meshes, Gaussians, or SMPL body models (Zhao et al., 22 Oct 2025, Oh et al., 19 May 2025, Jiang et al., 2022).

Motion and Deformation

Motion is modeled as:

Deformation field $D(\mathbf{x}, t)$ : maps a spatial point from canonical to posed/time $t$ space; often learned as an MLP.
Latent motion codes: summarizing global or local temporal evolution, as in H4D ( $\mathbf{m}$ ) or Neural ODE latent flows (Jiang et al., 2021, Jiang et al., 2022).
Factorized flows: such as the combination of articulation-driven linear blend skinning (LBS) and query-wise residuals, represented in a truncated Fourier basis (FourierHandFlow) (Lee et al., 2023).
Block decompositions: physical partitioning of the domain for noise, parallelization, or locality, with auxiliary fields per block in gauge theories (Giusti et al., 2022).

Interaction/Contact

When modeling articulated or interactive scenes:

Scene graphs or canonical maps: Cluster temporally-evolving points to canonical part centroids via learned per-frame offsets (Gomes et al., 7 Nov 2025).
Interaction modules: Force, contact, or relational fields, sometimes realized as auxiliary neural modules or graph networks (Zhao et al., 22 Oct 2025).

3. Mathematical Formalism and Optimization Objectives

Several mathematical motifs recur:

Core Decomposition Equations

General factorization:

$f(\mathbf{x},t) = G(\psi(\mathbf{x},t); \theta_G)$

with $\psi(\mathbf{x},t) = D(\mathbf{x},t;\theta_D)$ encapsulating deformation (Zhao et al., 22 Oct 2025).

Additive form:

$f(\mathbf{x},t) = G(\mathbf{x}; \theta_G) + \Delta_\text{motion}(\mathbf{x},t ; \theta_D) + \Delta_\text{inter}(\cdot)$

Canonical mappings:

Per-frame mapping to canonical space for points on an articulated part (CanonSeg4D (Gomes et al., 7 Nov 2025)):

$f_t(\mathbf{p}_t) = \mathbf{p}_t + g_\theta(\mathbf{f}_t(\mathbf{p}_t))$

Spline/Fourier-based query flows (FourierHandFlow (Lee et al., 2023)):

$\Phi(\mathbf{p}, t) = \Phi^{\mathrm{pose}}(\mathbf{p}, t) + \Phi^{\mathrm{shape}}(\mathbf{p}, t)$

with each decomposed into Fourier series.

Block Factorization for Physical Simulations:

Determinant decomposition for Lattice QCD (Giusti, Saccardi (Giusti et al., 2022)):

$\det D_w = \prod_{k=0}^4 \prod_{B_k} \det W_k(B_k)$

where $W_k(B_k)$ are Schur complements on hierarchical boundaries.

Loss functions

Photometric, reconstruction, chamfer, occupancy losses: applied to both geometry and time-warped predictions.
Temporal consistency and disentanglement losses: e.g., $L_\text{dis} = \|\nabla_\mathbf{x}G\|_1 + \|\nabla_t D\|_1 + \lambda \langle \nabla_\mathbf{x}G, \nabla_t D\rangle$ (Zhao et al., 22 Oct 2025).
Canonical alignment metrics: L1 and cosine similarity of predicted and target canonical offsets (Gomes et al., 7 Nov 2025).

4. Model Architectures and Training Paradigms

Factored 4D models exploit the decomposition in network design:

Modular decoders for geometry, motion, pose, scale, and interaction factors, e.g., Any4D's per-view transformer heads (Karhade et al., 11 Dec 2025).
Shared versus part-specific modules: LoRD learns one shared per-part network with different latent codes per part (Jiang et al., 2022).
Feed-forward versus optimization-based decoding: Any4D and CPT-4DMR enable instant prediction for $N$ frames; others use test-time optimization or auto-decoding for part codes (Wu et al., 22 Sep 2025, Jiang et al., 2022).
Continuous vs discrete time: ODE modules or Fourier series for temporally smooth latent flows (Jiang et al., 2021, Lee et al., 2023).

Supervision may target only the relevant factor (e.g., flow, depth, canonical alignment), enabling semi-supervised and partial-label learning (Karhade et al., 11 Dec 2025).

5. Applications and Empirical Impact

Factored 4D representations have demonstrated state-of-the-art results in diverse benchmarks and domains:

Method/Domain	Key Application	Reported Gains
Any4D (Karhade et al., 11 Dec 2025)	Metric-scale multi-view scene flow, geometry	2–3× lower EPE, 15× speedup vs. prior SOTA
3D-4DGS (Oh et al., 19 May 2025)	Hybrid static/dynamic video rendering	3–10× faster, 4–8× less memory, matched PSNR/SSIM
CPT-4DMR (Wu et al., 22 Sep 2025)	4D-MRI, real-time adaptive radiotherapy	15 min training (vs 5 hours), 2× error reduction
LoRD (Jiang et al., 2022), H4D (Jiang et al., 2022)	Non-rigid human modeling, sparse 3D/2.5D input	>0.9 F-Score, robust to point cloud sparsity
CanonSeg4D (Gomes et al., 7 Nov 2025)	4D panoptic segmentation of articulated objects	+17 points LSTQ (vs. Mask4Former), temporally coherent

In physics, block-local 4D representations support scalable lattice Monte Carlo:

Full domain decomposition of the fermion determinant allows parallelization, multi-level integration, and master-field simulation (Giusti et al., 2022, Saccardi et al., 2022).

Advantages observed:

Efficient hybridization of static/dynamic factors (3D-4DGS).
Unified handling of partial, mixed-modality, or cross-domain supervision (Any4D, CPT-4DMR).
Modular editability, temporal consistency, and dense correspondence.
Computational acceleration and memory reduction for large-scale models.

6. Methodological Variants and Considerations

Alternative strategies for factorization include:

Separate latent spaces for assets (static geometry) and dynamics (motion/interaction), sometimes using SVD or auto-encoder splits (Zhao et al., 22 Oct 2025).
Scene graph overlays and hierarchical decompositions for compositional and relational reasoning.
Full 4D MLPs or grid fields for unstructured, high-fidelity rendering at the expense of editability and scalability.
Canonical space mappings for category-agnostic pose normalization, e.g., CanonSeg4D.

Trade-offs:

Structured, part-based, or canonical representations excel at editability, temporal tracking, and generalization, but may limit geometric fidelity for complex fluid-like scenes.
Unstructured (implicit volumetric) representations achieve maximal appearance realism but require per-scene optimization and struggle with temporal consistency or relational computation.
Hybrid static/dynamic models balance representation cost and dynamic fidelity (3D-4DGS) (Oh et al., 19 May 2025).

7. Limitations and Open Challenges

Factored 4D representations face several challenges:

Data regime sensitivity: Dense, multimodal, or category-specific data often required; learning generalizations to unseen domains or categories remains limited (Zhao et al., 22 Oct 2025).
Physics and interaction: Many current models lack explicit force priors, physical constraints, or robust treatment of complex contacts and agent interactions.
Scalability: Full-resolution, temporally dense 4D fields can saturate both memory and compute unless aggressively factored or pruned (Oh et al., 19 May 2025, Karhade et al., 11 Dec 2025).
Ambiguity in partial observability: Canonical and factored reconstructions still struggle when input data is sparse, ambiguous, or noisy, though techniques like test-time auto-decoding (LoRD) alleviate this (Jiang et al., 2022).
Integration of unstructured and structured priors: Unified representations that combine graph structure, part hierarchy, and unstructured spatial fields remain an open research direction (Zhao et al., 22 Oct 2025).

Conclusion: Factored 4D representations offer a rigorous, modular framework for encoding dynamic scenes by explicitly separating geometry, motion, and interaction. Developments have led to improved accuracy, interpretability, efficiency, and applicability across physical simulation, dynamic perception, and scene understanding. Methodological diversity, robust benchmarking, and integration of strong priors continue to advance this foundational modeling paradigm.