Deformable 3D Gaussians in Dynamic Scene Modeling

Updated 21 July 2025

Deformable 3D Gaussians are explicit 3D primitives that represent dynamic scenes with time-varied geometric and appearance attributes.
They leverage learned deformation fields, physical priors, and control structures to accurately model motion and enable efficient real-time rendering.
Applications include human avatar animation, medical imaging, robotics, and SLAM, offering improved fidelity and scalable performance in dynamic environments.

Deformable 3D Gaussians are an explicit, high-fidelity representation for 3D dynamic scenes in computer vision, graphics, and scientific visualization, in which the primitives—spatial 3D Gaussians—are endowed with time-varying geometric and/or appearance attributes. These methods, which include frameworks such as ParDy-Human, EndoGaussians, SurgicalGaussian, and SD-GS, enable real-time or near real-time rendering, efficient model adaptation, and support downstream applications such as view synthesis, semantically-aware editing, and segmentation in dynamic, time-dependent environments. Deformation of the Gaussians is achieved through learned fields, physical priors, or control structures such as cages or anchor grids, providing explicit control over both motion and geometric/appearance variations across time.

1. Mathematical Representation and Deformation Fundamentals

Deformable 3D Gaussian frameworks model a scene as a collection of anisotropic Gaussians, each defined by a mean position $\mu \in \mathbb{R}^3$ , covariance $\Sigma \in \mathbb{R}^{3 \times 3}$ (often decomposed into rotation $R$ , scaling $S$ as $\Sigma = R S S^\top R^\top$ ), opacity $\sigma$ , color or radiance features (e.g., spherical harmonics coefficients), and possibly label or segmentation features ( $c$ ).

Dynamic scenes are handled by introducing a deformation field $T(\cdot; t)$ —learned or parameterized—so that at time $t$ , the canonical Gaussians are mapped to their deformed positions: $\mu_i(t) = \mu_i^{\text{canon}} + \Delta\mu_i(t)$ Often, not just the position, but scale and rotation are also updated via learned or analytical transformations (e.g., $R_i(t)$ , $S_i(t)$ ).

Rendering is performed by projecting all deformed Gaussians to the image plane and compositing their contributions via alpha blending: $C(p) = \sum_{i=1}^N c_i \alpha_i \prod_{j=1}^{i-1}(1 - \alpha_j)$ with each $\alpha_i$ proportional to the projected Gaussian density and opacity.

Deformations may be modeled as functions of coordinate–time (canonical-to-deformed mapping), per-Gaussian latent embeddings, group-based rigid motions, or physical fields that respect object or tissue structure.

2. Deformation Field Architectures and Control Mechanisms

Several architectural paradigms have been employed for learning or specifying deformation fields in deformable 3D Gaussian systems:

Coordinate-based Deformation Networks: Early approaches modeled deformation as MLPs taking $(x, y, z, t)$ as input and outputting offsets or transformations, but these can entangle static/dynamic regions and may propagate errors (Bae et al., 4 Apr 2024).
Per-Gaussian Embedding-based Fields: More recent systems assign a unique learnable embedding $z_i$ to each Gaussian and use this, along with a temporal embedding $z_t$ , as input to a deformation network: $\mathcal{F}_\theta(z_i, z_t)\rightarrow$ parameter updates. This increases spatial-temporal decoupling and allows for fine-grained, individualized motion (Bae et al., 4 Apr 2024).
Cage-based and Anchor-grid Deformations: Geometric control structures such as cages (Xie et al., 19 Nov 2024, Tong et al., 17 Apr 2025) or anchor grids (Yao et al., 10 Jul 2025) provide a means for structure-preserving, globally coherent deformation. Each Gaussian's position is updated via interpolation over the cage (e.g., using mean value coordinates), and the covariance is updated using the local Jacobian of the deformation:

$\Sigma' = J_f R S S^\top R^\top J_f^\top$
Physical and Biological Priors: In scene types such as human avatars (Jung et al., 2023) or surgical tissues (Xie et al., 6 Jul 2024), deformation is guided by a parametric model (e.g., SMPL for humans) or constrained by local physical consistency (e.g., isometric or isoparametric losses for tissue deformation (Chen et al., 24 Jan 2024)).
Motion-guided and Temporal Attention: Systems such as MotionGS (Zhu et al., 10 Oct 2024) use motion priors—e.g., optical flow decoupled from camera motion—to explicitly guide dynamic deformation, while TimeFormer (Jiang et al., 18 Nov 2024) leverages transformer-based temporal attention to model and propagate complex motion patterns.

3. Densification, Pruning, and Scalability Techniques

Dynamic and large-scale scenes require careful balancing between representational detail and efficiency:

Densification Strategies: Regions with complex motion or geometry are selectively populated with additional Gaussians or anchors using gradient-based criteria. Deformation-aware densification employs per-anchor weighted gradient magnitudes, with weights modulated by estimated deformation (position, scale, rotation change) to adaptively grow anchors in high-dynamic regions and suppress redundancy in static areas (Yao et al., 10 Jul 2025).
Pruning and Grouping: To avoid computational overload, several pruning criteria are applied:
- Temporal Sensitivity Pruning: Quantifies each Gaussian's contribution to reconstruction error over all times; insensitive Gaussians are eliminated (Tu et al., 9 Jun 2025).
- Annealing Smooth Pruning: Adds temporal noise during pruning score calculation to increase robustness against camera pose inaccuracies (Tu et al., 9 Jun 2025).
- Motion Clustering (GroupFlow): Gaussians with similar motion trajectories are grouped, and a shared rigid transformation is applied per group rather than per Gaussian (Tu et al., 9 Jun 2025).

The combination of these strategies—selectively densifying in dynamic regions and aggressively pruning/reducing inference load elsewhere—enables high performance even in complex 4D scenes.

4. Applications and Contextual Implementations

Deformable 3D Gaussians have demonstrated utility across disparate domains:

Human Avatar Animation: ParDy-Human (Jung et al., 2023) combines SMPL-driven body pose deformation with a refinement network for detailed clothing motion, achieving high-fidelity re-posable avatars from limited monocular data.
Medical Reconstruction: EndoGaussians (Chen et al., 24 Jan 2024) and SurgicalGaussian (Xie et al., 6 Jul 2024) enable accurate, real-time reconstructions of deformable tissues using monocular images supplemented with depth or event data, explicit geometric initialization, and occlusion-aware training.
Occupancy Prediction and Robotics: GaussianFormer3D (Zhao et al., 15 May 2025) fuses LiDAR and camera data into a deformable Gaussian representation, refined by 3D deformable attention to accurately represent environment occupancy for autonomous driving.
Efficient SLAM and Mapping: Splat-SLAM (Sandström et al., 26 May 2024) utilizes active deformation to maintain map consistency under changing keyframe poses and depths, with global and local adjustments for robust visual SLAM.
Volumetric Segmentation and Tracking: VolSegGS (Yao et al., 16 Jul 2025) leverages deformable 3D Gaussians to embed, segment, and track volumetric scene features in computationally scalable fashion, supporting real-time interactive analysis of simulation data.
Editing and Content Creation: Sketch- and cage-based editing systems (Xie et al., 19 Nov 2024, Tong et al., 17 Apr 2025) support intuitive deformation and animation of static 3DGS models, enabling re-shaping, re-targeting, and fine-grained artistic control.

5. Regularization, Training Schemes, and Temporal Consistency

Stability and high-quality reconstruction are achieved by a suite of regularization and training techniques:

Annealing and Smoothing: Training curricula that start with strong smoothing (e.g., via noise annealing or large-scale priors) and gradually anneal to finer details are common to prevent local minima and overfitting to pose inaccuracies (Yang et al., 2023, Tu et al., 9 Jun 2025).
Local Smoothness and Neighborhood Constraints: Losses penalizing inconsistent deformations among neighboring Gaussians (e.g., L1 distances in position or covariance before and after deformation, or affinity-based segment smoothing) ensure physical plausibility and avoid temporal artifacts (Xie et al., 6 Jul 2024, Xie et al., 19 Nov 2024, Yao et al., 16 Jul 2025).
Temporal Attention and Multiscale Modeling: Transformer-based modules (e.g., TimeFormer (Jiang et al., 18 Nov 2024)) and hierarchical architectures (Yao et al., 10 Jul 2025) enable the capture of complex, non-local motion patterns across time, benefiting scenes with rapid or irregular motion.

6. Quantitative Performance and Empirical Evidence

Recent deformable 3D Gaussian frameworks achieve significant improvements in quantitative and qualitative metrics:

Fidelity: In benchmarks on realistic datasets (e.g., N3DV, NeRF-DS), frameworks such as SD-GS (Yao et al., 10 Jul 2025) outperform prior 4D Gaussian methods in PSNR, SSIM, and LPIPS, often with sharper reconstructions and fewer artifacts in high-motion regions.
Efficiency: Model size reductions of 60% or more and 100% FPS improvements are reported for structured approaches (SD-GS), and up to $10\times$ rendering acceleration with minimal quality loss using pruning and grouping (Tu et al., 9 Jun 2025). Real-time or near real-time inference (e.g., >80 FPS) is standard across high-end desktop GPUs.
Application-specific Metrics: In autonomous driving (Zhao et al., 15 May 2025), occupancy IoU and detection performance match or exceed dense-grid and voxel-based benchmarks while reducing memory demands. In medical domains (Chen et al., 24 Jan 2024, Xie et al., 6 Jul 2024), high PSNR/SSIM and reliable tracking support practical clinical workflows.

7. Limitations and Future Directions

While deformable 3D Gaussians have advanced real-time, interpretable dynamic scene representation, open challenges remain:

Camera Pose Dependence: Methods relying on monocular or noisy pose estimation may suffer artifacts; explicit motion decoupling and pose refinement (Zhu et al., 10 Oct 2024) partially address this, but fully pose-free representations remain a topic for future work.
Occlusion and Event-aware Modeling: Integration of event camera data (Xu et al., 25 Nov 2024) and explicit occlusion modeling improves reconstruction of fast motion but requires careful joint threshold learning and dynamic-static decomposition to balance efficiency and accuracy.
Scalability and Interaction: As scene and application complexity grow (especially in long sequences or large-scale environments), further innovations in hierarchical, modular, and adaptive representations are anticipated to support both quality and resource efficiency.
Generalization and Interactivity: Increasing flexibility in input modality (e.g., supporting text, sketches, point clouds, or real-time user guidance) as demonstrated by cage or sketch-based systems (Xie et al., 19 Nov 2024, Tong et al., 17 Apr 2025), points toward a broader role for deformable 3D Gaussians in content creation, editing, and interactive visualization.

Collectively, deformable 3D Gaussians have established themselves as a versatile, efficient, and interpretable approach for modeling and rendering dynamic 3D scenes, with a growing set of architectures and algorithms supporting applications across graphics, medical imaging, robotics, and beyond.