Papers
Topics
Authors
Recent
Search
2000 character limit reached

Embedding-Based Per-Gaussian Deformation

Updated 29 June 2026
  • The paper introduces embedding-based per-Gaussian deformation, enabling dynamic 3D scene modeling through per-primitive latent codes that predict precise deformation updates.
  • It utilizes learnable embeddings to drive local, non-rigid motion, outperforming coordinate-based methods in efficiency and reconstruction quality for applications like novel-view synthesis and SLAM.
  • Architectural variants, including Transformer-based pipelines and MLP-driven deformers, integrate temporal, semantic, and spatial signals to achieve high-quality dynamic 3D reconstructions.

Embedding-based per-Gaussian deformation refers to a family of techniques for dynamic 3D scene modeling, where each 3D Gaussian primitive is augmented with a learnable embedding vector. This embedding parameterizes how each Gaussian deforms over time, in response to frame-dependent signals such as temporal embeddings, audio features, semantic context, or user input. Unlike global or coordinate-based deformation schemes, embedding-based methods allocate local, per-primitive latent codes that drive fine-grained non-rigid motion, enable efficient optimization, and support rapid novel-view or video synthesis. The paradigm is foundational in state-of-the-art pipelines for dynamic Gaussian Splatting (3DGS/4DGS), talking head synthesis, dynamic SLAM, and deformable mesh editing.

1. Mathematical Formulation and Core Components

At the core, each 3D Gaussian primitive is defined as

Gi(x)=wi  exp(12(xμi)Σi1(xμi))G_i(x) = w_i\;\exp\Bigl(-\tfrac12(x-\mu_i)^\top \Sigma_i^{-1}(x-\mu_i)\Bigr)

where μiR3\mu_i\in\mathbb{R}^3 is the center, ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3} is covariance (often represented as Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top for rotation RiR_i and scales sis_i), wiw_i is an opacity or contribution weight, and additional attributes include spherical harmonic color coefficients and geometric/material tags.

Each Gaussian is assigned a learnable embedding eiRde_i\in\mathbb{R}^d. This vector acts as a local latent code, encoding semantic, geometric, or appearance priors and serving as input to an MLP or other neural decoder that predicts deformation updates for positions (Δμi\Delta\mu_i), scales (Δsi\Delta s_i), rotation (μiR3\mu_i\in\mathbb{R}^30), and sometimes opacity and color coefficients.

The generic update rule, applied either per-frame or per-timestep μiR3\mu_i\in\mathbb{R}^31, is

μiR3\mu_i\in\mathbb{R}^32

with μiR3\mu_i\in\mathbb{R}^33 bundling all attributes and μiR3\mu_i\in\mathbb{R}^34 a neural network mapping embeddings and temporal/contextual signals to attribute deltas. Some methods decompose this as coarse + fine temporal components (Bae et al., 2024), or fuse context, time, and deformation signals as described in the relevant sections.

2. Architectural Realizations

Transformer-based Pipelines (LeanGaussian/DIG3D)

Each Gaussian corresponds to a query embedding μiR3\mu_i\in\mathbb{R}^35 allocated independently of input pixels or point clouds, breaking one-to-one correspondences (Wu et al., 2024). These queries are refined iteratively through a multi-layer deformable Transformer decoder, which performs:

  • Self-attention on query set μiR3\mu_i\in\mathbb{R}^36.
  • Deformable cross-attention: each μiR3\mu_i\in\mathbb{R}^37 attends to image features at the projection of its current center μiR3\mu_i\in\mathbb{R}^38 plus learned 2D offsets.
  • Linear "splat-head" network maps updated queries to parameter increments μiR3\mu_i\in\mathbb{R}^39.
  • Quaternion rotation updates for orientation, additive updates for other attributes.

This architecture enables efficient single-image to 3DGS reconstruction and high-speed novel-view rendering (e.g., 500–588 FPS for 10k Gaussians) (Wu et al., 2024).

Dynamic Scene Pipelines: Per-Gaussian Deformation Field via Learnable Embeddings

4D Gaussian Splatting

In frameworks for dynamic scene synthesis (Bae et al., 2024, Jiao et al., 21 Mar 2026), a canonical set of ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}0 Gaussians is assigned per-Gaussian embeddings ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}1. The deformation at time ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}2 is predicted by MLPs fed a combination of per-Gaussian embedding ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}3 and temporal embedding ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}4: ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}5 Advanced realizations employ hierarchical temporal embeddings—coarse and fine—fused via Hadamard product for early, robust compositional modeling of slow and fast deformations (Jiao et al., 21 Mar 2026): ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}6 The concatenated ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}7 vector feeds the deformation network to yield parameter deltas for each attribute.

Other systems further decompose deformation into context, temporal, and semantic embeddings, or augment the dynamic field with context from a foundation model (e.g., LSeg for per-Gaussian semantic context in autonomous driving scenes (Song et al., 9 Mar 2025)).

Rate-Distortion Optimized Compression

TED-4DGS employs learned per-anchor temporal feature ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}8 and a global temporal codebook ΣiR3×3\Sigma_i\in\mathbb{R}^{3\times3}9, combined bilinearly to form Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top0 and decoded via an MLP into per-anchor attribute updates. Temporal activation masks further control appearance and disappearance, enabling aggressive rate-distortion tradeoff for dynamic compression (Ho et al., 5 Dec 2025).

Talking Face Synthesis and Expression Modeling

Methods such as EmbedTalk and DEGSTalk allocate per-Gaussian embeddings, which are mapped, possibly after positional encoding, alongside audio and implicit expression/3DMM features, through a small shared MLP. The MLP predicts displacements (and sometimes opacity) per frame (Saggar et al., 8 Mar 2026, Deng et al., 2024). The deformation can be restricted to select attributes (e.g., position and opacity), leaving scale and color fixed, to target plausible facial animation.

3. Learning, Regularization, and Training Objectives

Per-Gaussian embeddings are randomly initialized and jointly optimized with the canonical field and deformation network parameters via photometric, perceptual, and regularization losses over rendered and ground truth images. Losses typically include:

  • Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top1 and LPIPS for photo-consistency.
  • D-SSIM for structural similarity.
  • Regularizers on covariance scales or opacities to avoid degenerate collapse or floaters.
  • Local smoothness constraints: often a KNN-based weighted Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top2 penalty between neighboring embeddings:

Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top3

where weights Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top4 decay with spatial distance (Saggar et al., 8 Mar 2026, Jiao et al., 21 Mar 2026).

  • In some SLAM/medical applications, a per-Gaussian deformation probability Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top5 is learned self-supervised, mediating rigid vs. deformable updates (Shan et al., 19 Feb 2026).

Joint training can involve staged optimization—for instance, freezing temporal embeddings initially or using progressive training to stabilize static geometry before deformation learning (Ho et al., 5 Dec 2025).

4. Applications and Experimental Findings

Static and Novel-view Synthesis

LeanGaussian and DIG3D achieve competitive or superior PSNR and SSIM compared to point/pixel-correspondence approaches, with high inference and rendering speed and substantial resource efficiency (Wu et al., 2024).

Dynamic Scene Reconstruction

Per-Gaussian embedding deformation is the state-of-the-art for dynamic scene modeling in 3DGS/4DGS. On benchmarks such as Neural 3D Video, Technicolor Light Field, and HyperNeRF, methods like (Bae et al., 2024) consistently outperform coordinate-based deformation fields in terms of PSNR, SSIM, and LPIPS, especially in scenes with fine object or region-specific motion.

CoDa-4DGS demonstrates superior dynamic scene reconstruction in autonomous driving, with sharper object boundaries and higher temporal consistency due to contextual and deformation embeddings (Song et al., 9 Mar 2025).

TED-4DGS demonstrates that embedding-based deformation is compatible with high-efficiency rate-distortion optimization, requiring Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top6 extra parameters and producing compact, flexible dynamic models (Ho et al., 5 Dec 2025).

Talking Head Synthesis and Fine-scale Facial Editing

EmbedTalk replaces tri-plane volumetric fields with a small per-Gaussian embedding, halving model size and doubling inference speed while statistically improving PSNR, SSIM, LPIPS, and lip landmarks (LMD) (Saggar et al., 8 Mar 2026). DEGSTalk extends this to long-hair talking face synthesis with per-Gaussian embeddings plus tri-plane local coding, achieving high PSNR, low LPIPS, and realistic dynamic hair motions (Deng et al., 2024).

SLAM and Medical Imaging

NRGS-SLAM introduces Bayesian self-supervision to learn a per-Gaussian deformation probability and spatio-temporally smooth embedding-based updates, yielding robust monocular non-rigid SLAM for endoscopic scenes (Shan et al., 19 Feb 2026). This achieves substantial improvements in camera pose estimation accuracy and photo-realistic reconstructions over competing methods.

Mesh-based and Editable Deformation

Mesh-based Gaussian splatting frameworks anchor splats to explicit mesh locations, using per-Gaussian SE(3)-and-scale transforms for local and articulated deformation (B, 9 Jul 2025). Embedding-based optimization coupled with explicit physics (XPBD) supports interactive design, VR, and complex non-rigid edits at high frame rates.

5. Methodological Variants and Design Choices

Design Dimension Example Implementation Key Significance
Embedding dimension Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top7 (Saggar et al., 8 Mar 2026, Bae et al., 2024), Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top8 (Jiao et al., 21 Mar 2026) Controls deformation capacity and model compactness
Temporal embedding Grid/interpolated (Bae et al., 2024, Ho et al., 5 Dec 2025, Jiao et al., 21 Mar 2026) Multiscale, coarse-to-fine modeling
Deformation MLP input Embedding, temporal, context, and audio features Flexibly conditions deformation on various signals
Regularization KNN smoothness, opacity, TV (Jiao et al., 21 Mar 2026, Ho et al., 5 Dec 2025) Prevents overfitting and floaters
Optimization schedule Freezing of deformation/temporal for initial epochs Stabilizes early-stage learning
Deformation targets All attributes or subset (position, opacity, color) Avoids implausible deformation artifacts

Local embedding regularization and careful initialization are critical for preventing isolated artifacts ("floaters"), drift in dynamic scenes, and instability in sparse initializations. Coarse-to-fine temporal coding captures both high-frequency motion and global drift, as shown in both FRoG and (Bae et al., 2024). Mesh-anchored embedding further aids in structural/physics-aware deformations (B, 9 Jul 2025).

6. Performance Metrics and Empirical Results

Across settings, embedding-based per-Gaussian deformation yields state-of-the-art quantitative results:

  • Novel-view synthesis: LeanGaussian reports Ridiag(si2)RiR_i\,\mathrm{diag}(s_i^2)\,R_i^\top9 PSNR gain over previous methods (PSNR 25.44/22.36 on ShapeNet/Google Scanned Objects), and near-real-time speed (Wu et al., 2024).
  • Dynamic video: +0.23 to +3.76 dB PSNR gain and -0.014 to -0.073 LPIPS reduction versus coordinate-based dynamic 4DGS (Bae et al., 2024).
  • Talking heads: EmbedTalk achieves PSNR=35.19 (vs. 32.5–33.2), LPIPS=0.021 (vs. 0.034), at 61 FPS mobile inference (Saggar et al., 8 Mar 2026).
  • SLAM: NRGS-SLAM reduces pose RMSE by up to 50% (Shan et al., 19 Feb 2026).

Ablation studies across works validate that removing per-Gaussian embeddings, discarding embedding regularization, or restricting deformation to only global or coordinate-based fields substantially degrades both fidelity and motion realism (e.g., PSNR drops by 0.5 dB, LMD and SyncNet metrics degrade in talking head applications).

7. Context, Significance, and Outlook

Embedding-based per-Gaussian deformation fundamentally decouples data-driven, local motion modeling from global field parameterization. It enables not only high-fidelity rendering and efficient encoding but also intuitive, controllable, and extensible pipelines for dynamic 3DGS across domains. By leveraging compact, learnable vectors to partition local deformation capacity, these methods outperform tri-plane, coordinate-based, or purely global deformation techniques, particularly in scenes with significant non-rigid or local motion. The paradigm is extensible to articulated object editing, compressed transmission, medical SLAM, and real-time content creation.

Current research directions include hierarchical or multi-scale embedding schemes, context adaptive and cross-modal embedding fusion (e.g., combining semantic, audio, and physical priors), and optimal embedding allocation for rate-distortion tradeoff (Ho et al., 5 Dec 2025). Challenges remain in scaling to ultra-high-resolution scenes, balancing interpretability and generalization, and coupling with explicit constraints or physics, but embedding-based per-Gaussian deformation is established as the leading approach for complex, high-performance dynamic 3D scene representation in the Gaussian Splatting literature (Wu et al., 2024, Bae et al., 2024, Saggar et al., 8 Mar 2026, Song et al., 9 Mar 2025, Ho et al., 5 Dec 2025, Deng et al., 2024, Jiao et al., 21 Mar 2026, Shan et al., 19 Feb 2026, B, 9 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Embedding-Based Per-Gaussian Deformation.