Embedding-Based Per-Gaussian Deformation

Updated 29 June 2026

The paper introduces embedding-based per-Gaussian deformation, enabling dynamic 3D scene modeling through per-primitive latent codes that predict precise deformation updates.
It utilizes learnable embeddings to drive local, non-rigid motion, outperforming coordinate-based methods in efficiency and reconstruction quality for applications like novel-view synthesis and SLAM.
Architectural variants, including Transformer-based pipelines and MLP-driven deformers, integrate temporal, semantic, and spatial signals to achieve high-quality dynamic 3D reconstructions.

Embedding-based per-Gaussian deformation refers to a family of techniques for dynamic 3D scene modeling, where each 3D Gaussian primitive is augmented with a learnable embedding vector. This embedding parameterizes how each Gaussian deforms over time, in response to frame-dependent signals such as temporal embeddings, audio features, semantic context, or user input. Unlike global or coordinate-based deformation schemes, embedding-based methods allocate local, per-primitive latent codes that drive fine-grained non-rigid motion, enable efficient optimization, and support rapid novel-view or video synthesis. The paradigm is foundational in state-of-the-art pipelines for dynamic Gaussian Splatting (3DGS/4DGS), talking head synthesis, dynamic SLAM, and deformable mesh editing.

1. Mathematical Formulation and Core Components

At the core, each 3D Gaussian primitive is defined as

$G_i(x) = w_i\;\exp\Bigl(-\tfrac12(x-\mu_i)^\top \Sigma_i^{-1}(x-\mu_i)\Bigr)$

where $\mu_i\in\mathbb{R}^3$ is the center, $\Sigma_i\in\mathbb{R}^{3\times3}$ is covariance (often represented as $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ for rotation $R_i$ and scales $s_i$ ), $w_i$ is an opacity or contribution weight, and additional attributes include spherical harmonic color coefficients and geometric/material tags.

Each Gaussian is assigned a learnable embedding $e_i\in\mathbb{R}^d$ . This vector acts as a local latent code, encoding semantic, geometric, or appearance priors and serving as input to an MLP or other neural decoder that predicts deformation updates for positions ( $\Delta\mu_i$ ), scales ( $\Delta s_i$ ), rotation ( $\mu_i\in\mathbb{R}^3$ 0), and sometimes opacity and color coefficients.

The generic update rule, applied either per-frame or per-timestep $\mu_i\in\mathbb{R}^3$ 1, is

$\mu_i\in\mathbb{R}^3$ 2

with $\mu_i\in\mathbb{R}^3$ 3 bundling all attributes and $\mu_i\in\mathbb{R}^3$ 4 a neural network mapping embeddings and temporal/contextual signals to attribute deltas. Some methods decompose this as coarse + fine temporal components (Bae et al., 2024), or fuse context, time, and deformation signals as described in the relevant sections.

2. Architectural Realizations

Transformer-based Pipelines (LeanGaussian/DIG3D)

Each Gaussian corresponds to a query embedding $\mu_i\in\mathbb{R}^3$ 5 allocated independently of input pixels or point clouds, breaking one-to-one correspondences (Wu et al., 2024). These queries are refined iteratively through a multi-layer deformable Transformer decoder, which performs:

Self-attention on query set $\mu_i\in\mathbb{R}^3$ 6.
Deformable cross-attention: each $\mu_i\in\mathbb{R}^3$ 7 attends to image features at the projection of its current center $\mu_i\in\mathbb{R}^3$ 8 plus learned 2D offsets.
Linear "splat-head" network maps updated queries to parameter increments $\mu_i\in\mathbb{R}^3$ 9.
Quaternion rotation updates for orientation, additive updates for other attributes.

This architecture enables efficient single-image to 3DGS reconstruction and high-speed novel-view rendering (e.g., 500–588 FPS for 10k Gaussians) (Wu et al., 2024).

Dynamic Scene Pipelines: Per-Gaussian Deformation Field via Learnable Embeddings

4D Gaussian Splatting

In frameworks for dynamic scene synthesis (Bae et al., 2024, Jiao et al., 21 Mar 2026), a canonical set of $\Sigma_i\in\mathbb{R}^{3\times3}$ 0 Gaussians is assigned per-Gaussian embeddings $\Sigma_i\in\mathbb{R}^{3\times3}$ 1. The deformation at time $\Sigma_i\in\mathbb{R}^{3\times3}$ 2 is predicted by MLPs fed a combination of per-Gaussian embedding $\Sigma_i\in\mathbb{R}^{3\times3}$ 3 and temporal embedding $\Sigma_i\in\mathbb{R}^{3\times3}$ 4: $\Sigma_i\in\mathbb{R}^{3\times3}$ 5 Advanced realizations employ hierarchical temporal embeddings—coarse and fine—fused via Hadamard product for early, robust compositional modeling of slow and fast deformations (Jiao et al., 21 Mar 2026): $\Sigma_i\in\mathbb{R}^{3\times3}$ 6 The concatenated $\Sigma_i\in\mathbb{R}^{3\times3}$ 7 vector feeds the deformation network to yield parameter deltas for each attribute.

Other systems further decompose deformation into context, temporal, and semantic embeddings, or augment the dynamic field with context from a foundation model (e.g., LSeg for per-Gaussian semantic context in autonomous driving scenes (Song et al., 9 Mar 2025)).

Rate-Distortion Optimized Compression

TED-4DGS employs learned per-anchor temporal feature $\Sigma_i\in\mathbb{R}^{3\times3}$ 8 and a global temporal codebook $\Sigma_i\in\mathbb{R}^{3\times3}$ 9, combined bilinearly to form $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 0 and decoded via an MLP into per-anchor attribute updates. Temporal activation masks further control appearance and disappearance, enabling aggressive rate-distortion tradeoff for dynamic compression (Ho et al., 5 Dec 2025).

Talking Face Synthesis and Expression Modeling

Methods such as EmbedTalk and DEGSTalk allocate per-Gaussian embeddings, which are mapped, possibly after positional encoding, alongside audio and implicit expression/3DMM features, through a small shared MLP. The MLP predicts displacements (and sometimes opacity) per frame (Saggar et al., 8 Mar 2026, Deng et al., 2024). The deformation can be restricted to select attributes (e.g., position and opacity), leaving scale and color fixed, to target plausible facial animation.

3. Learning, Regularization, and Training Objectives

Per-Gaussian embeddings are randomly initialized and jointly optimized with the canonical field and deformation network parameters via photometric, perceptual, and regularization losses over rendered and ground truth images. Losses typically include:

$R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 1 and LPIPS for photo-consistency.
D-SSIM for structural similarity.
Regularizers on covariance scales or opacities to avoid degenerate collapse or floaters.
Local smoothness constraints: often a KNN-based weighted $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 2 penalty between neighboring embeddings:

$R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 3

where weights $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 4 decay with spatial distance (Saggar et al., 8 Mar 2026, Jiao et al., 21 Mar 2026).

In some SLAM/medical applications, a per-Gaussian deformation probability $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 5 is learned self-supervised, mediating rigid vs. deformable updates (Shan et al., 19 Feb 2026).

Joint training can involve staged optimization—for instance, freezing temporal embeddings initially or using progressive training to stabilize static geometry before deformation learning (Ho et al., 5 Dec 2025).

4. Applications and Experimental Findings

Static and Novel-view Synthesis

LeanGaussian and DIG3D achieve competitive or superior PSNR and SSIM compared to point/pixel-correspondence approaches, with high inference and rendering speed and substantial resource efficiency (Wu et al., 2024).

Dynamic Scene Reconstruction

Per-Gaussian embedding deformation is the state-of-the-art for dynamic scene modeling in 3DGS/4DGS. On benchmarks such as Neural 3D Video, Technicolor Light Field, and HyperNeRF, methods like (Bae et al., 2024) consistently outperform coordinate-based deformation fields in terms of PSNR, SSIM, and LPIPS, especially in scenes with fine object or region-specific motion.

CoDa-4DGS demonstrates superior dynamic scene reconstruction in autonomous driving, with sharper object boundaries and higher temporal consistency due to contextual and deformation embeddings (Song et al., 9 Mar 2025).

TED-4DGS demonstrates that embedding-based deformation is compatible with high-efficiency rate-distortion optimization, requiring $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 6 extra parameters and producing compact, flexible dynamic models (Ho et al., 5 Dec 2025).

Talking Head Synthesis and Fine-scale Facial Editing

EmbedTalk replaces tri-plane volumetric fields with a small per-Gaussian embedding, halving model size and doubling inference speed while statistically improving PSNR, SSIM, LPIPS, and lip landmarks (LMD) (Saggar et al., 8 Mar 2026). DEGSTalk extends this to long-hair talking face synthesis with per-Gaussian embeddings plus tri-plane local coding, achieving high PSNR, low LPIPS, and realistic dynamic hair motions (Deng et al., 2024).

SLAM and Medical Imaging

NRGS-SLAM introduces Bayesian self-supervision to learn a per-Gaussian deformation probability and spatio-temporally smooth embedding-based updates, yielding robust monocular non-rigid SLAM for endoscopic scenes (Shan et al., 19 Feb 2026). This achieves substantial improvements in camera pose estimation accuracy and photo-realistic reconstructions over competing methods.

Mesh-based and Editable Deformation

Mesh-based Gaussian splatting frameworks anchor splats to explicit mesh locations, using per-Gaussian SE(3)-and-scale transforms for local and articulated deformation (B, 9 Jul 2025). Embedding-based optimization coupled with explicit physics (XPBD) supports interactive design, VR, and complex non-rigid edits at high frame rates.

5. Methodological Variants and Design Choices

Design Dimension	Example Implementation	Key Significance
Embedding dimension	$R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 7 (Saggar et al., 8 Mar 2026, Bae et al., 2024), $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 8 (Jiao et al., 21 Mar 2026)	Controls deformation capacity and model compactness
Temporal embedding	Grid/interpolated (Bae et al., 2024, Ho et al., 5 Dec 2025, Jiao et al., 21 Mar 2026)	Multiscale, coarse-to-fine modeling
Deformation MLP input	Embedding, temporal, context, and audio features	Flexibly conditions deformation on various signals
Regularization	KNN smoothness, opacity, TV (Jiao et al., 21 Mar 2026, Ho et al., 5 Dec 2025)	Prevents overfitting and floaters
Optimization schedule	Freezing of deformation/temporal for initial epochs	Stabilizes early-stage learning
Deformation targets	All attributes or subset (position, opacity, color)	Avoids implausible deformation artifacts

Local embedding regularization and careful initialization are critical for preventing isolated artifacts ("floaters"), drift in dynamic scenes, and instability in sparse initializations. Coarse-to-fine temporal coding captures both high-frequency motion and global drift, as shown in both FRoG and (Bae et al., 2024). Mesh-anchored embedding further aids in structural/physics-aware deformations (B, 9 Jul 2025).

6. Performance Metrics and Empirical Results

Across settings, embedding-based per-Gaussian deformation yields state-of-the-art quantitative results:

Novel-view synthesis: LeanGaussian reports $R_i\,\mathrm{diag}(s_i^2)\,R_i^\top$ 9 PSNR gain over previous methods (PSNR 25.44/22.36 on ShapeNet/Google Scanned Objects), and near-real-time speed (Wu et al., 2024).
Dynamic video: +0.23 to +3.76 dB PSNR gain and -0.014 to -0.073 LPIPS reduction versus coordinate-based dynamic 4DGS (Bae et al., 2024).
Talking heads: EmbedTalk achieves PSNR=35.19 (vs. 32.5–33.2), LPIPS=0.021 (vs. 0.034), at 61 FPS mobile inference (Saggar et al., 8 Mar 2026).
SLAM: NRGS-SLAM reduces pose RMSE by up to 50% (Shan et al., 19 Feb 2026).

Ablation studies across works validate that removing per-Gaussian embeddings, discarding embedding regularization, or restricting deformation to only global or coordinate-based fields substantially degrades both fidelity and motion realism (e.g., PSNR drops by 0.5 dB, LMD and SyncNet metrics degrade in talking head applications).

7. Context, Significance, and Outlook

Embedding-based per-Gaussian deformation fundamentally decouples data-driven, local motion modeling from global field parameterization. It enables not only high-fidelity rendering and efficient encoding but also intuitive, controllable, and extensible pipelines for dynamic 3DGS across domains. By leveraging compact, learnable vectors to partition local deformation capacity, these methods outperform tri-plane, coordinate-based, or purely global deformation techniques, particularly in scenes with significant non-rigid or local motion. The paradigm is extensible to articulated object editing, compressed transmission, medical SLAM, and real-time content creation.

Current research directions include hierarchical or multi-scale embedding schemes, context adaptive and cross-modal embedding fusion (e.g., combining semantic, audio, and physical priors), and optimal embedding allocation for rate-distortion tradeoff (Ho et al., 5 Dec 2025). Challenges remain in scaling to ultra-high-resolution scenes, balancing interpretability and generalization, and coupling with explicit constraints or physics, but embedding-based per-Gaussian deformation is established as the leading approach for complex, high-performance dynamic 3D scene representation in the Gaussian Splatting literature (Wu et al., 2024, Bae et al., 2024, Saggar et al., 8 Mar 2026, Song et al., 9 Mar 2025, Ho et al., 5 Dec 2025, Deng et al., 2024, Jiao et al., 21 Mar 2026, Shan et al., 19 Feb 2026, B, 9 Jul 2025).