Static-Dynamic 4D Hash Encoding
- Static-dynamic decoupled 4D decomposed hash encoding is a technique that factorizes spatio-temporal features through overlapping 3D hash grids to model static geometry and dynamic motion efficiently.
- The method employs spatial and temporal MLPs with a multi-head decoder to predict deformation parameters, enhancing physical consistency and rendering quality.
- Its integration with physics-informed losses and optical flow supervision results in improved PSNR, SSIM, and LPIPS metrics compared to traditional 4D grids.
Static-dynamic decoupled 4D decomposed hash encoding is a core computational construct designed for efficient and physically consistent representation of spatio-temporal geometry and motion in dynamic 3D Gaussian Splatting systems. It appears as a central technique in Physics-Informed Deformable Gaussian Splatting (PIDG) (Hong et al., 9 Nov 2025), which aims to unify explicit 3D representations with continuum-mechanics principles for the reconstruction and simulation of dynamic materials from monocular video.
1. Formal Definition and Conceptual Overview
In the PIDG framework, the objective is to represent both static and dynamic scene aspects over the 4D domain . Unlike monolithic 4D feature grids or solely MLP-based encoding, static-dynamic decoupled 4D decomposed hash encoding factorizes this 4D space via four overlapping 3D hash grids: This structure enables localized, high-frequency encoding of scene geometry and temporally-varying motion fields, while separating static features (via ) from dynamically evolving ones (via grids involving ).
A spatial MLP operates on to compute an "attention" scalar modulating the influence of static geometry. For dynamic features, a temporal MLP processes the concatenated features from the three temporal grids. The output is a weighted combination: A multi-head decoder subsequently predicts deformation increments: rotation (quaternion), translation , and scale update , which update the canonical Gaussian parameters (, , ).
This encoding scheme facilitates the separation and differential optimization of static and dynamic regions during training, typically implemented in two stages:
- Densification: all Gaussians are optimized for both static geometry and dynamic deformation.
- Refinement: a dynamic mask freezes static Gaussians (parameter fine-tuning only) and allows dynamic Gaussians to update their deformation fields.
2. Architectural Implementation
Feature extraction, modulation, and parameter update within the encoding paradigm use shallow neural networks:
- Spatial MLP and Temporal MLP : Each is a single hidden layer (256 units, ReLU).
- Decoder : A lightweight two-layer MLP (256 units each, ReLU) outputs per Gaussian.
Hash grids are optimized with hierarchical learning rates (MLP rate , hash grids up to 50× higher) via Adam (, ).
Memory scaling is determined by the number of hash grids, each being , which, compared to a full 4D grid (), saves significant resources while still capturing high-frequency spatio-temporal details.
3. Separation of Static and Dynamic Scene Components
The key advantage of the decoupled encoding is rigorous separation of static and dynamic regions within the optimization process. Initially, all Gaussians participate in joint optimization of both static and dynamic fields. After a dynamic mask is computed (often based on magnitude or consistency of deformation vectors), the system "freezes" parameters of static Gaussians, whereas dynamic Gaussians continue updating deformation fields.
This enables more stable learning, better generalization, and artifact-free rendering in static regions, while permitting flexible motion modeling in dynamic zones.
4. Integration with Physics-Informed Constraints
Static-dynamic decoupled hash encoding is intertwined with PIDG’s physics-informed loss, specifically the Cauchy momentum residual: where residual is computed pointwise from momentum and stress balance: Hash-encoded features feed into a multi-head MLP , which predicts both and for each Gaussian.
This approach supports unified learning of diverse constitutive laws (fluid-like or solid-like) across the scene without manual switching, controlled exclusively by the material field parameters inferred from hash grid embeddings.
5. Supervision via Optical Flow
Supervision is augmented with camera-compensated 2D optical flow and Lagrangian particle flow matching:
- Pretrained 2D flow networks estimate pixelwise motion, used as pseudo-ground-truth.
- For each pixel, the top- Gaussians (by splatting weight) are identified to compute Gaussian flow and velocity flow, both compared with the pseudo-ground-truth via a flow-matching loss: with typical .
6. Empirical Performance and Evaluation
Experimental results on the PIDG synthetic dataset, D-NeRF synthetic scenes, and real-world HyperNeRF videos demonstrate that static-dynamic decoupled 4D decomposed hash encoding leads to statistically superior PSNR, SSIM, and LPIPS metrics versus prior art (Grid4D, MotionGS, GaussianPredict). PIDG attains visibly sharper static details, more coherent motion, and better generalization for future prediction in dynamic scenes.
On the real-world HyperNeRF scenes, the method achieves +0.5 dB PSNR and +0.01 MS-SSIM over the strongest baselines, while isolating dynamic processes with high physical consistency.
7. Limitations, Advantages, and Ongoing Research
Key advantages include:
- Physical consistency and continuum mechanical realism via Cauchy-residual loss and flow matching
- Efficient, scalable memory usage ( instead of )
- Unified representation for multiple constitutive behaviors
Notable limitations are:
- High computational and memory demands for joint optimization over millions of particles and physics-based losses
- Limited expressiveness for complex material behaviors (nonlinear elastoplasticity, fracture)
- Reliance on accurate optical flow and masks for supervision, which may be error-prone
Planned future work includes real-time inference, richer constitutive modeling (viscoelasticity, plasticity), multi-scale hierarchical renormalization, and multi-view/RGB-D extensions.
This encoding paradigm is foundational for PIDG’s advances in dynamic reconstruction quality and physical generalization from monocular video, formalizing a robust methodology for 4D scene decomposition in physics-informed explicit scene representations (Hong et al., 9 Nov 2025).