SS4D: Structured Spacetime Latents for 4D Modeling

Updated 4 July 2026

SS4D is a native 4D generative model that directly synthesizes dynamic 3D objects from a single video using structured spacetime latents decoded into 3D Gaussian sequences.
It transfers a pre-trained 3D latent generator into the 4D domain by integrating temporal alignment, transformer-based layers, and a latent compression module to maintain coherence and efficiency.
Benchmark results show that SS4D outperforms prior optimization and reconstruction methods in high fidelity, temporal coherence, and rendering speed on datasets like ObjaverseDy and DAVIS.

SS4D, short for Structured Spacetime Latents for 4D Generative Modeling, is a native 4D generative model that synthesizes dynamic 3D objects directly from a monocular video. Rather than constructing a 4D representation by optimizing over 3D or video generative models, it is trained directly on 4D data and generates a compressed set of structured spacetime latents that are decoded into a sequence of 3D Gaussians. Its stated objective is to combine high fidelity, temporal coherence, and structural consistency while remaining substantially faster than optimization-based 4D pipelines at inference time (Li et al., 16 Dec 2025).

1. Problem formulation and position within 4D generation

The central task addressed by SS4D is: given a single RGB video $\mathcal{I}=\{I_t\}_{t=1}^{T}$ of a moving object, produce a dynamic 3D asset that can be rendered from novel viewpoints at each time step. In the paper’s terminology, SS4D is a native 4D generative model because it learns a model over 4D content directly, rather than reconstructing a 4D result by per-instance optimization (Li et al., 16 Dec 2025).

This positioning is explicit in its contrast with two prior families of methods. First, SDS-based optimization pipelines such as DreamGaussian4D, Consistent4D, and STAG4D optimize a 4D scene representation with Score Distillation Sampling; the paper characterizes these methods as slow and notes that they can produce over-saturated or unstable results. Second, feed-forward reconstruction pipelines such as L4GM or video-diffusion-based methods reconstruct 4D from synthesized multi-view or multi-frame outputs, but are described as often suffering from noisy geometry or weak spatio-temporal consistency. SS4D is presented as a direct generative alternative to both classes (Li et al., 16 Dec 2025).

A central difficulty is the scarcity of 4D training data relative to image or video corpora. The paper’s response is not to learn a 4D generator from scratch, but to transfer a pre-trained 3D latent generator into spacetime. This design choice is motivated by the claim that limited dynamic 3D data alone may be insufficient to learn strong spatial structure, temporal coherence, and robust long-range motion (Li et al., 16 Dec 2025).

2. Structured spacetime latents and output representation

SS4D extends the structured latent representation of TRELLIS from 3D to 4D. In the 3D formulation, an object is represented as sparse activated voxels with feature vectors

$f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$

where $p_i$ is the voxel coordinate and $f_i$ is a local feature aggregated from multi-view renderings. These voxel features are encoded by a 3D VAE into structured latents

$z=\mathcal{E}(f)=\{(z_i,p_i)\}_{i=1}^{L}, \qquad z_i\in\mathbb{R}^{D}.$

SS4D generalizes this to structured spacetime latents $Z=\{z_t\}_{t=1}^{T}$ (Li et al., 16 Dec 2025).

The paper describes the resulting latent representation as a sparse set of 4D elements rather than a dense $T\times X\times Y\times Z$ tensor. This preserves the sparse spatial inductive bias of TRELLIS while organizing the latent sequence across time. The model pipeline is correspondingly factorized into coarse structure prediction, latent generation, and decoding: a 4D Flow Transformer predicts a coarse voxel-based spatiotemporal structure $P=\{p_t\}$ ; a 4D Sparse Flow Transformer generates structured spacetime latents $Z$ conditioned on $P$ ; and the latent sequence is decoded into a sequence of 3D Gaussians

$f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 0

The paper emphasizes that this Gaussian sequence is the final 4D output (Li et al., 16 Dec 2025).

The transfer from 3D to 4D is implemented by fine-tuning TRELLIS’s autoencoder and generator. Four models are named in the pipeline: 4D Structure VAE, 4D Flow Transformer, 4D Sparse VAE, and 4D Sparse Flow Transformer. Among these, the 4D Structure VAE is frozen, while the remaining three models are fine-tuned. This preserves a strong spatial prior while adapting the generator to the temporal dimension (Li et al., 16 Dec 2025).

3. Temporal alignment, temporal layers, and latent compression

Temporal consistency in SS4D is enforced architecturally rather than by an explicit scalar temporal loss. The first mechanism is Temporal Alignment, which inserts temporal reasoning into pre-trained spatial layers. In the transformer, spatial self-attention layers are extended into temporal self-attention layers by rearranging dimensions: $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 1 Here $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 2 is batch size, $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 3 is number of frames, $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 4 is attention length, and $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 5 is feature dimension. The paper stresses that temporal alignment is applied to both the generator path and the VAE, because a static 3D-trained VAE otherwise introduces flickering during encode/decode of temporally coherent sequences (Li et al., 16 Dec 2025).

A second mechanism consists of dedicated Temporal Layers. These are described as temporal self-attention layers with shifted windows, inspired by Swin Transformer. Their function is to reason across frames without incurring the cost of full attention over all frames. The temporal layer retains the original absolute 3D positional embedding and adds a 1D Rotary Position Embedding (RoPE) along the temporal axis, so that frame order and temporal relationships are represented while preserving spatial reasoning (Li et al., 16 Dec 2025).

The third mechanism is CompNet, a latent compression module for long-sequence processing. The paper describes it as a 4D compression and convolutional network built from sparse 3D convolution with spatial downsampling, sparse 1D convolution for cross-frame communication, and a temporal downsampling block that packs two active voxels from the same $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 6 position across frames. The process starts from structured spacetime latents

$f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 7

The temporal length is shortened before later Transformer blocks, then restored by reversing the process with skip connections. The paper states that this factorized 4D compression is the key to efficient training and inference over long video sequences, although it does not provide a detailed asymptotic complexity analysis (Li et al., 16 Dec 2025).

4. Training data, curation, and optimization strategy

The training set is a curated collection of 16,000 animated 3D objects from Objaverse and ObjaverseXL. Samples with poor visual quality or very little motion are filtered out. The paper also modifies TRELLIS’s voxel-feature aggregation procedure: instead of averaging across all rendered views, it uses only views in which a voxel is visible and discards voxels invisible in all views. This is reported to reduce feature noise and shorten the spacetime latent sequence (Li et al., 16 Dec 2025).

Training uses a progressive curriculum over sequence length: first 8-frame sequences, then 16-frame, then 32-frame sequences. To improve robustness to occlusion and motion blur, the conditioning video is augmented with random black masks. The stated rationale is that this helps the model reconstruct hidden or blurred regions more plausibly (Li et al., 16 Dec 2025).

The optimization setup is specified as follows: 8× A800 GPUs, AdamW, learning rate $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 8, FP16 mixed precision, batch size 2 per GPU for the generator and 1 per GPU for the VAE, and total training time of about 7–8 days. Input sequences longer than 36 frames are truncated to the first 36 frames. For synthetic benchmark rendering, the paper uses 32 camera viewpoints, renders up to 32 frames per object at $f = \{(f_i,p_i)\}_{i=1}^{L}, \qquad f_i\in\mathbb{R}^{C},\; p_i\in\{0,1,\dots,N-1\}^3,$ 9 resolution, and provides a front-view video as the conditioning input for all methods (Li et al., 16 Dec 2025).

The paper does not provide a full loss decomposition of the form $p_i$ 0 in the excerpted description. What is specified is a fine-tuning regime over TRELLIS-style components under the data curation, curriculum, masking, and temporal-alignment procedures just described (Li et al., 16 Dec 2025).

5. Evaluation, quantitative results, and ablations

SS4D is evaluated on ObjaverseDy, Consistent4D, and DAVIS. It is compared against DG4D (DreamGaussian4D), Consistent4D, STAG4D, and L4GM. The reported metrics are LPIPS, CLIP-S, PSNR, SSIM, and FVD for synthetic benchmarks, plus a user study on DAVIS for Geometry Quality, Texture Quality, and Motion Coherence (Li et al., 16 Dec 2025).

The paper reports that SS4D achieves the best scores on all reported metrics on ObjaverseDy and again leads on Consistent4D. It is also substantially faster than optimization-based baselines, though slower than the fastest feed-forward baseline.

Setting	SS4D	Best baseline numbers
ObjaverseDy	LPIPS 0.150; CLIP-S 0.932; PSNR 18.09; SSIM 0.842; FVD 465	LPIPS 0.189 (Consistent4D); CLIP-S 0.887 (L4GM); PSNR 15.96 (STAG4D); SSIM 0.821 (STAG4D); FVD 640 (Consistent4D)
Consistent4D	LPIPS 0.149; CLIP-S 0.947; PSNR 18.90; SSIM 0.843; FVD 455	not separately broken out in the excerpt beyond SS4D’s lead
DAVIS user study	Geometry Quality 4.497; Texture Quality 4.413; Motion Coherence 4.527	baselines rated lower, especially STAG4D
Inference time	2 min	DG4D 15 min; Consistent4D 1.5 hr; STAG4D 1 hr; L4GM 3.5 s

The ablation studies identify temporal alignment in the VAE as a particularly important component. Without temporal alignment, the reported values are PSNR 27.11, Flickering 2.99, and FVD 403.88; with temporal alignment, they become PSNR 30.58, Flickering 2.22, and FVD 157.16. A second ablation compares feature aggregation strategies: Mean aggregation gives PSNR 31.63, Average Length 6605, and Encode Speed 68.4, while Visible aggregation gives PSNR 31.07, Average Length 5261, and Encode Speed 76.3. The paper interprets this as a tradeoff in which visibility-aware aggregation slightly lowers PSNR but shortens sequence length and improves encoding speed (Li et al., 16 Dec 2025).

6. Relation to adjacent 4D paradigms and stated limitations

SS4D belongs to a broader 4D-generation landscape that includes text-to-4D optimization, video-to-4D reconstruction, and 4D stylization. A nearby but methodologically different example is 4D-fy, which performs text-to-4D generation by optimizing a hash-grid-based neural radiance field with a three-stage hybrid SDS schedule that alternates 3D-aware SDS, image VSD, and video SDS. Its central claim is that alternating supervision from multiple diffusion priors is necessary to balance appearance, 3D structure, and motion (Bahmani et al., 2023). Another neighboring approach is DS4D, a video-to-4D generation method that explicitly separates dynamic and static content with a Dynamic-Static Feature Decoupling (DSFD) module, fuses cross-view motion evidence with Temporal-Spatial Similarity Fusion (TSSF), and uses a Gaussian-based 4D representation with a Deformation MLP (Yang et al., 12 Feb 2025). For evaluation rather than generation, Style4D-Bench defines a benchmark for 4D stylization with measurements of spatial fidelity, temporal coherence, and multi-view consistency, and its baseline Style4D is built on 4D Gaussian Splatting with per-Gaussian style-aware MLPs (Chen et al., 26 Aug 2025).

Against this background, SS4D’s distinctive claim is that 4D generation can be performed as a native structured latent generative model rather than as an optimization wrapper around 3D or video priors. The paper states that it preserves strong spatial consistency by inheriting TRELLIS’s structured latent space, adds temporal reasoning through Temporal Layers and RoPE, and scales to longer sequences via factorized 4D compression (Li et al., 16 Dec 2025).

The paper also states several limitations. The inherited two-stage pipeline from TRELLIS makes training less efficient than a fully end-to-end design. Because the model is trained mainly on synthetic data, it can produce overly simplified textures on real inputs and does not yet achieve full photorealism. It struggles with transparent or multi-layer objects because it retains only outermost voxels and discards internal structure. It also has difficulty with high-frequency details, which can still flicker, and can fail under rapid motion or heavy motion blur. Within the paper’s own framing, these constraints mark the boundary of current native 4D generative modeling rather than a rejection of the approach itself (Li et al., 16 Dec 2025).