4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time (2506.18890v1)

Published 23 Jun 2025 in cs.CV

Abstract: Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.

Summary

The paper presents 4D-LRM, a Transformer-based model that leverages unified 4D Gaussian Splatting to reconstruct dynamic geometry and appearance from sparse inputs.
The methodology achieves state-of-the-art performance with PSNR values over 30 and reconstructs a 24-frame sequence in under 1.5 seconds on an A100 GPU.
The results demonstrate robust generalization to novel objects and camera setups, enabling smooth temporal interpolation and high-fidelity 4D asset generation.

4D-LRM: A Large-Scale Transformer for Generalized 4D Space-Time Reconstruction

The paper introduces 4D-LRM, a large-scale, Transformer-based model for 4D (space-time) reconstruction from sparse, unconstrained multi-view and multi-temporal image inputs. The model is designed to predict dynamic object geometry and appearance at arbitrary viewpoints and timestamps, addressing the limitations of prior optimization-based, geometry-based, and generative approaches in terms of efficiency, generalization, and faithfulness.

Model Architecture and Representation

4D-LRM leverages a unified 4D Gaussian Splatting (4DGS) representation, parameterizing dynamic objects as clouds of anisotropic 4D Gaussians. Each Gaussian encodes spatial and temporal means, covariance, color, and opacity, allowing for continuous modeling of both geometry and appearance over time. The model processes input images—each with associated camera pose and timestamp—by patchifying and augmenting them with Plücker ray coordinates and temporal information. These are concatenated and projected into tokens, which are then processed by a deep Transformer.

The Transformer outputs per-pixel 4D Gaussian parameters, which are subsequently rendered using a differentiable rasterization pipeline. The model optionally supports a set of learnable "free" Gaussian tokens, enabling generative flexibility in scenarios with extremely sparse or monocular inputs.

Training and Data

4D-LRM is trained on a large-scale, curated dataset derived from Objaverse4D, containing 32,000 animated objects and 783,000 static objects (the latter augmented with synthetic motion). The training employs curriculum learning, starting at low resolution and scaling up, and utilizes a combination of MSE and perceptual losses. The model is pretrained without free Gaussians and can be fine-tuned for 4D asset generation tasks with additional generative tokens.

Experimental Results

Quantitative Performance:

4D-LRM demonstrates strong numerical results across multiple benchmarks and camera configurations. On the Consistent4D and Objaverse4D test sets, the model achieves PSNR values exceeding 30 in favorable settings (e.g., alternating canonical views) and maintains robust performance under challenging conditions such as frame interpolation, rotating cameras, and random input views. Notably, 4D-LRM outperforms per-frame 3D reconstruction baselines (e.g., GS-LRM) even with only one input view per frame, highlighting its ability to share information across both space and time.

Efficiency:

The model reconstructs a 24-frame dynamic object in a single forward pass in under 1.5 seconds on a single A100 GPU, a significant improvement over optimization-based and diffusion-based generative methods.

Generalization and Interpolation:

4D-LRM generalizes to novel objects and camera setups, and can interpolate missing frames by reallocating Gaussians in the temporal domain. Analysis of the predicted temporal means and variances shows that the model increases the temporal support of Gaussians when input timestamps are missing, facilitating smooth interpolation.

Ablations and Scaling:

Increasing model size and initializing from pretrained GS-LRM weights both improve performance and convergence. The model exhibits favorable scaling with more input views up to a point, after which performance plateaus or slightly degrades due to Transformer sequence length limitations and representation crowding.

4D Generation:

When fine-tuned with free Gaussians and paired with a diffusion model (e.g., SV3D), 4D-LRM surpasses existing 4D generation baselines in both faithfulness and inference speed, as measured by LPIPS, FVD, and CLIP scores.

Implementation Considerations

Computational Requirements: Training at $256 \times 256$ resolution requires substantial resources (160 A100 GPUs, multi-day runs). Inference is efficient, but scaling to higher resolutions or longer sequences remains computationally intensive.
Differentiable Rendering: The rasterization pipeline is adapted from 3DGS/4DGS, with deferred backpropagation and filtering strategies to maintain efficiency and stability.
Data Curation: High-quality 4D data is essential; the authors employ extensive filtering and augmentation to ensure motion consistency and diversity.
Failure Modes: The model struggles with highly non-linear motion trajectories, self-occlusion, and abrupt deformations, often resulting in temporal ghosting or residual artifacts. These are attributed to the limitations of the Gaussian representation and under-training.

Implications and Future Directions

Practical Impact:

4D-LRM enables efficient, high-fidelity reconstruction of dynamic objects from sparse, unconstrained inputs, with direct applications in 4D asset generation for games, film, AR/VR, and robotics. Its ability to interpolate missing frames and generalize to novel objects makes it suitable for real-world scenarios with incomplete or irregular data.

Theoretical Contributions:

The unified 4DGS representation and Transformer-based architecture demonstrate that large-scale spatiotemporal pretraining can yield generalizable, efficient 4D models. The model's design supports continuous time and view synthesis, moving beyond the limitations of discrete or per-frame approaches.

Limitations and Open Problems:

Long-Context Handling: The current architecture is limited by Transformer sequence length, restricting the number of input views and maximum resolution. Future work should explore hybrid models or memory-efficient architectures for long-context processing.
Removing 3D Inductive Bias: 4D-LRM relies on posed images and explicit 4DGS representations. Extending to unposed or in-the-wild data, or adopting architectures with minimal 3D inductive bias, remains an open challenge.
Scene-Level Reconstruction: The model is currently object-centric; scaling to full scenes with occlusions and unobserved regions requires new datasets and architectural advances.
Representation Expressiveness: The Gaussian kernel's ellipsoidal support is suboptimal for non-linear or branched motion trajectories. More expressive or compositional representations may be needed for complex dynamics.

Speculation on Future Developments

The trajectory of 4D-LRM suggests several promising research directions:

High-Resolution and Long-Sequence 4D Models: Advances in memory-efficient Transformers and hybrid architectures could enable processing of hundreds of high-resolution frames and views.
Self-Supervised and Unposed 4D Reconstruction: Removing the reliance on posed images would broaden applicability to real-world, unconstrained video data.
Integration with Generative Models: Combining 4D-LRM with powerful generative priors (e.g., video diffusion models) could further improve faithfulness and diversity in 4D asset generation.
Scene-Level and Embodied AI Applications: Extending 4D-LRM to scene-level modeling would facilitate applications in robotics, world modeling, and embodied AI, where understanding dynamic environments is critical.

In summary, 4D-LRM represents a significant step toward scalable, generalizable, and efficient 4D reconstruction, with strong empirical results and a clear path for future research in both foundational modeling and practical deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1937654059726516490