Lyra: Generative 3D Scene Reconstruction

Updated 25 September 2025

The paper introduces a self-distillation pipeline that trains a 3DGS decoder using dense multi-view supervision from a video diffusion model.
The method achieves state-of-the-art results with PSNR values over 21 and improved LPIPS scores, ensuring high-quality 3D scene generation.
The approach enables fast, explicit 3D scene reconstruction for both static and dynamic environments, supporting applications in VR, robotics, and simulation.

Lyra is a self-distillation framework for generative 3D scene reconstruction that distills the implicit 3D knowledge from a camera-controlled video diffusion model into an explicit, efficient 3D Gaussian Splatting (3DGS) representation. Distinct from pipelines relying on captured multi-view real-world imagery, Lyra enables 3D scene synthesis—both static and dynamic—starting from only a text prompt or a single image. The framework is structured around training a 3DGS decoder (student) using dense, multi-view supervision generated by the RGB decoder of a pre-trained video diffusion model (teacher). This approach forgoes the requirement for real multi-view datasets, leveraging the diversity and geometric cues available via high-capacity video generative models.

1. Self-Distillation Pipeline and 3DGS Representation

Lyra’s core innovation is to bridge 2D video generation with 3D scene reconstruction through a cross-modal self-distillation process. The pipeline begins with a camera-conditioned video diffusion model that generates multi-view RGB videos from text prompts or single images along sampled camera trajectories. The RGB decoder (the teacher) produces the target 2D outputs; in parallel, a 3DGS decoder (the student) processes multi-view video latents and outputs explicit 3D Gaussian splat parameters.

The 3DGS representation models the scene as a collection of anisotropic Gaussians, each parameterized by position (x, y, z), scale (sₓ, s_y, s_z), rotation (q_w, q_x, q_y, q_z), opacity (α), and color (r, g, b). The 3DGS decoder is applied as a transposed 3D convolution over latent tokens, resulting in a scene-encoded volume of per-pixel Gaussian parameters. This explicit structure is highly advantageous for real-time rendering, supporting direct rasterization and efficient hardware-accelerated inference.

2. Training Strategy and Architectural Details

Training is conducted entirely on synthetic data. For each synthesized scene, several multi-view camera trajectories are sampled using the video diffusion model (such as GEN3C), with each trajectory producing hundreds of frames. The shared latent encoding (size V × L′ × C × h × w) is "patchified" so that the channel dimension aligns with the decoder's hidden size, and is concatenated with Plücker camera embeddings encoding ray geometry.

The 3DGS decoder aggregates multi-view information through a hybrid block: one transformer layer (for global information propagation) followed by multiple Mamba-2 layers (for computational efficiency). The final decoder output, G ∈ ℝ^C×X×Y×Z, is mapped to the 14 Gaussian parameters via a transposed 3D convolution.

The loss function integrates:

ℒₘₛₑ: pixel-wise mean squared error between rendered and teacher RGB outputs,
ℒₗₚᵢₚₛ: LPIPS perceptual loss for high-frequency realism,
ℒ_depth: scale-invariant depth loss encouraging geometric consistency,
ℒ_opacity: L₁ penalty on opacity for effective pruning of Gaussians.

The combined loss reads:

$\mathcal{L} = \lambda_{mse} \mathcal{L}_{mse} + \lambda_{lpips} \mathcal{L}_{lpips} + \lambda_{depth} \mathcal{L}_{depth} + \lambda_{opacity} \mathcal{L}_{opacity}$

By propagating dense teacher supervision across synthetic multi-view scenes, Lyra’s 3DGS decoder overcomes view sparsity and captures detailed geometric and appearance cues.

3. Inference Workflow and Rendering

At inference, Lyra is purely feedforward: a text prompt or a single image is used to sample a camera trajectory, and the pre-trained video diffusion model generates the corresponding latent volume. The learned 3DGS decoder then synthesizes the full set of explicit 3D Gaussians, forming a complete scene representation. Rendering from arbitrary viewpoints is achieved by compositing colors and opacities along rays using the 3DGS compositing equation:

$C(p) = \sum_{i} c_i \alpha'_i \prod_{j=1}^{i-1}(1-\alpha'_j)$

where $\alpha'_i$ is the projected opacity of Gaussian $i$ at pixel $p$ , computed as:

$\alpha'_i = \alpha_i \cdot \exp\left(-\frac{1}{2}(p'_i - \mu'_i)^\top (\Sigma'_i)^{-1} (p'_i - \mu'_i)\right)$

for projected centroid $\mu'_i$ and projected covariance $\Sigma'_i$ .

This design provides explicit geometric control, low-latency rendering, and scalability to interactive and simulation environments.

4. Dynamic 3D Scene Generation (4D)

Lyra extends its framework to dynamic scene reconstruction from monocular videos by incorporating temporal conditioning in the 3DGS decoder. Source and target timecodes (raw and sinusoidal embeddings) are concatenated with the spatial latents, then re-encoded by the RGB encoder. Dynamic data augmentation—reversing video time and providing bidirectional supervision—ensures temporal consistency and robust learning of 4D spatiotemporal structure.

During training, the loss is applied at each timestep, and the augmented architecture produces temporally coherent explicit 3D scenes suitable for dynamic rendering, robotics simulation, or virtual reality content.

5. Quantitative and Qualitative Results

Lyra achieves state-of-the-art metrics for static and dynamic 3D scene generation, as measured on standardized benchmarks for real-to-3D generative quality. Reported PSNR values exceed 21, with LPIPS scores lower than prior works (e.g., Wonderland, BTimer, ZeroNVS, Bolt3D). For dynamic scenes, combined PSNR, SSIM, and perceptual scores also demonstrate significant improvements over baselines where existing static reconstructions are coupled with video diffusion models in a non-distilled, uncoordinated fashion.

Qualitative renderings exhibit notably fewer artifacts, greater view consistency, and superior geometry recovery—particularly in challenging regions of occlusion and free-form camera motion—when compared to alternative camera-defined methods.

6. Applications and System Integration

Lyra’s explicit, real-time 3DGS output is directly compatible with graphics simulation platforms. The framework’s outputs have been demonstrated in NVIDIA Isaac Sim, supporting robotics navigation scenarios where scene consistency and interaction are essential. Potential applications extend to:

Game development (rapid virtual environment synthesis)
Robotics and autonomous driving simulation (navigation, mapping)
VR/AR content production
Synthetic data generation for downstream perception models

This workflow addresses scalability for training and deployment by removing the need for labor-intensive real-world multi-view capture and manual scene curation.

7. Limitations and Future Prospects

A notable limitation of Lyra is the reliance on the generative capacity of the video diffusion model for scene diversity and realism; generalization to highly out-of-distribution domains may require advances in base model architectures. The compressed latent space training regime, while efficient, may bottleneck reconstruction detail compared to future higher-capacity models.

Prospective research may pursue:

Integration of physical constraints or sensor representations for robotics-centric applications
Expansion to higher-resolution or larger-scale scenes via hierarchical or multi-resolution 3DGS decoders
Joint training with conditional text/image/video diffusion backbones for improved control and cross-modal editability
Explicit control of scene objects, animations, and semantics via disentangled Gaussian splats parameterization

Summary Table: Lyra System Modules

Component	Input	Output
Video Diffusion	Text/image, camera traj.	RGB video latents
RGB Decoder	Video latents	Target images (teacher output)
3DGS Decoder	Video latents, camera	14D per-pixel Gaussian features
Rendering Engine	Gaussian features	Real-time RGB novel views

In summary, Lyra demonstrates an end-to-end generative 3D reconstruction system: by distilling implicit 3D cues from a camera-conditioned video diffusion model into an explicit, efficient, and geometrically consistent 3DGS representation, it advances the state of virtual environment synthesis and enables scalable, real-time 3D scene generation from minimal supervision sources (Bahmani et al., 23 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation (2025)

Follow Topic

Get notified by email when new papers are published related to Lyra: Generative 3D Scene Reconstruction.