Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 38 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 96 tok/s
GPT OSS 120B 466 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

Matrix-Pano: Omnidirectional 3D World Generation

Updated 13 August 2025
  • Matrix-Pano is a large-scale synthetic panoramic dataset providing complete 360° coverage with ground-truth geometry and rich multimodal annotations.
  • It supports cutting-edge 3D world reconstruction and trajectory-guided video diffusion through mesh-based conditioning and dual pipeline methods.
  • The dataset boosts 360° video synthesis research by offering high-resolution frames, precise camera poses, depth maps, and text annotations for robust multimodal learning.

The Matrix-Pano dataset is a large-scale synthetic collection introduced to support state-of-the-art omnidirectional explorable 3D world generation, particularly within the Matrix-3D framework (Yang et al., 11 Aug 2025). It comprises 116,000 high-fidelity panoramic video sequences, each annotated with precise camera trajectory information, depth maps, and multimodal text. This dataset fills a critical gap in panoramic video research by coupling exhaustive viewpoint coverage with ground-truth geometry, enabling both conditional panoramic video generation and 3D reconstruction. Matrix-Pano’s comprehensive annotations and panoramic scope make it foundational for training and assessing models in 360° video synthesis and wide-coverage 3D world reconstruction.

1. Dataset Composition

Matrix-Pano consists of 116,000 synthetically rendered panoramic video sequences. Each sequence is composed of high-resolution frames (reported as 1024×2048 pixels per panorama) covering the complete 360×180360^\circ \times 180^\circ sphere. Every video sequence is paired with:

  • Camera trajectory annotations, describing the exact pose for each frame in the video sequence.
  • Depth maps per frame, capturing metric spatial layout of the scene.
  • Text annotations, enabling multimodal conditioning for generative models.

The data was rendered synthetically to ensure both geometric accuracy and annotation completeness. Depth is encoded per pixel, facilitating dense scene geometry recovery.

Property Specification Annotation Type
Number of sequences 116,000 Video/Trajectory
Frame resolution 1024 × 2048 Image/Depth
Ground-truth data Camera pose, depth map, text Geometric, Semantic

This structured combination of imagery and metadata enables robust training for tasks entailing panoramic video generation and 3D world lifting.

2. Purpose and Usage

Matrix-Pano was designed to advance the two primary components of Matrix-3D:

  • Trajectory-guided panoramic video diffusion: The dataset enables training and evaluating video diffusion models that use mesh-based trajectory conditioning.
  • 3D world reconstruction: Through its precise depth and camera trajectory annotations, Matrix-Pano supports both feed-forward and optimization-based 3D scene lifting from 2D panoramic video content.

Prior panoramic datasets have lacked either panoramic coverage, camera pose metadata, or sufficient geometric ground-truth, constraining the scope and fidelity of generated scenes. By addressing these limitations, Matrix-Pano facilitates wide-coverage, geometrically consistent video synthesis and accurate scene geometry recovery from virtual tours, robotic vision, and immersive content creation.

3. Technical Details

Panoramic Video Generation

The trajectory-guided video diffusion model leverages scene mesh renderings as conditioning inputs. The synthetic pipeline proceeds as follows:

  1. From an input panorama and corresponding depth map, a polygonal mesh is constructed using world-coordinate projection of dense depth.
  2. Vertices with sharp depth discontinuities are masked as “invisible” to correctly handle occlusions.
  3. For each trajectory (of NfN_f frames), rendered mesh images and binary masks are produced, serving as video conditions.
  4. The video generation model—a causal 3D VAE and diffusion transformer—uses a flow-matching objective:

L(θ)=Ez0,z1,c,s,t[uθ(zt,c,s,t)vt22]L(\theta) = \mathbb{E}_{z_0, z_1, c, s, t} \left[ \|u_\theta(z_t, c, s, t) - v_t\|_2^2 \right]

where ztz_t is a latent interpolated between z0z_0 (noise) and z1z_1 (encoded video), and vt=z1z0v_t = z_1 - z_0 is ground-truth velocity.

3D World Reconstruction

Two distinct pipelines are supported:

  • Optimization-based: Keyframes (every five frames) are extracted, cropped into 12 perspective images, and super-resolved. These are then optimized by 3D Gaussian Splatting using L1L_1 loss between rendered and ground-truth crops.
  • Feed-forward Transformer-based: Video latents and spherical Plücker pose embeddings are patchified, concatenated, and processed by transformer blocks. A DPT head regresses 3D Gaussian attributes—RGB color, scale, rotation (quaternions), opacity, and depth—via a two-stage process:

    • Stage 1: Predict metric depth and RGB

    Lstage1=LSmooth-L1(D^,D)+λ1LL1(H^,H)\mathcal{L}_{\text{stage1}} = \mathcal{L}_{\text{Smooth-L1}}(\hat{D}, D) + \lambda_1 \mathcal{L}_{\text{L1}}(\hat{H}, H) - Stage 2: Depth frozen; remaining attributes refined with MSE and LPIPS loss.

Mesh renderings (as opposed to point clouds) are used for view guidance to reduce Moiré artifacts and improve occlusion handling.

4. Performance Metrics

Matrix-Pano underpins several quantitative evaluations:

  • Panoramic video synthesis: Metrics include FID (Fréchet Inception Distance), FVD (Fréchet Video Distance), PSNR, SSIM, and LPIPS for visual and temporal fidelity.
  • Trajectory controllability: Rotation error (R_err) and translation error (T_err) for alignment between predicted and ground-truth camera paths, with Matrix-3D outperforming baselines (lower errors, e.g., R_err \approx 0.03–0.04, T_err \approx 0.03–0.04).
  • 3D reconstruction: On cropped perspective images from reconstructed worlds, PSNR, LPIPS, and SSIM are reported; optimization-based pipelines yield superior (e.g., PSNR \approx 27.6, LPIPS \approx 0.294, SSIM \approx 0.816), while feed-forward approaches are much faster (inference time order of 10s vs several minutes).

These metrics establish the dataset’s relevance for both photorealism and spatial consistency.

5. Innovations and Contributions

Matrix-Pano introduces a panoramic video benchmark unprecedented in its scale and annotation detail:

  • Dataset innovation: First panoramic video dataset with exhaustive trajectory, depth, and text annotation, supporting wide-coverage generative modeling and accurate 3D lifting.
  • Trajectory-guided mesh conditioning: Conditioning on mesh renderings (rather than point clouds) reduces artifacts and enhances geometric accuracy in synthesized videos.
  • Dual reconstruction pipelines: Offers high-fidelity optimization-based splatting alongside fast transformer-based depth/attribute regression across the full panorama.
  • Integrated multimodal learning: Enables joint training for visual, geometric, and text-based scene synthesis, furthering spatial intelligence research.

6. Practical Impact and Future Directions

Matrix-Pano allows exploration of new capabilities, including panoramic-guided virtual world generation, digital content creation, robotic navigation, and immersive VR applications. The combination of trajectory, pose, and depth annotations makes it suitable for deploying controllable virtual cameras and robust real-time scene understanding frameworks.

A plausible implication is that, as future research extends Matrix-Pano with dynamic content or more diverse simulated scenes, applications in autonomous driving, AR/VR, and agent-based simulation will be further enhanced—driven by more complex scene generation and richer multimodal learning.

7. Relationship to Other Panoramic Datasets

Matrix-Pano represents a significant step beyond earlier high-resolution panoramic datasets focused on object detection (Yang et al., 2018), semantic segmentation (Xu et al., 2019), and planar reconstruction (Sun et al., 2021). Unlike those, Matrix-Pano specifically addresses video-level annotation, mesh-conditioned generation, and high-resolution geometry for 3D world lifting. Its contribution is thus foundational for omnidirectional scene synthesis and reconstruction research, as well as for benchmarking future generative models in spatial intelligence.