Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Published 14 May 2026 in cs.CV and cs.AI | (2605.14984v1)

Abstract: Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper proposes a geometry-first framework that uses gravity-based density loss, monocular depth priors, and spatial tokens to improve street-level 3D scene reconstruction.
The method reduces DSM RMSE from 6.76m to 5.20m and FID from ~40 to 19.2, demonstrating significant geometric and photorealistic improvements.
The approach enables applications in unsupervised DSM estimation, digital twin creation, and multi-view video synthesis from single satellite imagery.

Authoritative Summary of "Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image" (2605.14984)

Overview and Motivation

Sat3DGen addresses the challenge of generating detailed, semantically rich street-level 3D urban scenes from a single satellite image. Prior methods are bifurcated into two categories: geometry colorization pipelines, which focus on building-centric extrusions with limited semantic coverage, and proxy-based feed-forward methods that jointly optimize geometry and appearance but routinely suffer from geometric artifacts, instability, and semantic inconsistencies, especially given the sparse and misaligned cross-view supervision inherent to satellite and street-level imagery. Sat3DGen proposes a geometry-first methodology situated within the proxy-based framework, integrating targeted geometric priors and novel supervision strategies to overcome these limitations.

Methodology

Sat3DGen is instantiated as a feed-forward image-to-3D framework leveraging a frozen ViT-based DINO-v3 encoder to tokenize the input overhead imagery, subsequently decoded into a high-resolution tri-plane NeRF latent. The model introduces several novel components:

Gravity-Based Density Variation Loss: To physically bias volumetric density in alignment with gravity—encouraging denser aggregations near the ground and progressively sparser representations with altitude—thereby mitigating floating artifacts and structurally implausible reconstructions.
Monocular Relative-Depth Prior: Generated via pseudo labels from pretrained monocular depth models, this prior regularizes depth prediction in the satellite view, rectifying ambiguities in rooftop geometry and compensating for underconstrained elevations due to viewpoint sparsity.
Spatial Tokens: Augmentation of the canonical token grid with peripheral, learnable spatial tokens allows the effective field-of-view to extend beyond the strict crop of the satellite image, stabilizing peripheral geometry and preventing boundary artifacts sensitive to cross-view footprint mismatch.
Perspective View Training: Jointly supervising the model with direct panorama renderings and sampled perspective crops, thereby richly expanding the coverage of effective supervisory viewpoints and improving photometric consistency.

The model supports global illumination-code conditioning for realistic rendering under diverse lighting and panoramic sky generation using a dedicated, directionally-consistent spherical feature branch.

Quantitative and Qualitative Evaluation

Sat3DGen rigorously evaluates both geometric and photorealistic fidelity:

On the VIGOR-OOD Seattle evaluation, it reduces geometric RMSE of reconstructed Digital Surface Models (DSM) from 6.76m (Sat2Density++) to 5.20m, and MAE from 4.72m to 3.47m. The fraction of surface points with <2.5m error increases from 49.7% to 62.7%.
For photorealism as measured by Fréchet Inception Distance (FID), Sat3DGen outperforms all baselines, achieving FID 19.2 (vs. ~40 for Sat2Density++), and KID 0.014. These results are robust even though no image-for-realism modules are incorporated; the improvements are attributed solely to enhanced geometric modeling.
Semantic faithfulness, as measured by DINO-based feature similarity, also shows superior alignment compared to canonical models and prior works.
Qualitatively, reconstructions manifest smoother ground planes, coherent periphery geometry, planar roofs with correct inclination, connected facades, and detailed non-building urban features (e.g., crosswalks, vegetated medians, curbs)—addressing prior deficiencies where non-building elements were absent or distorted.

Ablation Study

A systematic ablation demonstrates the necessity of the introduced modules:

Removing any of Gravity-based Loss, Spatial Tokens, or Depth Prior degrades FID and raises geometric RMSE, with the gravity-based loss being most critical for photorealism and the depth prior and spatial tokens crucial for geometric accuracy.
Perspective training yields further gains in both FID (to 19.2) and RMSE (to 5.20m), synergistically complementing the aforementioned priors.
Alternative regularization (e.g., Total Variation) is empirically less effective, leading to over-smoothing and weakened structural guidance.

Applications and Broader Implications

Sat3DGen demonstrates the versatility of its assets for:

Unsupervised single-image DSM estimation: Without explicit depth supervision, the NeRF representation supports metric reconstruction from monocular input.
Semantic map-to-3D synthesis: Leveraging a pipeline from semantic 2D maps (e.g., OpenStreetMap) via conditional image generation followed by Sat3DGen's 3D generation, useful for digital twin creation and spatial planning.
Large-scale mesh generation: By tiled inference, Sat3DGen generates seamless large-area digital meshes from high-resolution satellite mosaics.
Multi-camera, viewpoint-consistent video synthesis: The NeRF backbone enables surround-view and arbitrary trajectory video rendering from a single input, supporting simulation, virtual tourism, and AR/VR content pipelines.

From a theoretical viewpoint, Sat3DGen substantiates that scene-level 3D structure with high geometric and semantic fidelity can be learned from cross-view urban imagery, provided sufficiently strong geometric priors and diversified supervisory signals are available—even when explicit 3D or dense metric data are unavailable. This relaxes the data constraints for scalable, automated urban modeling.

Limitations and Future Directions

Scene pose alignment is limited due to imperfect real-world camera extrinsics.
The model's assumptions (e.g., locally flat ground, architectural distributions) can result in degraded reconstructions in atypical landscapes or landmarks absent from the training data.
Evaluation is bounded by available benchmarks (VIGOR), and generalization to non-urban or hilly areas remains open. Integration with multi-modal sensing (e.g., SAR/DEM fusion) and proactive pose estimation could close these gaps.
The dataset's coverage and supervision density fundamentally constrain the model's capacity to resolve ambiguous vertical structures; future work may explore self-supervised or generative augmentation schemas.

Conclusion

Sat3DGen systematically advances state-of-the-art street-level 3D scene reconstruction from satellite imagery, delivering both substantial numerical improvements (notably, RMSE and FID reductions) and tangible qualitative gains in semantic coverage and geometric plausibility. The model's architecture, loss formulation, and supervision strategy collectively demonstrate that geometry-centric priors are essential to mitigating the limitations imposed by extreme viewpoint gaps and sparse cross-view supervision. Sat3DGen provides a robust foundation for scalable, automated 3D urban modeling and supports a range of practical downstream applications while highlighting opportunities for further research in data alignment, generalization, and cross-modal scene synthesis.

Markdown Report Issue