Sat2RealCity: 3D Urban Generation from Satellite Imagery
- Sat2RealCity is a framework for synthesizing geometry-aware, city-scale 3D urban environments using satellite imagery and OSM priors.
- It integrates multi-modal features and pretrained generative models to control both urban geometry and visual appearance with high fidelity.
- The framework achieves superior performance in geometric precision and style consistency, enabling applications in urban planning and simulation.
Sat2RealCity refers to a geometry-aware and appearance-controllable 3D urban generation framework that directly synthesizes explicit, city-scale 3D content from real-world satellite imagery. This paradigm enables interpretable and visually realistic reconstructions of large urban environments without requiring massive, prohibitively expensive annotated 3D city datasets. The core innovation lies in leveraging OpenStreetMap (OSM) priors, rich multi-modal features, and pretrained building-level generative models to bridge the gap between top-down remote sensing and high-fidelity 3D city scenes (Kang et al., 14 Nov 2025).
1. Motivation and Problem Formulation
Traditional city-scale 3D generation faces two chronic bottlenecks: the lack of high-quality, large-scale annotated 3D urban assets for supervised learning, and structural ambiguity in inferring full 3D geometry from inherently limited 2D satellite or Digital Surface Model (DSM) cues. Methods relying solely on height maps or semantic segmentation yield results decoupled from real-world appearance and struggle to generalize to previously unseen styles and geographies. Furthermore, fully neural rendering approaches exhibit memory and scalability constraints when moving beyond individual blocks to multi-square-kilometer scenes (Hua et al., 6 Jul 2025, Kang et al., 14 Nov 2025).
Sat2RealCity explicitly addresses these limitations by (a) anchoring geometry reconstruction to interpretable OSM priors (footprints and heights), (b) controlling appearance via textually guided multi-view representation, and (c) harnessing pretrained knowledge from 3D object generative modeling for compositional building assembly. The result is city-scale output which is both geographically faithful and visually realistic.
2. Framework Architecture
Sat2RealCity decomposes 3D urban reconstruction into three core modules:
- OSM-based spatial priors inject interpretable geometric constraints by constructing coarse volumetric proxies for every building via extrusion of OSM footprints to registered heights.
- Appearance-guided controllable modeling fuses top-view satellite feature embeddings and synthesized frontal style references to produce cohesive façade and roof textures via cross-attention within transformer blocks.
- MLLM-powered semantic guidance leverages clustering of per-building features and large multimodal LLMs (MLLM) to generate textual descriptions and corresponding style reference images, enabling fine-grained and contextually accurate appearance synthesis.
The system operates at the building instance level, aggregating building-wise geometry and appearance into a seamless, well-aligned 3D city scene. Key architectural steps are summarized in Algorithm 1 of the original publication (Kang et al., 14 Nov 2025):
- Extract DINOv3-based top-view embeddings and heights for each building from the satellite image.
- Cluster embeddings to group visually and geometrically similar buildings using HDBSCAN.
- For each cluster, use an MLLM (e.g., Qwen3-VL) to produce style summaries, rendered to frontal view reference images via a T2I (Text-to-Image) model.
- For each building, extrude its OSM footprint, encode geometry through a Sparse-Structure VAE, and linearly interpolate with Gaussian noise to inject diversity.
- Cross-attend top-view and frontal-style features in TRELLIS’s diffusion blocks to disentangle geometry from style and generate mesh, NeRF, or 3D Gaussian representations.
3. OSM-based Spatial Priors and Geometry Synthesis
OSM-based priors ensure that generated geometry both “snaps” to real-world spatial layout and remains interpretable. For each building, the OSM footprint is extruded by height to form a volumetric proxy, which is then encoded by a pretrained Sparse-Structure VAE to a latent . After normalization, Gaussian noise is fused via cosine interpolation: where controls determinism vs. generative stochasticity. This geometric prior is injected at generation time and acts as a hard constraint: geometry strictly adheres to the OSM proxy, guaranteeing precise alignment with real-world data (Kang et al., 14 Nov 2025).
4. Appearance Control and Semantic Guidance
To achieve fine-grained and regionally consistent style, buildings are grouped into clusters of visually+geometrically similar elements. For each cluster, an MLLM generates a compact textual description ; this summary is rendered into a frontal style image using a text-to-image generator. Cross-attention streams within each TRELLIS block attend to both the per-building top-view feature and the cluster style feature : 0
1
The interleaving of structure (top) and style (appearance) pathways allows for simultaneous enforcement of geometric fidelity and stylistic consistency, particularly selecting for roof textures versus façade details. This approach resolves previous limitations wherein virtual cities lacked visually plausible appearance or suffered from inconsistent urban stylistics (Kang et al., 14 Nov 2025).
5. Training Regime and Objective Functions
Sat2RealCity builds on TRELLIS’s two-stage diffusion backbone (SS-Flow for structure, SLAT-Flow for latent appearance) and employs Conditional Flow Matching (CFM) as its sole fine-tuning loss: 2 with 3. Fine-tuning occurs on 11.5K high-quality 3D AIGC building instances. The only new tunable parameter is 4 for OSM-prior interpolation. In practice, no explicit spatial or style reconstruction losses are needed; the CFM loss suffices for high-fidelity joint geometry–appearance modeling (Kang et al., 14 Nov 2025).
6. Empirical Evaluation
Performance is measured in three axes: geometry, appearance, and stylistic consistency.
- Geometry (vs. OSM-derived “ground-truth” point clouds):
- Chamfer Distance: Sat2RealCity achieves CD ≈ 0.0118, outperforming TRELLIS-MV (0.0184).
- F-score (1m threshold): Sat2RealCity F ≈ 0.8554; best baseline F ≈ 0.7929.
- Appearance:
- CLIP-Score (vs. Google Earth renderings):
- Sat2RealCity: 0.8563
- Next best: TRELLIS-MV-S (0.8120)
- Regional Stylistic Consistency:
- 5 = IoU6 × CLIP7
- Sat2RealCity: ≈ 0.7885
- Best baseline: ~0.3086
Qualitative evaluation confirms precisely aligned roofs and façade textures, with cohesive style across neighborhoods and far fewer texture seams or rotation artifacts than prior work (Kang et al., 14 Nov 2025).
7. Context, Impact, and Future Directions
Sat2RealCity sits within an active field of research on city-scale generative modeling from remote sensing and sparse urban data. It builds on, and meaningfully extends, existing approaches such as Sat2City, which pioneered full 3D latent-diffusion-based city generation from height maps but was validated only on synthetic data and relied on dense voxel grids with synthetic DSM supervision (Hua et al., 6 Jul 2025). Distinct from Sat2Scene, which applies diffusion to colorize pre-defined geometries but cannot refine or control urban appearance at instance level (Li et al., 2024), and video synthesis models like CityRAG that focus on spatial trajectory video generation but do not address explicit 3D structure (Chou et al., 21 Apr 2026), Sat2RealCity explicitly couples interpretable urban geometry and appearance control to off-the-shelf geospatial and visual data.
By leveraging real OSM geometry, MLLM-driven appearance synthesis, and pretrained generative models for building-level composition, Sat2RealCity enables city-scale 3D assets suitable for downstream applications in digital twins, interactive urban planning, virtual tourism, and simulation environments. Future directions include expanding to more diverse urban morphologies, integrating richer multimodal context (e.g., LiDAR or high-frequency remote sensing), and constructing large-scale real-world satellite-to-3D datasets for further improvement in structural realism and generalization (Kang et al., 14 Nov 2025).
References:
- "Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery" (Kang et al., 14 Nov 2025)
- "Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion" (Hua et al., 6 Jul 2025)
- "Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion" (Li et al., 2024)
- "CityRAG: Stepping Into a City via Spatially-Grounded Video Generation" (Chou et al., 21 Apr 2026)