Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Published 19 May 2026 in cs.CV | (2605.19656v1)

Abstract: We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

Summary

  • The paper introduces a novel feed-forward approach that predicts Gaussian splats from both ground-level and satellite images.
  • It unifies feature representations using cross-attention mechanisms, significantly enhancing view synthesis accuracy and robustness.
  • The method sets a new benchmark for georeferenced image-based view synthesis, enabling scalable scene reconstruction for urban planning and autonomous navigation.

Overview of Cross-View Splatter

The paper "Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images" (2605.19656) introduces a novel method for synthesizing novel views by integrating georeferenced ground images with satellite imagery. This approach addresses the limitations posed by traditional 3D reconstruction methods, which often rely solely on ground-level photographs, by leveraging orthorectified satellite views as an additional source of geometric prior. The method, known as Cross-View Splatter, utilizes feed-forward techniques to predict Gaussian splats from both perspectives within a unified coordinate system, thereby enhancing scene coverage and enabling more effective view extrapolation.

Methodology and Contributions

The Cross-View Splatter method stands out for several contributions:

  1. Feed-forward Gaussian Splats: The model is pioneering in predicting Gaussian splats for both ground-level perspective images and orthorectified satellite views. By using a shared 3D coordinate frame, it improves upon scene coverage and supports novel view synthesis better than models restricted to ground imagery alone.
  2. Unified Feature Space: Utilizing cross-attention mechanisms, ground and satellite viewpoints are encoded into a unified feature space. This allows seamless feature representation alignment between ground-level images and bird's-eye view perspectives, harnessing pre-trained 3D reconstruction models for improved accuracy and efficiency.
  3. Data Augmentation and Benchmarks: The authors curated georeferenced datasets and paired satellite-terrain data to train their models effectively, introducing a new benchmark for georeferenced image-based novel view synthesis. This provides a comprehensive evaluation framework to compare state-of-the-art methods rigorously.

Experimental Results

The experiments demonstrated that Cross-View Splatter outperforms several existing methods for novel-view synthesis, particularly in scenarios with sparse input views and challenging outdoor scenes. Quantitative results showed superior PSNR, SSIM, and LPIPS metrics compared to baselines, indicating more accurate and perceptually consistent reconstructions.

The method's robustness was tested against test scenes augmented from established datasets like Tanks and Temples and DL3DV-Benchmark, with findings confirming the benefits of integrating satellite imagery for visual extrapolation. Visualizations depicted sharper and more reliable depth maps and compensations for occlusion and unobserved regions, validating the hybrid approach's efficacy in synthesizing challenging urban landscapes and diverse environments.

Implications and Future Work

Cross-View Splatter opens pathways to scalable and efficient scene reconstruction, suitable for applications in urban planning, autonomous navigation, and augmented reality. By synthesizing models solely from georeferenced imagery, it reduces reliance on prohibitively large datasets typically needed for 3D reconstruction at scale, presenting a practical workaround using publicly available satellite data.

Future work could explore integration with other image modalities and proactive handling of imagery from different geographic or seasonal contexts. As satellite imagery often varies in resolutions and conditions, adaptive techniques and more robust data normalization schemes may enhance multi-environment adaptability and prediction accuracy.

Conclusion

The paper provides a substantial advancement in the field of view synthesis, positioning Cross-View Splatter as a potential cornerstone technology for achieving comprehensive scene understanding from limited ground-level imagery. By incorporating satellite data comprehensively and elevating 3D Gaussian splatting methods to accommodate diverse viewing angles, it not only extends traditional constraints but sets a constructive precedent for future graphical and spatial applications in computer vision.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper introduces Cross-View Splatter, a fast AI method that builds a 3D scene from two kinds of pictures of the same place:

  • normal photos taken on the ground (like a phone photo) with GPS,
  • a top‑down satellite image (like you see in online maps).

The goal is to make new, realistic views of the scene from angles you didn’t originally capture, using both the ground photos and the satellite view together.

The main questions the authors ask

  • Can we combine easy‑to‑get satellite images with a few GPS‑tagged ground photos to make better 3D models of outdoor places?
  • Can we do this quickly, in a single pass of a neural network (called “feed‑forward”), without slow per‑scene tuning?
  • Can satellite images help “fill in” areas the ground photos don’t see well, so new views look more complete?

How the method works, in simple terms

Think of the 3D scene as being made from thousands of tiny, soft, colored dots floating in space. The paper calls these dots “Gaussian splats.” When you look at them from a camera viewpoint and blend them together, they form a realistic image of the scene.

Here’s the idea, step by step:

  • Input:
    • One or more ground photos with GPS and the direction the camera was pointing.
    • One top‑down satellite image of the same area.
  • Two “brains,” one per view:
    • A ground branch looks at the ground photos and estimates how far things are (a depth map) and where the camera was.
    • A satellite branch looks at the top‑down image and estimates a height map (how high the ground or buildings are at each pixel).
  • Cross‑talk:
    • The two branches share information using a mechanism called cross‑attention. You can imagine it like two friends comparing notes: the ground friend has close-up details, the satellite friend sees the overall layout from above.
  • Building the 3D dots:
    • From the ground depth, the model places colored dots in 3D using normal perspective (how your eyes see).
    • From the satellite height map, it places dots using an orthographic “from above” setup (like a perfect map with no tilt).
  • Unifying everything:
    • Both sets of dots are put into the same 3D coordinate system (so they line up).
    • The model then renders these dots to make new pictures from viewpoints you choose.
  • Training help:
    • During training, they use public elevation/terrain data (Digital Elevation Models) to teach the satellite branch what heights look like.
    • At test time, you only need the ground photo(s) with GPS and the satellite image—no extra data.

Helpful translations of technical terms:

  • Feed‑forward: the network makes its prediction in one go, without slow trial-and-error loops.
  • Gaussian splats: tiny, soft blobs (like many dots of spray paint) whose sizes, colors, and opacities can be blended to draw a 3D scene.
  • Orthorectified satellite image: a top‑down map image corrected so everything is measured “straight down,” without the slanted look of perspective.
  • 3DoF vs 6DoF: 3 degrees of freedom are the GPS location and heading (where you are and which way you face). 6 degrees of freedom adds the full 3D position and rotation of the camera.

What they found and why it matters

Main results:

  • Using the satellite view together with ground photos gives better scene coverage and more accurate new views than using ground photos alone.
  • The biggest gains appear when the input photos don’t overlap much with the target view (for example, you’re trying to see around a corner or farther away). The satellite image helps the model “extrapolate” the global layout (roads, building footprints, open areas).
  • On their new outdoor benchmarks (built by aligning well-known datasets to satellite maps), their method often beats other fast, feed‑forward 3D methods and even competes with slower, per‑scene optimization methods.
  • The method produces sharper, more accurate depth and more reliable renderings of big outdoor scenes.

Why this is important:

  • Collecting lots of ground photos for big areas (like a whole neighborhood) is hard and expensive. Satellite images are easy to access and cover large regions. Combining them makes scalable 3D mapping more practical.

What this could lead to

  • Better city‑scale 3D maps for games, AR, and navigation, built faster and with fewer ground photos.
  • More robust outdoor 3D understanding for robots and self‑driving cars, especially when camera coverage is sparse.
  • Tools that can quickly generate plausible views of places from new angles, useful for planning, inspection, or education.

In short, Cross‑View Splatter shows that mixing a bird’s‑eye view (for the big picture) with ground photos (for close detail) is a powerful, fast way to create 3D scenes and new viewpoints in the real world.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, formulated to be specific and actionable for future research.

  • Sensitivity to georeferencing errors: The method assumes accurate GPS and heading (3DoF) for ground images and treats the reference image as zero altitude; the impact of GPS noise, heading misalignment, magnetic declination, and pitch/roll errors on reconstruction quality and BEV alignment is not quantified or mitigated.
  • Absolute metric accuracy: Predictions are expressed in a normalized coordinate frame via per-batch l2 scaling; there is no evaluation of absolute metric fidelity (e.g., height/elevation accuracy versus DEMs, metric distances in meters, or alignment to GIS coordinates).
  • Orthographic projection assumptions: Satellite views are modeled as perfectly orthographic with fixed rsat; the consequences of orthorectification artifacts, residual parallax, off-nadir angles, terrain-induced distortions, and provider-specific preprocessing are not studied.
  • Domain shift across satellite providers and conditions: Robustness to variations in satellite resolution, compression, seasonality, weather, time-of-day, and different mapping services (Google, Azure, Esri) is not evaluated; no ablation on rsat or provider choice.
  • Limited spatial extent and tiling: Inference uses a single 512×512 satellite tile with a fixed 244 m spatial extent; scalability to multi-tile city-scale coverage, tile stitching, continuity across tiles, and handling of boundary artifacts remains unexplored.
  • Merging strategy for ground and BEV splats: The union Gcombined = Gground ∪ Gsat lacks an explicit mechanism for duplicate suppression, conflict resolution, cross-view occlusion reasoning, or enforcing geometric consistency beyond a combined rendering loss.
  • Uncertainty handling: Confidence maps (for depth and height) are used only as training weights; there is no uncertainty propagation to rendering, calibration of confidence estimates, or risk-aware selection/weighting of splats at inference.
  • Camera estimation robustness: Ground truth camera poses/intrinsics are used during training, but inference relies on regressors; the paper does not report camera pose/intrinsics accuracy, sensitivity analyses, or how pose errors translate to view-synthesis degradation.
  • Unknown-pose multi-image inputs: The model claims support for multiple ground images with unknown 6DoF poses, but there is no dedicated evaluation under fully pose-free inputs or comparison to pose-free baselines in that setting.
  • Dynamic scene elements: Handling of moving objects (vehicles, pedestrians), vegetation motion, or water surfaces is not addressed; datasets and losses largely assume static scenes, raising questions about artifacts and temporal inconsistency in real-world captures.
  • Sky segmentation dependence: Sky regularization relies on an off-the-shelf segmenter; failure cases, domain generalization of sky masks, parameter sensitivity (e.g., threshold T), and impacts on non-sky bright regions are not analyzed.
  • Evaluation scope and metrics: Benchmarks focus on PSNR/SSIM/LPIPS; there is no quantitative assessment of 3D geometry quality (e.g., point-to-surface error, Chamfer/IoU, elevation RMSE), BEV alignment error, or semantic consistency (roads/building footprints).
  • Benchmark construction and reproducibility: The new geolocalized benchmark requires manual alignment of COLMAP reconstructions to satellite imagery; the potential subjectivity, annotation error, and inter-annotator variability are not measured, and licensing constraints prevent releasing satellite imagery, complicating reproducibility.
  • Fairness and coverage of baselines: Some baselines require multi-view input or use ground-truth intrinsics; the impact of these differences on fairness is not fully normalized, and key aerial/BEV-capable baselines (e.g., off-nadir multi-view approaches) are not compared due to data availability.
  • Computational performance: Training setup is reported, but inference time, memory footprint (tokens, gsplat rasterization), scalability to longer sequences or higher resolutions, and deployment feasibility on edge devices are not documented.
  • Satellite height supervision quality: Training relies on public DEM/lidar with varying coverage, resolution, and vertical datums; there is no analysis of label noise, datum alignment to the assumed zero altitude, or the method’s robustness where DEMs are missing or outdated.
  • Tall structures and vertical surfaces: The BEV height-map and orthographic assumptions provide weak cues for building facades, overhangs, and vertical discontinuities; limitations on reconstructing vertical geometry and occluded backfaces from satellite are not characterized.
  • Cross-view feature alignment design: The bidirectional Attnmeta (12 layers) choice is not ablated; alternative architectures (e.g., explicit cross-view geometric priors, deformable attention, separate satellite encoders) and their trade-offs are unexplored.
  • Multi-temporal/multi-angle satellite data: The approach uses a single orthorectified satellite image; potential gains from multi-temporal imagery (seasonal differences), off-nadir multi-angle views, or RPC-based satellite camera models remain an open avenue.
  • Semantic priors: Leveraging map semantics (roads, building footprints, land use) to improve cross-view alignment and occlusion reasoning is not integrated or evaluated.
  • Ground camera calibration: Effects of lens distortion, rolling shutter, and inaccurate intrinsics on splat backprojection are not analyzed; the camera head’s ability to compensate for real-world calibration variability is unclear.
  • Large-baseline extrapolation failure modes: Although stratified PSNR suggests gains at low overlap, there is no qualitative/quantitative catalog of failure patterns (e.g., stretching, ghosting, color bleeding) or strategies to detect and avoid unreliable extrapolation.
  • Rendering consistency across perspectives: Differences in Gaussian size behavior between perspective and orthographic projections are acknowledged but not enforced via explicit consistency constraints (e.g., scale-aware priors), leaving potential artifacts in BEV renders.
  • GIS integration: The paper does not demonstrate exporting splats to geospatial coordinates (e.g., EPSG codes), assessing alignment against map layers, or enabling downstream GIS/AR applications that require metric-accurate geo-registration.
  • API and legal constraints: Practical deployment issues—API rate limits, costs, caching strategies, and legal/ToS constraints of satellite providers—are not discussed, yet they may materially affect scalability and reproducibility.

Practical Applications

Immediate Applications

The following applications can be deployed now by leveraging the paper’s feed-forward, cross-view Gaussian splatting with GPS/heading-tagged ground photos and orthorectified satellite imagery accessible via public APIs.

  • Bold: Rapid outdoor AR occlusion and anchoring
    • Sectors: Software, AR/VR, Location-based entertainment
    • What it enables: Generate coarse, occlusion-aware 3D proxies of outdoor scenes from a few smartphone photos plus satellite tiles to improve AR content placement, occlusion, and physics.
    • Tools/workflow: Mobile SDK or edge service that (1) reads GPS and heading from phone images, (2) fetches the BEV tile via Maps API, (3) runs Cross-View Splatter to produce a 3D Gaussian scene, (4) exports to an AR engine (e.g., converting to mesh or using a 3DGS renderer).
    • Assumptions/dependencies: Accurate GPS/heading, availability and licensing of satellite tiles; coarse geometry (sub-meter satellite resolution) limits fine occlusion; GPU (or cloud) inference for 3DGS rendering.
  • Bold: Location-based game content scaffolding
    • Sectors: Gaming, AR/XR
    • What it enables: Near-instant scene proxies for new play areas to spawn, occlude, and route content without full photogrammetry passes.
    • Tools/workflow: Batch process player-submitted geotagged photos; unify with satellite tiles; deploy 3DGS assets to the game map; optionally downsample to lightweight splats/mesh.
    • Assumptions/dependencies: Terms of service for satellite providers; robustness to season/illumination shifts; not suitable for safety-critical navigation.
  • Bold: Field documentation for AEC predesign and site walkthroughs
    • Sectors: Architecture, Engineering, Construction (AEC); Real estate
    • What it enables: Coarse 3D site context from a handful of ground photos to plan viewpoints, access, staging areas, and crane/path constraints.
    • Tools/workflow: GIS/BIM plugin that ingests geotagged photos, queries BEV tiles, runs the model, exports a mesh/point-splat layer for Revit/Civil 3D/QGIS/ArcGIS.
    • Assumptions/dependencies: Height accuracy is approximate (relative to locally defined zero altitude); orthorectified tiles suppress parallax—do not expect fine facade fidelity.
  • Bold: Disaster response triage maps from sparse ground evidence
    • Sectors: Public safety, Humanitarian response, Policy (emergency management)
    • What it enables: Quickly reconstruct 3D context around damage photos to assess access routes, line-of-sight, and rough obstruction patterns when full surveys are infeasible.
    • Tools/workflow: Web app where responders upload geotagged images; server fetches satellite tiles and produces a 3DGS snapshot for planning.
    • Assumptions/dependencies: Not for precise measurements; satellite recency/coverage varies; GPS errors in dense urban can misalign BEV.
  • Bold: Robotics/drone navigation priors in GNSS-available environments
    • Sectors: Robotics, Drones/UAV
    • What it enables: Coarse BEV height maps and 3D proxies to seed SLAM, global planning, or simulation in outdoor scenes before detailed mapping.
    • Tools/workflow: ROS node that consumes geotagged frames, produces 3DGS/height prior for local planners/state estimators.
    • Assumptions/dependencies: Prior is coarse and should be treated as initialization; heading errors degrade alignment; not a replacement for on-board mapping.
  • Bold: Telecom pre-qualification for line-of-sight and siting
    • Sectors: Telecom (5G/FR2, FWA)
    • What it enables: Rapid, approximate LOS checks and site screening using predicted height maps and combined ground/BEV splats.
    • Tools/workflow: Planning tool that queries satellite tiles for candidate sites and ingests a few street-level photos to refine local geometry.
    • Assumptions/dependencies: Use only for pre-qualification—accuracy insufficient for final RF design; foliage/season changes impact reliability.
  • Bold: VFX/previz location proxies
    • Sectors: Media/Film
    • What it enables: Generate quick 3D stand-ins of outdoor locations from scouting photos for blocking and set extension planning.
    • Tools/workflow: DCC plugin (Blender/Maya/Unreal) to import 3DGS and optionally bake to mesh.
    • Assumptions/dependencies: Photoreal textures may be limited; consider downstream refinement or re-texturing; licensing of satellite base layers.
  • Bold: GIS gap-filling and community mapping
    • Sectors: Geospatial, Civic tech
    • What it enables: Fill coverage gaps in street-level data by fusing sparse user captures with BEV imagery to obtain coarse 3D layers.
    • Tools/workflow: Batch pipeline in a GIS server that builds 3D layers per tile using crowdsourced photos.
    • Assumptions/dependencies: Legal use of map tiles; DEMs not needed at inference, but local terrain extremes may challenge height estimation.
  • Bold: Academic benchmarking and model research
    • Sectors: Academia (computer vision, remote sensing)
    • What it enables: Immediate use of the curated georeferenced datasets, training code, and the new benchmark to study cross-view fusion and 3DGS learning.
    • Tools/workflow: Reproduce training with provided data-prep; evaluate on the geolocalized Tanks & Temples and DL3DV splits; extend cross-attention designs.
    • Assumptions/dependencies: Satellite imagery must be re-queried due to licensing; compute/GPU for training and inference.
  • Bold: Hobbyist outdoor scene capture
    • Sectors: Daily life, Creator tools
    • What it enables: Create shareable 3D walkarounds of parks, plazas, or trailheads from a few phone photos.
    • Tools/workflow: Mobile app that auto-fetches BEV tile and runs inference in the cloud; shares 3DGS or baked mesh on the web.
    • Assumptions/dependencies: Requires connectivity; outputs are coarse; privacy considerations for public captures.

Long-Term Applications

The following use cases require further research, scaling, integration, or policy/standards development before broad deployment.

  • Bold: City-scale, low-touch digital twins from crowd and satellite
    • Sectors: Smart cities, Urban planning, Mapping platforms
    • What it could enable: Continuous 3D updates of urban areas by fusing opportunistic geotagged photos with satellite priors, reducing dependence on dedicated surveys.
    • Tools/workflow: Persistent “AR cloud”/map service that streams tile-aligned 3DGS layers and periodically densifies to meshes; change detection over time.
    • Assumptions/dependencies: Robust handling of GPS/heading drift, dynamic scenes, and seasonal shifts; scalable deduplication/merging; sustainable licensing of global BEV data.
  • Bold: HD map bootstrapping and maintenance for automated driving
    • Sectors: Automotive, Robotics
    • What it could enable: Fast bootstraps for lane-level mapping and continuous maintenance using consumer-grade captures plus BEV priors.
    • Tools/workflow: Cross-view 3D priors feed into downstream semantic extractors (lanes, curbs, signs) and map conflation tools.
    • Assumptions/dependencies: Current geometry is too coarse for lane-grade precision; needs tighter metric guarantees, semantics, temporal consistency, and rigorous validation.
  • Bold: Disaster digital twins and rapid damage quantification
    • Sectors: Emergency management, Insurance
    • What it could enable: Near-real-time city-block reconstructions with uncertainty maps to prioritize response and estimate losses.
    • Tools/workflow: Fuse UAV, satellite, and sparse ground images; propagate uncertainty from confidence heads; integrate with risk models.
    • Assumptions/dependencies: Requires robust multi-sensor fusion and calibrated uncertainty; access to up-to-date satellite/UAV data under emergency-use licenses.
  • Bold: Outdoor AR cloud with persistent, multi-user alignment
    • Sectors: AR/XR, Telecommunications
    • What it could enable: Shared, spatially anchored content with scene proxies everywhere, refined on-demand by user scans.
    • Tools/workflow: On-device relocalization to BEV-aligned splats; incremental updates merged in the cloud; edge rendering of 3DGS.
    • Assumptions/dependencies: Standards for map layers and privacy; low-latency edge compute; handling of moving objects and occluders.
  • Bold: Generative completion and photorealistic texturing of 3DGS
    • Sectors: Media, Gaming, Visualization
    • What it could enable: Fill unobserved regions with learned priors while preserving geometric consistency from BEV-ground fusion.
    • Tools/workflow: Couple 3DGS with 3D-aware diffusion for texture completion; quality controls to prevent hallucination where accuracy matters.
    • Assumptions/dependencies: Safety constraints for non-hallucination in critical uses; strong priors across diverse geographies.
  • Bold: Telecom-grade propagation modeling and siting optimization
    • Sectors: Telecom
    • What it could enable: Replace expensive surveys with hybrid models that convert splats to watertight meshes and material classes for RF simulation.
    • Tools/workflow: Semantic enrichment (building/vegetation/material labels), mesh conversion, calibration against drive-test data.
    • Assumptions/dependencies: Requires high-fidelity height/material estimates; integration with propagation engines; regulatory compliance.
  • Bold: Environmental/climate risk assessment at parcel-to-district scale
    • Sectors: Climate tech, Insurance, Public policy
    • What it could enable: Rapid 3D built-environment context for flood, wind, and heat-risk models in data-sparse regions.
    • Tools/workflow: Align to DEMs, add hydrology layers; convert splats to elevation surfaces and roughness metrics per block.
    • Assumptions/dependencies: Needs calibrated metric scale and uncertainties; robust treatment of vegetation and overhangs.
  • Bold: Standards for geospatial 3DGS interchange and OGC integration
    • Sectors: Policy, Standards bodies, Geospatial software
    • What it could enable: Interoperable “Gaussian layers” alongside tiles/point clouds for streaming outdoor proxies.
    • Tools/workflow: Define CRS, scale normalization, and metadata (confidence, timestamp) for splat layers; reference implementations in GIS.
    • Assumptions/dependencies: Community consensus, performance on commodity hardware, and clear IP/licensing for derivative products.
  • Bold: Privacy-preserving, on-device cross-view reconstruction
    • Sectors: Consumer tech, Privacy
    • What it could enable: Local inference with no photo upload; only encrypted 3D proxies shared.
    • Tools/workflow: Efficient model distillation for mobile NPUs; federated updates; differential privacy for shared map layers.
    • Assumptions/dependencies: Model compression and on-device 3DGS rendering; clear privacy policies and user consent.
  • Bold: Global-scale mapping in denied or low-data areas
    • Sectors: Defense, Humanitarian mapping
    • What it could enable: Produce workable 3D context from limited ground captures plus publicly available satellite tiles.
    • Tools/workflow: Low-bandwidth capture kits; offline inference pipelines; uncertainty-aware outputs.
    • Assumptions/dependencies: Legal/ethical constraints; resilience to degraded GPS; robustness to novel terrains.

Notes on Feasibility and Dependencies (cross-cutting)

  • Data access and licensing: The approach relies on orthorectified satellite tiles (e.g., Google Maps, Azure Maps, Esri World Imagery). Commercial terms often restrict derivative products; open alternatives or enterprise licenses may be required for production.
  • Geolocation quality: GPS and heading are assumed known for aligning BEV to ground images. Smartphone heading can be noisy; magnetometer errors degrade alignment and coverage.
  • Metric fidelity: The method uses scene normalization and satellite pixel-to-meter metadata; absolute accuracy may vary. Treat outputs as coarse geometry unless calibrated to surveyed control points or high-quality DEMs.
  • Resolution limits: BEV tiles (e.g., ~0.5 m/pixel) cap the recoverable detail; thin structures and detailed facades will be approximate.
  • Domain shift: Weather, season, and regional style differences can reduce quality; additional fine-tuning per geography may be needed.
  • Compute: Real-time rendering of 3DGS is GPU-friendly but still requires capable hardware or an edge/cloud pipeline for mobile use.
  • Safety-critical use: Not recommended without rigorous validation and uncertainty modeling; current benchmarks indicate improved coverage/extrapolation but not survey-grade accuracy.

Glossary

  • 3D foundation model: A large pre-trained model that provides general 3D geometry priors for downstream tasks. "3D foundation models are not trained on satel- lite imagery and therefore struggle in our cross-view setting."
  • 3D Gaussian splatting: A rendering method that represents scenes as collections of 3D Gaussian primitives and renders them efficiently. "can be rendered via 3D Gaussian splatting [40]"
  • 3DoF: Three degrees of freedom specifying position or orientation (here: latitude, longitude, heading) without full 3D pose. "Georeferenced imagery refers to images tagged with 3DoF pose information"
  • 6DoF: Six degrees of freedom describing a camera’s 3D position and orientation. "supports multiple input images with unknown 6DoF poses."
  • Alpha-blending: A technique to composite colors by weighting with transparency along depth order. "use depth-sorted alpha-blending of colors based on accumulated transmittance at each pixel location."
  • Backprojection: Mapping per-pixel depths from an image back into 3D space to recover point locations. "We backproject splats from ground images using perspective projection"
  • BEV (bird's-eye view): A top-down view of a scene that removes perspective effects to show layout from above. "From a bird's-eye view (BEV), large-scale features such as roads and building footprints are more clearly delineated"
  • COLMAP: A widely used SfM/MVS software for 3D reconstruction and camera estimation from images. "COLMAP [68] reconstructions"
  • Confidence-weighted loss: A training loss that weights errors by predicted confidence to reduce the impact of noisy targets. "a confidence weighted depth loss"
  • Cross-attention: An attention mechanism where one set of tokens attends to another to fuse information across views/modalities. "bidirectional cross-attention layers."
  • Digital Elevation Model (DEM): A raster representation of terrain elevations used for height and geometry supervision. "Digital Elevation Models (DEMs)"
  • DINOv2: A self-supervised vision backbone used here to extract patch tokens for geometry transformers. "Images Iground E RHXW are encoded using DINOv2 [56]"
  • DPT (Dense Prediction Transformer): A transformer-based head for dense predictions such as depth, height, or Gaussian attributes. "with DPT [63] heads."
  • Georeferenced imagery: Images associated with geographic coordinates and orientation enabling alignment with maps. "Georeferenced imagery refers to images tagged with 3DoF pose information"
  • GPS-tagged: Images with embedded GPS metadata used for geographic alignment. "GPS-tagged ground level images"
  • Height map: A per-pixel map of elevations relative to a reference frame used to place 3D elements. "regress a height map relative to Iground"
  • IMU: Inertial Measurement Unit providing motion and orientation data to supplement GPS and camera info. "available in most devices with an IMU [1]"
  • Intrinsics (camera intrinsics): Parameters defining a camera’s internal geometry (e.g., focal length, principal point). "perspective camera intrinsics Ki"
  • LPIPS: A learned perceptual image similarity metric used to evaluate or supervise render quality. "We report PSNR, SSIM [89], and LPIPS (VGG-net) [102] metrics."
  • Multi-plane images (MPI): A layered scene representation with multiple depth-aligned planes for view synthesis. "multi-plane images (MPI) [105]"
  • Multi-view stereo (MVS): A class of methods that reconstruct 3D geometry from multiple calibrated views. "multi-view stereo (MVS) [24, 30, 66]"
  • Off-nadir imagery: Images captured at an angle away from the vertical (nadir), providing parallax for 3D recovery. "off-nadir, non-orthorectified imagery"
  • Orthoimagery: Aerial or satellite imagery geometrically corrected to have uniform scale and no perspective distortion. "Orthoimagery poses a fundamental challenge."
  • Orthographic projection: A projection model with parallel rays that removes perspective, often used for BEV/satellite views. "assumes an orthographic projection model"
  • Orthorectification: The process of removing perspective effects and terrain-induced distortions from imagery. "Orthorectification removes perspective effects and paral- lax"
  • Orthorectified satellite imagery: Satellite images corrected to a map-accurate, top-down view with constant scale. "orthorectified satellite imagery"
  • Parallax: Apparent displacement of objects due to viewpoint change, providing depth cues in perspective images. "Orthorectification removes perspective effects and paral- lax"
  • Perspective projection: A projection model where rays converge at a camera center, producing depth-dependent scaling. "using perspective projection"
  • PSNR: Peak Signal-to-Noise Ratio, a pixel-wise fidelity metric for image reconstruction quality. "We report PSNR, SSIM [89], and LPIPS (VGG-net) [102] metrics."
  • Rational Polynomial Camera (RPC): A camera model commonly used for satellite sensors to map image to ground coordinates. "recover geometry and BEV camera models (RPCs)."
  • Rasterization: Converting geometric primitives (e.g., Gaussians) into pixels on an image grid for rendering. "splatting-based rasterization."
  • Ray tracing: Rendering technique that traces rays through a 3D representation to compute depths or colors. "voxel grids for ray- traced coarse ground-level depth."
  • Scene scale normalization: Normalizing predicted depths/poses to a common scale to stabilize training and inference. "Scene scale normalization."
  • Sky regularization: Losses designed to handle sky regions lacking reliable depth cues by enforcing plausible depth/opacity. "Sky regularization."
  • Spherical harmonics: A basis for representing view-dependent color on the sphere, used to encode Gaussian colors. "represented via order Nsh = 1 spher- ical harmonics."
  • SSIM: Structural Similarity Index, a perceptual metric evaluating structural similarity between images. "We report PSNR, SSIM [89], and LPIPS (VGG-net) [102] metrics."
  • Structure-from-Motion (SfM): Recovering camera poses and sparse 3D structure from overlapping images. "structure-from- motion (SfM) [8, 32, 57, 68]"
  • Transmittance: The fraction of light not yet absorbed/occluded along a ray, used in alpha compositing. "accumulated transmittance"
  • Tri-plane representation: A compact 3D feature representation using three orthogonal feature planes for rendering. "tri-plane representation [4]"
  • ViT (Vision Transformer): A transformer architecture operating on image patches for vision tasks. "a ViT [15] encoder-decoder"
  • Volume rendering: Rendering by integrating densities/colors along rays through a continuous 3D field. "NeRF-like volume rendering for view- synthesis."
  • Voxel grid: A 3D grid of volumetric cells used to store geometry or density for rendering or depth estimation. "voxel grids for ray- traced coarse ground-level depth."

Collections

Sign up for free to add this paper to one or more collections.