Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-View Splatter in 3D Scene Representation

Updated 31 May 2026
  • The paper introduces Cross-View Splatter, which fuses ground and satellite imagery via per-pixel 3D Gaussian splats to overcome incomplete 3D coverage in outdoor scenes.
  • It employs a dual transformer architecture with bidirectional cross-attention to align local ground details with global geometric priors from aerial views, boosting extrapolation robustness.
  • The method is validated on benchmarks like Tanks & Temples, achieving higher PSNR and improved scene completeness compared to prior feed-forward models.

Cross-View Splatter is a family of methodologies in 3D computer vision and graphics that enable feed-forward novel-view synthesis and scene representation by fusing imagery from widely varying perspectives—most notably combining ground-level photographs with overhead imagery such as satellite or drone views—using per-pixel 3D Gaussian splat parameterizations and cross-perspective feature alignment. Driven by the need to capture large-scale outdoor environments with minimal camera coverage, these methods supplement the geometric and texture “holes” in ground imagery with the global priors afforded by satellite or aerial views, aligning all observations within a unified 3D world coordinate frame via explicit georeferencing and network-level cross-attention. Recent advances in this area have yielded substantial improvements in scene completeness, geometric fidelity, and extrapolation robustness compared to single-perspective feed-forward representations.

1. Motivation and Challenges of Cross-View Fusion

Traditional feed-forward novel-view synthesis models, trained exclusively on perspective ground-level images, are fundamentally limited by the difficulty of acquiring densely sampled ground images in large outdoor scenes. Such models exhibit problems including incomplete 3D coverage (resulting in missing façades, rooftops, or obscured terrain), poor extrapolation to wide baselines, and heightened susceptibility to ambiguities in thick urban or vegetated environments. Satellite or orthorectified aerial imagery mitigates these weaknesses by providing an essentially complete, globally accessible bird’s-eye view with explicit geometric content (e.g., building footprints, roads, vegetation structure). The fusion of these cross-perspective data sources—in particular, aligning and combining orthographic satellite tiles with GPS-tagged, georeferenced ground images—forms the basis for the cross-view splatter paradigm. This fusion allows the system to predict scene representations that combine the fine-grained, view-dependent detail from the ground with the geometric and coverage prior from overhead (Turkulainen et al., 19 May 2026).

2. Pixel-Aligned Gaussian Representation in a Unified 3D Frame

Cross-View Splatter represents outdoor scenes as the union of two sets of Gaussian splats, each associated with either a ground or satellite input pixel. Each splat is parameterized by a 3D mean position pjR3p_j \in \mathbb{R}^3, a covariance matrix ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}, an opacity oj[0,1]o_j \in [0,1], and a spherical-harmonic color coefficient vector cjc_j. For each ground image, perspective projection and predicted depth dground(u,v)d_\text{ground}(u,v) are used to back-project each pixel into world coordinates:

pj=Ti[Ki1(u,v,1)dground(u,v)]xyzp_j = T_i \left[ K_i^{-1} \cdot (u,v,1)^\top \cdot d_\text{ground}(u,v) \right]_{xyz}

For satellite pixels, an orthographic mapping positions each splat in a meter-accurate grid, with heights inferred by a dedicated head hsat(u,v)h_\text{sat}(u,v). Both sets are rendered via depth-sorted alpha-blend compositing, ensuring radiometric coherence (Turkulainen et al., 19 May 2026). This fused representation enables synthesis of novel views from arbitrary perspectives, with the 3D splat set always defined in a single, georeferenced world coordinate system.

3. Network Architecture and Cross-View Feature Alignment

At the architectural level, Cross-View Splatter deploys dual transformer-based branches—one processing each modality—using a backbone such as VGGT, capable of ingesting DINOv2 embeddings. The ground branch regresses per-pixel depth, camera intrinsics, and splat parameters, while the satellite branch predicts per-pixel heights and analogous splat attributes. A crucial module, Attn_meta, injects up to L=12L=12 layers of bidirectional cross-attention between ground and satellite tokens, forcing explicit alignment between ground-level structure and the global geometric prior from BEV imagery. This cross-attention is empirically critical; disabling it reduces PSNR by up to 1 dB and degrades scene coverage, especially in low-overlap scenarios (Turkulainen et al., 19 May 2026). During fusion, the outputs of the two branches are merged to form a unified Gaussian set G=GgroundGsatelliteG = G_\text{ground} \cup G_\text{satellite}.

4. Training Data, Losses, and Optimization

Cross-View Splatter is trained on curated georeferenced datasets containing both ground imagery (with GPS and heading data) and paired satellite tiles (with known geographic region and resolution). Key data sources include Metropolis (driving + street-view), VIGOR (panorama cutouts + satellite), MapFree, VKITTI2, and DL3DV. The loss function is a weighted sum incorporating:

  • Camera pose and intrinsic regression losses (LcamL_\text{cam})
  • Ground and satellite height/depth prediction (ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}0, ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}1)
  • Depth consistency between rendered Gaussians and predicted depths (ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}2)
  • RGB fidelity for both ground images and re-rendered satellite perspectives (ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}3, ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}4)
  • Joint ground-satellite rendering consistency (ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}5)
  • BEV reprojection loss ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}6
  • Sky regularization ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}7

Optimization employs AdamW (lr=ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}8, weight decay 0.05) with a batch size of 10 on A100 GPUs and typically initializes model weights from a pretrained AnySplat. Most ground-branch transformer layers are frozen; only cross-attention and output heads are fine-tuned (Turkulainen et al., 19 May 2026).

5. Quantitative Evaluation and Benchmarking

Evaluation is conducted on geolocalized benchmarks, such as Tanks & Temples (10 scenes) and DL3DV-Benchmark (40 scenes), using sparse-view splits (1–3 context views). Reference baselines include Splatfacto, MVSplat/DepthSplat, AnySplat, NoPoSplat, and Sat2Density. Cross-View Splatter achieves a PSNR of 11.33 (SSIM=0.274, LPIPS=0.631) in the most challenging 1-view regime, surpassing all prior feed-forward models and demonstrating the largest relative gain when ground-image overlap is lowest (up to +3 dB PSNR for IoU < 0.15). A table illustrating performance for Tanks & Temples (1 context view) is given below:

Method PSNR↑ SSIM↑ LPIPS↓
Splatfacto 8.24 0.279 0.688
AnySplat 7.48 0.357 0.648
Ours (ground) 9.00 0.259 0.619
Ours (sat only) 8.53 0.339 0.705
Ours (combined) 11.33 0.274 0.631

Qualitatively, the method recovers rooftops and terrain missed by ground-only 3DGS and handles large-baseline extrapolation across perspectives with fewer artifacts (Turkulainen et al., 19 May 2026).

6. Ablation Studies and Architectural Insights

Ablation studies demonstrate the necessity of cross-attention (Attn_meta) and the satellite branch. Removing Attn_meta decreases PSNR by ∼1 dB and reduces the ability to propagate geometric priors from BEV to street-level detail. Loss-term ablations show that joint color reconstruction (ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}9), Gaussian-to-depth consistency (oj[0,1]o_j \in [0,1]0), and sky regularization (oj[0,1]o_j \in [0,1]1) each provide measurable improvements, with the full model yielding the highest PSNR (18.63) in the Metropolis driving benchmark. The greatest impact is observed in the smallest spatial overlap bins, where fused ground+satellite supervision is most valuable (Turkulainen et al., 19 May 2026).

7. Limitations, Open Problems, and Future Directions

Current limitations include reliance on accurate GPS and heading data, variable satellite tile quality (resolution, seasonality, lighting), and inability to hallucinate structure unseen by both modalities (e.g., interiors, deep recesses). Further, Cross-View Splatter is only applicable to outdoor, overhang-sparse scenes. Ongoing research aims at integrating multi-temporal/multi-spectral satellite data, joint optimization with per-scene refinement, generative completion of unseen regions (e.g., via diffusion models), extension to additional priors such as UAV or LIDAR, and learned georeferencing layers to mitigate GPS noise. Addressing scene dynamics and integrating with object-based or temporal registration remains open (Turkulainen et al., 19 May 2026).


Cross-View Splatter thus marks a significant advance at the intersection of feed-forward 3D Gaussian splatting, georeferenced remote sensing, and vision-language based novel-view synthesis, by leveraging network-level cross-attention between dramatically different camera modalities to achieve robust scene reconstruction and high-fidelity extrapolation in data-sparse outdoor environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-View Splatter.