Cross-Sensor View Synthesis

Updated 3 March 2026

Cross-sensor view synthesis is the process of generating target data from a different sensor modality, addressing issues like geometric discrepancies, calibration gaps, and modality-specific texture differences.
Recent models leverage dual-branch diffusion architectures, unified 3D feature spaces, and hybrid GAN-diffusion cascades to align geometry and appearance across heterogeneous sensors.
Evaluations on benchmarks demonstrate improvements in metrics like PSNR, SSIM, and FID, though challenges remain with dynamic scenes and low-texture, noisy sensor inputs.

Cross-sensor view synthesis refers to the generation of images, videos, or structured data in one sensor modality or viewpoint from data captured in another, often fundamentally different, sensor modality or viewpoint. It generalizes classical novel view synthesis (NVS) by operating across distinct sensing domains: for example, synthesizing a street-level LiDAR scan from camera imagery, or producing a thermal view from RGB. This task introduces compounded challenges in geometry, radiometry, and calibration due to heterogeneous sensor characteristics, varying spatial resolutions, differing fields of view, and the lack of shared coordinate systems or pixel correspondences. Advances in this domain directly impact autonomous driving, multi-modal mapping, remote sensing, and machine perception by enabling multi-sensor simulation, robust sensor fusion, and vision under adverse conditions.

1. Cross-Sensor View Synthesis: Problem Setting and Challenges

Cross-sensor view synthesis encompasses tasks where the goal is to generate data of a target modality (such as LiDAR, SAR, thermal, or street-view RGB) from observations in a source modality that may have fundamentally different spatial, spectral, or geometric characteristics. Canonical settings include RGB-to-LiDAR, satellite-to-streetview, and RGB-to-thermal synthesis. Problem inputs generally consist of one or more images or data frames from the source modality and optional—or, in many practical scenarios, unavailable—auxiliary information such as calibration, depth maps, or pose priors (Xie et al., 2024, Berian et al., 16 Jan 2025, Wu et al., 27 Feb 2026).

The central challenges in cross-sensor view synthesis include:

Heterogeneous Geometric Mapping: Different modalities capture complementary but non-identical geometric priors; e.g., LiDAR offers direct metric structure, while SAR and EO are fundamentally different projections (Berian et al., 16 Jan 2025).
Unknown or Incomplete Calibration: Sensors may lack shared intrinsic/extrinsic calibration, precluding simple re-projection and compounding pixel misalignment (Wu et al., 27 Feb 2026).
Modality Gap and Texture Ambiguity: Spectral bands (e.g., NIR, thermal) may lack textural or semantic structures present in the RGB domain (Wu et al., 27 Feb 2026, Berian et al., 16 Jan 2025).
Sparse, Noisy, or Partial Observations: Certain sensors (LiDAR, SAR) provide sparse or noisy returns, complicating direct volumetric fusion, and dynamic scenes can further disrupt static assumptions (Ni et al., 21 Feb 2025, Wu et al., 27 Feb 2026).
Scale and Realism: Achieving multi-sensor generalization and photo-realistic synthesis at city or dataset scale (autonomous driving, remote sensing) remains a bottleneck (Ni et al., 21 Feb 2025, Bajbaa et al., 29 Sep 2025).

A comparison of principal modalities and associated challenges is summarized below:

Source Modality	Target Modality	Geometric Alignment	Texture Gap
RGB	Thermal	Perspective ≠ Wavelength	High
EO (satellite)	SAR/LiDAR	Viewpoint + Sensor Diff.	Moderate-High
Camera	LiDAR	3D Correspondence	Low

2. Algorithmic Foundations and Model Architectures

Modern cross-sensor view synthesis leverages several distinct architectural paradigms:

Dual-branch Diffusion/Generative Models: These models maintain parallel branches for different modalities, coupling their generation via cross-attention or mutual constraints to ensure both appearance and geometry are aligned. X-Drive uses two parallel latent diffusion branches, one for range (LiDAR) and one for multi-view images, incorporating cross-modality epipolar feature transforms at each UNet block (Xie et al., 2024). Likewise, "Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation" utilizes parallel diffusion U-Nets with cross-branch attention instillation, enforcing spatial correspondence (Kwak et al., 13 Jun 2025).
Unified 3D Feature Spaces and Geometry-Aware Encoding: CrossModalityDiffusion encodes multi-modal inputs into a shared 3D geometry-aware feature volume, enabling the network to render modality-agnostic latent features from any target viewpoint. Volumetric rendering is then used to generate conditioning features for modality-specific diffusion decoders (Berian et al., 16 Jan 2025).
Hybrid GAN-Diffusion Cascade: For satellite-to-street-level translation, hybrid pipelines combine diffusion models (delivering fine detail and semantic plausibility) with conditional GANs (addressing structural and viewpoint fidelity), often followed by explicit fusion modules (Bajbaa et al., 29 Sep 2025).
Calibration- and Depth-Free Matching and Densification: Recent work has removed the need for explicit multi-sensor calibration by extracting sparse cross-modality matches (via transformer-based matchers), densifying via confidence-aware propagation networks, and reconstructing consistent scene geometry via modality-augmented 3D Gaussian Splatting (3DGS) (Wu et al., 27 Feb 2026).

3. Geometric and Cross-Modality Alignment Mechanisms

Ensuring geometric and modality alignment in cross-sensor synthesis is non-trivial:

Epipolar Geometry for Cross-Modality: X-Drive employs epipolar-feature sampling along geometric correspondences to transfer local features between LiDAR and camera views, without explicit depth estimation at inference (Xie et al., 2024).
Cross-Modal Attention Instillation: In diffusion-based multi-task settings, one branch's attention maps (e.g., image branch) are injected into the parallel geometry branch, tying both spatial focus for improved geometric consistency (Kwak et al., 13 Jun 2025).
Mesh and Point-Cloud Conditioning: Warped and meshed 3D point-clouds, possibly filtered by surface normals (for occlusion culling), are used to guide both image and geometry synthesis (Kwak et al., 13 Jun 2025).
Sparse-to-Dense Confidence Propagation: Calibration-free pipelines propagate high-confidence sparse matches via dynamic spatial propagation networks, with self-matching filters to enforce unimodal concentration in the densified maps (Wu et al., 27 Feb 2026).
Unified 3D Scene Representations: Cross-sensor Gaussian Splatting and NeRF-style methods enforce that the same 3D primitives explain both modalities, further supporting multi-view photometric consistency (Ni et al., 21 Feb 2025, Wu et al., 27 Feb 2026).

4. Benchmark Datasets and Evaluation Protocols

Cross-sensor view synthesis research frequently leverages or introduces specialized benchmarks:

Para-Lane: Real-world multi-lane dataset for evaluating multi-sensor (RGB, multi-fisheye, LiDAR) NVS in driving, includes precise camera-to-LiDAR pose registration and tracks generalization across lateral viewpoint shifts (Ni et al., 21 Feb 2025).
CVUSA: Standard for satellite–street-view cross-domain synthesis. Used to benchmark hybrid approaches integrating diffusion and GAN branches (Bajbaa et al., 29 Sep 2025).
RGB-X: Recent efforts produce unpaired RGB–thermal, RGB–NIR, RGB–SAR video streams, emphasizing calibration-free synthesis and diverse, uncalibrated sensing scenarios (Wu et al., 27 Feb 2026).
ShapeNet-based Synthetic Multimodal Data: Enables controlled evaluation for EO, SAR, and LiDAR via simulation, as in CrossModalityDiffusion (Berian et al., 16 Jan 2025).

Metrics span:

Image quality: FID (↓), LPIPS (↓), PSNR (↑), SSIM (↑)
Point cloud realism: Mean minimum distance (MMD), Jensen-Shannon divergence (JSD)
Cross-modality consistency: Depth Alignment Score (DAS), image-text matching, and multi-view consistency scores (Xie et al., 2024, Ni et al., 21 Feb 2025, Wu et al., 27 Feb 2026).

5. Experimental Findings and Comparative Performance

Key quantitative and qualitative findings include:

Dual-branch latent diffusion architectures with explicit cross-attention modules (epipolar or learned) outperform independent or decoupled generative models on cross-modality and geometric alignment as measured by both image and point cloud metrics (Xie et al., 2024, Kwak et al., 13 Jun 2025).
Confidence-aware densification and self-matching filtering significantly increase PSNR (by 0.5–4 dB) and improve structural quality versus baseline homography or translation-only approaches in calibration-free pipelines (Wu et al., 27 Feb 2026).
Hybrid GAN–diffusion frameworks deliver improved SSIM (+2.18%) and FID (–2.68%) over GAN-only approaches in satellite-to-street-view translation, with more accurate geometric layout and finer texture detail (Bajbaa et al., 29 Sep 2025).
Precise extrinsic calibration (as achieved in Para-Lane) reduces photometric misalignment (NID-Loss ~10.5→4.1), directly enhancing NVS training and cross-sensor fusion consistency (Ni et al., 21 Feb 2025).
Larger errors and artifacts persist with increased lateral extrapolation (“cross-lane” synthesis), particularly when unscanned scene regions are present or dynamic elements are not jointly modeled (Ni et al., 21 Feb 2025).

6. Limitations, Open Challenges, and Future Directions

Reported limitations and future priorities include:

Calibration and Scene Dynamics: Calibration-free methods struggle with dynamic scenes, as current 3D scene representations like Gaussian Splatting assume static geometry (Wu et al., 27 Feb 2026). Future extensions include layered or time-varying 3D models.
Noisy and Low-Texture Modalities: Cross-modal matching and feature propagation fail when the target modality is visually homogeneous or heavily corrupted by noise (as in low-resolution thermal), identifying a need for joint denoising and robust feature extraction (Wu et al., 27 Feb 2026).
Inference Efficiency: Dual-branch diffusion pipelines are computationally expensive due to tandem denoising steps; shared backbones or earlier fusion points may yield acceleration (Kwak et al., 13 Jun 2025).
Generalization and Benchmark Diversity: Realism and domain shift (e.g., synthetic → real, narrow → wide FOV) remain open bottlenecks. Expanding and diversifying datasets (e.g., Para-Lane’s progressive release) are essential for future advances (Ni et al., 21 Feb 2025).
Broader Sensor Modalities: Methodologies generalize to other sensor pairs (e.g., RGB–LiDAR, thermal–LiDAR) by substituting modality-specific UNets and cross-attention strategies, yet demonstration across physically disparate sensors remains limited (Kwak et al., 13 Jun 2025, Berian et al., 16 Jan 2025).

Future research directions emphasize integration of scene flow for dynamic environments, generalized cross-modal architectures for arbitrary sensor types, joint optimization of denoising and scene completion, and scalable, self-supervised approaches for large-scale, cross-domain NVS (Xie et al., 2024, Wu et al., 27 Feb 2026).