RadarSFD: Single-Frame Radar-to-LiDAR Reconstruction

Updated 29 September 2025

RadarSFD is a conditional latent diffusion framework that reconstructs LiDAR-like point clouds from a single mmWave radar frame without motion or synthetic aperture, enabling dense 3D perception.
The method integrates a frozen VAE, a pretrained diffusion backbone, and channel-wise latent conditioning to preserve spatial alignment and enhance geometric consistency.
It achieves state-of-the-art performance with significant reductions in Chamfer and Modified Hausdorff Distances, demonstrating robust generalization across diverse scenes.

RadarSFD refers to a conditional latent diffusion framework for reconstructing dense LiDAR-like point clouds from a single frame of mmWave radar, without the need for motion or synthetic aperture operation. It was developed to enable dense, high-resolution perception using compact, low-cost automotive or robotic radar sensors in environments where multi-frame aggregation or platform motion are unavailable or impractical. RadarSFD leverages pretrained geometric priors, direct latent-space conditioning, and dual-space loss to achieve state-of-the-art fidelity on radar-to-point-cloud reconstruction tasks, including strong generalization across diverse scenes (Zhao et al., 22 Sep 2025).

1. Motivation and Problem Formulation

Traditional radar imaging systems—especially those based on mmWave sensors—are attractive in SWaP (size, weight, and power)-constrained platforms due to their robustness in fog, smoke, dust, and low-light conditions. However, spatial resolution is limited by the size of the physical aperture. Many previous works resort to synthetic aperture radar (SAR) or multi-frame aggregation techniques to improve resolution, which either require deliberate sensor motion or the accumulation of multiple frames. These approaches are prohibitive for application domains such as aerial inspection, wearable robotics, or any scenario with limited temporal context.

RadarSFD directly addresses this need by providing a single-frame, no-SAR solution that can reconstruct geometrically-accurate, dense point clouds (comparable to LiDAR outputs) from a single static mmWave radar observation. The framework replaces classical processing with a generative reconstruction pipeline that is tractable, data-efficient, and robust to the spatial and stochastic sparsity of radar returns.

2. Architectural Principles and Conditioning Mechanisms

RadarSFD leverages a conditional latent diffusion architecture, comprised of three principal modules:

Frozen VAE (TAESD): Both the radar bird’s-eye view (BEV)—lightly preprocessed via thresholding—and the reference LiDAR BEV are independently encoded by a Variational Autoencoder (Tiny AutoEncoder for Stable Diffusion), resulting in a compressed latent representation (e.g., 8× spatial downsampling). This shared latent space enforces global geometric consistency and enables efficient training.
Pretrained Diffusion Backbone: The denoising U-Net backbone is initialized from the Marigold monocular depth estimator. By incorporating these geometric priors, the network is specifically biased toward the recovery of depth and structural cues that are otherwise ambiguous in raw radar observations. The transfer learning improves data efficiency and enables generalization.
Explicit Latent Conditioning: The main conditioning mechanism is channel-wise latent concatenation: radar BEV latents (c) are directly concatenated with the input noisy LiDAR latent zₜ along the channel dimension. This approach preserves spatial alignment and anchors the diffusion model to the actual observed radar geometry. The rationale for channel-wise concatenation over alternatives such as cross-attention is that explicit fusion retains high-frequency spatial information and directly couples generative capacity with the sensor input, resulting in sharper and structurally correct reconstructions.

3. Dual-Space Objective and Training Strategy

RadarSFD optimizes a dual-space loss, which explicitly regularizes both the generative diffusion process and the structural accuracy of the output:

Latent Space Loss:

The diffusion model is trained to predict the Gaussian noise ε added to the LiDAR latent during the forward process,

$\mathcal{L}_{\mathrm{LDM}}(\theta) = \mathbb{E}_{t, z_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c) \|^2 \right]$

where $z_t$ is the latent at diffusion time t, conditioned on radar latent $c$ .

Pixel Space Loss:

After decoding the denoised latent to a LiDAR BEV, standard reconstruction metrics are enforced,

$\mathcal{L}_p = \lambda_{L1} \mathcal{L}_{L1} + \lambda_{\text{SSIM}} \mathcal{L}_{\text{SSIM}} + \lambda_{\text{LPIPS}} \mathcal{L}_{\text{LPIPS}}$

with

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\mathrm{LDM}} + \lambda_p \mathcal{L}_p$

This dual-objective ensures that the output not only belongs to the true data distribution (by diffusion denoising) but also matches ground truth at a pixel- and perceptual-level.

4. Quantitative and Qualitative Performance

RadarSFD is benchmarked on the RadarHD dataset—a radar-to-LiDAR translation benchmark—using Chamfer Distance (CD) and Modified Hausdorff Distance (MHD):

Model	Frames	Chamfer Distance (cm)	MHD (cm)
RadarHD Baseline	1	56	45
RadarSFD	1	35	28
RadarHD (multi-frame)	41	33	24

RadarSFD achieves a ~37.5% reduction in CD and a ~37.8% reduction in MHD over the single-frame baseline, and approaches the performance of (or modestly outperforms) multi-frame aggregation methods, despite using only one input frame.

Qualitatively, reconstructions exhibit recovery of fine wall boundaries and preservation of narrow gaps that are lost to blurring or sparsity in classical or single-frame baselines. Visual comparisons demonstrate that RadarSFD’s outputs align more closely with LiDAR ground truth, with fewer hallucinated or oversmoothed regions.

5. Generalization and Ablation Insights

Experiments on environments unseen during training confirm that RadarSFD’s reliance on strong geometric priors (Marigold) and explicit radar-latent conditioning supports cross-scene generalization. Key ablation results include:

Pretrained Initialization: Removing pretrained weights degrades performance by up to 3× in both CD and MHD, demonstrating the necessity of geometric priors.
Conditioning Method: Channel-wise latent concatenation outperforms cross-attention mechanisms in both sharpness and geometric fidelity.
Loss Design: Eliminating pixel-space supervision (latent-only loss) produces structurally inaccurate outputs; including L1, SSIM, and perceptual (LPIPS) losses is crucial for detailed reconstruction.
Radar Input Representation: A lightly thresholded BEV preserves essential sidelobe/multipath structure needed for accurate geometry, whereas zero-threshold or raw I/Q input performs worse.

6. Applications and Impact

RadarSFD's single-frame reconstruction paradigm enables dense scene perception for SWaP-constrained and dynamic robotic platforms where multi-frame or SAR aggregation is infeasible. This includes aerial drones, ground robots, inspection systems, and wearable robotics operating in visibility-reduced environments (fog, dust, smoke, night). The approach significantly expands the operational envelope for dense mmWave radar perception.

By leveraging pretrained monocular priors, explicit conditioning, and dual-space regularization, RadarSFD sets a precedent for radar-to-LiDAR translation in compact, single-frame pipelines. Its strong quantitative and qualitative performance, confirmed by generalization experiments, makes it a compelling solution for next-generation autonomous perception where radar is the only viable or robust environmental sensor (Zhao et al., 22 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

RadarSFD: Single-Frame Diffusion with Pretrained Priors for Radar Point Clouds (2025)

Follow Topic

Get notified by email when new papers are published related to RadarSFD.