Papers
Topics
Authors
Recent
2000 character limit reached

WRCFormer: 3D Object Detection Framework

Updated 2 January 2026
  • WRCFormer is a 3D object detection framework that integrates raw 4D radar data and RGB images for robust perception in autonomous driving.
  • It uses a wavelet-based multi-view architecture with a specialized Wavelet Attention Module and Geometry-guided Progressive Fusion to preserve key spectral-spatial features.
  • Experimental results on K-Radar benchmarks show state-of-the-art performance, with significant gains in mAP and real-time throughput under challenging conditions.

WRCFormer is a 3D object detection framework designed for robust multi-modal fusion of raw 4D radar tensors and camera images in autonomous driving and robotic perception. Capitalizing on the spectral–spatial properties of mmWave radar and the high semantic content of images, WRCFormer leverages a wavelet-based multi-view architecture with specialized attention and fusion modules to achieve information-preserving, real-time detection even under adverse weather. The framework introduces a Wavelet Attention Module (WAM) and a Geometry-guided Progressive Fusion (GPF) mechanism, establishing new state-of-the-art performance on the K-Radar benchmarks (Guan et al., 28 Dec 2025).

1. Problem Motivation and Modalities

4D mmWave radar generates dense tensors spanning range, azimuth, elevation, and Doppler dimensions, enabling accurate sensing of object velocity and position. However, direct point-cloud conversion (e.g., by CFAR and angle-of-arrival estimation) leads to significant loss of spatial and spectral detail, producing sparse, semantically poor radar representations. While raw-cube processing with 3D/4D CNNs or transformers preserves this information, the computational burden prohibits real-time deployment. Prior radar–camera fusion techniques either inherit radar’s sparsity through point clouds or incur excessive cost by operating on the unprocessed 4D radar cube, leaving a gap for efficient, information-rich fusion that maintains multi-scale structural cues (Guan et al., 28 Dec 2025).

2. WRCFormer Architecture Overview

WRCFormer accepts a mononocular RGB image (512×512) and a pre-CFAR 4D radar tensor. Its architecture consists of:

  • Multi-view radar decomposition: The raw radar cube is split into Range–Azimuth (RA) and Elevation–Azimuth (EA) 2D projections. Each cell computes four feature statistics (max, median, variance of amplitude, and Doppler) to mitigate noise and dimensionality.
  • Parallel backbones:
    • Image: ResNet-101 followed by a Wavelet-Attention Feature Pyramid Network (WA-FPN)
    • Radar EA & RA: Separate ResNet-50 backbones, each followed by WA-FPN, extracting multi-scale spatial–frequency features
  • Fusion head: Geometry-guided Progressive Fusion, comprising:
    • Geometry-driven Semantic Alignment (GSA), aligning EA map features with image features using query-key cross-attention
    • Range-aware Geometric Refinement (RGR) using deformable cross-attention over both camera–EA fused and RA features, with uncertainty modeling and dynamic reference points
  • Detection head: Query-based iterative refinement module that predicts 3D object boxes and categories

This design preserves rich spectral–spatial radar information while enabling real-time inference on standard hardware (Guan et al., 28 Dec 2025).

3. Wavelet Attention Module and Feature Pyramid Design

The Wavelet Attention Module (WAM) constitutes the core feature operator in WRCFormer. For each FPN level, given X∈RC×H×WX \in \mathbb{R}^{C \times H \times W}, a 2D Haar Discrete Wavelet Transform produces low-frequency approximation (XLLX_{LL}) and multi-directional details (XLHX_{LH}, XHLX_{HL}, XHHX_{HH}). These sub-bands are concatenated and processed via:

  • First-order branch: Convolution on concatenated sub-bands
  • Second-order branch: Further DWT and convolution, followed by inverse transformation (IWT)
  • Mixture-of-Experts (MoE): Fused features are distributed through a sparse MoE (using gating networks gig_i and experts EiE_i)
  • Residual reconstruction: Output is reconstructed by IWT and added back to the input

This hybrid spatial–frequency attention selectively amplifies return signals (low-frequency) and edge structures (high-frequency), while suppressing noise and clutter—critical for enhancing sparse radar and image features in downstream fusion (Guan et al., 28 Dec 2025).

The Wavelet-based FPN replaces traditional convolutional FPN connections with the WA-MoE sequence, reducing computational overhead and preserving information across scales.

4. Geometry-Guided Progressive Fusion

The GPF fusion head operates in two progressive stages:

  • Geometry-driven Semantic Alignment (GSA): EA features are converted to queries (QQ) via grouped dilated convolution and positional encoding; image features form keys (KK) and values (VV). Coarse alignment uses sigmoid-based attention, producing fused queries FGSF^{GS} that integrate multi-modal geometry-aware semantics.
  • Range-aware Geometric Refinement (RGR): For each fused query, reference points are dynamically generated. Deformable, uncertainty-aware cross-attention operates in parallel across the camera–EA path and RA path, using query-adaptive sampling offsets and learned uncertainty weights (ui,s,ku_{i,s,k}). The resulting features are fused, and processed through a feed-forward network for subsequent detection refinement.

These stages enable fine-grained cross-modal alignment and spatial reasoning, particularly effective in cluttered or weather-degraded scenes (Guan et al., 28 Dec 2025).

5. Training Procedure and Losses

WRCFormer employs bipartite Hungarian matching to align TT predictions with GG ground-truth instances. The objective consists of a weighted sum of classification loss (LclsL_{cls}, either cross-entropy or focal), and box regression loss (LboxL_{box}, composed of â„“1\ell_1 and GIoU terms):

L=λcls⋅Lcls+λbox⋅LboxL = \lambda_{cls} \cdot L_{cls} + \lambda_{box} \cdot L_{box}

Training uses a fixed image size (512×512), batch size of 4, for 200 epochs using AdamW optimizer with cosine decay on an NVIDIA RTX 4090 (Guan et al., 28 Dec 2025).

6. Experimental Results and Comparative Analysis

WRCFormer achieves state-of-the-art results on K-Radar v1.0 and v2.0:

Model Sensors Sedan mAP Bus/Truck mAP FPS VRAM (GB)
DPFT C+R 50.5 33.6 10.8 2.6
ASF C+R 52.7 36.2 — —
WRCFormer C+R 56.4 38.7 14.8 2.4

Notably, WRCFormer surpasses the best prior models by 2.4% total mAP_3D and 3.1% in sleet conditions, with a 37% increase in real-time throughput. Ablation studies confirm that the WA-MoE FPN provides higher accuracy (58.7 vs. 55.9 mAP_3D for DeformableConv v4) and lower computational cost (2.7 GFLOPs for WA-MoE vs. 60.4 for DeformableConv v4). The full GPF module yields the highest mAP_BEV and requires fewer parameters than comparable fusion heads (Guan et al., 28 Dec 2025).

7. Limitations and Future Directions

While WRCFormer preserves radar spectral–spatial structure and achieves real-time efficiency, it continues to rely on pre-CFAR feature selection on radar cubes, leaving the problem of end-to-end raw ADC-based learning unaddressed. The fixed Haar wavelet basis may not optimally capture nuanced frequency behaviors in varied environmental conditions. Anticipated avenues include temporal (multi-frame) radar fusion for dynamic scene modeling, extension to joint camera–LiDAR–radar wavelet-based fusion, and scenario-adaptive, learned wavelet bases for enhanced noise suppression under challenging weather (Guan et al., 28 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to WRCFormer.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube