REOcc: Radar-Camera Fusion for 3D Occupancy
- The paper introduces an innovative dual-stream architecture integrating a Radar Densifier and Amplifier to overcome radar sparsity and noise.
- It achieves a 3.31 pp improvement in overall mIoU and a 4.46 pp boost on dynamic object prediction on the Occ3D-nuScenes benchmark.
- It employs multi-modal deformable cross-attention to fuse enriched radar and camera features, enabling robust 3D occupancy mapping without LiDAR.
REOcc refers to “Camera-Radar Fusion with Radar Feature Enrichment for 3D Occupancy Prediction,” a framework designed to leverage radar and camera data to produce robust, LiDAR-free three-dimensional semantic occupancy maps in challenging driving environments. The methodology addresses the central limitation of radar—its sparsity and noise—by enriching radar features using two novel modules, enabling effective fusion with multi-view camera representations. REOcc empirically demonstrates significant gains in occupancy prediction, especially for dynamic object classes, on the Occ3D-nuScenes benchmark (Song et al., 10 Nov 2025).
1. Architectural Principles and Data Streams
REOcc structures the 3D occupancy prediction pipeline into dual input streams and dedicated enrichment modules for radar data. The system consumes synchronized multi-view camera images and raw radar point clouds at each timestep. Camera features are extracted through a 2D backbone (Swin-Transformer or ResNet-50) and aggregated across views, then transformed into a 3D volumetric tensor representing spatial and semantic evidence.
Radar points are voxelized into PointPillar-style pillars and encoded into a sparse 2D feature map , where most grid cells are empty due to the inherent sparsity of automotive radar. This initial radar representation undergoes a two-stage enrichment: the Radar Densifier propagates spatial information and mitigates sparsity, while the Radar Amplifier adaptively weights radar features to suppress noise.
Mappings from image and radar are then collapsed to BEV (Bird’s Eye View) and fused using a multi-modal deformable cross-attention block; the fused features are re-projected into 3D volumetric space for per-voxel semantic occupancy prediction.
2. Radar Densifier: Spatial Propagation and Densification
The Radar Densifier module addresses radar’s sparse spatial sampling and returns jitter. For each grid cell (pillar ), densification proceeds by accumulating features from neighboring nonempty pillars within a fixed spatial window. Contributions are weighted via a Gaussian kernel, where the bandwidth for neighbor is a learned function of radar cross section (RCS):
Here, and are BEV coordinates; adapts to RCS, allowing large/reflective returns to be propagated over broader spatial support, reflecting physical properties of radar backscatter. This procedure results in a spatially continuous, context-sensitive radar feature map .
3. Radar Amplifier: Channel-wise Confidence and Noise Suppression
After densification, radar features remain affected by false or weak returns. The Radar Amplifier refines by learning channel-wise probabilities indicating informativeness. This is realized via a small MLP (with softmax activation) that computes per-channel confidence scores :
The operator denotes concatenation, while represents elementwise scaling. Informative channels (high ) are amplified, and noisy features are suppressed. The output (dimension ) thus fuses raw and confidence-weighted radar features for subsequent fusion.
4. Multi-modal Fusion and Occupancy Prediction Head
For fusion, the image volume is flattened along the height and channel dimensions to produce BEV features . This is concatenated with and fed through a multi-modal deformable cross-attention block, where keys and values are drawn from both modalities. Mixings exploit deformable attention for flexible alignment.
The output is re-projected into 3D by distributing channels over the height dimension and concatenated with the original 3D image tensor, ensuring that the fused representation preserves full spatial context. A lightweight 3D convolutional occupancy head then predicts per-voxel class probabilities. REOcc is trained using a standard per-voxel cross-entropy loss:
where encodes ground truth, predicted probabilities, and is auxiliary regularization (e.g. weight decay). No LiDAR, distillation, or geometric priors are used.
5. Experimental Results and Benchmarking
On the Occ3D-nuScenes benchmark, REOcc demonstrates the following performance improvements (validation set):
- Overall mIoU (17 semantic classes + free space):
- Camera-only BEVDet4D baseline (SwinT backbone): 42.02%
- REOcc (with Densifier and Amplifier): 45.33% (+3.31 pp absolute)
- Dynamic object subset (8 classes: bicycle, bus, car, const. vehicle, motorcycle, pedestrian, trailer, truck):
- Baseline: 30.89% mIoU
- REOcc: 35.35% (+4.46 pp absolute, +14.43% relative)
- Ablation (ResNet-50 backbone): Fusion with neither module yields 39.43/32.20 mIoU (overall/dynamic); Densifier only 40.68/34.19; Amplifier only 40.70/33.92; both modules 41.80/35.35.
Qualitative analysis shows REOcc recovering objects missed by camera-only networks (traffic cones, curbs). Under adverse conditions (night, rain), REOcc segments motorcycles and pedestrians not visible in the images, and reconstructs overhanging signage. Feature-map visualizations illustrate effective radar spatial continuity and channel activation centering over real objects.
6. Mitigation of Radar Sparsity and Noise
REOcc’s enrichment modules—distance-weighted densification and probability-driven amplification—directly address radar’s limitations:
- Densification redistributes radar evidence without hallucinating, weighting by a physically motivated RCS-adaptive Gaussian kernel. Position jitter and empty cells are compensated by borrowed evidence from neighbors.
- Amplification further suppresses noisy channels (false/weak returns) via channel-wise confidence scores learned from context.
- The fusion stage effectively combines enriched radar spatial/velocity cues with high-resolution camera semantics using deformable cross-attention and re-projection.
No reliance on LiDAR or external supervision ensures independence from expensive sensors and robustness to sensor dropouts.
7. Context and Future Implications
REOcc establishes a methodology for fully unlocking radar’s complementary role in semantic 3D occupancy prediction. Its radar feature enrichment pipeline yields robust improvements in adverse scenarios and for dynamic objects. The empirical results suggest that learned, sensor-specific enrichment is necessary for high-performing camera-radar fusion, particularly as automotive radar is poised for further densification and evolution.
A plausible implication is that as radar manufacturers increase point density and channel count, fusion methods modeled after REOcc will become central to all-weather 3D perception stacks. Extensions could include instance-level occupancy, temporal aggregation, and adaptation to ultra-long-range scenarios. The approach also avoids incremental complexity commonly found in LiDAR distillation and geometric supervision, potentially facilitating on-vehicle, real-time deployment.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free