- The paper presents REFNet++ which fuses camera and radar data using a learnable variational encoder-decoder to enable end-to-end detection and segmentation in a BEV polar view.
- It achieves state-of-the-art performance with F1 scores over 93% in detection and mean IoU around 88% in segmentation, while significantly reducing computational overhead.
- The study validates that the sensor fusion approach outperforms single-modality methods and provides a scalable framework for multi-task perception in autonomous driving.
Multi-Task Efficient Fusion of Camera and Radar in BEV Polar Domain: A Review of REFNet++
Introduction
REFNet++ (2605.11824) addresses the problem of multi-modal sensor fusion for perception in autonomous driving, with a focus on fusing camera and raw radar data for both object detection and free space segmentation. The work introduces an architecture that aligns modalities in the bird's-eye polar view (BEV Polar), enabling end-to-end learning of both detection and segmentation in a resource-efficient manner. The approach is motivated by the complementary properties of cameras (high fidelity, vulnerable to adverse conditions) and millimeter-wave automotive radars (robust, low-resolution, resilient to weather), overcoming the challenge of disparate sensor data geometries via a variational encoder-decoder pipeline.
Methodology
REFNet++ augments the prior REFNet architecture by extending to multitask operation and introducing a completely learnable camera-to-BEV transformation. The critical technical advancement is the implicit learning of the mapping from perspective camera images to BEV in the polar domain using a variational encoder-decoder. This replaces prior manual or pre-processed transformation pipelines, reducing computational overhead and annotation dependency.
Radar Stream
The radar branch consumes the complex range-Doppler (RD) spectrum, leveraging the rich low-level information (azimuth, range, Doppler) from multiple receiving antennas common in automotive radar. The network structure applies a multi-input multi-output (MIMO) pre-encoder and residual blocks inspired by FFTRadNet, extracting RA-aligned features without requiring 3D RAD cubes or expensive preprocessing.
Camera Stream
The camera branch also employs a feature pyramid network followed by a variational encoder-decoder. A key feature is the use of the reparameterization trick to stochastically sample latents during training, promoting generalization and regularization. Feature alignment between the radar and camera branches is achieved through channel permutation and skip connections, allowing for concatenation in the fused BEV polar space.
Fusion and Heads
Both streams’ features are concatenated along channels after dimensional alignment. Detection and segmentation are handled by separate heads, each designed for their specific output:
- Detection: A classification branch outputs per-pixel vehicle likelihood, and a regression branch outputs object range and azimuth, both optimized via focal and Smooth L1 loss.
- Segmentation: Outputs a BEV free space mask, optimized via binary cross-entropy on the truncated road range.
Multi-task learning loss combines detection and segmentation, enabling either single-task or joint multitask operation.
Experimental Evaluation
Experiments are performed on the RADIal dataset, leveraging its high-fidelity and synchronized camera, radar, and LiDAR annotations. Extensive comparisons are made against state-of-the-art (SOTA) radar-only, camera-radar, and multitask models, including EchoFusion, ROFusion, CMS, REFNet, and several radar-only baselines (FFTRadNet, TFFTRadNet, ADCNet, TransRadar, SparseRadNet, Occugrid).
Key Quantitative Results
- Detection: REFNet++ achieves an F1 of 93.70% (single-task) and 92.67% (multi-task), with the best angle error (AE) and second-best AP/AR versus all fusion competitors. The best F1 is marginally above EchoFusion and surpasses REFNet and all radar-only baselines.
- Segmentation: REFNet++ sets a new state-of-the-art with mean IoU at 88.13% (single-task) and 87.58% (multi-task), significantly outperforming prior radar and fusion methods.
- Computational Efficiency: REFNet++ matches or outperforms all SOTA models in computational footprint, with training approximately twice as fast as REFNet due to the fully implicit learnable transformation. It records high throughput (up to 7.26 FPS multitask), low model size, and efficient GPU utilization.
Ablation and Visualization
Ablation studies confirm:
- Camera-only operation leads to marked drops in accuracy, validating the necessity of sensor fusion.
- Removing the variational reparameterization increases training time threefold with no accuracy benefit, demonstrating its relevance for efficient convergence.
Qualitative inspection indicates that the proposed fusion model provides robust detection and segmentation, outperforming both radar-only and camera-only branches in challenging cases with overlapping or obscured objects.
Implications and Theoretical Analysis
REFNet++ demonstrates that learning the geometric sensor alignment as part of the fusion architecture can obviate manual calibration and expensive preprocessing, reducing both annotation and computational requirements. The variational encoder-decoder ensures that the feature space is expressive enough to encode perspective-to-BEV transformations end-to-end, making the architecture adaptable to new sensor setups or deployment domains.
From a theoretical standpoint, this paradigm advances the field toward practical, deployable multi-modal learning systems for ADAS and autonomous vehicles by approaching the trade-off between accuracy and efficiency. It also establishes the feasibility of multitask learning for perception with minimal architecture and computational overhead.
The implicit fusion of feature spaces and learnable camera-to-BEV projection could be further extended to other tasks or modalities (e.g., LiDAR, event cameras), provided data and computational constraints are respected. The architecture's flexibility in backbone choice allows adaptation across a spectrum of embedded and high-performance platforms.
Future Prospects
Potential extensions include:
- Incorporation of additional modalities such as LiDAR for richer BEV representations, especially for trained environments with severe weather or occlusions.
- Adaptation to open-vocabulary detection tasks or integration with unified scene understanding pipelines.
- Transfer learning to new domains through domain-adaptive variational modules.
- Use with larger and more diverse datasets to further improve generalization and robust deployment under varying real-world conditions.
Conclusion
REFNet++ (2605.11824) delivers a resource-efficient, multi-task fusion system for camera and radar in the BEV polar view, offering high accuracy for detection and segmentation with substantially reduced computational overhead. Its design paradigm—implicit, learnable geometric alignment across modalities—removes the need for hand-designed transformations and manual feature alignment, achieving state-of-the-art results in multitask perception while maintaining operational efficiency suitable for automotive deployment. The method points the way toward scalable, robust multi-sensor fusion in future intelligent vehicle systems.