BroadBEV: Collaborative Sensor Fusion

Updated 10 June 2026

BroadBEV is a sensor fusion framework that enhances BEV perception by combining reliable LiDAR depth cues with camera-based feature extraction.
The Point-scattering module aligns modalities by transferring LiDAR occupancy information to refine monocular camera depth estimates.
The ColFusion module leverages cross-attention between camera and LiDAR features to generate robust semantic maps, even under adverse conditions.

BroadBEV is a collaborative sensor fusion framework designed to overcome critical limitations in autonomous vehicle perception from Bird’s Eye View (BEV) representations. Specifically, BroadBEV introduces two principal modules, Point-scattering and ColFusion, to address depth estimation errors in monocular camera branches and the inherent sparsity of LiDAR data at long ranges. The framework spatially synchronizes camera and LiDAR information, enabling broad-sighted semantic map construction with significant gains in performance over prior methods (Kim et al., 2023).

1. Motivation and Context

Traditional BEV sensor fusion approaches, such as Lift-Splat and BEVFusion, exhibit two primary weaknesses. First, monocular depth estimation from cameras introduces notable errors, particularly for objects at distances greater than 30–40 meters, often leading to spatially misaligned or warped BEV outputs. Second, while LiDAR sensors provide accurate range information, their point density falls off rapidly with distance, resulting in insufficient context for far-field structures (e.g., distant crosswalks or dividers). BroadBEV directly targets these deficiencies by (a) transferring LiDAR's reliable BEV depth knowledge to the camera pathway using Point-scattering, and (b) employing collaborative, attention-based fusion (ColFusion) to harness the complementary properties of camera and LiDAR modalities (Kim et al., 2023).

2. Architectural Overview

BroadBEV operates in three interconnected stages:

Feature Extraction:
- The camera branch processes $N$ input images with a Swin-T backbone, yielding 2D feature maps $Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ .
- The LiDAR branch uses VoxelNet to encode raw point clouds as BEV feature maps $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ .
View Transformation & Point-scattering:
- LiDAR BEV features are projected into a per-cell BEV depth probability map $D^{BEV}_P$ .
- Camera image-based depth and context are lifted and splatted into the BEV frame via BEV pooling operations.
- The LiDAR-derived $D^{BEV}_P$ is fused with the camera BEV depth estimate to refine and spatially reweight the camera’s BEV features.
Collaborative BEV Fusion (ColFusion):
- Each (camera/LiDAR) BEV feature set computes self-attention weights.
- These attention patterns are exchanged between modalities (cross-attention), enhancing both camera and LiDAR BEV features.
- Four attention-filtered BEV features are fused via a BEV-oriented Feature Pyramid Network (BEV-FPD) to yield the final semantic map.

3. Point-scattering Module

The Point-scattering module enables spatial alignment between camera and LiDAR features by transferring LiDAR occupancy cues to correct monocular camera BEV mislocalizations. The procedure includes:

LiDAR BEV Depth Prediction:
- $D^{BEV}_P = \sigma(h_\nu(Z^{BEV}_P))$
- $h_\nu$ is a 1×1 convolution followed by a sigmoid, producing occupancy likelihood $D^{BEV}_P(x, y)$ for each BEV cell.
Camera BEV Pooling:
- Context features and depth predictions from the camera image branch are lifted and splatted into BEV coordinates via:
$\tilde{Z}_I^{BEV}(x, y) = \sum_{u, v, d} Z_I(u, v) D_I(u, v, d) \mathbf{1}\{(x, y) \leftrightarrow (u, v, d)\}$

A similar pooling is applied to the camera depth estimates.

Fusion ("Scattering") of Depth Distributions:
- The LiDAR and camera BEV depth maps are concatenated and passed through a 1×1 convolution and sigmoid:
$D_I^{BEV}(x, y) = \sigma\left(h_\psi(\tilde{D}_I^{BEV}(x, y), D_P^{BEV}(x, y))\right)$

This operation provides camera features with spatial information about LiDAR-informed free space and occupancy.

Re-weighted Camera BEV Features:
- The refined depth confidence $Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 0 reweights the pooled camera features via:
$Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 1

$Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 2 (small convolution) and $Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 3 (FPN) yield the final camera BEV representation.

4. Collaborative Fusion (ColFusion)

ColFusion is a modality-interaction scheme that leverages cross-attention to facilitate reciprocal enhancement of LiDAR and camera BEV representations:

Self-attention Mechanism:
- For each modality $Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 4, compute queries, keys, and values:
$Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 5

Self-attention weights:

$Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 6

Cross-modal Application:
- Cross-attending each modality’s features with the attention weights of the other:
$Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 7

This yields four intermediate BEV representations, corresponding to within- and cross-sensor attention.

Final Fusion via BEV-FPD:
- A learned weighted sum of the four representations is input to a BEV Feature Pyramid Decoder:
$Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 8

The final BEV feature map is decoded into semantic map predictions.

5. Training Protocol and Empirical Results

BroadBEV is trained end-to-end with three loss components:

Camera Depth Loss ( $Z_I \in \mathbb{R}^{N \times H \times W \times C_I}$ 9): Cross-entropy between the predicted camera depth distribution and LiDAR-derived ground-truth depth bins at each pixel.
LiDAR BEV Depth Loss ( $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 0): Binary cross-entropy between predicted LiDAR BEV occupancy and a one-hot LiDAR BEV map.
Map Segmentation Loss ( $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 1): Multi-class cross-entropy on the semantic map grid.

Total objective:

$Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 2

with scalar weights $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 3 determined on the validation set (Kim et al., 2023).

Experimental Setting:

Dataset: nuScenes (700 train, 150 val, 150 test scenes).
Sensors: 6 cameras $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 4 px, 1 LiDAR with $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 5 m voxel grid.
Tasks:
- Map segmentation on $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 6 m BEV, evaluated with mIoU ( $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 7 m grid).
- HD map construction on $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 8 m BEV, mIoU ( $Z^{BEV}_P \in \mathbb{R}^{H' \times W' \times C_P}$ 9 m grid).
Baselines: OFT, LSS, CVT, M²BEV, BEVFusion, PointPillars, CenterPoint, PointPainting, MVP, X-Align, UniM²AE, HDMapNet, LiDAR2Map, BEVFormer, BEVSegFormer, BEVerse, UniFusion.

Results:

Map segmentation (nuScenes val): BroadBEV 70.1% mIoU; prior best (UniM²AE) 67.8%.
Under rain: BroadBEV 63.7% vs. X-Align 57.8%.
Under night: BroadBEV 50.8% vs. X-Align 46.1%.
HD map construction: BroadBEV 64.0% mIoU vs. LiDAR2Map 58.1%.
These improvements extend across all semantic classes.

Ablation:

Module	mIoU Gain on nuScenes
Point-scattering only	+1.3%
ColFusion only	+6.5%
Both (full BroadBEV)	+7.5% (62.6→70.1%)

At far distances (60–80 m), BroadBEV preserves 5–10% higher mIoU relative to typical fusion baselines.

6. Analysis and Limitations

BroadBEV’s qualitative assessment shows robust recovery of distant features (e.g., crosswalks at 70 m) absent in camera- or LiDAR-only BEV reconstructions. In adverse conditions (rain/night), Point-scattering stabilizes depth estimation; fused maps remain continuous, while baselines exhibit fragmenting or ghosting. Attention visualizations indicate that ColFusion enables each modality to focus on its domain strengths—cameras attend to texture, LiDAR to geometric structure.

However, ColFusion’s attention mechanism introduces approximately 70 ms additional latency, raising total inference time to 158 ms on an A100 GPU. This may be prohibitive for strict real-time applications. Further, effective Point-scattering depends on accurate extrinsic calibration between sensors; mis-calibration can degrade cross-modal alignment. Depth supervision in the camera branch is derived from LiDAR, restricting the approach to settings where dense LiDAR data is available for training. A plausible implication is that moving towards fully unsupervised or self-supervised depth estimation would expand applicability.

7. Prospects for Future Research

Future directions involve addressing computational overhead via lighter cross-modal interaction schemes (e.g., MetaFormer or PoolFormer versions), extending beyond map segmentation to 3D detection and future prediction tasks, and refining temporal fusion for frame-to-frame motion consistency. Improving cross-modal calibration robustness and moving towards unsupervised or self-supervised depth learning are also recognized avenues for development. These efforts aim to further advance broad-sighted scene understanding for safe and reliable autonomous systems (Kim et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

BroadBEV: Collaborative LiDAR-camera Fusion for Broad-sighted Bird's Eye View Map Construction (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BroadBEV.

BroadBEV: Collaborative Sensor Fusion

1. Motivation and Context

2. Architectural Overview

3. Point-scattering Module

4. Collaborative Fusion (ColFusion)

5. Training Protocol and Empirical Results

6. Analysis and Limitations

7. Prospects for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BroadBEV: Collaborative Sensor Fusion

1. Motivation and Context

2. Architectural Overview

3. Point-scattering Module

4. Collaborative Fusion (ColFusion)

5. Training Protocol and Empirical Results

6. Analysis and Limitations

7. Prospects for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research