CoBEVFusion DWCA: Cooperative Perception
- The paper introduces a dual-window cross-attention module that fuses LiDAR and camera BEV features for cooperative perception in autonomous vehicles.
- It leverages a five-module architecture combining local sensor processing, V2V feature sharing, and 3D CNN aggregation to enhance inference robustness.
- Experimental results on the OPV2V benchmark show significant improvements in mIoU and AP metrics over traditional single-vehicle methods.
CoBEVFusion is a cooperative perception framework for Connected Autonomous Vehicles (CAVs) that employs multimodal fusion of LiDAR and camera data, utilizing a Dual Window–Based Cross-Attention (DWCA) module to create a unified Bird’s-Eye View (BEV) representation. The architecture processes and fuses multimodal sensor data locally within each CAV, shares fused BEV features among vehicles, and aggregates the received information for downstream tasks such as BEV semantic segmentation and 3D object detection. CoBEVFusion demonstrates state-of-the-art performance for both single-vehicle and cooperative perception on the OPV2V benchmark, setting a precedent for future research in cross-modal, BEV-based cooperative perception architectures (Qiao et al., 2023).
1. System Architecture
CoBEVFusion comprises five primary modules:
- LiDAR Stream Processing: The input is an point cloud . A pillar feature network (PointPillars) partitions the cloud into vertical pillars, processes each pillar with a small PointNet, and reconstructs a 2D pseudo-image of size . Subsequent refinement is performed by a Feature Pyramid Network (FPN), and a final 2D convolution reduces the feature channels to . The output is .
- Camera Stream Processing: The camera subsystem receives monocular images with associated intrinsics and extrinsics . A ResNet/EfficientNet backbone extracts multi-scale features, followed by a Cross-View Transformer (CVT) which projects these into a unified BEV plane using learned queries and positional encodings. Additional 2D CNN and bilinear upsampling layers yield .
- BEV Representation Alignment: Both 0 and 1 are spatially aligned on a shared 2 grid with equal channel dimensionality 3, facilitating attention-based fusion.
- LiDAR-Camera BEV Fusion (DWCA): The DWCA module aligns and fuses LiDAR and camera features through dual windowed cross-attention procedures, further detailed in Section 2.
- Cooperative Feature Fusion and Perception Head: Locally fused BEV representations are spatially broadcast to peers via Vehicle-to-Vehicle (V2V) communication. A 3D CNN aggregates local and received features, feeding into the perception head for segmentation or 3D detection.
2. Dual Window–Based Cross-Attention (DWCA) Module
The DWCA module is pivotal to CoBEVFusion, facilitating robust multimodal fusion:
- Windowed Cross-Attention (WCA): The 4 BEV grid is partitioned into non-overlapping windows of size 5. Patch-embedded feature maps are reshaped into windows 6 with 7 tokens per window. Within each window 8, features are linearly projected:
- 9
- 0
- 1
- where 2 are query tokens from one modality, 3 from the other, and 4 are learnable.
The windowed attention computation is:
5
A final linear layer and residual connection yield the window’s output.
- Dual Windows: While the paper does not explicitly distinguish “inner” and “outer” windows, a plausible extension treats inner as non-overlapping 6 windows and outer as overlapping 7 windows straddling patch boundaries, with WCA applied to both and outputs concatenated or added.
- DWCA Architecture: Two WCAs operate in parallel: the left WCA uses 8 as query and 9 as key/value; the right WCA reverses the roles. Their outputs are concatenated along the channel axis to form a tensor of shape 0, followed by a multi-head self-attention (MHSA) aggregator projecting back to 1 channels:
2
3. Cooperative Feature Sharing and Aggregation
- BEV Feature Sharing: Each CAV computes 3 in its own coordinate frame and broadcasts it to peers using V2V communication (intermediate fusion).
- Coordinate Alignment: Upon reception, each peer reprojects others’ 4 into its ego BEV grid using relative pose obtained from GPS/IMU, employing bilinear feature sampling for spatial alignment.
- 3D CNN Aggregator: A stack of three Conv3D layers (kernel: 5, stride 1, batch normalization, ReLU) aggregates BEV features along the agent (temporal) dimension. Channel progression is 6. Aggregation is performed by element-wise summation across agents, followed by Conv3D.
4. Training Protocol and Supervision
- BEV Semantic Segmentation Loss: The output is four-class per-pixel softmax (background, vehicle, drivable area, lane) with weighted cross-entropy,
7
where 8 is the ground truth class per pixel.
- 3D Object Detection Loss: An SSD-style head predicts class confidences 9 and 3D box regressions 0. Classification uses focal loss,
1
and regression uses smooth 2 loss,
3
with the combined detection loss,
4
- Optimization: AdamW optimizer (5, weight decay 6), multi-step LR scheduler (decay at epochs 7), early stopping on validation set.
- Ground Truths: BEV segmentations are rasterized from annotated 3D bounding boxes and drivable area/lane labels; 3D detection targets are defined from projected bounding boxes onto BEV anchors.
5. Experimental Evaluation
CoBEVFusion was evaluated on OPV2V’s Default CARLA Towns split (6765 train / 1980 val / 2170 test), and its performance metrics included per-class and mean intersection-over-union (mIoU) for BEV segmentation, and average precision (AP) at IoU thresholds 0.5 and 0.7 for 3D object detection.
| Model / Task | mIoU Vehicle | mIoU Drivable Area | mIoU Lane | [email protected] | [email protected] |
|---|---|---|---|---|---|
| DWCA (LC), Single-vehicle | 40.4 | 61.4 | 47.6 | 83.4 | 64.7 |
| CoBEVFusion (LC), Coop. | 59.5 | 61.7 | 50.9 | 92.5 | 86.5 |
DWCA outperforms LiDAR-only (PointPillars) and BEVFusion approaches for both segmentation and detection (Qiao et al., 2023). Under cooperative perception, CoBEVFusion reaches 59.5% vehicle mIoU and 92.5% [email protected], surpassing L-only CAV strategies. On domain-adapted Culver City data, CoBEVFusion achieves 79.4% [email protected]. Ablation studies show that:
- WCA(LC) (LiDAR query, Camera key/value) yields improved vehicle AP.
- WCA(CL) (Camera query, LiDAR key/value) improves drivable area/lane.
- DWCA (combining both) surpasses either variant alone in all reported metrics.
6. Computational and Communication Characteristics
- Inference Runtime: On an A100 GPU with 8, inference achieves approximately 10 fps.
- Computational Complexity: Each WCA module incurs 9 operations; MHSA contributes 0.
- Communication Overhead: Fused BEV features (1), e.g., 2 bytes (float32), result in ~10 MB per vehicle per frame—lower than raw point cloud (~50 MB), but significantly above late-fusion prediction-level protocols (~hundreds of kB).
- Robustness Assumptions: The architecture presumes lossless, low-latency peer-to-peer broadcast within CAV ranges. No experimental data is provided regarding adverse impact from packet loss, temporal asynchrony, or pose misalignment. This suggests further study of robustness to communication impairments is warranted.
7. Limitations and Prospective Research
The study omits ablations on window size, attention head number, bandwidth constraints, and message synchronization, which constitute salient directions for future investigation. The communication model assumes ideal conditions, and resource-aware or fault-tolerant extensions are currently unaddressed. Systematic exploration of attention window partitioning and evaluation under network constraints may provide insights into scaling multimodal, BEV-based cooperative frameworks.
The DWCA module in CoBEVFusion successfully reconciles LiDAR’s high geometric fidelity with the textural reach of camera imagery in BEV space. The resulting performance sets a new bar for multimodal, cooperative vehicle perception, while suggesting further inquiry into adaptive windowing, efficient fusion under volume constraints, and robust aggregation under less-ideal network conditions (Qiao et al., 2023).