Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoBEVFusion DWCA: Cooperative Perception

Updated 10 June 2026
  • The paper introduces a dual-window cross-attention module that fuses LiDAR and camera BEV features for cooperative perception in autonomous vehicles.
  • It leverages a five-module architecture combining local sensor processing, V2V feature sharing, and 3D CNN aggregation to enhance inference robustness.
  • Experimental results on the OPV2V benchmark show significant improvements in mIoU and AP metrics over traditional single-vehicle methods.

CoBEVFusion is a cooperative perception framework for Connected Autonomous Vehicles (CAVs) that employs multimodal fusion of LiDAR and camera data, utilizing a Dual Window–Based Cross-Attention (DWCA) module to create a unified Bird’s-Eye View (BEV) representation. The architecture processes and fuses multimodal sensor data locally within each CAV, shares fused BEV features among vehicles, and aggregates the received information for downstream tasks such as BEV semantic segmentation and 3D object detection. CoBEVFusion demonstrates state-of-the-art performance for both single-vehicle and cooperative perception on the OPV2V benchmark, setting a precedent for future research in cross-modal, BEV-based cooperative perception architectures (Qiao et al., 2023).

1. System Architecture

CoBEVFusion comprises five primary modules:

  1. LiDAR Stream Processing: The input is an n×4n \times 4 point cloud (x,y,z,intensity)(x, y, z, \text{intensity}). A pillar feature network (PointPillars) partitions the cloud into vertical pillars, processes each pillar with a small PointNet, and reconstructs a 2D pseudo-image of size H×W×CH \times W \times C. Subsequent refinement is performed by a Feature Pyramid Network (FPN), and a final 2D convolution reduces the feature channels to CC. The output is FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}.
  2. Camera Stream Processing: The camera subsystem receives KK monocular images IkRh×w×3I_k \in \mathbb{R}^{h \times w \times 3} with associated intrinsics KkK_k and extrinsics (Rk,tk)(R_k, t_k). A ResNet/EfficientNet backbone extracts multi-scale features, followed by a Cross-View Transformer (CVT) which projects these into a unified BEV plane using learned queries and positional encodings. Additional 2D CNN and bilinear upsampling layers yield FcameraRH×W×CF_\text{camera} \in \mathbb{R}^{H \times W \times C}.
  3. BEV Representation Alignment: Both (x,y,z,intensity)(x, y, z, \text{intensity})0 and (x,y,z,intensity)(x, y, z, \text{intensity})1 are spatially aligned on a shared (x,y,z,intensity)(x, y, z, \text{intensity})2 grid with equal channel dimensionality (x,y,z,intensity)(x, y, z, \text{intensity})3, facilitating attention-based fusion.
  4. LiDAR-Camera BEV Fusion (DWCA): The DWCA module aligns and fuses LiDAR and camera features through dual windowed cross-attention procedures, further detailed in Section 2.
  5. Cooperative Feature Fusion and Perception Head: Locally fused BEV representations are spatially broadcast to peers via Vehicle-to-Vehicle (V2V) communication. A 3D CNN aggregates local and received features, feeding into the perception head for segmentation or 3D detection.

2. Dual Window–Based Cross-Attention (DWCA) Module

The DWCA module is pivotal to CoBEVFusion, facilitating robust multimodal fusion:

  • Windowed Cross-Attention (WCA): The (x,y,z,intensity)(x, y, z, \text{intensity})4 BEV grid is partitioned into non-overlapping windows of size (x,y,z,intensity)(x, y, z, \text{intensity})5. Patch-embedded feature maps are reshaped into windows (x,y,z,intensity)(x, y, z, \text{intensity})6 with (x,y,z,intensity)(x, y, z, \text{intensity})7 tokens per window. Within each window (x,y,z,intensity)(x, y, z, \text{intensity})8, features are linearly projected:
    • (x,y,z,intensity)(x, y, z, \text{intensity})9
    • H×W×CH \times W \times C0
    • H×W×CH \times W \times C1
    • where H×W×CH \times W \times C2 are query tokens from one modality, H×W×CH \times W \times C3 from the other, and H×W×CH \times W \times C4 are learnable.

The windowed attention computation is:

H×W×CH \times W \times C5

A final linear layer and residual connection yield the window’s output.

  • Dual Windows: While the paper does not explicitly distinguish “inner” and “outer” windows, a plausible extension treats inner as non-overlapping H×W×CH \times W \times C6 windows and outer as overlapping H×W×CH \times W \times C7 windows straddling patch boundaries, with WCA applied to both and outputs concatenated or added.
  • DWCA Architecture: Two WCAs operate in parallel: the left WCA uses H×W×CH \times W \times C8 as query and H×W×CH \times W \times C9 as key/value; the right WCA reverses the roles. Their outputs are concatenated along the channel axis to form a tensor of shape CC0, followed by a multi-head self-attention (MHSA) aggregator projecting back to CC1 channels:

CC2

3. Cooperative Feature Sharing and Aggregation

  • BEV Feature Sharing: Each CAV computes CC3 in its own coordinate frame and broadcasts it to peers using V2V communication (intermediate fusion).
  • Coordinate Alignment: Upon reception, each peer reprojects others’ CC4 into its ego BEV grid using relative pose obtained from GPS/IMU, employing bilinear feature sampling for spatial alignment.
  • 3D CNN Aggregator: A stack of three Conv3D layers (kernel: CC5, stride 1, batch normalization, ReLU) aggregates BEV features along the agent (temporal) dimension. Channel progression is CC6. Aggregation is performed by element-wise summation across agents, followed by Conv3D.

4. Training Protocol and Supervision

  • BEV Semantic Segmentation Loss: The output is four-class per-pixel softmax (background, vehicle, drivable area, lane) with weighted cross-entropy,

CC7

where CC8 is the ground truth class per pixel.

  • 3D Object Detection Loss: An SSD-style head predicts class confidences CC9 and 3D box regressions FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}0. Classification uses focal loss,

FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}1

and regression uses smooth FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}2 loss,

FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}3

with the combined detection loss,

FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}4

  • Optimization: AdamW optimizer (FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}5, weight decay FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}6), multi-step LR scheduler (decay at epochs FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}7), early stopping on validation set.
  • Ground Truths: BEV segmentations are rasterized from annotated 3D bounding boxes and drivable area/lane labels; 3D detection targets are defined from projected bounding boxes onto BEV anchors.

5. Experimental Evaluation

CoBEVFusion was evaluated on OPV2V’s Default CARLA Towns split (6765 train / 1980 val / 2170 test), and its performance metrics included per-class and mean intersection-over-union (mIoU) for BEV segmentation, and average precision (AP) at IoU thresholds 0.5 and 0.7 for 3D object detection.

Model / Task mIoU Vehicle mIoU Drivable Area mIoU Lane [email protected] [email protected]
DWCA (LC), Single-vehicle 40.4 61.4 47.6 83.4 64.7
CoBEVFusion (LC), Coop. 59.5 61.7 50.9 92.5 86.5

DWCA outperforms LiDAR-only (PointPillars) and BEVFusion approaches for both segmentation and detection (Qiao et al., 2023). Under cooperative perception, CoBEVFusion reaches 59.5% vehicle mIoU and 92.5% [email protected], surpassing L-only CAV strategies. On domain-adapted Culver City data, CoBEVFusion achieves 79.4% [email protected]. Ablation studies show that:

  • WCA(LC) (LiDAR query, Camera key/value) yields improved vehicle AP.
  • WCA(CL) (Camera query, LiDAR key/value) improves drivable area/lane.
  • DWCA (combining both) surpasses either variant alone in all reported metrics.

6. Computational and Communication Characteristics

  • Inference Runtime: On an A100 GPU with FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}8, inference achieves approximately 10 fps.
  • Computational Complexity: Each WCA module incurs FlidarRH×W×CF_\text{lidar} \in \mathbb{R}^{H \times W \times C}9 operations; MHSA contributes KK0.
  • Communication Overhead: Fused BEV features (KK1), e.g., KK2 bytes (float32), result in ~10 MB per vehicle per frame—lower than raw point cloud (~50 MB), but significantly above late-fusion prediction-level protocols (~hundreds of kB).
  • Robustness Assumptions: The architecture presumes lossless, low-latency peer-to-peer broadcast within CAV ranges. No experimental data is provided regarding adverse impact from packet loss, temporal asynchrony, or pose misalignment. This suggests further study of robustness to communication impairments is warranted.

7. Limitations and Prospective Research

The study omits ablations on window size, attention head number, bandwidth constraints, and message synchronization, which constitute salient directions for future investigation. The communication model assumes ideal conditions, and resource-aware or fault-tolerant extensions are currently unaddressed. Systematic exploration of attention window partitioning and evaluation under network constraints may provide insights into scaling multimodal, BEV-based cooperative frameworks.

The DWCA module in CoBEVFusion successfully reconciles LiDAR’s high geometric fidelity with the textural reach of camera imagery in BEV space. The resulting performance sets a new bar for multimodal, cooperative vehicle perception, while suggesting further inquiry into adaptive windowing, efficient fusion under volume constraints, and robust aggregation under less-ideal network conditions (Qiao et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoBEVFusion (DWCA).