Papers
Topics
Authors
Recent
2000 character limit reached

MBFNet: Motion-Guided BEV Fusion Network

Updated 17 November 2025
  • The paper introduces MBFNet, a neural architecture that applies motion-guided deformable attention to align BEV features across consecutive frames.
  • It employs motion feature extraction and heatmap-based consistency loss to effectively handle dynamic scene changes and object motion.
  • Integrating MBFNet within OnlineBEV achieves state-of-the-art performance on nuScenes, improving NDS and mAP in camera-only 3D detection.

Motion-Guided BEV Fusion Network (MBFNet) is a neural architecture designed to enable recurrent, temporally consistent fusion of bird's eye view (BEV) feature representations from multi-frame sequences in camera-based 3D perception for autonomous driving. By extracting motion features and leveraging deformable alignment between consecutive BEV frames, MBFNet directly addresses the challenge of dynamic scene changes and object motion, enabling efficient temporal aggregation while maintaining spatial alignment. MBFNet forms the core of the OnlineBEV methodology, achieving state-of-the-art results on nuScenes for camera-only 3D object detection and robust performance across a range of scenarios (Koh et al., 11 Jul 2025). Related works such as MotionBEV apply analogous motion-guided BEV fusion strategies in LiDAR-based moving object segmentation domains (Zhou et al., 2023).

1. Foundational Concepts and Motivation

Multi-camera 3D perception systems rely on projecting perspective image features into a BEV domain for robust spatial reasoning. Temporal fusion of BEV features across frames improves detection by leveraging multi-frame context, but naive aggregation faces critical obstacles. Dynamic environments induce feature misalignment due to unmodeled object motion, even after standard ego-motion compensation. MBFNet addresses this by:

  • Explicitly extracting motion cues—quantifying local changes in BEV features between successive frames.
  • Dynamically aligning historical BEV features to the present using motion-guided deformable attention.
  • Supervising temporal alignment via heatmap-based consistency loss, directly penalizing misalignments in detection outputs.

This design contrasts prior camera-based BEV fusion methods, which either fuse shallow BEV features or aggregate multiple frames without adaptive alignment, resulting in plateaued performance gains or degraded robustness as the temporal window increases (Koh et al., 11 Jul 2025).

2. Architecture and Internal Dataflow

MBFNet is instantiated at each transformer layer within OnlineBEV's recurrent fusion block. Inputs to layer ll at frame tt are two BEV queries: the current qt(l)RC×H×Wq_t^{(l)} \in \mathbb{R}^{C \times H \times W} and the historical qt1(l)q_{t-1}^{(l)} carried forward from t1t-1.

Architectural Modules:

  • Motion Feature Extractor (MFE):

    • Computes the local difference Δq(l)=qt(l)qt1(l)\Delta q^{(l)} = q_t^{(l)} - q_{t-1}^{(l)}, passing it through a two-layer spatially-shared MLP and a Channel-Wise Attention (CWA) block:

    Mt(l)=CWA(FC(Δq(l)))M_t^{(l)} = \mathrm{CWA}(\mathrm{FC}(\Delta q^{(l)})) - This produces motion context Mt(l)RC×H×WM_t^{(l)} \in \mathbb{R}^{C \times H \times W} to guide alignment.

  • Motion-Guided BEV Warping Attention (MGWA):

    • Utilizes Mt(l)M_t^{(l)} to parameterize a deformable cross-attention (DeformAttn) over qt1(l)q_{t-1}^{(l)} at each BEV spatial location pp:

    Zt1(l)(p)=DeformAttn(Q=Mt(l)(p), ref=p, V=qt1(l))Z_{t-1}^{(l)}(p) = \mathrm{DeformAttn}(Q = M_t^{(l)}(p), ~ref = p, ~V = q_{t-1}^{(l)}) - Learns offsets and weights for spatial resampling, adaptively warping historical features to align with current ones.

  • Residual Updates:

    • Aligned features pass through transformer block layers (residual, layer norm, FFNs) and produce next-layer queries:

    q^t1(l)=LN(qt1(l)+Dropout(Zt1(l))) qt1(l+1)=FFN1(q^t1(l)) qt(l+1)=FFN2(qt(l)q^t1(l))\hat{q}_{t-1}^{(l)} = \mathrm{LN}(q_{t-1}^{(l)} + \text{Dropout}(Z_{t-1}^{(l)})) \ q_{t-1}^{(l+1)} = \mathrm{FFN}_1(\hat{q}_{t-1}^{(l)}) \ q_t^{(l+1)} = \mathrm{FFN}_2(q_t^{(l)} \oplus \hat{q}_{t-1}^{(l)}) - After LL layers, the recurrent outputs are Ht=qt(L)H_t = q_t^{(L)} and H^t=qt1(L)\hat{H}_t = q_{t-1}^{(L)}.

The architectural design accommodates variable input resolution and backbone depth: ResNet-50 (H=W=128H=W=128, C=256C=256), or higher backbones and larger inputs (H=W=256H=W=256), maintaining channel dimension.

3. Temporal Feature Alignment and Supervision

The dynamic temporal alignment mechanism uses MFE-produced motion context to guide MGWA, enabling explicit warping rather than static location matching. This compensates for feature misalignments generated by object motion and dynamic scene changes.

Temporal consistency is supervised with a heatmap-based loss. Two detection heads (shared weights) map HtH_t and H^t\hat{H}_t into heatmaps QtQ_t and Q^t\hat{Q}_t:

Lcons=QtQ^t22\mathcal{L}_{cons} = \| Q_t - \hat{Q}_t \|_2^2

Gradient flow is blocked on QtQ_t, ensuring that only H^t\hat{H}_t is adjusted, directly penalizing historical misalignment. This makes the warping process end-to-end trainable and robust to inter-frame inconsistencies.

4. OnlineBEV Integration and Inference Pipeline

MBFNet is deployed within OnlineBEV as the core recurrent fusion mechanism. The inference cycle at each timestep tt:

  • Multi-camera perspective features are projected into BEV using a shared backbone and BEV generator (e.g., Lift-Splat-Shoot).
  • Transformer inputs are initialized:

qt(0)=Ft, qt1(0)=Ht1q_t^{(0)} = F_t, ~q_{t-1}^{(0)} = H_{t-1}

  • MBFNet blocks are applied layerwise, updating both current and historical BEV queries.
  • Final BEV feature HtH_t and aligned history H^t\hat{H}_t are produced.
  • 3D detection heads predict outputs from HtH_t; consistency loss applies during training.
  • Only HtH_t is stored for the next step, keeping memory requirements minimal.

This recurrent state update is computationally efficient, obviating the need to cache and process multiple historical frames.

5. Performance, Ablation Results, and Comparative Analysis

On the nuScenes benchmark, OnlineBEV with MBFNet achieves an NDS of 63.9% (V2-99 backbone), exceeding SparseBEV by 0.3 NDS and SOLOFusion by 2.0 NDS. Ablation studies demonstrate:

  • Recurrent fusion without motion guidance lifts NDS from 41.9% (single-frame) to 50.4%.
  • Incorporating MBFNet's MFE+MGWA boosts NDS to 51.4% (+1.0) and mAP by approximately 1.3%.
  • Temporal consistency loss adds a further +0.5% NDS (total 51.9% in ablation).
  • MFE variants show that difference-based inputs contribute 0.2 points, and CWA further adds 0.3 points.

MBFNet’s motion-guided deformable attention yields pronounced advantages in dynamic settings and for longer fusion windows, enhancing robustness to motion blur and occlusion.

MotionBEV (Zhou et al., 2023) applies motion-guided BEV fusion in the domain of LiDAR moving object segmentation, demonstrating analogous principles:

  • LiDAR scans are projected into a polar BEV grid, and height-difference motion channels are calculated between sliding windows of frames.
  • Appearance features are learned with a simplified PointNet, while the dual-branch network fuses appearance and motion features using a stage-wise co-attention module (AMCM).
  • This network achieves 69.7% IoU on SemanticKITTI-MOS (full window) and 90.8% IoU on SipailouCampus (solid-state LiDAR), with 23 ms inference time on RTX 3090 for a batch size of 8.

A plausible implication is broad applicability of motion-guided BEV fusion, not only for camera-based 3D detection but also for segmentation and tracking tasks across sensor modalities.

7. Technical Implications and Limitations

MBFNet’s approach allows for temporally coherent recurrent aggregation with low memory overhead, but it is contingent on accurate motion extraction and warping quality. Failure modes may arise from ambiguous motion cues or scenes with persistent occlusion. The architectural requirement for transformer depth and BEV spatial resolution scales linearly with computational demand. The fundamental assumption is that motion-induced feature difference accurately represents object displacement at the BEV level.

Subsequent research may seek more generalized warping mechanisms, alternative temporal consistency objectives, or adaptation to broader sensor fusion contexts. The demonstrated gains of MBFNet in OnlineBEV and of related motion-guided fusion in MotionBEV suggest a central role for explicit motion modeling in future BEV-based perception systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Motion-Guided BEV Fusion Network (MBFNet).