MBFNet: Motion-Guided BEV Fusion Network

Updated 17 November 2025

The paper introduces MBFNet, a neural architecture that applies motion-guided deformable attention to align BEV features across consecutive frames.
It employs motion feature extraction and heatmap-based consistency loss to effectively handle dynamic scene changes and object motion.
Integrating MBFNet within OnlineBEV achieves state-of-the-art performance on nuScenes, improving NDS and mAP in camera-only 3D detection.

Motion-Guided BEV Fusion Network (MBFNet) is a neural architecture designed to enable recurrent, temporally consistent fusion of bird's eye view (BEV) feature representations from multi-frame sequences in camera-based 3D perception for autonomous driving. By extracting motion features and leveraging deformable alignment between consecutive BEV frames, MBFNet directly addresses the challenge of dynamic scene changes and object motion, enabling efficient temporal aggregation while maintaining spatial alignment. MBFNet forms the core of the OnlineBEV methodology, achieving state-of-the-art results on nuScenes for camera-only 3D object detection and robust performance across a range of scenarios (Koh et al., 11 Jul 2025). Related works such as MotionBEV apply analogous motion-guided BEV fusion strategies in LiDAR-based moving object segmentation domains (Zhou et al., 2023).

1. Foundational Concepts and Motivation

Multi-camera 3D perception systems rely on projecting perspective image features into a BEV domain for robust spatial reasoning. Temporal fusion of BEV features across frames improves detection by leveraging multi-frame context, but naive aggregation faces critical obstacles. Dynamic environments induce feature misalignment due to unmodeled object motion, even after standard ego-motion compensation. MBFNet addresses this by:

Explicitly extracting motion cues—quantifying local changes in BEV features between successive frames.
Dynamically aligning historical BEV features to the present using motion-guided deformable attention.
Supervising temporal alignment via heatmap-based consistency loss, directly penalizing misalignments in detection outputs.

This design contrasts prior camera-based BEV fusion methods, which either fuse shallow BEV features or aggregate multiple frames without adaptive alignment, resulting in plateaued performance gains or degraded robustness as the temporal window increases (Koh et al., 11 Jul 2025).

2. Architecture and Internal Dataflow

MBFNet is instantiated at each transformer layer within OnlineBEV's recurrent fusion block. Inputs to layer $l$ at frame $t$ are two BEV queries: the current $q_t^{(l)} \in \mathbb{R}^{C \times H \times W}$ and the historical $q_{t-1}^{(l)}$ carried forward from $t-1$ .

Architectural Modules:

Motion Feature Extractor (MFE):
- Computes the local difference $\Delta q^{(l)} = q_t^{(l)} - q_{t-1}^{(l)}$ , passing it through a two-layer spatially-shared MLP and a Channel-Wise Attention (CWA) block:
$M_t^{(l)} = \mathrm{CWA}(\mathrm{FC}(\Delta q^{(l)}))$ - This produces motion context $M_t^{(l)} \in \mathbb{R}^{C \times H \times W}$ to guide alignment.
Motion-Guided BEV Warping Attention (MGWA):
- Utilizes $M_t^{(l)}$ to parameterize a deformable cross-attention (DeformAttn) over $q_{t-1}^{(l)}$ at each BEV spatial location $p$ :
$Z_{t-1}^{(l)}(p) = \mathrm{DeformAttn}(Q = M_t^{(l)}(p), ~ref = p, ~V = q_{t-1}^{(l)})$ - Learns offsets and weights for spatial resampling, adaptively warping historical features to align with current ones.
Residual Updates:
- Aligned features pass through transformer block layers (residual, layer norm, FFNs) and produce next-layer queries:
$\hat{q}_{t-1}^{(l)} = \mathrm{LN}(q_{t-1}^{(l)} + \text{Dropout}(Z_{t-1}^{(l)})) \ q_{t-1}^{(l+1)} = \mathrm{FFN}_1(\hat{q}_{t-1}^{(l)}) \ q_t^{(l+1)} = \mathrm{FFN}_2(q_t^{(l)} \oplus \hat{q}_{t-1}^{(l)})$ - After $L$ layers, the recurrent outputs are $H_t = q_t^{(L)}$ and $\hat{H}_t = q_{t-1}^{(L)}$ .

The architectural design accommodates variable input resolution and backbone depth: ResNet-50 ( $H=W=128$ , $C=256$ ), or higher backbones and larger inputs ( $H=W=256$ ), maintaining channel dimension.

3. Temporal Feature Alignment and Supervision

The dynamic temporal alignment mechanism uses MFE-produced motion context to guide MGWA, enabling explicit warping rather than static location matching. This compensates for feature misalignments generated by object motion and dynamic scene changes.

Temporal consistency is supervised with a heatmap-based loss. Two detection heads (shared weights) map $H_t$ and $\hat{H}_t$ into heatmaps $Q_t$ and $\hat{Q}_t$ :

$\mathcal{L}_{cons} = \| Q_t - \hat{Q}_t \|_2^2$

Gradient flow is blocked on $Q_t$ , ensuring that only $\hat{H}_t$ is adjusted, directly penalizing historical misalignment. This makes the warping process end-to-end trainable and robust to inter-frame inconsistencies.

4. OnlineBEV Integration and Inference Pipeline

MBFNet is deployed within OnlineBEV as the core recurrent fusion mechanism. The inference cycle at each timestep $t$ :

Multi-camera perspective features are projected into BEV using a shared backbone and BEV generator (e.g., Lift-Splat-Shoot).
Transformer inputs are initialized:

$q_t^{(0)} = F_t, ~q_{t-1}^{(0)} = H_{t-1}$

MBFNet blocks are applied layerwise, updating both current and historical BEV queries.
Final BEV feature $H_t$ and aligned history $\hat{H}_t$ are produced.
3D detection heads predict outputs from $H_t$ ; consistency loss applies during training.
Only $H_t$ is stored for the next step, keeping memory requirements minimal.

This recurrent state update is computationally efficient, obviating the need to cache and process multiple historical frames.

5. Performance, Ablation Results, and Comparative Analysis

On the nuScenes benchmark, OnlineBEV with MBFNet achieves an NDS of 63.9% (V2-99 backbone), exceeding SparseBEV by 0.3 NDS and SOLOFusion by 2.0 NDS. Ablation studies demonstrate:

Recurrent fusion without motion guidance lifts NDS from 41.9% (single-frame) to 50.4%.
Incorporating MBFNet's MFE+MGWA boosts NDS to 51.4% (+1.0) and mAP by approximately 1.3%.
Temporal consistency loss adds a further +0.5% NDS (total 51.9% in ablation).
MFE variants show that difference-based inputs contribute 0.2 points, and CWA further adds 0.3 points.

MBFNet’s motion-guided deformable attention yields pronounced advantages in dynamic settings and for longer fusion windows, enhancing robustness to motion blur and occlusion.

MotionBEV (Zhou et al., 2023) applies motion-guided BEV fusion in the domain of LiDAR moving object segmentation, demonstrating analogous principles:

LiDAR scans are projected into a polar BEV grid, and height-difference motion channels are calculated between sliding windows of frames.
Appearance features are learned with a simplified PointNet, while the dual-branch network fuses appearance and motion features using a stage-wise co-attention module (AMCM).
This network achieves 69.7% IoU on SemanticKITTI-MOS (full window) and 90.8% IoU on SipailouCampus (solid-state LiDAR), with 23 ms inference time on RTX 3090 for a batch size of 8.

A plausible implication is broad applicability of motion-guided BEV fusion, not only for camera-based 3D detection but also for segmentation and tracking tasks across sensor modalities.

7. Technical Implications and Limitations

MBFNet’s approach allows for temporally coherent recurrent aggregation with low memory overhead, but it is contingent on accurate motion extraction and warping quality. Failure modes may arise from ambiguous motion cues or scenes with persistent occlusion. The architectural requirement for transformer depth and BEV spatial resolution scales linearly with computational demand. The fundamental assumption is that motion-induced feature difference accurately represents object displacement at the BEV level.

Subsequent research may seek more generalized warping mechanisms, alternative temporal consistency objectives, or adaptation to broader sensor fusion contexts. The demonstrated gains of MBFNet in OnlineBEV and of related motion-guided fusion in MotionBEV suggest a central role for explicit motion modeling in future BEV-based perception systems.