BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

Published 3 Apr 2026 in cs.CV | (2604.02930v1)

Abstract: A robust awareness of how dynamic scenes evolve is essential for Autonomous Driving systems, as they must accurately detect, track, and predict the behaviour of surrounding obstacles. Traditional perception pipelines that rely on modular architectures tend to suffer from cumulative errors and latency. Instance Prediction models provide a unified solution, performing Bird's-Eye-View segmentation and motion estimation across current and future frames using information directly obtained from different sensors. However, a key challenge in these models lies in the effective processing of the dense spatial and temporal information inherent in dynamic driving environments. This level of complexity demands architectures capable of capturing fine-grained motion patterns and long-range dependencies without compromising real-time performance. We introduce BEVPredFormer, a novel camera-only architecture for BEV instance prediction that uses attention-based temporal processing to improve temporal and spatial comprehension of the scene and relies on an attention-based 3D projection of the camera information. BEVPredFormer employs a recurrent-free design that incorporates gated transformer layers, divided spatio-temporal attention mechanisms, and multi-scale head tasks. Additionally, we incorporate a difference-guided feature extraction module that enhances temporal representations. Extensive ablation studies validate the effectiveness of each architectural component. When evaluated on the nuScenes dataset, BEVPredFormer was on par or surpassed State-Of-The-Art methods, highlighting its potential for robust and efficient Autonomous Driving perception.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a camera-only BEV instance prediction model that leverages spatio-temporal attention and gated transformer layers for improved motion forecasting.
It employs a difference-guided module and attention-based BEV projection to enhance feature extraction and achieve robust segmentation and flow estimation.
Empirical results on nuScenes show that the model yields superior VPQ and IoU scores, especially in long-range and complex urban scenarios.

BEVPredFormer: Spatio-Temporal Attention for BEV Instance Prediction in Autonomous Driving

Overview of BEVPredFormer Architecture

BEVPredFormer is a camera-only BEV instance prediction model designed for robust perception in autonomous driving. The architecture integrates attention-based temporal processing modules to effectively capture spatio-temporal dependencies. It operates without recurrences in its temporal modeling, leveraging gated transformer layers for both spatial and temporal attention, as well as a multi-scale prediction head for future vehicle segmentation and motion prediction in BEV space.

The input consists of multi-camera sequences and vehicle localization information. Feature extraction is achieved via EfficientViT, maximizing attention efficiency with cascaded group attention. Image features are projected into BEV using BEVFormer, an attention-based cross-view method that mitigates the need for explicit depth estimation. Temporal consistency and fine-grained motion patterning are enhanced by a difference-guided feature extraction module, and spatio-temporal attention blocks inspired by PredFormer.

Figure 1: BEVPredFormer architectural flow: efficient feature extraction, BEV projection, temporal attention, and multi-scale predictive decoding.

Temporal Modeling and Difference-Guided Enhancement

The temporal module is a core innovation, comprising two main stages: the difference module and transformer-based spatio-temporal blocks. The difference module performs frame-wise subtraction between consecutive BEV feature maps, assigning additional weights to regions with high temporal variation, thereby amplifying motion-specific features for subsequent prediction tasks.

Figure 2: Difference module architecture captures temporal variances in BEV features for enhanced flow representation.

PredFormer-based blocks receive difference features and employ patch-wise embedding, gated transformer encoding, and divided spatial-temporal attention. Absolute positional encoding is incorporated to retain spatial coherence. Multiple attention configuration patterns are evaluated (Binary-TS, Triplet-TST, Quadruplet-TSST), with experiments confirming the advantage of temporal-first attention ordering for improved accuracy in forecasting.

Figure 3: PredFormer temporal attention—diagram of spatio-temporal processing and differentiated attention block patterns.

Instance Prediction: Flow and Segmentation

BEVPredFormer produces two principal outputs: backward optical flow maps in BEV (for pixel-wise displacement tracking) and BEV semantic segmentation maps (for occupancy and classification). A multi-scale prediction head encodes temporally enriched BEV features, decodes future states, and fuses outputs for instance ID propagation.

Crucially, backward flow formulation mitigates propagation errors typically induced by fast-moving objects and occlusions. Auxiliary supervision is applied to centerness and offset maps, refining BEV instance segmentation.

Performance Evaluation and Ablation

Evaluation on nuScenes adheres to standardized metrics: Intersection over Union (IoU) for per-frame segmentation, and Video Panoptic Quality (VPQ) for temporal instance tracking consistency. The model achieves superior segmentation accuracy and competitive VPQ scores, especially notable in long-range prediction scenarios.

Key numerical observations:

Long range: VPQ = 33.3, IoU = 40.9
Short range: VPQ = 54.9, IoU = 63.9

Comparative ablation shows that adoption of temporal attention and difference modules consistently improves both VPQ and IoU. Two temporal blocks (Triplet-TST) optimize the tradeoff between computation and accuracy. Image resolution is tightly coupled with performance, with increases yielding greater spatial detail but incurring additional computational latency.

Figure 4: Short-range results—consistent object correspondence between camera view and BEV prediction.

Figure 5: Long-range results—accurate vehicle predictions across varying distances and complex scene layouts.

Qualitative visualizations demonstrate robust detection and forecasting of vehicles in diverse scenarios, including urban intersections and adverse weather. While future motion accuracy is high, some degradation is observed in extended temporal horizons, particularly for objects distal from the ego vehicle.

Implications, Limitations, and Future Research Directions

Practically, BEVPredFormer's attention-driven, recurrent-free architecture offers improved computational efficiency and parallelization, supporting real-time operation. Its reliance on attention-based cross-view BEV projection obviates the need for depth estimation, allowing streamlined sensor fusion and compensation for ego-motion.

Theoretically, the model exemplifies the application of transformer-based—especially spatio-temporal—architectures to dense, structured instance forecasting from raw sensor data. Unlike multi-stage pipelines with sequential tracking, instance prediction enables end-to-end representation learning for joint detection and motion modeling.

Further development should explore:

Integration of non-camera sensors (LiDAR, RADAR) for multi-modal fusion
Extension of input temporal horizon for improved long-term motion prediction
Adaptive attention mechanisms to handle sparse BEV occupancy
Continual learning in dynamic environments with online updates

Conclusion

BEVPredFormer introduces a robust, efficient spatio-temporal attention framework for camera-based BEV instance prediction in autonomous driving. Its unified, attention-centric design outperforms depth-based models in segmentation and maintains comparable temporal instance quality. The architecture's adaptability signals promising directions for scalable, dense perception in future autonomous systems, setting a foundation for temporally consistent, scene-aware, semantics-guided BEV modeling.

Markdown Report Issue