BEVTraj: Bird's-Eye View Trajectory Prediction
- BEVTraj is a paradigm that converts scene data into a metric top-down view to predict future motion with spatial consistency.
- Architectural families range from image-to-image regression and graph-based prediction to transformer-based decoders using BEV representations.
- Empirical results show BEVTraj systems achieve lower prediction errors and faster inference, with ongoing research in robust multi-modal fusion.
Searching arXiv for papers on BEV trajectory prediction and closely related methods to ground the article in cited research. to=arxiv_search.search 心博json {"query":"Bird's-Eye View trajectory prediction autonomous driving BEV map-free deformable attention sparse goal proposals trajectory prediction", "max_results": 10} to=arxiv_search.search 天天购彩票json {"query":"BEV trajectory prediction bird eye view representations deep learning highways U-Net graph enhancement ego vehicle trajectory prediction", "max_results": 10} Bird’s-Eye View Trajectory Prediction (BEVTraj) denotes a family of trajectory-forecasting formulations in which future motion is predicted in a top-down bird’s-eye-view coordinate frame rather than directly in perspective image space. In the cited literature, this paradigm appears in several forms: image-to-image regression over BEV occupancy maps for highway forecasting (Izquierdo et al., 2020, Izquierdo et al., 2022), graph-based ego-vehicle prediction from BEV object layouts (Sharma et al., 2023), camera-only holistic planning through BEV occupancy grid maps (Loukkal et al., 2020), joint surround-view BEV segmentation and ego-trajectory prediction (Sharma et al., 2023), dense BEV motion-flow prediction under self-supervision (Fang et al., 2024), direct attention over internal BEV features for behavior prediction (Gu et al., 2024), and map-free end-to-end multimodal prediction with deformable attention and sparse goal proposals (Kong et al., 12 Sep 2025). Across these works, the central motivation is consistent: BEV offers metric consistency for spatial reasoning, makes interactions easier to encode, and can align prediction outputs with downstream planning interfaces.
1. BEV formulation and scene representation
A defining property of BEVTraj methods is the explicit conversion of scene state into a metric top-down representation. In highway raster pipelines, a fixed BEV image of size covers a real-world rectangle , with pixel coordinates given by
The same line of work also gives a homogeneous mapping
$\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$
and renders vehicles either as constant-intensity rectangles or as 2D Gaussians centered at (Izquierdo et al., 2022). In the HighD-based formulation, each vehicle is rendered by a bi-dimensional Gaussian and overlapping occupancies are fused by a per-pixel maximum, producing a common BEV heat-map over the entire traffic scene (Izquierdo et al., 2020).
The Gaussian encoding is not merely a visualization device. In the PREVENTION-based highway system, the Gaussian pixel value is
with and , and the method reports that Gaussian rendering is the best performing configuration (Izquierdo et al., 2022). In the earlier HighD study, this choice is likewise connected to the claim that prediction errors remain in the order of the representation resolution up to three seconds ahead (Izquierdo et al., 2020).
Other BEVTraj variants replace rasterized occupancy by object-centric structures. In the graph-enhanced ego-trajectory model, every detected object in the scene plus the ego vehicle is a graph node, each node carries a visual feature vector extracted from a cropped BEV mask, and the node state concatenates visual features, absolute coordinates, and sinusoidal positional encoding: Edges are created by K-nearest neighbors in the BEV plane, with weights given by inverse Euclidean distance (Sharma et al., 2023).
Camera-based systems often construct BEV implicitly through geometry. The monocular “Flatmobiles” pipeline predicts drivable-area segmentation and vehicle-footprint segmentation in camera view, then warps them into BEV occupancy grid maps by a homography: 0 The critical constraint is that only the vehicle footprint is segmented for warping, so that the flat-world hypothesis implied by the homography is respected (Loukkal et al., 2020). Surround-view systems generalize this idea by fusing multiple calibrated cameras into a BEV feature or mask representation (Sharma et al., 2023).
2. Architectural families
The BEVTraj literature contains several distinct architectural lineages. A useful way to organize them is by what is predicted directly from BEV and how temporal structure is handled.
| Family | Core BEV representation | Representative works |
|---|---|---|
| Image-to-image regression | Stacked past and future BEV heat-maps | (Izquierdo et al., 2020, Izquierdo et al., 2022) |
| Graph-based ego prediction | BEV object graph with KNN edges | (Sharma et al., 2023) |
| Camera-to-BEV planning | BEV occupancy grid maps from homography-warped masks | (Loukkal et al., 2020) |
| Joint perception-prediction | Surround-view BEV vehicle segmentation plus trajectory head | (Sharma et al., 2023) |
| Direct BEV feature attention | Internal BEV features queried by agent locations | (Gu et al., 2024) |
| Map-free end-to-end multimodal decoding | Dense BEV features with deformable attention and sparse goal proposals | (Kong et al., 12 Sep 2025) |
In raster U-Net systems, the problem is cast as sequence-to-sequence image regression. The input is a tensor of stacked past BEV maps, the output is a stack of future BEV maps, and a U-Net encoder-decoder with skip connections performs the mapping (Izquierdo et al., 2020, Izquierdo et al., 2022). In the PREVENTION study, the best configuration is a U-Net with 6 depth levels and a linear final layer; lane markings were found to produce no improvement in prediction performance (Izquierdo et al., 2022).
Graph-based methods replace dense convolutions over raster stacks with explicit interaction structure. In the graph enhancement approach, standard Graph Convolutional Network layers propagate information over KNN-defined scene graphs: 1 or, with symmetric normalization,
2
The ego-node embedding after 3 GCN layers is then decoded by an LSTM over the future horizon (Sharma et al., 2023).
Transformer-based BEV prediction has two major variants in the cited corpus. One uses direct cross-attention from agent queries into internal BEV features, bypassing map decoding and re-encoding. In that setting, multi-head attention is applied from agent-location-derived queries to BEV patch embeddings, and the resulting context is fused with agent-interaction features for trajectory forecasting (Gu et al., 2024). The other, represented by the map-free BEVTraj model, uses deformable attention to sample sparse, context-relevant locations from dense BEV tensors and combines this with a Sparse Goal Candidate Proposal (SGCP) module so that full end-to-end prediction does not require post-processing (Kong et al., 12 Sep 2025).
A neighboring but distinct line predicts future BEV instance segmentation and flow rather than explicit coordinate trajectories. The efficient transformer-based instance prediction model uses dual heads for segmentation and backward flow; trajectory parameterization is implicit via per-pixel instance IDs and flow propagation (Antunes-García et al., 2024). This suggests that some BEVTraj systems treat forecasting as future occupancy or motion-field prediction, while others decode explicit waypoint sequences.
3. Temporal decoding, multimodality, and goal conditioning
Temporal modeling in BEVTraj is not uniform. In U-Net raster predictors, temporal structure is encoded by stacking several past frames as input channels and predicting several future frames in a single shot (Izquierdo et al., 2020, Izquierdo et al., 2022). This is a sequence-to-sequence formulation without an explicit recurrent decoder. Precise trajectories are then recovered after prediction by peak extraction and matching.
In graph-based ego prediction, temporal evolution is separated from spatial interaction encoding. After graph propagation, the final ego embedding 4 is passed into a multi-layer LSTM unrolled over the 5 future timesteps, and the LSTM hidden states are mapped to waypoints by
5
The authors explicitly identify this separation of graph interaction modeling and LSTM temporal decoding as a source of stable multi-step forecasts (Sharma et al., 2023).
Modern multimodal systems usually operate through goal or mode hypotheses. In MTR-VP, the future is modeled as a conditional distribution
6
and the decoder outputs 7 trajectory hypotheses with logits 8, with 9 in the reported experiments (Keskar et al., 27 Nov 2025). In the map-free BEVTraj model, the SGCP module produces a small set of 2D goal coordinates from learnable seeds fused with the target agent state, ranks them with goal scores, and retains the top modes for downstream decoding (Kong et al., 12 Sep 2025). That decoder first predicts initial trajectories and then performs iterative trajectory refinement by repeatedly attending to BEV at predicted waypoint positions and applying learned offsets.
Goal-conditioned decoding also appears in non-driving BEV trajectory prediction. BiTraP estimates trajectory goals in BEV and uses a bi-directional decoder, with a forward GRU running from current time to horizon and a backward GRU initialized from the predicted goal (Yao et al., 2020). Although the task there is pedestrian forecasting, it illustrates a more general BEVTraj principle: endpoint or mode structure can be more stable than direct autoregressive coordinate prediction over long horizons.
4. Supervision, losses, and decoding from BEV outputs
The simplest BEVTraj objectives are regression losses on raster or coordinate outputs. Highway U-Net systems minimize mean squared error over all pixels and future BEV channels: 0 or equivalently the analogous map-level MSE over occupancy probabilities (Izquierdo et al., 2022, Izquierdo et al., 2020). The graph-enhanced ego model uses a sole training objective equal to the Mean Squared Error between ground-truth and predicted future waypoints over the 5-step horizon (Sharma et al., 2023).
Raster methods require an explicit decoding stage to recover object centers. The PREVENTION formulation uses greedy peak-picking above a threshold 1, sub-pixel refinement by a local weighted centroid,
2
and then Hungarian matching to associate extracted future centers to known vehicles (Izquierdo et al., 2022). The HighD formulation uses the same general pattern: local maxima search, weighted centroid refinement, window suppression, and Hungarian association (Izquierdo et al., 2020).
Multimodal trajectory predictors typically use likelihood-based objectives. MTR-VP defines, for each mode 3,
4
then optimizes a multi-modal NLL loss
5
with an additional cross-entropy loss on the closest mode and total loss 6 (Keskar et al., 27 Nov 2025). The map-free BEVTraj model jointly trains goal proposal, displacement, dense future prediction, and multimodal trajectory decoding through
7
where 8 includes 9, $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$0, $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$1, and $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$2 (Kong et al., 12 Sep 2025).
Self-supervised BEV motion prediction introduces a different training logic. The cross-modality framework supervises BEV motion fields with masked Chamfer distance, piecewise rigidity, and temporal consistency losses,
$\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$3
using optical-flow-derived static/dynamic masks and rigid pieces during training, while requiring only LiDAR at inference (Fang et al., 2024). This is not coordinate-sequence forecasting, but it is part of the broader BEV prediction landscape because each BEV cell carries a future 2D displacement.
5. Benchmarks, metrics, and empirical findings
The reported empirical landscape is heterogeneous. The cited works use synthetic CARLA data (Sharma et al., 2023), PREVENTION (Izquierdo et al., 2022), HighD (Izquierdo et al., 2020, Sormoli et al., 2023), nuScenes (Fang et al., 2024, Gu et al., 2024, Sharma et al., 2023, Kong et al., 12 Sep 2025), Waymo End-to-End Driving (Keskar et al., 27 Nov 2025), ATG4D (Wang et al., 2020, Fadadu et al., 2020), Argoverse 2 Sensor (Kong et al., 12 Sep 2025), and EgoTraj-Bench (Liu et al., 1 Oct 2025). This suggests that cross-paper numbers should not be compared directly without attention to dataset, horizon, and metric.
Within highway raster prediction, the PREVENTION-based U-Net with 6 depth levels, linear terminal layer, Gaussian vehicle representation, and no lanes reports average prediction error of $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$4 and $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$5 meters and final prediction error of $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$6 and $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$7 meters for longitudinal and lateral coordinates, respectively, for a predicted trajectory length of $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$8 seconds, with errors up to $\begin{bmatrix}u\v\1\end{bmatrix} \sim \begin{bmatrix} \mathrm{PPM}_x & 0 & -X_{\min}\,\mathrm{PPM}_x\ 0 & \mathrm{PPM}_y & -Y_{\min}\,\mathrm{PPM}_y\ 0&0&1 \end{bmatrix} \begin{bmatrix}X\Y\1\end{bmatrix},$9 lower compared to the baseline method (Izquierdo et al., 2022). The earlier HighD study reports mean absolute longitudinal and lateral errors of 0 m and 1 m at 2 s, and 3 m and 4 m at 5 s (Izquierdo et al., 2020).
The graph-enhanced ego predictor reports particularly large gains in its CARLA setting. With a ViT backbone, the DNN-LSTM baseline obtains MSE 6 m with 7 M parameters, while GNN-LSTM obtains MSE 8 m with 9 M parameters. Across all backbones, GNN-LSTM yields 0–1 orders of magnitude smaller MSE with 2 fewer parameters, and positional encoding and KNN edge-weights are both found essential (Sharma et al., 2023).
The map-free BEVTraj model is evaluated directly against HD-map methods. On nuScenes it attains 3 m, 4 m, and 5 versus MTR’s 6. On Argoverse 2 it achieves 7 m and 8 m, and the authors state that performance is comparable to state-of-the-art HD map-based models while eliminating dependency on pre-built maps (Kong et al., 12 Sep 2025).
Direct BEV feature attention focuses on speed-accuracy trade-offs. On nuScenes, HiVT + MapTR improves from 9 m to 0 m and from 1 m to 2 m, while inference speed changes from 3 ms to 4 ms. The paper summarizes this as up to 5 faster inference speeds and up to 6 more accurate predictions (Gu et al., 2024).
The most explicit cautionary result appears in MTR-VP. On the Waymo End-to-End Driving Dataset, MTR-VP obtains ADE 7 m / 8 m at 9 s / 0 s, which is worse in top-1 ADE than UniPlan and DiffusionLTF on the cited test split. However, top-1 ADE drops substantially as the number of modes increases, and the blank-image ablation indicates that ADE and RFS are nearly identical when images are removed, leading the authors to state that current cross-attention fusion fails to exploit visual cues (Keskar et al., 27 Nov 2025).
6. Limitations, misconceptions, and research directions
A common misconception is that BEV alone resolves perception-prediction coupling. The literature is more specific. BEV removes front-view perspective distortion and simplifies spatial reasoning (Sharma et al., 2023), but it can also lose fine appearance detail from perspective images (Sharma et al., 2023). In the MTR-VP experiments, transformer-based fusion of visual features with past kinematic features is reported as not effective at combining both modes to produce useful scene context embeddings, and blank-image ablation confirms that “vision” is not yet used (Keskar et al., 27 Nov 2025). BEV therefore improves the geometry of the representation, but does not by itself guarantee effective sensor fusion.
Another misconception is that static context channels are always helpful once rasterized into BEV. In the PREVENTION highway study, the use of lane markings was found to produce no improvement in prediction performance (Izquierdo et al., 2022). By contrast, map-free BEVTraj argues that dense BEV features can encode raw semantics such as road paint, curbs, and barriers, and that deformable attention can selectively retrieve the relevant context without relying on vectorized maps (Kong et al., 12 Sep 2025). This suggests that the utility of static context depends on how it is encoded and queried, not only on whether it is present.
Self-supervised and robust-observation variants further expose the limits of current formulations. Cross-modality self-supervision for BEV motion prediction depends on optical-flow-derived supervision signals, and pseudo static/dynamic masks degrade at night or in rain (Fang et al., 2024). EgoTraj-Bench, which grounds noisy first-person visual histories in clean BEV futures, reports that all BEV-trained models degrade 2–3 in ADE when fed ego-view-noisy history versus clean history, while BiFlow recovers more than 4 of the loss (Liu et al., 1 Oct 2025). A plausible implication is that robustness to observation corruption is becoming a central issue for BEVTraj systems, especially when training assumptions differ from deployment-time perception.
The forward directions named in the corpus are correspondingly diverse: probabilistic heads such as Mixture Density Networks, learned BEV reconstruction via encoder-decoder transformers, temporal graph layers in place of explicit LSTMs (Sharma et al., 2023), improved visual-kinematic fusion and spatial grounding losses (Keskar et al., 27 Nov 2025), multi-modal forecasting of occupancy and semantics (Fang et al., 2024), adaptive or multi-scale BEV grids (Kong et al., 12 Sep 2025), and extensions toward unified detection-prediction pipelines (Antunes-García et al., 2024, Kong et al., 12 Sep 2025). Taken together, these directions indicate an active shift from BEV as a convenient rasterization trick toward BEV as a learned, queryable, and increasingly end-to-end predictive state space for autonomous driving and robotic navigation.