Pedestrian Intention Prediction
- Pedestrian intention prediction is a computational task that uses video, sensor, and contextual data to infer imminently crossing behavior, crucial for safer autonomous driving.
- It integrates computer vision, sequential modeling, and sensor fusion—incorporating visual, kinematic, and scene cues—to form accurate binary classifications.
- Recent advances employ deep architectures such as CNN+RNN, transformers, and graph neural networks to enhance robustness against occlusions, adverse weather, and varied urban scenarios.
Pedestrian intention prediction is the computational task of inferring, from video, sensor, or contextual data, whether a pedestrian is likely to cross the street imminently in front of a vehicle—most commonly within modern Advanced Driver Assistance Systems (ADAS) and autonomous vehicle (AV) pipelines. This field represents a convergence of computer vision, sequential modeling, contextual scene analysis, and real-time embedded inference. Accurately predicting pedestrian intent is central for proactive collision avoidance, adaptive vehicle control, and safe human–machine interaction in urban environments.
1. Problem Formulation and Core Objectives
Pedestrian intention prediction is usually phrased as a binary classification problem: Given an observation window (sequence of video frames, kinematic tracks, multimodal features) up to time , predict whether a target pedestrian will attempt to cross the road within a specified prediction horizon (e.g., 1–2 seconds or a set number of frames) (Varytimidis et al., 2018, Li et al., 25 Nov 2025, Azarmi et al., 2024, Azarmi et al., 2023). Let input denote the fused feature sequence—ranging from low-level pixel data to compact, structured attributes:
- Visual features: cropped pedestrian frames, optical flow, body-pose keypoints (Azarmi et al., 2023, Piccoli et al., 2020, Elkammar et al., 5 Jan 2026).
- Kinematic state: bounding-box positions/velocities, centers, trajectory segments (Liu et al., 2 Nov 2025, Bouhsain et al., 2020).
- Scene context: traffic signal state, crosswalk presence, lane count, semantic segmentation (Nia et al., 16 Jul 2025, Li et al., 25 Nov 2025).
- Vehicle context: ego-vehicle speed/acceleration, proximity or time-to-collision (Li et al., 25 Nov 2025, Azarmi et al., 2024).
Mathematically, the aim is to train a model such that
where indicates the pedestrian will cross within the prediction horizon. Temporal variants and forecasting frameworks extend the prediction to multi-step or trajectory-conditioned settings (Bouhsain et al., 2020, Munir et al., 2024).
2. Sensing Modalities and Feature Engineering
Features for intention prediction span a spectrum from engineered, interpretable descriptors to automatically learned spatio-temporal embeddings. Critical input elements are:
- Bounding boxes and geometric cues: Location and size of pedestrian detection boxes are repeatedly shown to have the largest feature importance (Azarmi et al., 2024), especially at closer distances or in intersection contexts.
- Motion cues: Optical flow, velocity, and higher-order derivatives underpin both early CRF-based (Neogi et al., 2019) and deep learning (Li et al., 25 Nov 2025, Bouhsain et al., 2020) paradigms.
- Body pose/skeletonization: Structures from pose estimators (OpenPose, AlphaPose) enable reasoning about gait phase, direction, and pre-crossing postural cues (Piccoli et al., 2020, Riaz et al., 2024, Elkammar et al., 5 Jan 2026).
- Ego-vehicle dynamics: Speed and its variation, time-to-collision, and proximity rate are critical predictors but may also bias models to vehicle-side cues (Azarmi et al., 2024, Azarmi et al., 5 Jul 2025).
- Scene semantics and environmental context: Crosswalks, traffic lights, road topology, and dynamic signal states fundamentally alter the likelihood of crossing (Nia et al., 16 Jul 2025, Li et al., 25 Nov 2025).
Synthetic data generation, as enabled by frameworks such as ARCANE and PedSynth, increases the diversity and compositional breadth of input contexts, augmenting real-world datasets for more robust learning (Riaz et al., 2024).
3. Model Architectures and Algorithmic Paradigms
The field has evolved from sequential feature-based probabilistic models to end-to-end deep learning with progressive context integration:
- Sequential probabilistic models: Factored Latent-Dynamic Conditional Random Fields (FLDCRF) jointly model motion, scene and vehicle context, outperforming LSTM baselines in early prediction and parameter efficiency (Neogi et al., 2019).
- Convolutional and recurrent neural networks: CNN+RNN, CNN+GRU, and 3D convnets encode visual and kinematic modalities over time (Liu et al., 2 Nov 2025, Azarmi et al., 2024, Bouhsain et al., 2020, Azarmi et al., 2023).
- Skeleton and pose fusion: Early-fusion of body pose in the image stream substantially improves lead-time and precision over late or combined fusion approaches (Piccoli et al., 2020, Riaz et al., 2024).
- Transformer models: Spatio-temporal transformers (e.g., ViViT, MFT) fuse multiple context streams—including behavior, localization, environment, and vehicle motion—progressively, achieving state-of-the-art on standard datasets with reduced parameter count (Elkammar et al., 5 Jan 2026, Li et al., 25 Nov 2025).
- Graph neural networks (GNNs): Spatio-temporal graph convolutional networks link pedestrians, traffic lights, and vehicles, explicitly integrating discrete signal state and spatial adjacency (Nia et al., 16 Jul 2025). PedGNN exemplifies a lightweight, skeleton-inference graph approach (Riaz et al., 2024).
- Diffusion models: Recent advances exploit diffusion-based generative processes by conditioning on recognized intentions (lateral, longitudinal, or endpoint tokens) and provide multimodal, uncertainty-aware future path distributions (Liu et al., 6 Aug 2025, Liu et al., 10 Aug 2025, Liu et al., 2 Nov 2025).
- Vision-LLMs: Systems leveraging large-scale vision-language foundation models (VLFMs), prompted with hierarchical templates incorporating vehicle dynamics and posture cues, achieve superlative performance and context generalization (Azarmi et al., 5 Jul 2025).
4. Contextual, Multimodal, and Hierarchical Fusion
Contemporary SOTA models systematically integrate multimodal features with context-aware or hierarchical fusion strategies to explicitly disentangle pedestrian, vehicle, and environment factors:
- Progressive fusion architectures: E.g., MFT first aggregates features within each context (behavior, location, environment, motion) using intra-context attention, then fuses context tokens via cross-context attention, enabling context-guided refinement (Li et al., 25 Nov 2025).
- Attention relation networks: ARN-type modules compute scene-object attention (traffic lights, crosswalks, neighbors) conditioned on pedestrian and ego-vehicle state (Yao et al., 2021).
- Self- and cross-attention in multi-modal fusion: Interleaved fusion of local (bounding box, pose) and global (scene parsing, semantic segmentation, optical flow) context is mediated by stacked self-attention or transformer modules (Azarmi et al., 2023, Elkammar et al., 5 Jan 2026, Li et al., 25 Nov 2025).
- Traffic signal and vehicle state integration: Explicit inclusion of dynamic traffic light states, signaled via one-hot or learned embeddings, significantly boosts accuracy and reduces decision latency in urban settings (Nia et al., 16 Jul 2025).
5. Evaluation, Feature Importance, and Interpretability
Standard metrics include accuracy, AUC (area under ROC curve), F1-score, ADE (average displacement error), FDE (final displacement error), precision, and recall. Context-aware feature importance analysis has revealed several consistent findings (Azarmi et al., 2024):
| Feature | Global Importance (ΔAcc %) | Variance | Comments |
|---|---|---|---|
| Bounding box | 9.1 | ±1.2 | Most critical, especially in close/four-way |
| Ego speed | 5.1 | ±2.1 | High utility, but strong vehicle-side bias |
| Local context | 4.7 | ±0.7 | Complements bbox, esp. in close/fog scenes |
| Pose | 1.3 | ±0.5 | Limited in isolation, occlusion-prone |
Contextual permutation feature importance (CAPFI) uncovers scenario-specific dependencies and reveals that model reliance on ego-vehicle speed may induce driver-side bias, especially in yielding scenarios (deceleration) (Azarmi et al., 2024). Alternatives such as proximity change rate partially mitigate but do not eliminate this bias.
Qualitative ablations and visualizations (e.g., via ARN or hierarchical prompts in VLFMs) further highlight that context-aware cues—such as crosswalk presence, signal state, and vehicle speed changes—heavily govern model predictions. Chain-of-thought prompt structures and explicitly causal or time-conscious representations further improve VLFM inference (Azarmi et al., 5 Jul 2025).
6. Robustness: Occlusions, Weather, and Synthetic Data
Robust pedestrian intention prediction must address incomplete or degraded observations:
- Occlusion-aware models: Diffusion models with transformer backbones conditioned on occlusion masks reconstruct missing motion tokens during the reverse process, maintaining intent prediction quality under severe (up to 5 frames) synthetic occlusions (Liu et al., 2 Nov 2025).
- Adverse weather and sensor fusion: Spiking Neural Networks (SNNs) paired with Dynamic Vision Sensors (DVS) outperform standard RGB CNN/RNN pipelines in fog, rain, and low-light, with significant energy savings and minimal inference latency (Sakhai et al., 2024). Still, SNNs underperform on RGB in good weather, indicating a tradeoff.
- Synthetic dataset augmentation: Synthetic datasets such as PedSynth generated by ARCANE, with rich scene configuration, skeleton, and semantic segmentation ground truth, substantially enhance model training diversity. When combined with real data for training, they yield F1 score improvements of up to 5–10 points (Riaz et al., 2024).
7. Current SOTA Performance, Practical Implications, and Future Challenges
Current state-of-the-art performance metrics (JAAD/PIE/Urban-PIP):
| Model | Dataset | Accuracy | AUC | F1 | Lead Time | Notable Insights |
|---|---|---|---|---|---|---|
| ViViT/Transformer (Elkammar et al., 5 Jan 2026) | JAAD_all | 0.86 | 0.772 | 0.61 | ~1–2 s | 6-15× smaller than prior SOTA |
| MFT (Li et al., 25 Nov 2025) | JAADall | 0.93 | 0.97 | 0.83 | ≤2 s | Progressive multimodal fusion |
| TA-STGCN (Nia et al., 16 Jul 2025) | PIE | 0.85 | — | 0.87 | 15 frames | Explicit traffic signal modeling |
| PIP-Net (Azarmi et al., 2024) | PIE | 0.91 | 0.90 | 0.84 | up to 4 s | Depth/flow/context, 3-cam fusion |
| PTINet (Munir et al., 2024) | PIE | 0.98 | — | 0.97 | 1 s | Multi-task intention+trajectory |
| PedGNN (Riaz et al., 2024) | JAAD | ~0.85 | — | 0.85 | 0.6 s | 27 KB model, <1 ms inference |
Advances in joint intention-trajectory multitask models, explicit context fusion, and robust handling of occlusion and adverse weather demonstrate convergence toward robust, deployable systems (Munir et al., 2024, Liu et al., 2 Nov 2025, Sakhai et al., 2024). However, persistent challenges include:
- Reducing reliance on vehicle-side cues to ensure prediction stems from pedestrian-side features, not just vehicle yielding (Azarmi et al., 2024).
- Generalizing across scene configurations, weather, and sensing modalities, as highlighted by synthetic data results (Riaz et al., 2024, Sakhai et al., 2024).
- Incorporating richer, fine-grained semantic intentions (e.g., stopping, hesitating, turning) and probabilistic multi-modal outputs (Liu et al., 6 Aug 2025, Liu et al., 10 Aug 2025).
- Achieving real-time, high-confidence prediction in dense, multi-agent urban scenes with diverse occlusion and edge-case behaviors.
Ongoing research emphasizes context-oriented feature integration, interpretability (via CAPFI and attention visualizations), data diversity, and efficient, scalable architectures suitable for AV/ADAS hardware integration.