Papers
Topics
Authors
Recent
2000 character limit reached

Pedestrian Intention Prediction

Updated 12 January 2026
  • Pedestrian intention prediction is a computational task that uses video, sensor, and contextual data to infer imminently crossing behavior, crucial for safer autonomous driving.
  • It integrates computer vision, sequential modeling, and sensor fusion—incorporating visual, kinematic, and scene cues—to form accurate binary classifications.
  • Recent advances employ deep architectures such as CNN+RNN, transformers, and graph neural networks to enhance robustness against occlusions, adverse weather, and varied urban scenarios.

Pedestrian intention prediction is the computational task of inferring, from video, sensor, or contextual data, whether a pedestrian is likely to cross the street imminently in front of a vehicle—most commonly within modern Advanced Driver Assistance Systems (ADAS) and autonomous vehicle (AV) pipelines. This field represents a convergence of computer vision, sequential modeling, contextual scene analysis, and real-time embedded inference. Accurately predicting pedestrian intent is central for proactive collision avoidance, adaptive vehicle control, and safe human–machine interaction in urban environments.

1. Problem Formulation and Core Objectives

Pedestrian intention prediction is usually phrased as a binary classification problem: Given an observation window (sequence of video frames, kinematic tracks, multimodal features) up to time tt, predict whether a target pedestrian will attempt to cross the road within a specified prediction horizon (e.g., 1–2 seconds or a set number of frames) (Varytimidis et al., 2018, Li et al., 25 Nov 2025, Azarmi et al., 2024, Azarmi et al., 2023). Let input XX denote the fused feature sequence—ranging from low-level pixel data to compact, structured attributes:

Mathematically, the aim is to train a model fθf_\theta such that

y=fθ(X)∈{0,1}y = f_\theta(X) \in \{0,1\}

where y=1y=1 indicates the pedestrian will cross within the prediction horizon. Temporal variants and forecasting frameworks extend the prediction to multi-step or trajectory-conditioned settings (Bouhsain et al., 2020, Munir et al., 2024).

2. Sensing Modalities and Feature Engineering

Features for intention prediction span a spectrum from engineered, interpretable descriptors to automatically learned spatio-temporal embeddings. Critical input elements are:

Synthetic data generation, as enabled by frameworks such as ARCANE and PedSynth, increases the diversity and compositional breadth of input contexts, augmenting real-world datasets for more robust learning (Riaz et al., 2024).

3. Model Architectures and Algorithmic Paradigms

The field has evolved from sequential feature-based probabilistic models to end-to-end deep learning with progressive context integration:

4. Contextual, Multimodal, and Hierarchical Fusion

Contemporary SOTA models systematically integrate multimodal features with context-aware or hierarchical fusion strategies to explicitly disentangle pedestrian, vehicle, and environment factors:

  • Progressive fusion architectures: E.g., MFT first aggregates features within each context (behavior, location, environment, motion) using intra-context attention, then fuses context tokens via cross-context attention, enabling context-guided refinement (Li et al., 25 Nov 2025).
  • Attention relation networks: ARN-type modules compute scene-object attention (traffic lights, crosswalks, neighbors) conditioned on pedestrian and ego-vehicle state (Yao et al., 2021).
  • Self- and cross-attention in multi-modal fusion: Interleaved fusion of local (bounding box, pose) and global (scene parsing, semantic segmentation, optical flow) context is mediated by stacked self-attention or transformer modules (Azarmi et al., 2023, Elkammar et al., 5 Jan 2026, Li et al., 25 Nov 2025).
  • Traffic signal and vehicle state integration: Explicit inclusion of dynamic traffic light states, signaled via one-hot or learned embeddings, significantly boosts accuracy and reduces decision latency in urban settings (Nia et al., 16 Jul 2025).

5. Evaluation, Feature Importance, and Interpretability

Standard metrics include accuracy, AUC (area under ROC curve), F1-score, ADE (average displacement error), FDE (final displacement error), precision, and recall. Context-aware feature importance analysis has revealed several consistent findings (Azarmi et al., 2024):

Feature Global Importance (ΔAcc %) Variance Comments
Bounding box 9.1 ±1.2 Most critical, especially in close/four-way
Ego speed 5.1 ±2.1 High utility, but strong vehicle-side bias
Local context 4.7 ±0.7 Complements bbox, esp. in close/fog scenes
Pose 1.3 ±0.5 Limited in isolation, occlusion-prone

Contextual permutation feature importance (CAPFI) uncovers scenario-specific dependencies and reveals that model reliance on ego-vehicle speed may induce driver-side bias, especially in yielding scenarios (deceleration) (Azarmi et al., 2024). Alternatives such as proximity change rate partially mitigate but do not eliminate this bias.

Qualitative ablations and visualizations (e.g., via ARN or hierarchical prompts in VLFMs) further highlight that context-aware cues—such as crosswalk presence, signal state, and vehicle speed changes—heavily govern model predictions. Chain-of-thought prompt structures and explicitly causal or time-conscious representations further improve VLFM inference (Azarmi et al., 5 Jul 2025).

6. Robustness: Occlusions, Weather, and Synthetic Data

Robust pedestrian intention prediction must address incomplete or degraded observations:

  • Occlusion-aware models: Diffusion models with transformer backbones conditioned on occlusion masks reconstruct missing motion tokens during the reverse process, maintaining intent prediction quality under severe (up to 5 frames) synthetic occlusions (Liu et al., 2 Nov 2025).
  • Adverse weather and sensor fusion: Spiking Neural Networks (SNNs) paired with Dynamic Vision Sensors (DVS) outperform standard RGB CNN/RNN pipelines in fog, rain, and low-light, with significant energy savings and minimal inference latency (Sakhai et al., 2024). Still, SNNs underperform on RGB in good weather, indicating a tradeoff.
  • Synthetic dataset augmentation: Synthetic datasets such as PedSynth generated by ARCANE, with rich scene configuration, skeleton, and semantic segmentation ground truth, substantially enhance model training diversity. When combined with real data for training, they yield F1 score improvements of up to 5–10 points (Riaz et al., 2024).

7. Current SOTA Performance, Practical Implications, and Future Challenges

Current state-of-the-art performance metrics (JAAD/PIE/Urban-PIP):

Model Dataset Accuracy AUC F1 Lead Time Notable Insights
ViViT/Transformer (Elkammar et al., 5 Jan 2026) JAAD_all 0.86 0.772 0.61 ~1–2 s 6-15× smaller than prior SOTA
MFT (Li et al., 25 Nov 2025) JAADall 0.93 0.97 0.83 ≤2 s Progressive multimodal fusion
TA-STGCN (Nia et al., 16 Jul 2025) PIE 0.85 — 0.87 15 frames Explicit traffic signal modeling
PIP-Net (Azarmi et al., 2024) PIE 0.91 0.90 0.84 up to 4 s Depth/flow/context, 3-cam fusion
PTINet (Munir et al., 2024) PIE 0.98 — 0.97 1 s Multi-task intention+trajectory
PedGNN (Riaz et al., 2024) JAAD ~0.85 — 0.85 0.6 s 27 KB model, <1 ms inference

Advances in joint intention-trajectory multitask models, explicit context fusion, and robust handling of occlusion and adverse weather demonstrate convergence toward robust, deployable systems (Munir et al., 2024, Liu et al., 2 Nov 2025, Sakhai et al., 2024). However, persistent challenges include:

  • Reducing reliance on vehicle-side cues to ensure prediction stems from pedestrian-side features, not just vehicle yielding (Azarmi et al., 2024).
  • Generalizing across scene configurations, weather, and sensing modalities, as highlighted by synthetic data results (Riaz et al., 2024, Sakhai et al., 2024).
  • Incorporating richer, fine-grained semantic intentions (e.g., stopping, hesitating, turning) and probabilistic multi-modal outputs (Liu et al., 6 Aug 2025, Liu et al., 10 Aug 2025).
  • Achieving real-time, high-confidence prediction in dense, multi-agent urban scenes with diverse occlusion and edge-case behaviors.

Ongoing research emphasizes context-oriented feature integration, interpretability (via CAPFI and attention visualizations), data diversity, and efficient, scalable architectures suitable for AV/ADAS hardware integration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Pedestrian Intention Prediction.