Joint Attention in Autonomous Driving

Updated 11 December 2025

Joint attention in autonomous driving is the process where drivers, pedestrians, and cyclists use gaze and gestures to coordinate movement and negotiate right-of-way.
The JAAD dataset offers annotated video clips from diverse urban scenarios, providing detailed behavioral and bounding-box data for analyzing attention cues.
Modeling techniques such as LSTM, Transformer, and Bayesian networks leverage these cues to predict pedestrian intent accurately and support real-time decision-making.

Joint attention in autonomous driving (JAAD) refers to the process in which drivers—human or autonomous—and other road users, primarily pedestrians and cyclists, establish a mutual and intentional exchange of observable attention cues such as gaze, gestures, and speed modulations to coordinate navigation and avoid collisions in dynamic environments. The JAAD paradigm extends beyond object detection, requiring autonomous systems to perceive, interpret, and reciprocate non-verbal social cues that underlie the negotiation of right-of-way in real-world traffic. Central to the study of JAAD is the Joint Attention in Autonomous Driving dataset (JAAD), designed to support the modeling, analysis, and evaluation of these two-way attention-driven interactions in diverse urban scenarios (Kotseruba et al., 2016).

1. Conceptual Foundations and Theoretical Models

Joint attention in the traffic context draws from developmental psychology, particularly models by Scaife & Bruner and Baron-Cohen, which decompose attention-sharing into distinct sub-processes: intentional detection (recognition of agent and volition), eye-direction detection (inferring another’s gaze), shared-attention mechanisms (binding self, other, and focal object/event), and higher-level theory-of-mind processes for inferring goals and intentions. Within urban driving, these mechanisms materialize as a multimodal communication protocol: a pedestrian making or avoiding eye contact, varying their walking speed, or signaling, and a driver observing, interpreting, and choosing to yield or maintain course accordingly (Rasouli et al., 2018).

A representative formal framework introduces a dynamic Bayesian network over latent and observed states (gaze, pose, environment, joint attention state $T_t$ , intention $I_t$ ), supporting probabilistic inference of intention conditioned on observed cues:

$P(I=1|\mathbf{x}) = \sigma\left(w_g G + w_p P + w_c C\right)$

where $G$ is gaze, $P$ is pose, $C$ is context features, $w_k$ are learned weights, and $\sigma$ is the sigmoid activation (Rasouli et al., 2018).

2. The JAAD Dataset: Structure and Annotation

The JAAD dataset is a foundational resource, comprising 346 high-resolution (1920×1080 or 1280×720) video clips (5–15 s, ≈240 hours in total), recorded with front-facing windshield cameras across diverse geographic locations (Canada, Ukraine, Germany, USA), spanning urban, suburban, and occasional rural environments. The dataset reflects a wide range of conditions including variable weather (clear ≈60 %, rain/snow ≈25 %, night/sunrise ≈10 %), demographics (child, young, adult, senior, ≈70 % adult), and traffic scenarios (marked, unmarked crosswalks, jaywalking) (Kotseruba et al., 2016, Rasouli et al., 2017).

JAAD’s dual-layer annotation schema enables detailed study of joint attention:

Bounding-box annotations: Per-frame for all relevant dynamic entities, providing (x, y, width, height), with occlusion flags.
Behavioral (event-based) annotations: Scene-level variables (weather, crossing type, age/gender) and fine-grained per-subject, time-stamped behavioral events, distinguishing between state events (e.g., Crossing, Looking, Slow_down) and point events (e.g., Look, Handwave).

This structure enables the analysis of attention event timing, duration, and their correlation with contextual and demographic variables. Representative metrics include Intersection-over-Union (IoU) for bounding-box accuracy, event-sequence correlations (e.g., reaction latency), and gaze-alignment probability, computed as the proportion of frames with mutual attention states (e.g., pedestrian Looking at car and car Slow_down) (Kotseruba et al., 2016).

3. Behavioral Taxonomy, Metrics, and Statistical Findings

Huang et al. (2023) propose a hierarchical taxonomy reflecting the levels of pedestrian behavior relevant to joint attention:

Intent: Long-term, unobserved (e.g., destination, route).
Action: Low-level, short-term, observable (e.g., crossing/not crossing with respect to the ego-vehicle).
Motion: Raw trajectory and speed profile.

JAAD treats “crossing/not-crossing” as the key action label for joint attention: each label encodes whether the pedestrian has, in effect, acknowledged the ego vehicle and decided to act (Huang et al., 2023).

Analyses of the JAAD data reveal:

In more than 90 % of non-signalized crossings, pedestrians perform a primary attention event (looking or glancing) before entering the roadway.
The probability of crossing without attention strongly depends on time-to-collision (TTC); such events are rare except when TTC > 6 s and nearly absent at TTC < 2 s or on wide streets (Rasouli et al., 2017).
Attention duration correlates linearly with TTC up to a threshold dependent on age (adults ≈7 s, elderly ≈8 s).
Joint attention is modulated by crosswalk design: at non-designated crossings, yielding is contingent on driver responses to pedestrian gaze, while designated and signalized crossings reduce reliance on mutual attention.

These findings underpin model design: head pose, gaze, and body posture, as well as scene context and explicit driver reactions, are necessary features for reliable intention prediction and for capturing the nuanced process of joint attention.

4. Modeling Approaches and System Architectures

Several architectural paradigms leverage the JAAD dataset to model joint attention for autonomous driving systems:

LSTM-Based Multi-Task Learning: The PV-LSTM encoder-decoder architecture employs two parallel LSTM encoders (position, velocity) whose outputs are concatenated to a shared latent representation and decoded into future pedestrian bounding-box trajectories and crossing intentions. The losses are composed of bounding-box velocity regression (MSE) and binary cross-entropy for intention, forming a multitask objective. This model achieves 91.48 % crossing classification accuracy in the joint task and 75.2 % AIOU for box prediction, with an inference speed exceeding 200 FPS (Bouhsain et al., 2020).
Transformer-Based Architectures: Huang et al. propose a multi-task sequence-to-sequence Transformer (TF-ed) using only ego-vehicle camera-derived position and speed sequences as input. Positional encodings and self-attention allow for robust short-term action prediction, achieving 81 % crossing action accuracy, outperforming LSTM baselines by 7.4 %. The trade-off is higher trajectory error over long prediction horizons (Huang et al., 2023).
Dynamic Bayesian Networks and Logistic Regression: Formalization of joint attention in probabilistic graphical models enables fusion of gaze, pose, context, and behavior in estimating intention. Empirical results include a joint-attention detection true-positive rate of 84 % with a 7 % false-alarm rate, and intent inference ROC AUC of 0.92, with 88 % 1 s-ahead prediction accuracy (Rasouli et al., 2018).

Architectures often entail perception (object detection, tracking, pose/gaze estimation), attention selection (saliency, relevance filtering), joint-attention event detection, intention inference (probabilistic modeling), and behavior planning modules—facilitating a closed loop from perception to action.

5. Evaluation Protocols and Empirical Performance

Evaluation on JAAD employs protocolized train/test splits (e.g., 300/46 videos for train/test in (Bouhsain et al., 2020)), with per-sequence uniform resampling and cross-dataset tests (e.g., Citywalks). Key performance metrics include:

Average Displacement Error (ADE): Lower is better; e.g., PV-LSTM: 9.19 px.
Final Displacement Error (FDE): Lower is better; PV-LSTM: 15.22 px.
Average IoU (AIOU): Higher is better; PV-LSTM: 75.2 %.
Classification accuracy: e.g., multi-task PV-LSTM: 91.48 %.
Real-time inference: PV-LSTM achieves >200 FPS.

These benchmarks consistently demonstrate the feasibility of joint attention modeling with camera-only inputs. Notably, explicit scene features (e.g., ResNet-50 extracted) yield only marginal improvements, attributing the majority of performance to accurate action/trajectory encoding (Bouhsain et al., 2020).

6. Challenges, Contextual Dependencies, and Prospects

Weather, lighting, location, demographic variation, and traffic context substantially affect both annotation fidelity and model performance. For example, occlusion (umbrellas, hoods), dynamic range limits (night, glare), or dense downtown traffic alter both the observability and salience of cues such as gaze or body orientation (Kotseruba et al., 2016). Senior pedestrians display longer attention and slower crossings, which suggests the need for demographic-aware modeling.

Current limitations include degradation of long-horizon box forecasts, lack of explicit modeling for static scene features (e.g., crosswalk geometry), and susceptibility to detection and tracking disruptions. The dataset’s ego-centric, camera-only basis, while robust for low-level action (crossing/not crossing), ultimately restricts generalization in non-standard scenarios; integration of further context priors and multimodal sensing is an active area for future research (Huang et al., 2023, Bouhsain et al., 2020).

7. Significance and Broader Implications

The JAAD corpus enables the study and benchmarking of socially interactive perception–decision pipelines required for deployment of safe, socially aware autonomous vehicles. Synchronized bounding-box and behavioral event annotations allow training of end-to-end vision-to-intent architectures and quantitative evaluation of mutual attention. JAAD’s methodologies and derived models influence hybrid systems that complement deep learning with rule-based and probabilistic protocols, advancing the field toward robust, interactive autonomy capable of genuine joint attention with pedestrians and other road users (Kotseruba et al., 2016, Rasouli et al., 2018).