JAAD: Joint Attention in Autonomous Driving

Updated 11 December 2025

Joint Attention in Autonomous Driving (JAAD) is a public dataset providing high-resolution urban driving videos with detailed spatiotemporal and behavioral annotations.
It includes multi-layer annotations such as bounding boxes, event logs, and contextual metadata to analyze pedestrian intent, driver response, and time-to-collision dynamics.
JAAD facilitates empirical research into joint attention inference and behavioral prediction, serving as a benchmark for enhancing autonomous driving safety and communication studies.

The Joint Attention in Autonomous Driving (JAAD) Dataset is a public resource specifically designed to support research on non-verbal communication, behavioral prediction, and joint attention dynamics between drivers and pedestrians in varied urban traffic scenarios. Through high-resolution video, detailed spatiotemporal annotations, and behavioral event streams, JAAD enables systematic analysis of visual cues and interaction protocols critical to the safety and functionality of both human-driven and autonomous vehicles (Rasouli et al., 2017, Varytimidis et al., 2018, Kotseruba et al., 2016).

1. Data Acquisition and Scope

JAAD comprises 346 high-resolution video clips, each spanning 5–15 seconds, sampled from approximately 240 hours of naturalistic driving. Two experimental vehicles—each fitted with a wide-angle dashboard camera mounted centrally just below the rear-view mirror—were used to collect footage across a range of environments:

Geographic Locations: North York, Canada (55 clips, GoPro HERO+, 1920×1080); Kremenchuk, Ukraine (276 clips, Garmin GDR-35, 1920×1080); Hamburg, Germany (6 clips, Highscreen Black Box, 1280×720); Lviv, Ukraine (4 clips, Highscreen Black Box, 1280×720); New York, USA (5 clips, GoPro HERO+, 1920×1080).
Environmental Variation: Clips cover urban downtown, suburban, a minority of residential/rural segments, and variable weather and daylight conditions including sunny, overcast, rainy, snowy, and sun-glare states. Over 90% of video sequences are set in urban and suburban streets (Rasouli et al., 2017, Varytimidis et al., 2018, Kotseruba et al., 2016).

2. Annotation Paradigms and Category Schema

JAAD provides comprehensive multi-layered annotation encompassing spatial, temporal, behavioral, and contextual aspects:

Bounding Boxes: Manually labeled, per-frame tracks for all pedestrians and interacting vehicles, formatted as XML files (fields: frame index, x-y pixel coordinates, width, height, occlusion flag).
Behavioral Event Logs: Event-based time-stamped state/action annotations (using BORIS²), e.g., {walking, standing, looking, glancing, nodding, hand_gesture, crossing}, structured per-subject (e.g., ped1, ped2, Driver).
Demographics: Pedestrian age group (child, adult, elderly), gender.
Scene Context: Crosswalk type (non-designated, zebra, signalized), weather, time-of-day.
Joint Attention Cues: Defined as two-party interactions:
- Pedestrian: Looking (gaze ≥ 1s toward oncoming vehicle), glancing (gaze < 1s), secondary cues (nods, hand gestures).
- Driver: Speed maintenance/slight acceleration, slowing, full stop, explicit yielding (wheel turn, brake-light flash) (Rasouli et al., 2017).

Annotation quality control was conducted via manual cross-verification; inter-annotator agreement metrics (e.g., Cohen’s κ) were not reported but are flagged as future work (Rasouli et al., 2017, Kotseruba et al., 2016).

3. Metrics, Analytical Structures, and Modeling

JAAD is structured to facilitate empirical analysis of joint attention, pedestrian intent prediction, and driver-pedestrian negotiation:

Time-to-Collision (TTC): Defined as

$\mathrm{TTC} = \frac{d}{v}$

where $d$ is the vehicle-to-crossing-point distance and $v$ is vehicle speed (assumed constant). Four TTC bins are provided: $<$ 3 s, 3–7 s, 7–15 s, $\geq$ 15 s. No crossing occurs without attention for $\mathrm{TTC}<2$ s; $\geq$ 50% of crossings without attention occur for $\mathrm{TTC}>10$ s (Rasouli et al., 2017).

Conditional Dependency Modeling: Empirical histograms report conditional probabilities such as:

$P(\mathrm{Attention}\mid \mathrm{TTC}=t)$

and

$P(\mathrm{Cross} \mid \mathrm{Attention},\,\mathrm{CrosswalkType}=c,\,\mathrm{Response}=r)$

capturing dependencies among attention, crossing likelihood, TTC, crosswalk type, and driver response; no parametric regression was performed (Rasouli et al., 2017).

Behavior Recognition Baselines: Using frame-aligned features (e.g., HOG, LBP, CNN embeddings from AlexNet) and classifiers (SVM, k-NN, ANN, DT):
- Head orientation (HOG): 72% accuracy; (CNN): 70%
- Motion detection (CNN): 85% accuracy (Varytimidis et al., 2018)
Annotation Structure Overview Table:

Layer	Content (Examples)	Format
Bounding Boxes	Pedestrians, vehicles	XML
Behavioral Events	Looking, walking, gestures	CSV (BORIS2)
Context Metadata	Weather, crosswalk type, age	CSV

4. Key Empirical Findings

Analysis of JAAD yields several robust findings regarding pedestrian and driver behavior at crossing points (Rasouli et al., 2017):

Visual Attention Precedes Crossing: In non-signalized crosswalks, $\gt$ 90% of pedestrians perform a looking action before crossing, with the rest split between glancing and absent attention (which only occurs at $\mathrm{TTC}\gt6$ s).
Mid-Crossing Monitoring: Secondary attention behaviors (mid-crossing glance, look-back) arise in ~25% of crossings.
Attention Modulation by TTC: The likelihood of pedestrian attention rises sharply as TTC drops below 7 s; inattentive crossings are never observed at $\mathrm{TTC}\lt2$ s.
Attention Duration by Age: Elderly pedestrians average $\sim$ 1 s longer gaze than adults pre-crossing; children’s dwell times are modestly shorter.
Driver Response and Crosswalk Type:
- At non-designated crosswalks, if the driver slows/stops, $\approx$ 85% of pedestrians cross after attention. If the driver maintains speed, $\lt$ 10% cross (except $\mathrm{TTC}\gt25$ s).
- At zebra-marked/signalized crossings, crossing rates after attention exceed 95%. When drivers accelerate, pedestrians hold unless $\mathrm{TTC}\gt20$ s.

5. File Structure, Access, and Recommended Processing

File and Directory Format: Video (H.264 MP4, 1920×1080/1280×720, $\sim$ $\sim$ 30 fps), bounding boxes (per-clip XML), behavioral logs (CSV from BORIS2), e.g.:
- /JAAD/Videos/
- /JAAD/Boxes/
- /JAAD/Annotations/
Usage License: Publicly available to academic researchers under Creative Commons BY-NC-SA. Ethics approval: York University (#2016-203) (Rasouli et al., 2017).
Preprocessing Recommendations:

Undistort and crop to 16:9 aspect ratio.
Frame sampling at 10 fps for behavior modeling.
Per-channel mean normalization of pixel intensities.
Use provided boxes to train pedestrian/vehicle detectors (e.g., SSD, YOLO).
Extract head-pose/gaze features for in-box pedestrians (e.g., two-stage CNN).
Build 1–3 s temporal windows before crossings for LSTM/GRU sequence models (Rasouli et al., 2017).

6. Research Applications and Limitations

JAAD serves both as a benchmarking resource and as a foundation for computational research in:

Joint Attention Inference: By combining attention cues with action labels and TTC, the dataset enables probabilistic modeling of joint attention, negotiation, and right-of-way inference in traffic scenes (Rasouli et al., 2017, Varytimidis et al., 2018).
Behavioral Prediction: Detection and prediction of pedestrian intent (e.g., imminent crossing) from visual cues and driver actions; supports both rule-based and data-driven approaches.
Limitations: JAAD is monocular video only—no LiDAR, radar, or multi-view data. Event annotations may lack inter-annotator agreement; head orientation is binary (looking/not looking), limiting gaze granularity. Cyclist interactions are rare and not separately annotated. The moderate number of clips may not cover edge cases (e.g., severe occlusion, night scenarios) (Kotseruba et al., 2016, Varytimidis et al., 2018).

7. Extensions and Suggested Improvements

Research utilizing JAAD highlights potential directions for dataset and methodology enrichment:

Higher Granularity Annotations: Finer head-pose and gaze-direction labeling.
Sensor Diversity: Inclusion of additional modalities (e.g., LiDAR, radar, optical flow).
Deeper Learning Architectures: Application of advanced models (e.g., VGG, ResNet) and explicit pose-based descriptors for enhanced effect modeling (Varytimidis et al., 2018).
Automated Semantic Parsing: Automated annotation for additional contextual variables (e.g., crosswalk state, signalization) as future enhancement targets.

JAAD constitutes a comprehensive benchmark for the empirical study of joint attention, non-verbal communication, and pedestrian intention in realistic urban traffic, with strong relevance for autonomous driving research (Rasouli et al., 2017, Varytimidis et al., 2018, Kotseruba et al., 2016).