JAAD Dataset for Pedestrian-Vehicle Interaction
- The JAAD dataset is a comprehensive benchmark capturing pedestrian behavior with high-resolution video and dense spatial, behavioral, and contextual annotations.
- It underpins research in pedestrian intention and trajectory prediction, enabling actionable insights for ADAS and autonomous driving through precise labeling.
- Robust evaluation protocols and pose-based features enhance model validation, supporting action recognition and non-verbal communication studies in complex urban traffic.
The Joint Attention in Autonomous Driving (JAAD) dataset is a benchmark corpus designed to facilitate the paper of pedestrian–vehicle interaction, intention prediction, and joint attention phenomena in urban traffic. Composed of high-resolution, short video sequences acquired from instrumented vehicles in naturalistic environments, JAAD incorporates comprehensive spatial, behavioral, and contextual annotations. It has become integral to research on pedestrian intention and trajectory prediction, as well as action recognition for Advanced Driver Assistance Systems (ADAS) and autonomous driving.
1. Dataset Composition and Collection Protocols
JAAD is constructed from approximately 240 hours of recorded driving data using two instrumented vehicles equipped with wide-angle cameras centered beneath the rearview mirror. These cameras capture at 30 frames per second in either 1920×1080 or 1280×720 resolution, with a field of view of 60°–90° horizontal. From this footage, 346 video clips (each 5–15 seconds; average ~170 frames/clip) were manually selected to include meaningful pedestrian–vehicle interaction events (Kotseruba et al., 2016, Varytimidis et al., 2018, Rasouli et al., 2017, Gesnouin et al., 2021). The dataset encompasses diverse environments: urban and suburban streets, plazas, indoor passages, and parking lots, with coverage across North America and Europe (majority in Ukraine). Represented scenarios span various weather (clear, cloudy, rain, snow) and lighting (day/night) conditions, crosswalk designs (zebra, non-designated), and population demographics.
2. Annotation Layers and Schema
JAAD adopts a dense, multi-layer annotation strategy per frame, encapsulating spatial, behavioral, and contextual information (Varytimidis et al., 2018).
- Spatial (Bounding Boxes): For each frame, bounding boxes are assigned to:
- Primary pedestrian (with full behavioral tags),
- Secondary pedestrians ("bystanders"),
- Grouped entities (when separation is ambiguous).
- The bounding box format is in pixel coordinates.
- Behavioral Tags: Each annotated pedestrian is labeled with:
- Head orientation: , binary indicator ,
- Motion: , binary indicator ,
- Motion direction: ,
- Kinematics: ,
- Gender: ,
- Age class: ,
- Driver actions as one-hot vector (fast, slow, brake, accelerate).
- Contextual (Scene) Tags: These include:
- Number of lanes, location type, signalization (signalized/non-signalized), crossing design (zebra/none), weather, and time of day per frame.
Annotations are distributed as CSV/JSON files with one record per frame, containing all box and tag variables. For detailed behavior, the BORIS tool is used to encode event onsets and offsets (e.g., walking, looking, handwave, etc.) (Rasouli et al., 2017).
A secondary set of annotations, from later evaluation protocols, provides 18-joint 2D skeleton keypoints (OpenPose or ViTPose) and binary crossing-intention labels ("will cross"/"not cross") with crossing timestamps (Gesnouin et al., 2021, Ghiya et al., 11 Mar 2025).
3. Behavioral Taxonomies and Statistics
The core purpose of JAAD is to operationalize and quantify joint attention and intention-inference cues in street scenes. The behavioral tag set enables:
- Action segmentation: Distinguishing between “crossing,” “not crossing,” “precondition + crossing,” sequences with and without observed attention, and presence/absence of driver responses (Rasouli et al., 2017). For instance, of 654 behavior samples, nine ordered action types are enumerated, including "crossing+attention" and "crossing+attention+reaction."
- Communication events: Empirical analysis shows >90% of pedestrians in non-signalized crossings gaze at vehicles prior to crossing; primary "looking" and "glancing" are distinguished by duration.
- Temporal relationships: Statistics such as time-to-collision (TTC), gaze duration, and frequency of various driver responses (stop, slow_down, hand_gesture) are derived. For instance, “crossing without attention” only occurs for TTC > 10 s, never for TTC < 2 s.
- Class balance: Behavioral classes (head orientation, motion) are balanced in experimentation via subsampling, e.g., 333 crossing and 333 non-crossing frames in intent recognition (Varytimidis et al., 2018).
4. Evaluation Tasks and Protocols
JAAD underpins multiple research tasks, each with established protocol (Varytimidis et al., 2018, Gesnouin et al., 2021, Damirchi et al., 2023, Ghiya et al., 11 Mar 2025, Huang et al., 2023):
| Task | Input Features | Evaluation Metric | Reported Result(s) |
|---|---|---|---|
| Head orientation | Box + contextual/appearance features | Classification Accuracy | 97% (CNN+SVM) |
| Motion detection | Box + contextual | Classification Accuracy | 98% (CNN+SVM) |
| Action/intention | Head, motion, context tags | Classification Accuracy; F1 | 89.4% (12 vars) |
| Crossing intention | Skeleton pseudo-images, bbox, JCD, attention | F1, Precision/Recall, AUC | 0.76 (TrouSPI-Net, F1) |
| Trajectory prediction | Bounding box (center, size), pose angles | MSE, CMSE, CFMSE, ADE, FDE | 62.62 @ 0.5s (SGNetPose+) |
| Action prediction | Bbox center/size + derived speeds | Classification Accuracy | 81% (Transformer TF-ed) |
Evaluation splits are typically either by trajectory (50/10/40 for train/val/test) or by frame within video clips (e.g., 60/40 split in cross-validation) (Damirchi et al., 2023, Varytimidis et al., 2018, Huang et al., 2023). Metrics vary by task: classification accuracy, confusion matrix entries (TP/TN/FP/FN), F1 score, mean squared error over future trajectory, average/final displacement error.
Recent advances augment with pose features (OpenPose/ViTPose 2D skeletons for keypoints, geometric angles for body segments) for improved trajectory or intention prediction (Gesnouin et al., 2021, Ghiya et al., 11 Mar 2025).
5. Downstream Applications and Research Impact
JAAD serves as a canonical test-bed for:
- Pedestrian intention prediction: Enabling model development for tasks such as early crossing- versus non-crossing-detection using multimodal cues (head pose, body language, contextual scene variables) (Varytimidis et al., 2018, Gesnouin et al., 2021).
- Trajectory forecasting: Facilitating supervised learning of multi-modal, temporally coherent trajectory predictors under challenging first-person (ego-vehicle) view geometry (Damirchi et al., 2023, Ghiya et al., 11 Mar 2025).
- Study of non-verbal communication: Enabling joint-attention research on mutual signaling, anticipation, and safe human–robot interaction in traffic. Explicit labels for gaze, gestures, and behavioral sequences enable detailed empirical analyses (Rasouli et al., 2017).
- Benchmarking machine learning models: Providing robust comparison standards for deep learning architectures (e.g., CNN + SVM, LSTM, Transformer, CVAE, U-GRU), incorporating pose streams and complex scene encodings, under real-world noise and occlusion.
- Validation of perception modules: Supporting detection and tracking pipelines for bounding boxes, human pose, and action recognition, evaluated in diverse weather, lighting, and density regimes.
6. Dataset Limitations and Considerations
Key limitations include:
- Geographic and scene bias: Heavily weighted toward Eastern European urban topologies (~80% from Ukraine) (Kotseruba et al., 2016).
- Environmental coverage: Underrepresentation of severe-weather and night scenes; limited samples for certain rare behaviors or demographics.
- Modality: Lacks depth, LiDAR, or explicit metric speed labels; speed is categorical, not quantitative (except as inferred from video).
- Identity tracking: No built-in pedestrian identity tracking; pose-based intention models rely on external trackers (e.g., OpenPose, PoseFlow).
- Label granularity: While comprehensive, some annotations (e.g., pose, crossing intent, trajectory) are available only for subsets of the data (with success rates dictated by pose estimator or annotation protocol).
7. Evolving Usage and Extensions
Subsequent works increasingly leverage pose-based features for improved performance in both intention and trajectory tasks (Ghiya et al., 11 Mar 2025, Gesnouin et al., 2021), with data augmentation (e.g., horizontal flipping) used to enhance effective training set size. Recent models such as SGNetPose+ integrate both bounding boxes and body-joint angles via goal-driven RNN-CVAE decoders, yielding superior trajectory prediction error metrics over previous architectures using JAAD as a main evaluation suite (Ghiya et al., 11 Mar 2025).
Evaluation practices frequently cross-reference splits and pre-processing pipelines established in prior benchmarks (Gesnouin et al., 2021). The dataset continues to guide the development of architectures that bridge intention and motion, highlighting the nuances of joint attention, parallax, and perspective encountered in first-person urban driving.
Key References
- (Kotseruba et al., 2016) Joint Attention in Autonomous Driving (JAAD)
- (Rasouli et al., 2017) Agreeing to Cross: How Drivers and Pedestrians Communicate
- (Varytimidis et al., 2018) Action and intention recognition of pedestrians in urban traffic
- (Gesnouin et al., 2021) TrouSPI-Net: Spatio-temporal attention on parallel atrous convolutions and U-GRUs for skeletal pedestrian crossing prediction
- (Damirchi et al., 2023) Context-aware Pedestrian Trajectory Prediction with Multimodal Transformer
- (Huang et al., 2023) Learning Pedestrian Actions to Ensure Safe Autonomous Driving
- (Ghiya et al., 11 Mar 2025) SGNetPose+: Stepwise Goal-Driven Networks with Pose Information for Trajectory Prediction in Autonomous Driving