JRDB Dataset for Robotic Perception

Updated 7 July 2025

JRDB Dataset is a comprehensive, multimodal dataset offering annotated sensor data for studying egocentric robotic vision and human activity in built environments.
It integrates diverse modalities such as stereo 360° video, LiDAR point clouds, and RGB-D imagery to enable detection, tracking, and overall scene understanding.
Its extensive annotations and specialized extensions support benchmarks in pose estimation, trajectory forecasting, and social group inference for advanced robotic research.

The JRDB Dataset is a large-scale, multimodal benchmark created to advance perceptual research for robotics, with a distinct focus on egocentric robot vision and human activity understanding in real-world built environments. Captured using the Stanford JackRabbot social mobile manipulator, JRDB and its subsequent extensions offer richly annotated sensor data that underpin research in detection, tracking, social group inference, spatio-temporal activity recognition, pose estimation, and comprehensive scene understanding.

1. Composition and Data Modalities

JRDB’s core release consists of 64 minutes of annotated sensor data acquired in both indoor and outdoor environments on a university campus (Martín-Martín et al., 2019). Its primary modalities include:

Stereo cylindrical 360° RGB video (15 fps, five cameras per cycle)
3D point clouds from two Velodyne 16-beam LiDARs and two Sick line 2D LiDARs
360° panoramic images from a fisheye camera
RGB-D video (30 fps, head-mount)
Audio recordings
Robot odometry/encoder data

In total, the dataset features 54 distinct annotated scenes, encompassing stationary and mobile data captures, yielding over 2.3 million 2D bounding boxes and approximately 1.8 million 3D cuboids with trajectory associations for pedestrians, spread across more than 3,500 time-consistent tracks (Martín-Martín et al., 2019).

Several extensions significantly broaden the dataset:

JRDB-Act: Adds over 2.8 million spatio-temporal action labels with social group assignments and annotator confidence levels for 54 sequences (Ehsanpour et al., 2021).
JRDB-Pose: Includes 636,000 pose instances (17 body keypoints with per-keypoint occlusion labels) and track IDs for multi-person pose estimation and tracking, supporting high frequency annotation up to 15 Hz (Vendrow et al., 2022).
JRDB-Traj: Targets trajectory forecasting, supplying agent movement data, synchronizing scene images, and point clouds, together with an evaluation protocol that reflects the realities of imperfect agent detection and tracking (Saadatnejad et al., 2023).
JRDB-PanoTrack: Enriches the corpus with open-world panoptic segmentation and 3D/2D tracking across crowded human scenes using panoramic and point cloud data (Le et al., 2 Apr 2024).
JRDB-Social: Introduces a three-level semantic annotation scheme capturing individual demographics, pairwise in-group interactions, and social group environmental contexts (Jahangard et al., 6 Apr 2024).
Group Activity/Scene Graph Datasets: Provide fine-grained group-level relationship, situation, and interaction labels for scene graph generation and group activity understanding (Chappa et al., 2023).

2. Annotation Schemes and Evaluation Protocols

JRDB employs a multi-layered, fine-grained annotation system:

Detection and Tracking: 2D bounding boxes assigned to all persons, temporally consistent across frames and cameras; 3D cuboids for corresponding pedestrian localizations.
Action and Social Labels: Each bounding box is given a mandatory action label (from 11 pose classes and additional interaction/“miscellaneous” classes) plus social group clusterings (Ehsanpour et al., 2021).
Pose Estimation: Every visible person has full-body pose annotation (17 keypoints per person), with three-level occlusion status (Invisible, Occluded, Visible) (Vendrow et al., 2022).
Panoptic Segmentation and Tracking: Instance and class-level masks are marked for all “thing” and “stuff” categories, including complex multi-label overlaps (e.g., objects behind glass) (Le et al., 2 Apr 2024).
Social Group/Contextual Information: Social group attributes include demographics, group purpose (e.g., working, socializing), bodily-pose connection with scene content, and salient scene features (Jahangard et al., 6 Apr 2024).

Benchmark splits ensure a balanced mix of scene types, robot motion status, and population density. Evaluation metrics are adapted to the complexity of each task:

Detection/Tracking: Average Precision (AP, often at 0.5 IoU threshold), Multiple Object Tracking Accuracy (MOTA), IDF1, and OSPA-based metrics for instance and temporal matching (Vendrow et al., 2022, Saadatnejad et al., 2023, Le et al., 2 Apr 2024).
Trajectory Forecasting: End-to-End Forecasting Error (EFE) penalizes both localization and association/cardinality errors under real-world detection-tracking uncertainty (Saadatnejad et al., 2023).
Action/Grouping: Mean AP (mAP) for per-action and per-group inference, with partitioned loss functions and eigenvalue-based social group loss (Ehsanpour et al., 2021).
Panoptic: OSPA for spatial/temporal scoring, robust to multi-label and open-world settings (Le et al., 2 Apr 2024).

3. Key Research Applications

The JRDB family of datasets supports a wide range of robotic perception and interaction research:

2D/3D Person Detection: Benchmarked with methods such as F-PointNet, TANet++, PiFeNet, and Person-MinkUNet, with state-of-the-art accuracy reported up to 76.4% AP on the 3D detection benchmark (Jia et al., 2021, Le et al., 2021, Ma, 2021).
Multi-Object Tracking (MOT): JRDB enables egocentric and panoramic MOT research, including real-time RGB-LiDAR fusion approaches (e.g., JRMOT (Shenoi et al., 2020)), affinity-based tracking via PointNet (PC-DAN (Kumar et al., 2021)), and advanced panoramic tracking frameworks such as OmniTrack, improving HOTA by 3.43% over prior methods (Luo et al., 6 Mar 2025).
Action/Group Activity Understanding: The dense spatio-temporal action annotations and group clusterings of JRDB-Act and JRDB-Social are leveraged for both basic action recognition and deep group dynamic modeling, including group activity scene graphs and transformer-based multi-modal recognition (Chappa et al., 2023, Chappa et al., 28 Oct 2024).
Pose Estimation and Tracking: JRDB-Pose supports state-of-the-art frameworks for pose estimation in occlusion-heavy, panoramic settings, with public benchmarks using both AP and OSPA-Pose metrics (Vendrow et al., 2022, Fu et al., 2023).
Trajectory Prediction: JRDB-Traj directly assesses end-to-end forecasting, accounting for detection/spatial errors from upstream components and proposing holistic metrics that do not require perfect agent correspondence (Saadatnejad et al., 2023).
Scene Segmentation and Awareness: Open-world panoptic segmentation and tracking are enabled by multi-label, panoramic, and point cloud projections, fostering research in 2D/3D environmental understanding for autonomous robots (Le et al., 2 Apr 2024).

4. Methodological Innovations and Benchmarks

JRDB and its extensions have driven the development of several methodological advances:

Attention Mechanisms: Triple Attention modules for 3D detection (Ma, 2021), multi-level Pillar Aware Attention for efficient 3D pedestrian detection (Le et al., 2021), and novel flow-based and hierarchical attention for scene graph and group activity generation (Chappa et al., 2023, Chappa et al., 28 Oct 2024).
Panoramic and Domain-Specific Adaptations: MOT tailored for panoramic imagery (e.g., OmniTrack’s Tracklets Management and CircularStatE modules (Luo et al., 6 Mar 2025)); panoramic pose estimation via nearest-match initialization for HRNet (Fu et al., 2023).
Learning and Selection in Crowds: Trajectory prediction with Gumbel Softmax-based importance selection, which maintains accuracy while reducing computational complexity on crowded real-world sequences (Urano et al., 23 Jun 2025).
Robust Benchmarking: Introduction of OSPA-based metrics and partitioned loss strategies to circumvent class imbalance, multi-label ambiguity, and over-reliance on manual thresholds (Vendrow et al., 2022, Le et al., 2 Apr 2024).

JRDB-Social and extensions like HAtt-Flow and LiGAR position the JRDB family as a central resource for social interaction and group activity modeling.

Multi-Level Social Annotation: From individual (demographics) and intra-group (explicit pairwise interaction types) up to group-level context (scene structure, purpose), JRDB-Social facilitates holistic social understanding relevant for robots operating in populated spaces (Jahangard et al., 6 Apr 2024).
Activity Scene Graphs and Group Dynamics: Datasets and models support reasoning over appearance, interaction, spatial relationship, and situational factors, with transformer-based architectures designed to predict and interpret group activities and dynamics (Chappa et al., 2023, Chappa et al., 28 Oct 2024).
Evaluation via LLMs: The JRDB-Social benchmark includes systematic LLM-based experiments, demonstrating the intricacies and challenges in automated recognition of complex social dynamics, with performance quantified using accuracy and F₁-score (Jahangard et al., 6 Apr 2024).

6. Impact, Limitations, and Accessibility

The breadth and depth of JRDB and its extensions have positioned it as a foundational benchmark for academic and applied research in robotic perception, navigation, and social interaction.

International Collaboration: The project is led by the Stanford Vision and Learning Laboratory (Roberto Mart, Mihir Patel, Abhijeet Shenoi, JunYoung Gwak, Eric Frankel, Amir Sadeghian, and Silvio Savarese) and formally recognizes partnership with the Department of Data Science and AI at Monash University (Hamid Rezatofighi) (Martín-Martín et al., 2019).
Accessibility: Datasets, code, and evaluation servers are made publicly available for nearly all extensions, fostering reproducibility and ongoing research advances.
Recognized Limitations: Some studies highlight that, despite its coverage, JRDB does not support certain high-entropy, unstructured multi-agent crowd interactions as required for multi-agent IRL research—though it remains a leading benchmark for perceptual tasks (Chandra et al., 26 May 2024).
Future Directions: The dataset family continues to evolve, with ongoing annotation expansion, more open-world and social context understanding, and integration with state-of-the-art algorithmic advances.

In total, the JRDB dataset suite serves as a comprehensive, technologically rigorous, and extensible foundation for perceptual and social reasoning research in robotics, providing both the complexity of real-world sensory environments and the granularity of annotation needed to push forward the capabilities of autonomous systems.