Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JRDB Dataset for Robotic Perception

Updated 7 July 2025
  • JRDB Dataset is a comprehensive, multimodal dataset offering annotated sensor data for studying egocentric robotic vision and human activity in built environments.
  • It integrates diverse modalities such as stereo 360° video, LiDAR point clouds, and RGB-D imagery to enable detection, tracking, and overall scene understanding.
  • Its extensive annotations and specialized extensions support benchmarks in pose estimation, trajectory forecasting, and social group inference for advanced robotic research.

The JRDB Dataset is a large-scale, multimodal benchmark created to advance perceptual research for robotics, with a distinct focus on egocentric robot vision and human activity understanding in real-world built environments. Captured using the Stanford JackRabbot social mobile manipulator, JRDB and its subsequent extensions offer richly annotated sensor data that underpin research in detection, tracking, social group inference, spatio-temporal activity recognition, pose estimation, and comprehensive scene understanding.

1. Composition and Data Modalities

JRDB’s core release consists of 64 minutes of annotated sensor data acquired in both indoor and outdoor environments on a university campus (1910.11792). Its primary modalities include:

  • Stereo cylindrical 360° RGB video (15 fps, five cameras per cycle)
  • 3D point clouds from two Velodyne 16-beam LiDARs and two Sick line 2D LiDARs
  • 360° panoramic images from a fisheye camera
  • RGB-D video (30 fps, head-mount)
  • Audio recordings
  • Robot odometry/encoder data

In total, the dataset features 54 distinct annotated scenes, encompassing stationary and mobile data captures, yielding over 2.3 million 2D bounding boxes and approximately 1.8 million 3D cuboids with trajectory associations for pedestrians, spread across more than 3,500 time-consistent tracks (1910.11792).

Several extensions significantly broaden the dataset:

  • JRDB-Act: Adds over 2.8 million spatio-temporal action labels with social group assignments and annotator confidence levels for 54 sequences (2106.08827).
  • JRDB-Pose: Includes 636,000 pose instances (17 body keypoints with per-keypoint occlusion labels) and track IDs for multi-person pose estimation and tracking, supporting high frequency annotation up to 15 Hz (2210.11940).
  • JRDB-Traj: Targets trajectory forecasting, supplying agent movement data, synchronizing scene images, and point clouds, together with an evaluation protocol that reflects the realities of imperfect agent detection and tracking (2311.02736).
  • JRDB-PanoTrack: Enriches the corpus with open-world panoptic segmentation and 3D/2D tracking across crowded human scenes using panoramic and point cloud data (2404.01686).
  • JRDB-Social: Introduces a three-level semantic annotation scheme capturing individual demographics, pairwise in-group interactions, and social group environmental contexts (2404.04458).
  • Group Activity/Scene Graph Datasets: Provide fine-grained group-level relationship, situation, and interaction labels for scene graph generation and group activity understanding (2312.07740).

2. Annotation Schemes and Evaluation Protocols

JRDB employs a multi-layered, fine-grained annotation system:

  • Detection and Tracking: 2D bounding boxes assigned to all persons, temporally consistent across frames and cameras; 3D cuboids for corresponding pedestrian localizations.
  • Action and Social Labels: Each bounding box is given a mandatory action label (from 11 pose classes and additional interaction/“miscellaneous” classes) plus social group clusterings (2106.08827).
  • Pose Estimation: Every visible person has full-body pose annotation (17 keypoints per person), with three-level occlusion status (Invisible, Occluded, Visible) (2210.11940).
  • Panoptic Segmentation and Tracking: Instance and class-level masks are marked for all “thing” and “stuff” categories, including complex multi-label overlaps (e.g., objects behind glass) (2404.01686).
  • Social Group/Contextual Information: Social group attributes include demographics, group purpose (e.g., working, socializing), bodily-pose connection with scene content, and salient scene features (2404.04458).

Benchmark splits ensure a balanced mix of scene types, robot motion status, and population density. Evaluation metrics are adapted to the complexity of each task:

  • Detection/Tracking: Average Precision (AP, often at 0.5 IoU threshold), Multiple Object Tracking Accuracy (MOTA), IDF1, and OSPA-based metrics for instance and temporal matching (2210.11940, 2311.02736, 2404.01686).
  • Trajectory Forecasting: End-to-End Forecasting Error (EFE) penalizes both localization and association/cardinality errors under real-world detection-tracking uncertainty (2311.02736).
  • Action/Grouping: Mean AP (mAP) for per-action and per-group inference, with partitioned loss functions and eigenvalue-based social group loss (2106.08827).
  • Panoptic: OSPA for spatial/temporal scoring, robust to multi-label and open-world settings (2404.01686).

3. Key Research Applications

The JRDB family of datasets supports a wide range of robotic perception and interaction research:

  • 2D/3D Person Detection: Benchmarked with methods such as F-PointNet, TANet++, PiFeNet, and Person-MinkUNet, with state-of-the-art accuracy reported up to 76.4% AP on the 3D detection benchmark (2107.06780, 2112.15458, 2106.15366).
  • Multi-Object Tracking (MOT): JRDB enables egocentric and panoramic MOT research, including real-time RGB-LiDAR fusion approaches (e.g., JRMOT (2002.08397)), affinity-based tracking via PointNet (PC-DAN (2106.07552)), and advanced panoramic tracking frameworks such as OmniTrack, improving HOTA by 3.43% over prior methods (2503.04565).
  • Action/Group Activity Understanding: The dense spatio-temporal action annotations and group clusterings of JRDB-Act and JRDB-Social are leveraged for both basic action recognition and deep group dynamic modeling, including group activity scene graphs and transformer-based multi-modal recognition (2312.07740, 2410.21108).
  • Pose Estimation and Tracking: JRDB-Pose supports state-of-the-art frameworks for pose estimation in occlusion-heavy, panoramic settings, with public benchmarks using both AP and OSPA-Pose metrics (2210.11940, 2303.07141).
  • Trajectory Prediction: JRDB-Traj directly assesses end-to-end forecasting, accounting for detection/spatial errors from upstream components and proposing holistic metrics that do not require perfect agent correspondence (2311.02736).
  • Scene Segmentation and Awareness: Open-world panoptic segmentation and tracking are enabled by multi-label, panoramic, and point cloud projections, fostering research in 2D/3D environmental understanding for autonomous robots (2404.01686).

4. Methodological Innovations and Benchmarks

JRDB and its extensions have driven the development of several methodological advances:

  • Attention Mechanisms: Triple Attention modules for 3D detection (2106.15366), multi-level Pillar Aware Attention for efficient 3D pedestrian detection (2112.15458), and novel flow-based and hierarchical attention for scene graph and group activity generation (2312.07740, 2410.21108).
  • Panoramic and Domain-Specific Adaptations: MOT tailored for panoramic imagery (e.g., OmniTrack’s Tracklets Management and CircularStatE modules (2503.04565)); panoramic pose estimation via nearest-match initialization for HRNet (2303.07141).
  • Learning and Selection in Crowds: Trajectory prediction with Gumbel Softmax-based importance selection, which maintains accuracy while reducing computational complexity on crowded real-world sequences (2506.18291).
  • Robust Benchmarking: Introduction of OSPA-based metrics and partitioned loss strategies to circumvent class imbalance, multi-label ambiguity, and over-reliance on manual thresholds (2210.11940, 2404.01686).

5. Social Context and Interaction Modeling

JRDB-Social and extensions like HAtt-Flow and LiGAR position the JRDB family as a central resource for social interaction and group activity modeling.

  • Multi-Level Social Annotation: From individual (demographics) and intra-group (explicit pairwise interaction types) up to group-level context (scene structure, purpose), JRDB-Social facilitates holistic social understanding relevant for robots operating in populated spaces (2404.04458).
  • Activity Scene Graphs and Group Dynamics: Datasets and models support reasoning over appearance, interaction, spatial relationship, and situational factors, with transformer-based architectures designed to predict and interpret group activities and dynamics (2312.07740, 2410.21108).
  • Evaluation via LLMs: The JRDB-Social benchmark includes systematic LLM-based experiments, demonstrating the intricacies and challenges in automated recognition of complex social dynamics, with performance quantified using accuracy and F₁-score (2404.04458).

6. Impact, Limitations, and Accessibility

The breadth and depth of JRDB and its extensions have positioned it as a foundational benchmark for academic and applied research in robotic perception, navigation, and social interaction.

  • International Collaboration: The project is led by the Stanford Vision and Learning Laboratory (Roberto Mart, Mihir Patel, Abhijeet Shenoi, JunYoung Gwak, Eric Frankel, Amir Sadeghian, and Silvio Savarese) and formally recognizes partnership with the Department of Data Science and AI at Monash University (Hamid Rezatofighi) (1910.11792).
  • Accessibility: Datasets, code, and evaluation servers are made publicly available for nearly all extensions, fostering reproducibility and ongoing research advances.
  • Recognized Limitations: Some studies highlight that, despite its coverage, JRDB does not support certain high-entropy, unstructured multi-agent crowd interactions as required for multi-agent IRL research—though it remains a leading benchmark for perceptual tasks (2405.16439).
  • Future Directions: The dataset family continues to evolve, with ongoing annotation expansion, more open-world and social context understanding, and integration with state-of-the-art algorithmic advances.

In total, the JRDB dataset suite serves as a comprehensive, technologically rigorous, and extensible foundation for perceptual and social reasoning research in robotics, providing both the complexity of real-world sensory environments and the granularity of annotation needed to push forward the capabilities of autonomous systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)