LeRobot: Multimodal Robot Dataset Schema

Updated 16 August 2025

LeRobot Dataset Schema is a structured approach to organizing multimodal robot data, emphasizing spatial relationships and dynamic schema adaptability.
It employs precise annotation workflows with tools like SGDET-Annotate to generate detailed object attributes and spatial predicate triplets for robust machine learning integration.
The schema enhances data sharing, retrieval, and real-world task planning through efficient compression, scene graph representations, and semantic label enrichment.

The LeRobot Dataset Schema denotes a structured framework for representing and managing multimodal robot datasets, emphasizing spatial relationships, efficient compression, dynamic schema adaptability, and integration into retrieval- and learning-oriented systems. These datasets are designed to facilitate data sharing, search, benchmarking, and robust machine learning applications in robotics, particularly in environments characterized by complex object arrangements and evolving ontological structures.

1. Dataset Structure and Composition

LeRobot-style datasets typically consist of robot-collected sensory data—including RGB video, textual commands, numerical sensor streams, and episode metadata—with an emphasis on spatial relationship annotation for objects present in indoor scenes. A major example contains nearly 1,000 images captured via a Boston Dynamics Spot robot, with scenes including diverse objects such as bottles, foam cubes, remote controls, books, and humans (Wang et al., 14 Jun 2025). Each image is annotated for object attributes (e.g., class, color, material), bounding boxes, and a curated set of spatial relationships that encode complex spatial arrangements and contextual object interactions.

Spatial relationships are formalized using seven predicates: "behind," "in front of," "on," "to the left of," "to the right of," "under," and "near." The preferred representation for downstream models and foundation model integration is a scene graph $G = (V, E)$ , where $V$ denotes objects and $E$ encapsulates subject-predicate-object triplets for spatial relations.

2. Annotation Workflow and Data Format

Annotation is performed with dedicated tools such as SGDET-Annotate, which support manual axis-aligned bounding box generation and assignment of object classes and spatial relationship triplets. Annotators select the subject and object pairs and specify one of the seven spatial predicates for each pair. Quality control involves majority voting among multiple trained annotators and central cleaning for predicate merging (e.g., "above," "over," and "on" unified), as well as exclusion of images with insufficient object count for spatial labeling.

Exports are provided in:

YOLOv10m-style text files suitable for object detection training;
Visual Genome–compatible JSON, which encodes all objects, attributes, and spatial relationships for SGG models.

The dataset, annotation interface, and documentation are available under open-source licenses for research utilization (Wang et al., 14 Jun 2025).

3. Data Compression and Storage Schema

LeRobot employs lossy video compression (e.g., AV1, typically CRF=30) to store vision data, achieving substantial file size reductions (approximately 70× compared to raw frames in Robo-DM and similar ratios in LeRobot) (Chen et al., 21 May 2025). Vision streams are stored as separate MP4 files per camera and episode. Language and action data (including episode-level metadata) utilize HuggingFace datasets backed by Apache Arrow and safetensors, resulting in a hybrid multi-file structure.

In comparison, Robo-DM schema encapsulates multimodal robot data into a unified EBML (Extensible Binary Meta Language) container, multiplexing all modalities and associated alignment metadata within a single file. This structural difference impacts loading performance—Robo-DM achieves up to 50× faster sequential decoding and more efficient retrieval during model training, while LeRobot’s multi-source querying introduces overhead (Chen et al., 21 May 2025).

4. Schema Label Generation and Retrieval Integration

To address the limitations of static table headers and enable semantic search, LeRobot-compatible schemas leverage schema label generation models (Chen et al., 2020). These models automatically enrich dataset tables by generating additional schema labels derived from latent interaction and co-occurrence statistics. Specifically, matrix factorization decomposes the dataset-column label matrix, while schema label embedding (using Shifted Positive PMI and factorization) captures contextual associations. Joint optimization is performed per Equation 4:

$\mathcal{L} = \sum_{u,p} c_{up} (M_{up} - \alpha_u^T \beta_p)^2 + \sum_{(p,i): M^{SSPMI}_{pi} \neq 0} (M^{SSPMI}_{pi} - \beta_p^T \gamma_i - b_p - c_i)^2 + \lambda_\alpha \sum_u \|\alpha_u\|^2 + \lambda_\beta \sum_p \|\beta_p\|^2 + \lambda_\gamma \sum_i \|\gamma_i\|^2$

The top $m$ schema labels predicted for each column via a multi-label classifier (e.g., Random Forest) supplement the metadata. Retrieval models use a mixed-ranking system: relevance is computed over metadata (BM25), table content, and generated schema labels (scored using fastText embeddings and negative Word Mover's Distance). This enables improved semantic matching, recall, and retrieval precision for robotics datasets where query vocabulary may not align with technical header descriptions.

5. Dynamic Schema Adaptability

Conventional knowledge graph construction (KGC) methods are static, but LeRobot schemas require adaptability due to the rapidly evolving nature of robotic ontologies and operational environments (Ye et al., 2023). Dynamic schema frameworks employ mechanisms such as schema-enriched prefix instructors—where the schema is represented as a linearized text prefix, embedded and prepended to the input—and schema-conditioned dynamic decoding using trie-based constraints informed by the current schema.

Benchmark creation uses horizontal (adding classes at the same ontology level), vertical (introducing child nodes or subclasses), and hybrid expansion strategies to simulate schema evolution. Metrics such as F1-score are monitored across expansion iterations. AdaKGC, a baseline model with these schema-aware modules, demonstrates improved event extraction performance and robust adaptation to newly introduced types, while static or pre-trained methods exhibit performance degradation over schema updates. For LeRobot, this implies practical means to incorporate new sensor types, actions, or environmental descriptors without retraining from scratch—critically supporting the continual learning paradigm in real-world robotics.

6. Model Benchmarking and Integration with Planning Systems

Benchmarking of scene-graph generation (SGG) models on LeRobot’s spatially annotated datasets indicates distinct trade-offs between inference speed and relational recall/mR@K scores. Motif Predictor and Transformer Predictor provide balanced accuracy and real-time performance (e.g., Motif Predictor: latency ~24.9 ms, R@100 ≈ 0.4856), whereas VCTree achieves the highest recall (mR@100 ≈ 0.4909) at the cost of higher compute latency (~92.5 ms/image) (Wang et al., 14 Jun 2025).

Integration of explicit spatial relationship information from SGG models into foundation models (e.g., ChatGPT 4o) leads to substantial improvements in robotic task planning, enabling the generation of executable, spatially-aware plans and resolution of ambiguities in object selection and manipulation tasks. This demonstrates the critical role of spatially annotated schema in bridging vision, language, and action domains for robotics research.

7. Applications and Accessibility

LeRobot schema supports a broad range of robotics research: manipulation, navigation, personalized assistance, and environment modeling in cluttered, real-world settings. It enables curriculum learning strategies and model refinement given label imbalance and annotation noise trends observed during benchmarking. Public availability of datasets and tools (https://github.com/PengPaulWang/SpatialAwareRobotDataset; https://github.com/harvey-ph/SGDET-Annotate) ensures reproducibility and extensibility across the robotics and machine learning communities.

A plausible implication is that continued research will focus on further semantic enrichment of schema labels, dynamic KGC integration, and improved scene graph architectures tailored for robotics applications—underlining the significance of schema design choices in large-scale robot dataset management and learning systems.