Open Spatial Dataset (OSD) Overview

Updated 19 November 2025

OSD is a publicly accessible dataset featuring comprehensive spatial annotations across audio, visual, and 3D modalities for machine learning applications.
It employs standardized file formats and annotation schemas (e.g., JSON, GeoJSON) to ensure interoperability and reproducibility in spatial reasoning tasks.
OSDs serve as critical benchmarks for robotics, embodied AI, and audio-visual analysis by enabling detailed evaluation through metrics like localization error and instance segmentation.

An Open Spatial Dataset (OSD) is a publicly accessible collection of annotated data—audio, visual, or multi-modal—explicitly designed to facilitate machine learning and algorithmic research that requires spatial understanding of real or simulated environments. OSDs include, but are not limited to, comprehensive labels for object positioning, geometric relationships, spatial topology, and temporal tracking. They are structured for interoperability, open licensing, and reproducibility, and serve as critical benchmarks for spatial reasoning in robotics, embodied AI, and audio-visual scene analysis.

1. Definitional Scope and Hallmarks of OSDs

Key characteristics of an OSD include open accessibility, rich spatial annotation, standardized file formats, and explicit benchmark protocols for tasks involving spatial reasoning. Exemplars include datasets for spatial audio learning such as Spatial LibriSpeech (Sarabia et al., 2023), visually grounded datasets for robotic scene graph generation (Wang et al., 14 Jun 2025), and panoramic panoptic segmentation and tracking corpora in crowded environments (Le et al., 2 Apr 2024). Fundamental to an OSD is the explicit representation of locations, directions, and relationships in either 2D image coordinates, 3D world coordinates, or higher-order manifold spaces, often including time as well.

2. Data Structures, Formats, and Annotation Schemas

Common modalites in OSDs span audio (multichannel arrays, Ambisonics), 2D images, 3D point clouds, and multimodal sensor networks. Annotation schemas are tailored to data type:

Spatial LibriSpeech encodes 19-channel spherical mic-array audio, first-order Ambisonics, and provides per-sample JSON metadata linking each utterance to 3D source positions (azimuth, elevation, distance), source directivity, detailed room geometry, acoustic material coefficients, and standardized acoustic parameters such as T30, DRR, C50, and EDT (Sarabia et al., 2023).
Spatial-Relationship-Aware Robotics datasets store per-image object bounding boxes, object attributes, and scene graphs in Visual Genome-style JSONs, with each spatial relation (e.g., "on," "left_of") defined via bounding box inequalities and overlap statistics; these are easily mapped into GeoJSON for GIS-style applications (Wang et al., 14 Jun 2025).
JRDB-PanoTrack integrates 2D panoramic segmentations, multi-class panoptic masks, 3D LiDAR point clouds, and tracking IDs, with calibration and coordinate frame information, enabling projection between 2D and 3D domains and supporting both instance segmentation and long-term tracking (Le et al., 2 Apr 2024).

A typical OSD will employ directory structures keyed by split and sample identifier (e.g., train-clean-100/speakerID/chapterID/utteranceID), and annotation granularity is set by task and modality—per waveform, per frame, or per object.

3. Mathematical Representation of Spatial Information

Spatial relationships and metrics are mathematically formalized at multiple levels:

Localization is typically measured in polar (azimuth/elevation) and radial (distance) coordinates, with errors assessed via geodesic angular distance:

$\alpha = \cos^{-1}(\sin\varphi \sin\hat{\varphi} + \cos\varphi \cos\hat{\varphi} \cos(\theta - \hat{\theta}))$

where $(\theta,\varphi)$ are ground-truth and $(\hat{\theta},\hat{\varphi})$ are estimated.

Acoustic Metrics such as reverberation time (T30) and Direct-to-Reverberant Ratio (DRR) follow established formulas:

$T_{30} = -60 \cdot (\sum_k p(k) \Delta t) / \ln(10^{-6}), \quad DRR = 10 \log_{10} \left[ \frac{\int_0^{t_d} h^2(\tau)\, d\tau }{ \int_{t_d}^{\infty} h^2(\tau)\, d\tau } \right]$

Spatial Predicates in robotic scene graphs are given by bounding box inequalities, e.g., on(A,B): y_A^{min} \geq y_B^{max} - \varepsilon and IoU_x(b_A, b_B) \geq \tau_x, as well as L2 distance thresholds for "near" relations (Wang et al., 14 Jun 2025).
Segmentation and Tracking Metrics employ the OSPA family, e.g.,

$OSPA_{PS}(X_c, Y_c) = \left( \frac{1}{n} \min_{\pi \in \Pi_n} \sum_{i=1}^n d_c(x_i, y_{\pi(i)})^p + c^p|n-m| \right)^{1/p}$

with $d_c(x, y) = 1 - \text{IoU}(x, y)$ , and OSPA $^2$ for tracks as time-indexed mask sets (Le et al., 2 Apr 2024).

4. Benchmarking, Task Protocols, and Model Evaluation

OSDs serve as canonical testbeds for a wide array of benchmark tasks:

Spatial LibriSpeech establishes evaluation for 3D localization (median error 6.60°, generalizing to 12.43° on TUT 2018), distance estimation (0.43 m), acoustic parameter estimation (T30, DRR) with protocols matching training and test room diversity. Median and IQR metrics and Pearson correlation coefficients provide quantitative assessment (Sarabia et al., 2023).
Scene Graph Datasets for robotics (Wang et al., 14 Jun 2025) benchmark six state-of-the-art scene graph architectures (e.g., Transformer, Motif, VCTree) using relational recall (R@K, mR@100) and latency per image. Statistical results are reported for predicate-specific recall (e.g., "on": 0.76–0.90, "in_front_of": 0.50–0.57), with strongest performance on spatial containment predicates.
JRDB-PanoTrack evaluates panoptic segmentation and tracking under both Closed-World (CW) and Open-World (OW) settings, using PQ, IDF1, STQ, fragmentation counts, and OSPA $_{PS}$ for segmentation, OSPA $^2_{PT}$ for long-term tracks. OW protocol assesses generalization to long-tail and unseen classes; baseline PQ values are in the range of 36.6% (CW) and 11% (OW) for segmentation, illustrating open-world difficulty (Le et al., 2 Apr 2024).

Benchmark datasets encourage both direct task evaluation and cross-dataset generalization, with explicit reporting facilitating reproducibility and downstream model assessment.

5. Practical Applications and Use Cases

OSDs enable a comprehensive suite of downstream applications:

End-to-end spatial audio models (localization, dereverberation, source separation, beamforming), as enabled by simulated and richly labeled audio datasets (Sarabia et al., 2023).
Spatial reasoning for robotics, including executable task planning—scene graphs provide interpretable spatial context to large language-action models, substantially improving spatially grounded plan generation (e.g., ChatGPT 4o for pick-and-place with spatial precondition checking) (Wang et al., 14 Jun 2025).
Panoptic mapping and navigation in built environments: 360° panoramic segmentation and tracking enhance scene understanding for navigation in crowded, dynamic scenarios; multi-modal sensor fusion with RGB-D and LiDAR aligns with real-world robotics deployment requirements (Le et al., 2 Apr 2024).
Contrastive and multi-task representation learning: The abundance of spatial cues and complex relational structure promotes contrastive approaches and robust multi-task objective design.

6. Access, Licensing, and Interoperability

OSDs such as those described are typically released under Creative Commons licenses (CC BY 4.0), supporting both academic and commercial use with attribution. Distribution is facilitated via public repositories (GitHub, AWS S3), with clear directory and metadata documentation, versioned tools for annotation and conversion (e.g., to GeoJSON), and explicit dependencies (Python libraries, data loaders). Standardization is prioritized by aligning annotation formats with established schemas (Visual Genome, JSON-LD, GeoJSON), providing calibration and coordinate mapping scripts for unambiguous spatial reference (Sarabia et al., 2023, Wang et al., 14 Jun 2025).

For integration across catalogs, recommendations include: adoption of unified ontologies (e.g., schema.org/Object), inclusion of geo-referencing fields (robot pose, world coordinates), and maintenance of continuous-integration pipelines to validate new spatial annotations (Wang et al., 14 Jun 2025).

7. Limitations and Prospects for Extension

Limitations stem from simulation-vs-reality gaps (e.g., simulated rooms in audio datasets lacking real-world acoustic complexity), single static sources per utterance, and restricted sensor/platform diversity (single robotic platform, viewpoint bias) (Sarabia et al., 2023, Le et al., 2 Apr 2024). Pseudo-labeled 3D annotations inherit 2D projection errors and may lack complete visibility due to camera occlusions.

Prospective extensions include:

Incorporation of real-room or hybrid real+simulated measurements,
Expansion to dynamic, multi-source, and dense object scenarios,
Broader support for alternate sensor geometries (higher-order Ambisonics, diverse robot platforms),
Introduction of open-world 3D tracking benchmarks with raw ground-truth labels,
Community contribution via open-source tooling, Dockerization, and rigorous release versioning (Sarabia et al., 2023, Wang et al., 14 Jun 2025, Le et al., 2 Apr 2024).

A plausible implication is that as OSDs continue to grow in complexity and scale, standardized schemas, robust integration tools, and rich multi-modal annotation will be central to enabling next-generation spatial reasoning research in embodied systems.

PDF Markdown Chat (Pro)

References (3)

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning (2023)

A Spatial Relationship Aware Dataset for Robotics (2025)

JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments (2024)

Follow Topic

Get notified by email when new papers are published related to Open Spatial Dataset (OSD).