NL2SpaTiaL Dataset Overview

Updated 23 December 2025

NL2SpaTiaL is a benchmark suite that aligns natural language instructions with structured SpaTiaL formulas for spatial and spatio-temporal reasoning.
It comprises datasets targeting robotic manipulation, virtual reality verb learning, and clinical spatial extraction through annotated, hierarchical mappings.
Its rigorous semantic verification and compositional annotation pipeline enable precise NL-to-spatial logic translation for embodied AI and healthcare applications.

The NL2SpaTiaL dataset designates a family of benchmarks, corpora, and frameworks for aligning natural language with formal or structured representations of spatial (and spatio-temporal) relations. Notably, the term "NL2SpaTiaL" has been used for: (1) a large-scale dataset for grounding geometric Spatio-Temporal Logic (SpaTiaL) formulas from natural language aimed at robotic manipulation (Luo et al., 15 Dec 2025); (2) a multimodal corpus for verb learning in virtual reality environments (Ebert et al., 2020); and (3) a schema for extracting spatial frames from clinical (especially ophthalmology) text (Datta et al., 2023). While each instantiation shares the NL→spatial formalism perspective, they differ in logic expressivity, domain focus, and representation granularity.

1. Formalism and Dataset Construction: SpaTiaL Logic Benchmark

The NL2SpaTiaL dataset of (Luo et al., 15 Dec 2025) is constructed to facilitate research in translating natural-language instructions into geometric spatio-temporal logic specifications for robotic task planning and manipulation. The benchmark integrates a pipeline that:

Synthesizes hierarchical SpaTiaL formulas as rooted operator trees (depth 2–4, branching up to 3), sampling from an expressive grammar comprising temporal operators (G, F, U), Boolean connectives (¬, ∧, ∨), and atomic geometric predicates (e.g., Touch, closeTo, LeftOf, Above, Between).
Each leaf is instantiated with a scene-agnostic spatial atom, parameterized by object indices and (where appropriate) thresholds (e.g., $closeTo(i,j; \epsilon_c)$ ).
Deterministic, invertible back-translation renders every formula node into compositional canonical English. This mapping is strictly bijective: for any translated formula $\phi$ , the renderer $\tau$ yields $\tau(\phi)$ , and $\tau^{-1}$ reconstructs $\phi$ .
Alignment is hierarchical: each root-level NL–SpaTiaL pair is richly annotated at every subformula node, supporting decompositional and compositional learning.

The translation-verification framework further utilizes a language-model-based semantic checker to confirm that every operator, object mention, and numerical constraint in the formula is justified by explicit text. Candidate formulations that fail semantic alignment are rejected and regenerated at the node level.

Pseudo-BNF for Formula Generation

$\begin{aligned} \Phi &::= \mu \mid \neg \Phi \mid \Phi \wedge \Phi \mid \Phi \vee \Phi \mid F_{[a,b]}\,\Phi \mid G_{[a,b]}\,\Phi \mid \Phi \;U_{[a,b]}\;\Phi\ \mu &::= \mathrm{Touch}(i,j) \mid \mathrm{closeTo}(i,j;\varepsilon_c) \mid \mathrm{farFrom}(i,j;\varepsilon_f) \mid \mathrm{ovlp}(i,j;\tau) \mid \mathrm{enclIn}(i,j;\rho)\ &\hspace{2em}\mid \mathrm{LeftOf}(i,j;\kappa) \mid \mathrm{Above}(i,j;\kappa) \mid \mathrm{Between\_px}(a,b,c) \mid \mathrm{oriented}(i,j;\kappa) \end{aligned}$

2. Dataset Coverage and Supervision Granularity

The core dataset consists of ≈10,000 fully annotated root-level NL–SpaTiaL pairs, each decomposed into 3–5 logic layers. Every node is annotated with its corresponding canonical English expression, and typically multiple paraphrases, yielding supervised (subformula, phrase) pairs suitable for both root-to-leaf and local learning objectives. Standard splits use 80% for training, 10% for development, and 10% for testing; all decomposed subformula pairs inherit these splits.

Table: Operator and Predicate Distributions

Logic Operator	Proportion (%)	Spatial Predicate	Proportion (%)
F (Eventually)	28	closeTo / farFrom	20
G (Always)	26	Touch	12
U (Until)	18	ovlp / enclIn	18
¬ (Negation)	12	LeftOf / Above / ...	25
∧ / ∨ (And/Or)	16	Between	8
		oriented	5

Paraphrase augmentation increases linguistic variability for robust semantic parsing.

3. Logic Semantics and Quantitative Grounding

SpaTiaL logic in NL2SpaTiaL builds directly on metric and symbolic reasoning about spatial and temporal relations. Key features:

Temporal predicates are defined as time interval quantifiers: $G_{[a,b]}\,\phi$ (globally in $[a,b]$ ), $F_{[a,b]}\,\phi$ (eventually in $[a,b]$ ), and $U_{[a,b]}$ (until).
Geometric predicates are quantitatively parameterized: $closeTo(i,j): \epsilon_c - \|p_i - p_j\|$ , $farFrom(i,j): \|p_i - p_j\| - \epsilon_f$ , $LeftOf(i,j; \kappa)$ , etc.
Robustness is defined via min/max over bounds, supporting end-to-end reward shaping in robotics.

Composite instructions can layer multiple spatial and temporal constraints:

"Within 10 s, place obj_1 inside reg_A and keep it above obj_2."

$F_{[0,10]}(enclIn(obj_1,reg_A) \wedge Above(obj_1,obj_2;\kappa))$

"Maintain obj_3 far from obj_4 until it touches obj_5, then ensure it is not close to obj_4 after 5 s."

$(farFrom(obj_3,obj_4) U_{[0,20]} Touch(obj_3,obj_5)) \wedge F_{[5,20]} \neg closeTo(obj_3,obj_4)$

4. Evaluation: Manipulation and Rearrangement Tasks

NL2SpaTiaL was benchmarked via task suites in (a) the ReKep environment [Huang et al. 2024], covering pen manipulation, teapot pouring, lid placement, box reorientation, and (b) simulated PyBullet rearrangement tasks. Using SpaTiaL formulas as control specifications:

Pen Manipulation: $\rho \approx +0.12$ (partial success, 2/3 subgoals).
Lid-to-Teapot: $\rho \approx -0.53$ (failure).
Cup Alignment: $\rho \approx -0.36$ . Averaged task robustness metrics ( $\rho$ ) are detailed in Tables I and II of (Luo et al., 15 Dec 2025).

Action verification with SpaTiaL-based semantic rollouts resulted in a ≈15% improvement in downstream success over pure vision-language-action baselines. Failure analyses identified hierarchical errors where nested until scopes overlapped, consistently flagged by the dataset’s semantic checker.

5. Data Format, Organization, and Accessibility

Each dataset instance is stored as a JSON object:

{
  "formula": "<root-level SpaTiaL string>",
  "canonical": "<tau(formula)>",
  "paraphrases": ["...","..."],
  "nodes": [
    {"id":"v1", "subformula":"G_[0,20](…)", "canon":"Throughout [0,20], …", "paraphrases":[…], "span":[start,end]},
    ...
  ]
}

The dataset directory implements /data/train, /data/dev, and /data/test, with object identities and parameters provided in a scene-agnostic registry where required. Access is via https://sites.google.com/view/nl2spatial, licensed CC-BY-4.0.

6. Alternative Instantiations: Visuospatial and Clinical NL2SpaTiaL Resources

A. Verb Learning in VR Environments ("New Brown Corpus")

NL2SpaTiaL is also used as a label for a child-directed narration dataset with 18,000 word tokens, multimodal (speech, video, frame-synchronous 3D trajectories) alignments, and exact ground-truth object state in VR kitchens (Ebert et al., 2020). The corpus enables studies of grounded verb acquisition, word-meaning induction, and multimodal modeling. Annotation is at the word and object level, but spatial relations are not symbolically labeled; instead, users must derive spatial predicates from raw 3D pose and bounding box sequences.

B. Clinical SpaTiaL Information Extraction (Eye-SpatialNet)

In the Eye-SpatialNet schema, the NL2SpaTiaL concept underlies a corpus of 600 expert-annotated ophthalmology notes mapped to frame-semantic units based on spatial triggers ("in", "behind") (Datta et al., 2023). The core annotation connects spatial lexical units to their Figure (entity) and Ground (reference), with ophthalmology-specialized fields for directionality and impact. Automated frame extraction is achieved by a two-turn BERT-based QA pipeline, yielding F1 scores up to 89.3 for spatial triggers and 88.5 for Ground frame elements. This resource has broad utility for clinical IE and structured EHR coding.

7. Significance, Current Limitations, and Outlook

NL2SpaTiaL datasets exemplify domain-agnostic frameworks for semantic parsing and formal grounding of spatially indexed language. The geometric SpaTiaL benchmark directly advances research on compositional task specification for embodied agents, prioritizing expressivity, compositional supervision, and soundness validation. Its structure facilitates hierarchical training, fine-grained error analysis, and formal verification in closed or open-world robotic manipulation.

Identified limitations include the lack of scene grounding (objects are identifiers, not percepts), absence of dynamic discourse context, and coverage restrictions in vocabulary and spatial primitives—though the architecture allows for principled extension. A plausible implication is that incorporation of visual grounding or open-vocabulary NL anchoring would more fully bridge between language and physical task execution.

Collectively, the NL2SpaTiaL corpora highlight the necessity of formal and quantitative representation in spatial language research, supporting the development and robust evaluation of instruction-following, spatial IE, and multimodal reasoning systems across scientific and applied domains (Luo et al., 15 Dec 2025, Ebert et al., 2020, Datta et al., 2023).

Markdown Upgrade to Chat

References (3)

NL2SpaTiaL: Generating Geometric Spatio-Temporal Logic Specifications from Natural Language for Manipulation Tasks (2025)

A Visuospatial Dataset for Naturalistic Verb Learning (2020)

Eye-SpatialNet: Spatial Information Extraction from Ophthalmology Notes (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NL2SpaTiaL Dataset.