Papers
Topics
Authors
Recent
2000 character limit reached

GATE-VLAP Dataset for Vision-Language-Action

Updated 19 December 2025
  • GATE-VLAP Dataset is a validated corpus of 2,124 atomic action segments with planner alignment and STRIPS-style annotations for robotic manipulation tasks.
  • Segmentation is achieved using the Atomic Action Slicing framework with Gemini models, ensuring precise temporal boundaries and robust confidence scoring.
  • The dataset enhances hierarchical policy learning and symbolic planning, yielding up to 5 percentage point improvements in long-horizon task success rates.

The GATE-VLAP Dataset is a validated corpus of 2,124 planner-aligned atomic action segments designed to facilitate advances in vision-language-action (VLA) agents, especially in settings requiring complex skill composition. Released alongside the Atomic Action Slicing (AAS) framework by Tabakov et al., GATE-VLAP targets long-horizon robotic manipulation tasks by providing segmented, typed, and temporally localized action annotations derived from the LIBERO-Goal and LIBERO-Long demonstration suites. Each segment is rigorously labeled with action taxonomy, STRIPS-like logical pre/postconditions, boundaries, and a confidence score, forming a unique bridge between planner interface requirements and policy learning in modern VLA architectures (Tabakov et al., 12 Dec 2025).

1. Dataset Composition and Action Taxonomy

GATE-VLAP consists of 2,124 atomic segments segmented from 825 demonstrations—758 segments from LIBERO-Goal (434 demos), and 1,366 from LIBERO-Long (391 demos). The atomic action types are distributed as follows: place/put (24%), grasp/pick (22%), open (drawer/door, 18%), close (12%), move/push (10%), and other actions such as pour or stack (14%). Action segment counts are moderately imbalanced (coefficient of variation ≈ 0.65), with the most frequent type (“place”) in ~510 segments and the rarest (“pour”) in ~30.

Atomic actions adhere to a discrete STRIPS-style schema, with each segment record containing:

  • Action Symbol: e.g., place_bowl_in_drawer(bowl, drawer)
  • Arguments: slot-specific, e.g., (object, container)
  • Preconditions / Postconditions: in BDDL/PDDL logic
  • Temporal Bounds: start_frame, end_frame, span durations
  • Confidence Score: c[0,1]c \in [0, 1]

Representative pre/postcondition structures are:

  • open_drawer(object, drawer):
    • Pre: clear(drawer), closed(drawer)
    • Post: open(drawer)
    • Typical Span: 20–60 frames
  • grasp(object):
    • Pre: reachable(object), ¬grasped(object)
    • Post: grasped(object)
    • Span: 15–45 frames
  • place(object, target):
    • Pre: grasped(object), clear(target)
    • Post: in(object, target), ¬grasped(object)
    • Span: 30–120 frames

The action taxonomy and symbolic labeling provide explicit grounding for downstream planners and hierarchical policy learning.

2. Segmentation, Confidence Labeling, and Validation Protocols

Segmentation of demonstrations into atomic actions utilizes AAS with Gemini 2.5 (Flash and Pro) models, ensuring alignment with planner-generated action sequences. Each candidate segment is subjected to a threefold validation:

  • Count: The number of predicted segments must match the number of planner steps: predicted_segments=planner_steps|{\rm predicted\_segments}| = |{\rm planner\_steps}|
  • Order: Predicted label sequence matches planner-derived sequence.
  • Duration: For each segment kk, dk=endkstartk+1d_k = {\rm end}_k - {\rm start}_k + 1 must satisfy dmin(ok)dkdmax(ok)d_{\rm min}(o_k) \leq d_k \leq d_{\rm max}(o_k) for that action class.

Following this validation, the per-segment confidence score is assigned:

c=αsmodel+β(slack_duration/allowed_range)+γIoUjitterc = \alpha \cdot s_{\rm model} + \beta \cdot ({\rm slack\_duration}/ {\rm allowed\_range}) + \gamma \cdot {\rm IoU}_{\rm jitter}

with

  • smodels_{\rm model}: VLM boundary confidence
  • slack_duration{\rm slack\_duration}: fit of predicted span within class range
  • IoUjitter{\rm IoU}_{\rm jitter}: predicted vs. plan-interval agreement under ±2-frame shifts

Recommended weights: α=0.5\alpha=0.5, β=0.3\beta=0.3, γ=0.2\gamma=0.2.

Key metrics (100 held-out demos, Gemini 2.5 Pro model):

Metric Value (Pro)
Segmentation Success Rate 93%
Avg. Segments per Demo ≈ 3.4
Mean Kendall's W (Label Concordance) 0.9136
Avg. Segment IoU (±2-frame jitter) 0.92 (Stability@Jitter)
Mean Frame IoU vs. Planner 0.88
MAE (Start/End) ≈ ±3 frames

Statistical significance is established by 95% bootstrap confidence intervals and nonparametric Wilcoxon tests (p<0.01p < 0.01 confirming Pro > Flash performance).

3. Data Modalities, Format, and Access

GATE-VLAP provides multimodal data:

  • RGB Video: .jpg frames at 10 fps
  • Robot State Logs: per-frame joint angles, gripper width (.json)
  • Symbolic Scene Descriptions: BDDL format (.json)
  • Atomic-Action Annotations: per-episode JSON arrays

The dataset directory is hierarchically organized into frames/ (LIBERO suites/episodes), states/, scene_desc/, annotations/, and a splits.json file indicating train/validation/test splits. Each annotation record includes:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "episode_id": "libero_goal/0001",
  "segments": [
    {
      "action": "grasp(bowl)",
      "start_frame": 14, "end_frame": 48,
      "pre": ["¬grasped(bowl)", "clear(table)"],
      "post": ["grasped(bowl)"],
      "confidence": 0.86
    }
    ...
  ]
}

Minimal usage can be demonstrated as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from datasets import load_dataset
from PIL import Image
from torch.utils.data import DataLoader

ds = load_dataset("gate-institute/GATE-VLAP-datasets")
def collate_fn(batch):
    images, texts = [], []
    for ex in batch:
        seg = ex["segments"][0]
        fnames = [f"{ex['episode_id']}/frames/{i:04d}.jpg"
                  for i in range(seg["start_frame"], seg["end_frame"]+1)]
        frames = [Image.open(fpath) for fpath in fnames]
        images.append(frames); texts.append(seg["action"])
    return {"images": images, "texts": texts}

train_loader = DataLoader(ds["train"], batch_size=8, shuffle=True, collate_fn=collate_fn)
The full dataset is publicly hosted at https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets.

4. Benchmarking and Downstream Performance

Fine-tuning the CLIP-RT+ model on GATE-VLAP’s 2,124 atomic segments produces the following task success rates:

Task Suite Baseline CLIP-RT+ CLIP-RT+ + GATE-VLAP
LIBERO-Goal 94.2% 95.3% (+1.1 pp)
LIBERO-Long 83.8% 88.8% (+5.0 pp)

These improvements indicate that atomic segments have modest effect on short horizon tasks (reinforcing individual operators) and a markedly positive impact (+5 pp) on long-horizon, compositional scenarios where error propagation is otherwise acute (Tabakov et al., 12 Dec 2025). A plausible implication is that atomic-level demonstrations enhance VLA agents’ ability to combine skills for novel or compound tasks.

5. Applications, Limitations, and Recommendations

Applications of GATE-VLAP include:

  • Imitation Learning: Training low-level operator policies with symbolic and temporal precision.
  • Hierarchical Reinforcement/Curriculum Learning: Utilizing atomic segments as reusable "skills" for higher-level policy training.
  • Symbolic Planning and Operator Verification: Direct evaluation and learning of STRIPS/HTN-like representations.
  • Robotic Dataset Augmentation: Recombination of validated atomic segments across new tasks and environments.

Limitations and prospective directions:

  • GATE-VLAP’s current validation is confined to simulated environments (LIBERO); performance in real-world transfer remains untested.
  • Segmentation fidelity depends on the completeness of BDDL scene descriptions; tasks without symbolic layouts are currently outside scope.
  • Sensitivity to video noise or highly dynamic scenes may degrade segmenter reliability.
  • Recommendations include the development of automatic scene descriptor extraction (beyond BDDL), self-supervised boundary refinement using motion or affordance cues, extension to diverse robotic datasets (e.g., R3M, VIP), multi-room/3D scenarios, and augmenting annotations with natural-language paraphrases for enriched language–action modeling.

6. Context and Relation to Prior Work

GATE-VLAP represents a critical evolution in robotic demonstration datasets by explicitly aligning atomic action segments with planner semantics and temporal boundaries (Tabakov et al., 12 Dec 2025). Unlike prior datasets focused on high-level goal or task completions, GATE-VLAP offers compositional, symbolic, and fine-grained labels for each action unit, enabling more effective interfaces between symbolic planners and policy learners. The validation protocol and high-confidence labeling facilitate robust benchmarking and cross-method reproducibility. This approach situates GATE-VLAP as a foundational resource for VLA research emphasizing modularity, compositional generalization, and integration between robot learning and planning systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GATE-VLAP Dataset.