GATE-VLAP Dataset for Vision-Language-Action
- GATE-VLAP Dataset is a validated corpus of 2,124 atomic action segments with planner alignment and STRIPS-style annotations for robotic manipulation tasks.
- Segmentation is achieved using the Atomic Action Slicing framework with Gemini models, ensuring precise temporal boundaries and robust confidence scoring.
- The dataset enhances hierarchical policy learning and symbolic planning, yielding up to 5 percentage point improvements in long-horizon task success rates.
The GATE-VLAP Dataset is a validated corpus of 2,124 planner-aligned atomic action segments designed to facilitate advances in vision-language-action (VLA) agents, especially in settings requiring complex skill composition. Released alongside the Atomic Action Slicing (AAS) framework by Tabakov et al., GATE-VLAP targets long-horizon robotic manipulation tasks by providing segmented, typed, and temporally localized action annotations derived from the LIBERO-Goal and LIBERO-Long demonstration suites. Each segment is rigorously labeled with action taxonomy, STRIPS-like logical pre/postconditions, boundaries, and a confidence score, forming a unique bridge between planner interface requirements and policy learning in modern VLA architectures (Tabakov et al., 12 Dec 2025).
1. Dataset Composition and Action Taxonomy
GATE-VLAP consists of 2,124 atomic segments segmented from 825 demonstrations—758 segments from LIBERO-Goal (434 demos), and 1,366 from LIBERO-Long (391 demos). The atomic action types are distributed as follows: place/put (24%), grasp/pick (22%), open (drawer/door, 18%), close (12%), move/push (10%), and other actions such as pour or stack (14%). Action segment counts are moderately imbalanced (coefficient of variation ≈ 0.65), with the most frequent type (“place”) in ~510 segments and the rarest (“pour”) in ~30.
Atomic actions adhere to a discrete STRIPS-style schema, with each segment record containing:
- Action Symbol: e.g.,
place_bowl_in_drawer(bowl, drawer) - Arguments: slot-specific, e.g., (object, container)
- Preconditions / Postconditions: in BDDL/PDDL logic
- Temporal Bounds:
start_frame,end_frame, span durations - Confidence Score:
Representative pre/postcondition structures are:
- open_drawer(object, drawer):
- Pre:
clear(drawer),closed(drawer) - Post:
open(drawer) - Typical Span: 20–60 frames
- Pre:
- grasp(object):
- Pre:
reachable(object),¬grasped(object) - Post:
grasped(object) - Span: 15–45 frames
- Pre:
- place(object, target):
- Pre:
grasped(object),clear(target) - Post:
in(object, target),¬grasped(object) - Span: 30–120 frames
- Pre:
The action taxonomy and symbolic labeling provide explicit grounding for downstream planners and hierarchical policy learning.
2. Segmentation, Confidence Labeling, and Validation Protocols
Segmentation of demonstrations into atomic actions utilizes AAS with Gemini 2.5 (Flash and Pro) models, ensuring alignment with planner-generated action sequences. Each candidate segment is subjected to a threefold validation:
- Count: The number of predicted segments must match the number of planner steps:
- Order: Predicted label sequence matches planner-derived sequence.
- Duration: For each segment , must satisfy for that action class.
Following this validation, the per-segment confidence score is assigned:
with
- : VLM boundary confidence
- : fit of predicted span within class range
- : predicted vs. plan-interval agreement under ±2-frame shifts
Recommended weights: , , .
Key metrics (100 held-out demos, Gemini 2.5 Pro model):
| Metric | Value (Pro) |
|---|---|
| Segmentation Success Rate | 93% |
| Avg. Segments per Demo | ≈ 3.4 |
| Mean Kendall's W (Label Concordance) | 0.9136 |
| Avg. Segment IoU (±2-frame jitter) | 0.92 (Stability@Jitter) |
| Mean Frame IoU vs. Planner | 0.88 |
| MAE (Start/End) | ≈ ±3 frames |
Statistical significance is established by 95% bootstrap confidence intervals and nonparametric Wilcoxon tests ( confirming Pro > Flash performance).
3. Data Modalities, Format, and Access
GATE-VLAP provides multimodal data:
- RGB Video: .jpg frames at 10 fps
- Robot State Logs: per-frame joint angles, gripper width (.json)
- Symbolic Scene Descriptions: BDDL format (.json)
- Atomic-Action Annotations: per-episode JSON arrays
The dataset directory is hierarchically organized into frames/ (LIBERO suites/episodes), states/, scene_desc/, annotations/, and a splits.json file indicating train/validation/test splits. Each annotation record includes:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{
"episode_id": "libero_goal/0001",
"segments": [
{
"action": "grasp(bowl)",
"start_frame": 14, "end_frame": 48,
"pre": ["¬grasped(bowl)", "clear(table)"],
"post": ["grasped(bowl)"],
"confidence": 0.86
}
...
]
} |
Minimal usage can be demonstrated as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from datasets import load_dataset from PIL import Image from torch.utils.data import DataLoader ds = load_dataset("gate-institute/GATE-VLAP-datasets") def collate_fn(batch): images, texts = [], [] for ex in batch: seg = ex["segments"][0] fnames = [f"{ex['episode_id']}/frames/{i:04d}.jpg" for i in range(seg["start_frame"], seg["end_frame"]+1)] frames = [Image.open(fpath) for fpath in fnames] images.append(frames); texts.append(seg["action"]) return {"images": images, "texts": texts} train_loader = DataLoader(ds["train"], batch_size=8, shuffle=True, collate_fn=collate_fn) |
4. Benchmarking and Downstream Performance
Fine-tuning the CLIP-RT+ model on GATE-VLAP’s 2,124 atomic segments produces the following task success rates:
| Task Suite | Baseline CLIP-RT+ | CLIP-RT+ + GATE-VLAP |
|---|---|---|
| LIBERO-Goal | 94.2% | 95.3% (+1.1 pp) |
| LIBERO-Long | 83.8% | 88.8% (+5.0 pp) |
These improvements indicate that atomic segments have modest effect on short horizon tasks (reinforcing individual operators) and a markedly positive impact (+5 pp) on long-horizon, compositional scenarios where error propagation is otherwise acute (Tabakov et al., 12 Dec 2025). A plausible implication is that atomic-level demonstrations enhance VLA agents’ ability to combine skills for novel or compound tasks.
5. Applications, Limitations, and Recommendations
Applications of GATE-VLAP include:
- Imitation Learning: Training low-level operator policies with symbolic and temporal precision.
- Hierarchical Reinforcement/Curriculum Learning: Utilizing atomic segments as reusable "skills" for higher-level policy training.
- Symbolic Planning and Operator Verification: Direct evaluation and learning of STRIPS/HTN-like representations.
- Robotic Dataset Augmentation: Recombination of validated atomic segments across new tasks and environments.
Limitations and prospective directions:
- GATE-VLAP’s current validation is confined to simulated environments (LIBERO); performance in real-world transfer remains untested.
- Segmentation fidelity depends on the completeness of BDDL scene descriptions; tasks without symbolic layouts are currently outside scope.
- Sensitivity to video noise or highly dynamic scenes may degrade segmenter reliability.
- Recommendations include the development of automatic scene descriptor extraction (beyond BDDL), self-supervised boundary refinement using motion or affordance cues, extension to diverse robotic datasets (e.g., R3M, VIP), multi-room/3D scenarios, and augmenting annotations with natural-language paraphrases for enriched language–action modeling.
6. Context and Relation to Prior Work
GATE-VLAP represents a critical evolution in robotic demonstration datasets by explicitly aligning atomic action segments with planner semantics and temporal boundaries (Tabakov et al., 12 Dec 2025). Unlike prior datasets focused on high-level goal or task completions, GATE-VLAP offers compositional, symbolic, and fine-grained labels for each action unit, enabling more effective interfaces between symbolic planners and policy learners. The validation protocol and high-confidence labeling facilitate robust benchmarking and cross-method reproducibility. This approach situates GATE-VLAP as a foundational resource for VLA research emphasizing modularity, compositional generalization, and integration between robot learning and planning systems.