Papers
Topics
Authors
Recent
Search
2000 character limit reached

Atomic Action Slicing Framework

Updated 5 March 2026
  • The paper introduces a framework that segments continuous VLA demonstrations into short, typed atomic action segments aligned with planner outputs.
  • It employs a three-stage process—segmentation, validation, and policy fine-tuning—to ensure robust integration between symbolic planning and imitation learning.
  • Empirical results on the GATE-VLAP dataset show enhanced success rates and refined segment quality, highlighting the method’s practical impact on policy learning.

Atomic Action Slicing (AAS) is a planner-aligned framework that decomposes long-horizon demonstrations in vision-language-action (VLA) domains into short, typed atomic action segments. These atomic options align precisely with planner-defined plans, facilitating both effective symbolic planning and improved policy learning. The method produces validated datasets of atomic action segments, each labeled by action type, temporal span, and confidence score, establishing a bridge between high-level planners and low-level control policies in generalist VLA agents (Tabakov et al., 12 Dec 2025).

1. Formalism and Atomic Decomposition

At the core of Atomic Action Slicing is the decomposition of a demonstration episode

τ=(o1:T,s1:T,a1:T,,E)τ=(o_{1:T}, s_{1:T}, a_{1:T}, ℓ, ℰ)

where oto_{t} are image observations, sts_{t} denotes ground-truth states, ata_{t} are low-level motor commands, is the language instruction, and E denotes a symbolic scene graph. The goal is to segment ττ into KK contiguous atomic segments:

Γ^=[(o^k,ts(k),te(k),c(k))]k=1K\hat Γ = [(\hat o_{k}, t^{(k)}_{s}, t^{(k)}_{e}, c^{(k)})]_{k=1…K}

Here, each o^kΣ\hat o_{k} \in Σ is a typed atomic action from a fixed schema, with preconditions, effects, temporal bounds [ts(k),te(k)][t^{(k)}_{s}, t^{(k)}_{e}], and a scalar confidence c(k)[0,1]c^{(k)} \in [0,1]. Segmentation constraints ensure full coverage and contiguity:

  • ts(1)=1t^{(1)}_{s} = 1, te(K)=Tt^{(K)}_{e}=T
  • ts(k)te(k)t^{(k)}_{s} \le t^{(k)}_{e}, te(k)+1=ts(k+1)t^{(k)}_{e} + 1 = t^{(k+1)}_{s}
  • o^k=P[k]\hat o_{k} = P[k] for planner’s output P[1K]P[1…K]

Atomic segments, therefore, are symbolic “options” that are suitable for both planning algorithms and direct imitation learning at the policy level.

2. Taxonomy and Properties of Atomic Actions

The schema ΣΣ includes a concise set of typed atomic actions such as:

  • open_drawer(drawer)
  • close_drawer(drawer)
  • grasp(object)
  • release(object)
  • place(object, receptacle)
  • lift(object)
  • lower(object)
  • push(object)
  • pull(object)

Each action oΣo \in Σ is parameterized, e.g., βoβ_{o} (“bowl”, “drawer”) and is formally linked to planning primitives as:

  • Preconditions pre(o)pre(o) (e.g., grasped(bowl),isOpen(drawer)grasped(bowl), isOpen(drawer))
  • Effects eff(o)eff(o) (e.g., in(bowl,drawer),¬grasped(bowl)in(bowl,drawer), ¬grasped(bowl))
  • Terminal condition (e.g., “end-effector exits drawer mouth”)
  • Typical duration range [dmin(o),dmax(o)][d_{min}(o), d_{max}(o)]
  • These align directly to STRIPS or HTN planner operators, providing symbolic compatibility.

A fixed action schema enables seamless integration with planners and helps constrain representation complexity for generalization.

3. Segmentation Pipeline and Policy Learning

Atomic Action Slicing employs a three-stage process. The core segmentation (Stage II) uses a schema-constrained large Vision-LLM (VLM)—Gemini 2.5 Flash or Pro—prompted with: instruction , scene E, schema ΣΣ, planner anchors P[1K]P[1…K], and few-shot exemplars. The VLM outputs action boundary proposals while enforcing coverage and contiguity. No fine-tuning of the VLM is performed; the method leverages zero-shot, few-shot prompting.

Segment validation (Stage III) requires the following conditions:

  • Correct segment count (KK)
  • Planner label order and timing monotonicity
  • Segment durations within prescribed bounds

The resulting atomic-labeled dataset is used to fine-tune the CLIP-RT+ policy via imitation learning:

L=(os:t,es:t)logπθ(as:tos:t,)L = - \sum_{(o_{s:t},e_{s:t})} \log π_θ(a_{s:t} | o_{s:t},ℓ)

where each atomic segment supplies short, dense sequences for improved policy training.

4. GATE-VLAP Dataset

Applying AAS to 825 LIBERO demonstrations yields the GATE-VLAP dataset, comprising 2,124 atomic segments (758 LIBERO-Goal, 1,366 LIBERO-Long). Each segment is annotated with its label, start and end frame, and confidence score. Compared to the original demonstrations, this yields approximately 2.6 times more “training instances.” The dataset’s validation protocol ensures high segment quality, calibrating confidence scores by aggregating VLM internal signals, segment duration slack, and agreement under keyframe jitter.

SUBSET Num. Segments
LIBERO-Goal 758
LIBERO-Long 1,366
Total 2,124

This resource is publicly released to promote reproducibility and further research.

5. Algorithmic Workflow and Planner Integration

AAS is integrated with symbolic planners through a closed-loop system:

  1. Symbolic state tracking: Predicates extracted via RGB-D and object tracking provide ssyms_{sym}.
  2. Planning step: PDDL/HTN planner selects next ΣΣ-action (oo^*).
  3. Execution: Policy πoπ_{o^*} (CLIP-RT+AA) is run until its terminal condition.
  4. Transition and verification: Symbolic state is updated post-execution.
  5. Repeat until task completion.

A simplified online control pseudocode:

1
2
3
4
5
6
for t = 1 ... T_total:
    if current option o_k complete:
        observe symbolic state s_sym
        o_{k+1} ← Planner.solve(s_sym, goal)
    a_t ← π_{o_k}(o_t, ℓ)
    execute a_t, observe o_{t+1}
This structure aligns symbolic search and learned policies, ensuring correct execution sequencing and coverage.

6. Empirical Evaluation and Quantitative Results

AAS segmentation is benchmarked using 100 demonstration episodes:

Metric Flash (Gemini 2.5) Pro (Gemini 2.5) Δ (Pro–Flash)
Success Rate 74.0% 93.0% +19 pp
Avg Segments 3.41 3.46 +0.05
Mean Kendall’s W 0.9105 0.9136 +0.0031

Downstream task success increases after policy fine-tuning with atomic segments:

Task Suite Baseline CLIP-RT+ Fine-tuned CLIP-RT+AA Δ
LIBERO-Goal 94.2% 95.3% +1.1pp
LIBERO-Long 83.8% 88.8% +5.0pp

Segmenters with stronger language-vision alignment (Gemini 2.5 Pro) yield substantially improved segmentation and policy performance.

7. Robustness, Limitations, and Future Directions

AAS demonstrates stability under ±2 frame keyframe jitter (IoU_idx ≳ 0.9). Segmentation success drops considerably with smaller VLMs (–19pp) or without planner anchoring/strong schema (degradations of 10–15pp in SeqAcc and EditSim). This indicates that planner guidance and segment validation are critical for robustness.

Key limitations include dependence on fully specified BDDL scenes, sensitivity to temporal misalignment, and evaluation restricted to simulator environments. Potential extensions:

  • Automatic scene description extraction (e.g., via SLAM and perception)
  • Self-supervised boundary refinement
  • Joint training of segmenter and policy
  • Real-robot experiments and generalization to diverse domains

AAS thus defines a reproducible and planner-compatible pathway for extracting and leveraging atomic actions in VLA agents, significantly advancing compositionality, robustness, and generalization in long-horizon manipulation tasks (Tabakov et al., 12 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Atomic Action Slicing (AAS) Framework.