Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 174 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Reasoning-Annotated LIBERO-100

Updated 25 October 2025
  • The paper introduces a reasoning-annotated LIBERO-100 extension that uses a semi-automated Gemini pipeline to segment demonstrations and generate intermediary chain-of-thought plans.
  • It employs vision-language-action models that alternate between generating textual reasoning tokens and low-level actions to ensure coherent task execution.
  • The approach demonstrates improved out-of-distribution robustness and up to 15% performance gains on novel manipulation tasks via runtime alignment verification.

A reasoning-annotated extension of LIBERO-100 constitutes an enriched version of the LIBERO-100 robot learning benchmark in which each manipulation task and demonstration trajectory is complemented with fine-grained, structured annotations that explicitly encode intermediate reasoning steps, chain-of-thought plans, and task decompositions. This design aims to provide models not only with raw sensory and action data but also with interpretable intermediary textual plans that capture the robot's latent decision process. The annotated dataset serves a dual function: supporting improved policy training and enabling runtime verification of reasoning-action alignment, particularly under out-of-distribution (OOD) scenarios. Recent work introduces a semi-automated annotation pipeline powered by Gemini to segment demonstrations and generate reasoning annotations, and leverages vision-LLMs for runtime policy steering to enhance robustness and compositional generalization (Wu et al., 18 Oct 2025).

1. Reasoning-Annotated Demonstration Data

The extension applies an automated annotation methodology to the base LIBERO-100 suite, originally comprised of diverse long-horizon manipulation tasks. Each demonstration is segmented into intermediate sub-tasks via a model-driven pipeline, which generates structured textual plans describing the sequential intentions (e.g., “Plans,” “What has been done,” “Now I need to do”) at each stage of the trajectory.

  • The annotation process utilizes Gemini to infer subtask boundaries and produce plan descriptions for each demonstration without intensive manual labeling.
  • Resulting datasets contain correspondences between observations, high-level instructions, segmented subtasks, and textual reasoning plans.

This enriched data provides explicit chain-of-thought scaffolding for agents, encoding not only the underlying actions but the rationale and context for each transition, supporting interpretability, and facilitating compositional understanding.

2. Training Vision-Language-Action Models with Reasoning

Within this framework, vision-language-action (VLA) models are trained using the reasoning-annotated LIBERO-100 demonstrations as a dual-output prediction problem. Each model must first generate the intermediate textual plan (“> ” tokens) followed by associated low-level actions (“<act>” tokens), alternating between planning and acting across the trajectory.

The loss function is a weighted combination of textual reasoning generation (LreasonL_{reason}) and action generation (LactL_{act}):

Lr-vla(θ;Dreason)=λreasonLreason+λactLactL_{\text{r-vla}}(\theta; D_\text{reason}) = \lambda_\text{reason} \cdot L_\text{reason} + \lambda_\text{act} \cdot L_\text{act}

where Lreason=logπθr-vla(rjot,rj1,g)L_\text{reason} = -\log \pi_\theta^\text{r-vla}(r_j | o_t, r_{j-1}, g) and Lact=tt+Hlogπθr-vla(atot,rj,g)L_\text{act} = -\sum_{t'}^{t+H} \log \pi_\theta^\text{r-vla}(a_{t'} | o_{t'}, r_j, g).

This interleaved modeling strategy ensures that the agent internalizes the linkage between intermediate reasoning and actual motor execution, improving transparency and offering a basis for downstream verification.

3. Runtime Reasoning-Action Alignment Verification

To address the “embodied CoT faithfulness gap”—the phenomenon whereby plausible textual plans are not always matched by low-level actions—the system implements a training-free, runtime policy steering method.

  • At each planning juncture, the policy samples KK candidate action sequences conditioned on the current textual plan.
  • Each sequence is simulated via a dynamics model to predict the resultant observations.
  • A pre-trained open-world vision-LLM (e.g., GPT-4o) serves as a verifier, scoring each candidate’s alignment with the plan.

The candidate with maximal alignment score is selected for execution, transforming the VLA’s output diversity into improved robustness rather than uncertainty. The steering process is formalized by minimizing an expected alignment loss:

Lalign(θ;ot,r)=EP,πθr-vla[Ralign(ot:t+H,r)]L_{\text{align}}(\theta; o_t, r) = -\mathbb{E}_{P, \pi_\theta^\text{r-vla}}[ R_{\text{align}}(o_{t:t+H}, r) ]

where RalignR_\text{align} is binary or graded feedback from the verifier.

4. Out-of-Distribution (OOD) Robustness

Evaluation on tailored OOD test suites demonstrates the effectiveness of reasoning-annotated data and runtime verification:

  • Semantic perturbations: instruction rephrasing, altered object descriptors.
  • Visual perturbations: object substitutions, scene shifts, and viewpoint changes.

Models using reasoning-annotated LIBERO-100 and runtime alignment verification maintain higher success rates under OOD conditions than baselines, demonstrating improved generalization across both language and perceptual variations.

OOD Evaluation Suite Example

Variation Type Description Robustness Outcome
Semantic (Lang-Ref) Rephrased instructions Improved alignment
Visual (Scene/VP) Scene or viewpoint modifications Maintained performance

Success rates remain competitive even as task or visual contexts shift, confirming the strategy’s value beyond in-distribution evaluation.

5. Performance Gains and Behavior Composition

In composition and OOD tasks, the integrated method yields clear quantitative improvements:

  • Reported up to 15% performance gains on novel behavior composition tasks compared to prior work.
  • Gains scale commensurately with increased candidate samples (compute) and more diverse training data, indicating that both data richness and sampling flexibility amplify effectiveness.

The dataset and steering methodology permit models to robustly recombine previously learned skills in new configurations—a central aim of lifelong robot learning.

6. Technical Formulation

Key mathematical formulations employed include:

  • Standard VLA training objective:

Lvla(θ;D)=(ot,at)Dlogπθvla(atot,g)L_{\text{vla}}(\theta; D) = \sum_{(o_t, a_t) \in D} -\log \pi_\theta^\text{vla}(a_t | o_t, g)

  • Reasoning VLA loss (see above).
  • Best-of-N candidate selection:

At={at(k)πθr-vla(ot,r)},  k=1,,KA_t = \{ a_t^{(k)} \sim \pi_\theta^\text{r-vla}(\cdot | o_t, r) \},\; k = 1,\ldots, K

Only the highest-scoring candidate (by alignment verifier) is executed.

These reinforcement learning-style objectives, augmented by alignment-based selection, ensure that both high-level textual plans and their physical realization are coherently enforced.

7. Impact and Future Directions

A reasoning-annotated extension of LIBERO-100 provides direct empirical support for hypotheses about the benefits of intermediate reasoning: representation enhancement, improved learning scaffolding, and greater compositionality. By combining structured annotations, model architectures that leverage intermediary plans, and real-time action verification, the method advances robustness and interpretability in embodied agent learning.

Methodologies such as automated annotation pipelines, alignment-based action selection, and OOD task evaluation open new directions for scalable deployment and generalization. Incorporating richer structured reasoning into lifelong learning benchmarks is a promising avenue for future research aimed at robust, compositional, and transparent robot policy design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Reasoning-Annotated Extension of LIBERO-100.