LIBERO Benchmark: Vision-Language-Action in Robotics

Updated 13 November 2025

LIBERO is a standardized and extensible benchmarking suite that evaluates multi-modal vision-language-action models in diverse robotic manipulation tasks.
It employs a procedural generation pipeline and curated simulation/annotation infrastructure to create varied tasks that test generalization, robustness, and transfer capabilities.
Empirical findings show high success rates under clean conditions yet reveal model brittleness under perturbations, highlighting challenges in lifelong, compositional learning.

LIBERO is a standardized and extensible benchmarking suite for assessing multi-modal, vision-language-action (VLA) models in robot manipulation, with a particular emphasis on lifelong, multi-task, and knowledge-transfer learning across diverse and procedurally generated environments. It comprises both a task-generation engine and curated simulation/annotation infrastructure designed to probe a spectrum of generalization, robustness, and transfer challenges in embodied AI (Liu et al., 2023).

1. Benchmark Definition, Task Suites, and Protocols

LIBERO formalizes the lifelong decision-making (LLDM) problem by focusing on both declarative (object, spatial) and procedural (policy/action) knowledge transfer.

Canonical Task Formalism:

$\tau = (\ell,\,\mathcal{O},\,\mathcal{E},\,p_0,\,G)$

where $\ell$ is the language instruction, $\mathcal{O}$ the set of scene objects, $\mathcal{E}$ environment layout, $p_0$ the initial state distribution, and $G$ the goal predicate.

Core Suites (Original LIBERO):

Spatial: 10 tasks, object(s) fixed, spatial arrangements/predicates vary.
Object: 10 tasks, object types vary, spatial/scene fixed.
Goal: 10 tasks, same scene/object but variable goal action predicates.
Long: 10 tasks, compositional/long-horizon multi-step instructions.
LIBERO-90/100: Large-scale (90 or 100 tasks) multitask suites, concatenating all above; allows mixed-mode and pretraining evaluation.

Each task includes 50 high-quality, human-teleoperated demonstrations per task (each a sequence of images, robot proprioception, and continuous 7-DoF end-effector/gripper actions), rendered using MuJoCo and robosuite (Wu et al., 6 Aug 2025). The benchmark imposes precise success criteria, typically requiring the goal predicate to be satisfied continuously for ≥10 time steps to avoid spurious credit.

Evaluation Metrics:

Success Rate: Main metric, % of rollouts where $G(s_T) = 1$ .
Forward Transfer (FWT), Negative Backward Transfer (NBT), and Area Under Curve (AUC): Used for lifelong settings to quantify adaptation and forgetting (Liu et al., 2023).

2. Procedural Generation, Data Collection, and Annotation

LIBERO’s procedural generation pipeline produces potentially infinite task variations:

Extracts behavioral templates from large-scale activity datasets (e.g., Ego4D).
Instantiates templates into natural-language instructions and goal predicates (e.g., “Open the drawer of the cabinet”).
Samples scene layouts and generates PDDL problem files specifying object placements and task semantics.
BDDL (Behavioral Domain Definition Language) encodes these specifications and supports automated mapping to robosuite environments for both data collection and simulation (Wu et al., 6 Aug 2025).

Demonstrations: Collected using a SpaceMouse teleoperator at 20 Hz with automatic filtering and validation pipelines; available as HDF5 for reproducibility.

Annotation Extensions:

LIBERO+: Adds pixel-level segmentation masks, bounding boxes, temporal instance IDs, and task-relevant object lists. Supports object-centric and object-relation policy research and enables slot-based perception models (Hanyu et al., 10 Nov 2025).
Task-Relevant Filtering: Each language instruction is programmatically parsed to extract active objects enhancing downstream slot selection.

3. Architectural and Algorithmic Landscape

LIBERO benchmarks a broad array of neural architectures and algorithms:

Architectures:

Early models include ResNet-RNN, ResNet-Transformer, and ViT-Transformer hybrids, tracing a trajectory from recurrent nets to pure Transformers (Liu et al., 2023).
FiLM-based conditioning for vision–language fusion (e.g., BAKU), using instruction-dependent γ, β modulation of CNN features (Haldar et al., 2024).
Diffusion Transformers (MDT, MoDE) for multimodal goal learning, with auxiliary self-supervision (contrastive alignment, masked foresight) for handling sparse language supervision (Reuss et al., 2024, Reuss et al., 2024).
Unified Token Approaches (UniVLA): All modalities quantized as discrete tokens enabling pairwise and causal modeling via a single autoregressive Transformer (Wang et al., 24 Jun 2025).
ActionFlow (ActionSink): Action sequences framed as action-caused optical flows, with dynamic memory-augmented retrieval and fusion (Guo et al., 5 Aug 2025).
Slot-based Perception (SlotVLA/LIBERO+): Inputs encoded into object-centric slots and relation tokens for structural generalization and interpretability (Hanyu et al., 10 Nov 2025).

Algorithmic Frameworks:

Behavior Cloning (BC): $\mathcal{L}^{BC} = \E_{(o, a)} \| a - \hat{a} \|_2^2$, standard in LIBERO (with or without chunking).
Lifelong Learning Algorithms: Sequential finetuning (SeqL), Experience Replay (ER), EWC (regularization), PackNet (dynamic architecture), BUDS, LOTUS, M2Distill (multi-modal distillation with latent consistency and GMM KL regularization), T2S (Tokenized Skill Scaling), among others (Roy et al., 2024, Zhang et al., 2 Aug 2025).
RL Fine-tuning: Both on-policy (PPO, GAE, GRPO) (Zang et al., 8 Oct 2025, Li et al., 11 Sep 2025), and specialized flow-based policy RL ( $\pi_{RL}$ : Flow-SDE, Flow-Noise) yield large gains over SFT and BC, with specific adaptations for macro-step credit assignment, action chunking, and parallelized simulation (Chen et al., 29 Oct 2025).
In-context Learning (ICIL): Transformer and SSM (state-space model, Longhorn) backbones for few-shot adaptation from demonstration prompts, with robust scaling in sequence length and rollout efficiency (Yoo et al., 24 Sep 2025).

4. Robustness, Compositionality, and Limitations

While LIBERO standardized evaluation procedures, critical analyses identified significant limitations:

Overfitting to Train/Test Homogeneity: Minimal distributional shift in the default splits allows rote memorization, with SOTA models collapsing to near-zero success under moderate task, language, object, or environment perturbations (Zhou et al., 4 Oct 2025).
Insensitivity to Language and Robustness Factors: Even blanking or corrupting instructions is often ignored; models default to vision–action patterns from training (Fei et al., 15 Oct 2025).
LIBERO-PRO and LIBERO-Plus: Remediate by systematically injecting controlled perturbations (object color/identity, initial state, environment, compositional language rewrites, camera, light, noise, background) to probe generalization and compositional reasoning (Zhou et al., 4 Oct 2025, Fei et al., 15 Oct 2025). Measured across these axes, reported LIBERO SOTA models degrade from ≈98% to 0–40% in perturbed regimes.
Object-Relation Reasoning and Structured Perception: LIBERO+ demonstrates that slot-object modeling, especially when paired with object-centric annotation, drastically reduces required compute while preserving generalization (e.g., 28 vs. 256 vision tokens, ≈3× fewer GFLOPs) (Hanyu et al., 10 Nov 2025).

5. Empirical Findings, Model Performance, and Insights

Canonical Model/Table: SOTA results on LIBERO (unperturbed)

Model	LIBERO-Goal	LIBERO-Object	LIBERO-Spatial	LIBERO-Long	LIBERO-90	LIBERO-130	Avg.
BAKU	—	—	—	—	0.90	—	—
MoDE (pretr.)	—	—	—	—	0.95	—	—
VITA-VLA	97.9%	99.8%	98.0%	93.5%	—	—	97.3%
RLinf-VLA (RL)	98.79%	99.80%	99.40%	93.95%	98.59%	98.11%	98.11%
UniVLA	93.6%	98.8%	95.4%	94.0%	—	—	95.5%
ActionSink	—	—	—	47.0%*	—	—	68.1%*
SimpleVLA-RL	99.2%	99.1%	99.4%	98.5%	—	—	99.1%

For ActionSink, suites not listed were not separately measured.

Noteworthy findings:

RL fine-tuning yields dramatic gains over all SFT and BC pipelines. For example, RLinf-VLA achieves absolute improvements up to +45 pp on LIBERO-10/Long over BC (Zang et al., 8 Oct 2025).
Sample efficiency: RL techniques ( $\pi_{\text{RL}}$ , SimpleVLA-RL) approach or surpass full-data SFT results with only 40–200 demos per suite (Chen et al., 29 Oct 2025, Li et al., 11 Sep 2025).
*Token-based parameter scaling (T2S) and multi-modal latent distillation (M2Distill) eliminate catastrophic forgetting in lifelong settings—NBT ≈ 0–1 %—while preserving high FWT (Zhang et al., 2 Aug 2025, Roy et al., 2024).
Multi-modal auxiliary losses (e.g., MDT's CLA+MGF) enable robust language grounding with sparse labeled data (2% labeled, >72% overall success) (Reuss et al., 2024).
Slot-object architectures (SlotVLA) reduce perception cost by > 85%, enabling interpretable policy heads while matching dense-token models on object/goal/generalization (Hanyu et al., 10 Nov 2025).

6. Design Principles, Best Practices, and Extensions

LIBERO supports the evolution of generalist manipulation benchmarks through the following methodologies:

Procedural and linguistic variability: Supports continual expansion for lifelong, OOD-generalization, and compositional benchmarks.
Plugin annotation infrastructures: LIBERO+ enables structured perception and grounding for object-centric architectures, advancing explainability and sample-efficient learning.
Rapid multicore simulation and vectorized rollouts: RLinf-VLA demonstrates >1.6× throughput gains for parallelized training on LIBERO (Zang et al., 8 Oct 2025).
Standardized evaluation, but explicit need for perturbation-based and compositional testing: LIBERO-PRO and LIBERO-Plus show that without systematic robustness evaluation, benchmark success rates are misleading.
Behavioral cloning and RL integration: Combined frameworks show additive benefits (RL pushes SFT/BC models from ≈70%→98%+ success).

7. Limitations, Controversies, and Open Challenges

Critiques of LIBERO focus on:

Memorization vs. understanding: LIBERO’s default splits are vulnerable to overfitting and memorization, as shown by a near-zero generalization under modest task modifications (Zhou et al., 4 Oct 2025).
Language grounding: Most models exploit language only superficially; robustness ablations (language deletion or paraphrasing) cause marginal or no drop in most suites (Fei et al., 15 Oct 2025).
Design for robustness: Early versions of LIBERO do not contain perturbations in viewpoint, lighting, or distractors, requiring LIBERO-PRO and LIBERO-Plus to fill this gap.
Compositional generalization: While SOTA models interpolate smooth variants of known tasks, even the best models fail at true extrapolation or skill chaining unless directly architected for it (e.g., text-latent manipulation in $\pi_0$ (Li, 6 May 2025)).

Open challenges include:

Constructing benchmarks that enforce and measure genuine skill composition, OOD reasoning, and robust language grounding.
Systematic adoption of analyses where models are tested over real distribution shift axes (object/goal swap, paraphrased instructions, dynamic distractors).
Developing architectures and algorithms with explicit mechanisms for cross-modal grounding, uncertainty estimation, and adaptive policy compositionality under distribution shift.

In summary, LIBERO and its derivatives establish the de facto standard for VLA robotics benchmarking but must be interpreted with care. Reported SOTA figures (>95% success) reflect performance under clean, nearly identically distributed train/test splits, whereas systematic robustness analyses reveal significant brittleness and a tendency toward memorization. Ongoing work focuses on procedural growth, annotation extensibility, structured perception, RL integration, and, most critically, fair evaluation under meaningful semantic, perceptual, and compositional generalization regimes (Liu et al., 2023, Zhou et al., 4 Oct 2025, Fei et al., 15 Oct 2025, Hanyu et al., 10 Nov 2025).