Libero-Object Benchmark

Updated 30 June 2025

Libero-Object Benchmark is a standardized evaluation suite for lifelong robot learning that focuses on object-centric transfer in sequential manipulation tasks.
It employs procedurally generated pick-and-place scenarios with human demonstrations to study transfer, robustness, and prevention of catastrophic forgetting.
The benchmark integrates multi-modal learning techniques and uses metrics like FWT, NBT, and AUC to quantify performance and memory retention.

The Libero-Object Benchmark is a standardized evaluation suite within the LIBERO lifelong robot learning framework, targeting object-centric knowledge transfer in sequential robotic manipulation tasks. It is designed to assess and catalyze research on compositionality, transfer, and robustness of manipulation policies when exposed to a diverse set of objects, facilitating the study of both declarative (object properties, identities) and procedural (manipulation skills) knowledge in a lifelong learning setting.

1. Design and Objectives

The primary objective of the Libero-Object Benchmark is to provide a rigorous, extensible substrate for evaluating lifelong imitation learning, with a focus on object-driven generalization and memory retention. Each task in the suite introduces a novel object for a standard manipulation scenario, typically pick-and-place, requiring agents to continually integrate and utilize knowledge about previously unseen object attributes and dynamics while minimizing catastrophic forgetting of earlier tasks. This enables controlled investigation into the ability of learning algorithms and architectures to transfer object-specific knowledge across a temporal learning curriculum.

The benchmark is part of the broader LIBERO LLDM suite, which encompasses 130 tasks organized into thematic suites addressing spatial, goal, and entangled knowledge in addition to the object-centric Libero-Object subset (Liu et al., 2023).

2. Procedural Generation of Tasks

Libero-Object tasks are generated through a systematic, extendible pipeline based on behavioral templates extracted from large-scale datasets of human activities. The process entails sampling task instructions, configuring initial scene layouts with newly introduced objects, and specifying goal predicates in PDDL. Each task utilizes human-teleoperated demonstrations (typically 50 per task) to provide ground truth and training signals for imitation learning.

All tasks are instantiated in Robosuite, a modular simulation platform, ensuring realism and standardization across evaluations. The object suite is readily extensible: new object models and configurations can be programmatically added, expanding the benchmark's applicability.

3. Evaluation Protocols and Metrics

Evaluation on LIBERO-OBJECT follows lifelong imitation learning conventions, where algorithms are exposed to a stream of tasks in sequence, without revisiting all past data. Core metrics include:

Forward Transfer (FWT): Assesses how prior learning accelerates or benefits new task acquisition.
Negative Backward Transfer (NBT): Quantifies the extent of performance degradation on previous tasks as new ones are learned (measure of catastrophic forgetting).
Area Under the Success Rate Curve (AUC): Aggregates performance across all tasks and incremental learning stages, summarizing sustained competence.

Formally, let $c_{i,j,e}$ denote the success rate on task $j$ after $e$ epochs of learning task $i$ . For $K$ tasks: $\begin{split} &\text{FWT}_k = \frac{1}{11}\sum_{e \in \{0 \dots 50\}} c_{k,k,e} \ &\text{NBT}_k = \frac{1}{K-k} \sum_{\tau = k+1}^K \left(c_{k, k} - c_{\tau, k}\right) \ &\text{AUC}_k = \frac{1}{K-k+1} \left(\text{FWT}_k + \sum_{\tau=k+1}^K c_{\tau, k}\right) \end{split}$ These are aggregated across all $K$ tasks for comprehensive reporting.

4. Comparative Assessment of Lifelong Algorithms

Extensive benchmarking on LIBERO-OBJECT has revealed several empirical insights:

Sequential fine-tuning can outperform established lifelong learning algorithms (EWC, Experience Replay, PackNet) in forward transfer, suggesting that aggressive regularization may unduly restrict beneficial plasticity (Liu et al., 2023).
No single visual encoder is universally optimal. Vision Transformer policies yield better object-centric transfer, while ResNet-based encoders excel on procedural or motion-centric tasks.
The choice of language encoder (BERT, CLIP, task-ID) exerts minimal influence—embeddings primarily serve as task differentiators under current paradigms.
Task ordering has a pronounced effect, especially for dynamic architecture methods and experience replay schemes.
Naive supervised pretraining might impair lifelong performance, contrary to its typical advantages in other domains.

A summary table of LIBERO-OBJECT’s properties:

Feature	Description
Purpose	Object-centric lifelong knowledge transfer
Tasks	10, each with a unique object (train/test split possible)
Focus	Declarative object knowledge across manipulations
Data Provided	50 human demonstrations per task, meshes, instructions
Metrics	FWT, NBT, AUC
Extensibility	Programmatic addition of new objects/tasks

Recent methods such as M2Distill have demonstrated state-of-the-art performance on LIBERO-OBJECT by explicitly regulating distribution shifts in latent features and policy outputs across vision, language, and action modalities (Roy et al., 2024). The method augments the imitation learning objective with auxiliary loss terms:

Latent feature distillation: Constrains the squared Euclidean drift of features for vision, language, joint, and gripper states between current and prior models, maintaining representation consistency.

$\mathcal{L}_\epsilon = \frac{1}{N L} \sum_{i=1}^N \sum_{j=1}^L \| f_{i,j}^{k,\epsilon} - f_{i,j}^{k-1,\epsilon} \|_2^2$

Policy distribution alignment: Enforces KL divergence minimization between old and new Gaussian Mixture Model (GMM) policy outputs, reducing behavioral drift: $\mathcal{L}_{\text{policy}} = \text{KL}(\pi^k \| \pi^{k-1}) \approx \frac{1}{N} \sum_{s=1}^N \Big( \log \pi^k(a^s) - \log \pi^{k-1}(a^s) \Big)$ Combined, these objectives preserve task-relevant features and behaviors as new objects and skills are encountered.

Empirical comparison on LIBERO-OBJECT yielded the following results:

Method	FWT	NBT	AUC
Sequential	0.62 (±.00)	0.63 (±.02)	0.30 (±.00)
EWC	0.56 (±.03)	0.69 (±.02)	0.16 (±.02)
ER	0.56 (±.01)	0.24 (±.00)	0.49 (±.01)
BUDS	0.52 (±.02)	0.21 (±.01)	0.47 (±.01)
LOTUS	0.74 (±.03)	0.11 (±.01)	0.65 (±.03)
M2Distill	0.75 (±.03)	0.08 (±.05)	0.69 (±.04)

These results indicate that multi-modal distillation significantly reduces forgetting (lowest NBT) and improves overall performance consistency (highest AUC).

6. Research Implications and Extensions

The LIBERO-OBJECT benchmark has become a reference for evaluating object-centric lifelong robotic learning algorithms. Its structure enables isolation of object knowledge transfer, offering granular analysis of how representations persist, transform, or degrade with incremental exposure to novel objects.

The procedural task generation pipeline ensures ongoing extensibility—new objects, manipulation variants, and sensor modalities can be integrated, supporting future exploration of compositional and generalization challenges.

The benchmark’s metrics and outcomes also highlight ongoing open issues: regularization and experience replay methods developed for standard continual learning may need further adaptation for the high-dimensional, multi-modal, and sequentially compositional tasks characteristic of real-world robotics.

7. Summary Table

Aspect	Libero-Object Specification	Key Insights
Number of Tasks	10, with possibility for expansion	Enables object-centric generalization study
Evaluation Metrics	FWT, NBT, AUC	Quantifies transfer, forgetting, overall success
Data	Human demonstrations, meshes, language	Supports sample-efficient policy learning
Notable Methods	M2Distill, LOTUS, ER, EWC, PackNet	Multi-modal distillation yields best retention
Observed Challenges	Task ordering, pretraining hurt, encoder
Extensibility	Procedurally generated, indefinitely	Ongoing research utility

Conclusion

The Libero-Object Benchmark provides a standardized, extensible framework for evaluating robotic agents’ ability to learn, generalize, and retain object-centered manipulation skills in a lifelong learning context. Its adoption has advanced the field’s understanding of knowledge transfer, memory, and the role of architecture and training strategy in sequential robot learning. The benchmark continues to serve as a foundational testbed for both incremental algorithmic progress and the identification of persistent challenges in embodied lifelong intelligence.

PDF Markdown Chat (Pro)

References (2)

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning (2023)

M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Libero-Object Benchmark.