MetaWorld ML10 Benchmark

Updated 22 November 2025

MetaWorld ML10 Benchmark is a standardized meta-RL suite featuring ten diverse robotic manipulation tasks for testing rapid skill adaptation.
It employs a unified 7-DoF Sawyer arm platform with consistent observation/action spaces and configurable dense reward formulations for reproducibility.
The benchmark facilitates evaluation of few-shot adaptation, multi-task generalization, and hierarchical policy architectures with detailed performance metrics.

The MetaWorld ML10 benchmark is a standardized, multi-task and meta-reinforcement learning (meta-RL) suite tailored to empirical evaluation of rapid skill adaptation in robotic manipulation. ML10 encompasses a carefully curated set of ten diverse manipulation environments, sharing a common 7-DoF Sawyer arm platform, unified action/observation spaces, and configurable dense reward formulations. It is widely adopted in the paper of few-shot adaptation, multi-task generalization, and hierarchical policy architectures. The recent Meta-World+ v3 release has established reproducibility and technical clarity across the evolving research landscape (McLean et al., 16 May 2025).

1. Task Suite Definition and Design Rationale

The ML10 set comprises the following ten manipulation tasks (all “-v3” variants):

reach: move the end-effector to a target position
push: push a puck to a goal
pick-place: grasp and relocate an object
door-open: pull open a hinged door
door-close: close a hinged door
drawer-open: pull open a drawer
drawer-close: push closed a drawer
button-press: depress a button
lever-pull: pull a lever
window-open: push a sliding window open

Each ML10 task is formulated as a Gymnasium-compliant Markov Decision Process (MDP) with the following properties:

Observation space $\mathcal{S} = \mathbb{R}^{39}$ : Includes end-effector XYZ (3), gripper aperture (1), object 1 pose (XYZ + quaternion: 7), object 2 pose (if present, else zeros: 7), previous observation (20), and goal position (3).
Action space $\mathcal{A} = \mathbb{R}^4$ : 3D end-effector displacement $[\Delta x,\Delta y,\Delta z]$ and gripper open/close command.
Success criterion:

Dictated by task-specific thresholds (e.g., object within $\varepsilon$ of goal, joint exceeding angle, or binary limit switches).

The selection criteria for ML10 emphasized (1) diversity of primitive skills (reaching, pushing, grasping, opening, closing), (2) non-parametric variation across task definitions, (3) compositional complexity, and (4) computational tractability ( $\sim6$ hours to adapt on a single A100 GPU). This yields a testbed that balances algorithmic breadth with reproducible evaluation (McLean et al., 16 May 2025, Yu et al., 2019).

2. Environment Specification and Standardization

With the advent of Meta-World+ v3, the ML10 environments are standardized as follows:

Reward Functionality:

Users select between the legacy V1 (hand-crafted sequential waypoint) and V2 (fuzzy-logic constraints aggregated via the Hamacher product) reward formulations. V2 is recommended for consistent, task-agnostic scaling ( $r_t\in[0,10]$ , with task success yielding $r_t\approx 10$ ).

Gymnasium API Compliance:

All environments utilize canonical reset/step methods; custom wrappers from earlier MetaWorld versions are deprecated.

Episode Structure:

Each episode is capped at $T=150$ time-steps, with task-specific success flags provided through the info["success"] API.

Instantiation:

Environments are accessible via Gymnasium IDs “Meta-World/ML10-train” and “Meta-World/ML10-test”.

No observation normalization or reward clipping is applied by default. The reward version and random seed must be explicitly specified for reproducibility (McLean et al., 16 May 2025).

3. Evaluation Protocols and Baseline Performance

The standard experimental protocol for ML10 includes:

Seeding: Ten distinct seeds per full training run (Python, NumPy, JAX, Gymnasium).
Episode Horizon: $T=150$ , overrideable via configuration.
Reporting Frequency: Evaluation performed every $50\,000$ steps.
Metrics:
- Task-level success: $S_i = \frac{1}{N} \sum_{j=1}^N 1\{\text{success}_{i,j}\}$
- Benchmark-wide success: $\bar S = \frac{1}{10} \sum_{i=1}^{10}S_i$
- Aggregate: Interquartile mean (IQM) of success rates $\pm$ 95% confidence intervals

Baseline results (after $1$M environment steps; $n=10$ seeds) for canonical meta-RL algorithms are:

Algorithm	V1 Reward	V2 Reward
MAML	27.8 ± 4.3%	29.6 ± 3.9%
RL²	14.2 ± 5.1%	31.1 ± 4.7%

Paired t-tests reveal no statistically significant difference between reward versions for MAML ( $p=0.21$ ) nor between MAML-V2 and RL²-V2 ( $p=0.18$ ) (McLean et al., 16 May 2025). Other meta-learning configurations report lower meta-test performance and larger task-level variance (Atamuradov, 15 Nov 2025).

4. Reproducibility, API Usage, and Reference Implementations

Meta-World+ provides a reproducible evaluation framework:

Codebase: Versioned at tag v3.0.0 (Meta-World+ GitHub)
Baselines: Complete algorithm implementations (commit 4a8e3f2), supporting MAML, RL², and other standard agents.
Dependencies: mujoco-python-bindings v2.3.3, Gymnasium 0.30.0
Evaluator Utilities:

from metaworld.evaluation import metalearning_evaluation, MetaLearningAgent
agent = MyMetaAgent()
mean_succ, mean_ret, per_task = metalearning_evaluation(
    agent, test_envs, adaptation_episodes=10, eval_episodes=3)

Reference Hardware: AMD EPYC 7402 (24 cores), NVIDIA A100 PCIe (40 GB), 512 GB RAM

Full scripts covering meta-training and evaluation are included in the official repository. Strict random seed control is enforced across JAX, Gymnasium, NumPy, and Python to ensure result fidelity (McLean et al., 16 May 2025).

5. Benchmark Limitations and Practical Considerations

Despite its widespread adoption, the ML10 benchmark exhibits notable challenges and pitfalls:

Reward Version Sensitivity: The V1 vs V2 reward scaling can notably affect convergence and optimization dynamics. Proper documentation and reporting of reward versions are mandatory.
Low Meta-RL Success Rates: Many published RL and meta-RL approaches plateau at $\sim$ 30% success, reflecting the composite difficulty arising from task diversity and compositional requirements.
Fixed Episode Horizon: The cap of $T=150$ may prematurely truncate complex sub-skill sequences, impacting success rates in longer-horizon tasks.
Held-Out Generalization: Only 5 held-out test tasks are used for evaluation, which may not capture the full generalization landscape.
Architectural Capacity Constraints: Models with inadequate network capacity cannot adapt to all 10 tasks (especially for architectures tuned for narrow domains).

Researchers are advised to tune hyperparameters extensively—beyond common RL defaults—when approaching ML10, and reconfirm results with consistent random seeds and explicit reward reporting. Substantial variance across tasks and algorithms persists, underscoring the importance of multi-task robustness and careful methodological documentation (McLean et al., 16 May 2025, Atamuradov, 15 Nov 2025).

6. Impact and Evolution within the Meta-RL Community

The ML10 benchmark has catalyzed method development in meta-RL, preference-based learning, diffusion-model policies, and language-conditioned control. It serves as a proving ground for emergent techniques such as:

Reward learning from preference feedback (as in VARP, which reports significant gains via trajectory-augmented vision-LLM supervision) (Singh et al., 18 Mar 2025).
Hierarchical, diffusion-based policies (e.g., DAWN, which employs pixel-motion diffusion intermediates for interpretable, language-to-action control pipelines) (Nguyen et al., 26 Sep 2025).
Dual-phase, prompt-conditioned generalist agents (e.g., DualMind, which demonstrates near-saturated zero-shot and few-shot transfer via large-scale self-supervised pretraining and prompt-conditioned imitation) (Wei et al., 2023).

The ML10 task suite’s design—non-redundant, compositionally challenging, and compute-efficient—continues to provide a stringent standard for few-shot adaptation and robust policy transfer in robot learning, setting empirically grounded baselines for state-of-the-art method validation (McLean et al., 16 May 2025).