Meta-World+: An Improved, Standardized, RL Benchmark (2505.11289v1)

Published 16 May 2025 in cs.AI and cs.LG

Abstract: Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (https://github.com/Farama-Foundation/Metaworld/) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.

Summary

The paper demonstrates that inconsistent reward functions critically impact multi-task and meta-RL performance in robotic manipulation.
It re-engineers Meta-World by standardizing V1 and V2 rewards through integration with the modern Gymnasium API, ensuring reproducibility.
The benchmark supports custom task sets and flexible evaluation, promoting fair and robust comparisons in reinforcement learning research.

The paper "Meta-World+: An Improved, Standardized, RL Benchmark" (2505.11289) addresses critical issues of reproducibility and standardization in the Meta-World benchmark, a widely used platform for evaluating multi-task and meta-reinforcement learning (RL) algorithms in robotic manipulation. The core problem identified is that undocumented changes, particularly to the reward functions, have made it difficult to reliably compare results across different research papers that used varying versions of the benchmark.

The authors re-engineer Meta-World to provide a standardized, technically ergonomic, and reproducible version that integrates with the modern Gymnasium API. Key contributions include:

Demonstrating Reproducibility Issues: Through empirical experiments, the paper shows that different reward function versions (referred to as V1 and V2) significantly impact algorithm performance, especially for multi-task RL methods. This highlights why comparing results from studies using different Meta-World versions is problematic.
Standardized and Reproducible Benchmark: The new Meta-World+ explicitly includes both V1 and V2 reward functions via a standardized API, allowing researchers to select the version needed for comparison with specific prior works.
Gymnasium Integration: The benchmark is updated to align with the latest Gymnasium API (Towers et al., 24 Jul 2024) and MuJoCo Python bindings [todorov2012mujoco], removing dependencies on older, unsupported packages. This streamlines environment creation and allows leveraging the broader Gymnasium ecosystem (Figure 1).
Flexible Task Sets: In addition to the original MT10/50 and ML10/45 task sets, two new sets (MT25/ML25) are introduced to provide intermediate challenge levels and computational costs. Crucially, the benchmark now allows users to create custom task sets of arbitrary size and composition.

Understanding Meta-World Tasks and Setup

Meta-World consists of 50 distinct robotic manipulation tasks for a Sawyer robot arm. Tasks range from pushing and opening/closing objects to grasping and assembly. All tasks share a common observation and action space, although the relevant objects and goals vary.

Action Space: A 4-tuple representing desired end-effector displacement (x, y, z) and gripper position.
Observation Space: A 39-dimensional vector, consisting of the current observation concatenated with the previous observation and the goal location. The observation includes end-effector coordinates, gripper state, and coordinates/orientations of up to two objects. In multi-task settings, the goal is visible in the observation, while in meta-RL, it's zeroed out during meta-training.
Task Sets:
- Multi-Task (MT): Agents are trained and evaluated on the same set of tasks (e.g., MT10, MT50, new MT25). The objective is to maximize average success rate across the set.
- Meta-Learning (ML): Tasks are split into training and testing sets (e.g., ML10 train/test, ML45 train/test, new ML25 train/test). Agents train on the training tasks and are evaluated on their ability to quickly adapt to unseen test tasks. The objective is the mean success rate on the test tasks after adaptation.
Evaluation Metric: Mean success rate across all evaluation tasks.

Impact of Reward Functions (V1 vs V2)

The paper empirically evaluates several multi-task (MTMHSAC, SM, MOORE, PaCo, PCGrad) and meta-RL (MAML, RL²⁾ algorithms on MT10/50 and ML10/45 using both V1 and V2 rewards.

V1 Rewards: Original, dense rewards derived from a pick-place structure, often poorly tuned for other tasks and with varying scales.
V2 Rewards: Newer, dense rewards designed with "fuzzy logic" to have more consistent scales across tasks (rewards typically between 0 and 10) and better optimize with PPO.

The empirical results demonstrate:

Multi-task RL: Algorithms like MTMHSAC, SM, PaCo, and MOORE achieve significantly higher success rates on MT10 and MT50 when trained with V2 rewards compared to V1 (Figure 2). The authors attribute this to the V2 rewards' more consistent scaling, which improves the Q-function's ability to model state-action values across different tasks, echoing findings on the importance of normalized returns in multi-task learning [popart].
Meta-RL: MAML and RL^2, using policy gradient methods, show less dramatic differences between V1 and V2, except for RL² on V1 which performs poorly, potentially due to using raw, unnormalized V1 rewards in the observation (Figure 3). The modest overall performance suggests that the diverse, non-parametric task distribution in ML10/45 remains challenging for current meta-RL methods, highlighting a need for benchmarks that bridge the gap between parametric variation (ML1) and task diversity (ML10/45).

Practical Implications and Implementation

The Meta-World+ release is designed for practical use in research:

Easy Environment Creation: Using the Gymnasium gym.make interface, users can easily instantiate standard or custom Meta-World environments.

import gymnasium as gym
import metaworld # Registers Meta-World envs with Gymnasium

# Create a MT10 environment using the default V2 rewards
envs_mt10 = gym.make('Meta-World/MT10-v2', vector_strategy='sync', seed=42)

# Create a MT50 environment using V1 rewards
envs_mt50_v1 = gym.make('Meta-World/MT50-v1', vector_strategy='sync', seed=42)

# Create a custom MT benchmark with specific tasks and V2 rewards
custom_tasks = ['pick-place-v2', 'door-open-v2', 'drawer-close-v2']
envs_custom_mt = gym.make('Meta-World/MT-custom-v2', env_names=custom_tasks, vector_strategy='sync', seed=42)

# Create ML10 train/test environments using V2 rewards
train_envs_ml10 = gym.make('Meta-World/ML10-train-v2', vector_strategy='sync', seed=42)
test_envs_ml10 = gym.make('Meta-World/ML10-test-v2', vector_strategy='sync', seed=42)

Custom Task Sets: The MT-custom and ML-custom interfaces allow researchers to define specific subsets of tasks. This is invaluable for controlled experiments studying task similarity, transfer, or specific failure modes.

# Example of creating a custom ML benchmark train/test
custom_ml_tasks = ['reach-v2', 'push-v2', 'pick-place-v2']
train_envs_custom_ml = gym.make('Meta-World/ML-custom-train-v2', env_names=custom_ml_tasks, vector_strategy='sync', seed=42)
test_envs_custom_ml = gym.make('Meta-World/ML-custom-test-v2', env_names=custom_ml_tasks, vector_strategy='sync', seed=42) # Note: same tasks for custom ML

Evaluation Utilities: The library provides helper functions for standardized evaluation in both multi-task and meta-learning settings (Figure 4).
Computational Efficiency: The introduction of MT25/ML25 provides a middle ground in computational cost (e.g., MT25 training $\sim 12$ hours vs. MT50 $\sim 25$ hours on specified hardware), enabling faster experimentation cycles before scaling to the full benchmarks.
Need for Re-running Baselines: The paper strongly recommends researchers re-implement and run baseline algorithms on the standardized benchmark version they are using, rather than quoting potentially incomparable results from older publications. This ensures fair comparisons.

Limitations and Future Directions

While Meta-World+ standardizes the benchmark and improves usability, the empirical results reveal ongoing challenges:

Scaling multi-task performance to more tasks (MT50) remains difficult, potentially due to network capacity issues as suggested by the MTMHSAC performance drop on MT25 vs MT10 (Figure 8a). This aligns with findings that increasing parameters can help mitigate plasticity loss in multi-task settings [mclean2025multitask].
Meta-RL performance on task distributions like ML10/45 is modest, indicating significant room for improvement in generalization to truly novel tasks.

Future benchmark development should focus on:

Greater task diversity and compositional complexity (e.g., tasks requiring sequences of skills).
Benchmarks that interpolate between parametric variations (like ML1) and larger task distributions (ML10/45).
Exploring transfer across different robot morphologies [bohlinger2024one] or leveraging structured representations and world models [DBLP:journals/corr/abs-2301-04104, hansen2024tdmpc].

In summary, Meta-World+ is a crucial update that enhances reproducibility and usability of a key RL benchmark. It provides a stable foundation for future research in multi-task and meta-RL while empirically highlighting the importance of benchmark standardization, consistent reward design, and the need for further algorithmic advances to tackle complex, diverse task distributions.