ManiSkill2: Benchmark for Robotic Manipulation

Updated 20 April 2026

ManiSkill2 is a unified benchmark and simulation platform that enables research on generalizable manipulation skills by offering diverse, physics-rich tasks.
It integrates GPU-accelerated rigid and soft-body simulations with over 2,000 unique object models and millions of demonstration frames for robust policy evaluation.
The platform supports multi-task evaluation, rapid motor adaptation, and sim2real transfer, making it a pivotal tool for advancing embodied AI research.

ManiSkill2 is a unified benchmark, simulation platform, and large-scale dataset designed for the development and evaluation of generalizable manipulation skills in embodied AI. By systematically addressing limitations in prior simulation benchmarks—such as inadequate object variation, insufficient soft/rich contact dynamics, and weak multi-task coverage—ManiSkill2 has become a central testbed for both rigid- and soft-body manipulation, complex control, transfer learning, and policy generalization research. The benchmark couples GPU-accelerated, fully dynamic simulation with an open, extensible suite of environments and competitive baselines, providing a comprehensive substrate for algorithmic advances across RL, IL, TAMP/Sense–Plan–Act, and multi-modal policy paradigms (Gu et al., 2023).

1. System Overview and Motivation

ManiSkill2 was developed to overcome three core obstacles in generalizable manipulation: (i) narrow topological/geometric object coverage, (ii) lack of physically accurate and contact-rich simulation for both rigid and soft objects, and (iii) single-task, limited-paradigm evaluation. Its key features are:

20 manipulation task families spanning rigid/soft bodies, stationary/mobile bases, and single/dual robot arms.
Over 2 000 unique object models sourced from YCB, EGAD, and custom procedural generation; more than 4 million demonstration frames.
Full-physics simulation via SAPIEN for rigid bodies and Warp-based MLS-MPM for soft materials, supporting two-way coupling.
OpenAI Gym-compatible API, including privileged state/RGBD/point cloud observation modes and flexible controllers (joint and SE(3) end-effector, both delta and absolute).
High-throughput, decoupled rendering supporting up to 2 000 FPS for visual RL on standard workstation hardware.
Open-source codebase and cloud evaluation server for reproducibility and competition (Gu et al., 2023).

These system design decisions enable ManiSkill2 to act as a general-purpose substrate for embodiment, with realistic variation and domain diversity not present in earlier simulators.

2. Task Taxonomy and Environment Design

ManiSkill2 tasks are structured along axes of object type (rigid/soft), robot base (fixed/mobile), and robot morphology (single/dual-arm). Examples include:

Soft-Body Manipulation: Fill (clay ↦ beaker), Hang (noodle ↦ rod), Excavate (scoop clay), Pour (liquid ↦ beaker), Pinch (deform plasticine to target shape), Write (draw character on clay).
Precise Assembly: PegInsertionSide (peg-in-hole), PlugCharger, AssemblingKits (shape-slot insertion).
Pick-and-Place/Stack: PickCube, PickSingleYCB/EGAD, PickClutterYCB, StackCube.
Articulated/Object Manip: OpenCabinetDoor/Drawer, TurnFaucet, PushChair, MoveBucket, AvoidObstacles.

Success criteria are defined per-task via explicit geometric, dynamic, and/or semantic predicates. For example, PickCube requires $\|p_{\text{cube}}-p_{\text{goal}}\|<2.5$ cm with arm static; soft Fill demands $m_{\rm in\_beaker} > 0.9\,m_{\rm full}$ and near-zero velocity (Gu et al., 2023).

Each environment exposes a range of observation types (state, RGBD, fused point cloud), simulates sensor noise, and randomizes object geometry, color/texture, dynamics, and scene layout to foster robust policy generalization.

3. Dataset, Demonstrations, and Benchmarks

The demonstration dataset includes:

Over 0.5 million expert policies' trajectories generated via Task & Motion Planning (TAMP), MPC, or RL, supporting automatic action-space conversion for flexible policy/classifier reuse.
Substantial object, pose, and visual domain randomization, with systematic train/test splits on “seen” and “held-out” objects and textures (Gu et al., 2023).

Baseline metrics such as SuccessRate, Return, and TimeToCompletion are computed over hundreds of evaluation episodes per task, strictly separating train/test objects to evaluate out-of-distribution (OOD) performance.

Benchmarking covers several families:

Sense–Plan–Act (e.g., Contact-GraspNet + OMPL): 43.2% success on PickSingleYCB.
Imitation Learning (BC): Fails on precise rigid-body tasks but achieves 62% (Fill) and 35% (Hang) on select soft-body tasks.
RL with Demonstrations (DAPG+PPO): Point cloud input substantially outperforms RGBD for pick-place; controller and frame selection are critical for policy success (Gu et al., 2023).

4. Algorithmic Advances and Research Enabled

ManiSkill2 has stimulated rapid progress in generalizable RL/IL and rapid adaptation via methods such as:

Rapid Motor Adaptation (RMA²): Policies infer environment embeddings summarizing latent dynamics/objects (mass, friction, shape, etc.) during training; at deployment, an adapter network estimates these from proprioceptive/action histories and wrist-mounted depth. This two-phase protocol yields superior sample efficiency and generalization to OOD object variation, as shown across PickPlace(YCB/EGAD), Peg Insertion, and Faucet Turning (Liang et al., 2023).
3DGS-based Scene Representation in RL: Query-based Generalizable 3D Gaussian Splatting pipelines (QGFS), combined with hierarchical semantic encoding, compute compact latent codes rich in geometry and semantics. These are used as scene representations for DAPG policies, showing improved learning speed and reliability in tasks such as OpenDrawer, PlugCharger, StackCube, and others (Wang et al., 2024).
Diffusion-based Policies: Diffusion Transformer Policy (DiT) for continuous action chunk prediction enables improved multi-step planning and robust closed-loop execution. DiT models achieve 65.8% average success rate over five ManiSkill2 pick-place tasks, outperforming discretized/MLP heads by 35 points (Hou et al., 2024).
Guided Self-Attention Behavior Cloning (GP2E): In soft-body manipulation (e.g., Fill, Hang, Excavate, Pour), GP2E’s integration of point cloud encoding with spatially guided attention and a two-stage noisy fine-tuning regime led to state-of-the-art success rates, outperforming all baseline competitors in the ManiSkill2 CVPR challenge (Li et al., 2024).
Two-Stage Fine-Tuning for Generalization: Across all tracks (rigid and soft), resuming fine-tuning from the best checkpoint with reduced batch/sample scales can inject additional gradient noise, preventing overfitting and boosting OOD performance by 3–8 points on held-out objects (Gao et al., 2023).

5. Infrastructure, Evaluation Protocols, and Performance

The ManiSkill2 infrastructure is centered on:

Multi-process parallel simulation (SAPIEN) leveraged via python and render-server decoupling (gRPC), achieving >2 500 FPS for rigid tasks (64 envs, 16 CPU, 1 GPU).
MLS-MPM GPU-based soft-body simulation with fully coupled particle–rigid/robot interaction (80+ FPS across 16 envs).
Flexible controller/action space conversion utilities, detailed cloud evaluation protocols, automatic demonstrator pipelines, and per-task variation/randomization for robust model selection (Gu et al., 2023).

Open-source APIs, baseline policies, and Docker-based challenge evaluation ensure reproducibility and ease of adoption.

6. Extensions and Downstream Research

Recent extensions built atop ManiSkill2:

ClevrSkills: Curriculum and dataset for compositional visual-language reasoning—derived from ManiSkill2's environments and API, with oracle solvers, multi-camera vision, predicate-based dense rewards, and 330 000 trajectories spanning skill composition, sequence, memory, and logic (Haresh et al., 2024).
Multi-modal, language-driven robot instruction, leveraging ManiSkill2’s textured/generalized scene layouts and observation modalities.
Sim2real transfer studies via systematic domain randomization, facilitated by ManiSkill2's configurable observation noise and rendering.

A shared feature is strict OOD evaluation (unseen objects/textures), supporting research on meta-learning, adaptation, transfer, and compositionality.

7. Notable Limitations and Future Prospects

Despite its coverage, ManiSkill2’s most challenging tasks—for example, high-precision assembly (PlugCharger), soft-body shape manipulation (Pinch, Write), and long-horizon composition—remain unsolved by pure IL or large-scale RL. Vision-LLMs and diffusion policies, even when pre-trained on massive data, have not achieved reliable compositional generalization without substantial fine-tuning (Haresh et al., 2024, Hou et al., 2024, Li et al., 2024).

A plausible implication is that continued progress will require integrating search/planning, richer multi-modal feedback (vision, force, language), and scalable cross-task learning methods. ManiSkill2’s structure and diversity make it a persistent and evolving testbed for these frontiers.