Robosuite Dataset: Robotic Manipulation Benchmarks

Updated 20 December 2025

Robosuite dataset is a multimodal collection featuring nine standardized robotic manipulation tasks integrated with the MuJoCo physics engine.
It provides reproducible, configurable streams of sensor, state, and action data for both scripted demonstrations and reinforcement learning experiments.
The dataset supports various data modalities, including low-dimensional proprioception and high-dimensional vision, ensuring robust performance benchmarking.

The robosuite dataset is a multimodal, on-demand collection protocol built around a standardized suite of nine robotic manipulation benchmarks. It is tightly integrated with the robosuite simulation and benchmarking framework, which operates atop the MuJoCo physics engine. Rather than shipping static demonstration files, robosuite establishes a reproducible, extensible environment specification—the “dataset” consists of (1) parametrizable benchmark tasks; (2) configurable streams of sensor, state, and action data from simulated environments; and (3) tools for episodic data recording, serialization, and benchmarking within a rigorous evaluation protocol (Zhu et al., 2020).

1. Benchmark Environments and Task Scope

robosuite v1.0 provides nine benchmark environments (“tasks”), divided into single-arm and two-arm settings. Each environment exposes a Gym-style API—reset(), step()—with environment initializations sampled stochastically by a placement_initializer, ensuring varied, collision-free configurations on each episode reset.

Single-Arm Tasks

Block Lifting: 7-DoF arm (default Panda), single 0.05 m cube on tabletop; (x, y) cube position randomized in 0.1 m disk in front of gripper; success if $z_{\text{cube}} > h_{\text{thresh}}$ .
Block Stacking: Single arm, two identical cubes; randomized, non-colliding starting positions; success when cube A is stably placed on cube B.
Pick-and-Place: Single arm, up to four objects and four receptacles; randomized subset-object and receptacle assignments; single-object variants provided.
Nut Assembly: Single arm, two pegs and two nuts; randomized nut locations; goal is correct insertion on each peg; single-nut variants supported.
Door Opening: Single arm, hinged door with cylindrical handle; random pose (translation plus yaw) of door; success for opening angle $>\pi/4$ .
Table Wiping: Arm with eraser end-effector; whiteboard tabletop with randomized smear regions; must clear $>90\%$ of marked area for success.

Two-Arm Tasks

Two-Arm Lifting: Bimanual (Panda×2 or Sawyer×2); pot with two handles; random pot (x, y); arms start co-located or opposite; lift pot above threshold while keeping pitch and roll $<\varepsilon$ .
Two-Arm Peg-In-Hole: Bimanual; square-holed board (arm1) and peg (arm2); random end-effector poses; insert peg through board.
Two-Arm Handover: Bimanual; hammer object, randomized size and location; arm1 grasps and passes hammer to arm2; success if hammer ends in gripper2.

All environments utilize per-task placement samplers for stochasticity and reproducibility.

Task	Default Robot(s)	Object(s) / Objects	Start-State Variations
Block Lifting	Panda (7-DoF Arm)	Unit cube	(x, y) in 0.1 m disk
Block Stacking	Single arm	Two cubes	Non-colliding, randomized (x, y)
Pick-and-Place	Single arm	Up to 4 objects, 4 bins	Object/container assignment, subsets
Nut Assembly	Single arm	2 pegs, 2 nuts	Nut positions randomized
Door Opening	Single arm	Hinged door	Door pose (translation+yaw) randomized
Table Wiping	Single arm, eraser	Surface “whiteboard”	Smear regions randomized
Two-Arm Lifting	Two arms	Pot with handles	Pot (x, y), arm starting pose
Two-Arm Peg-In-Hole	Two arms	Board w/ hole, peg	Random EE poses
Two-Arm Handover	Two arms	Hammer	Hammer size, pose randomized

2. Data Modalities and Observational Structure

robosuite’s observation and sensor suite is modular and configurable at environment instantiation, supporting low-dimensional proprioception, rich vision, and extensible sensor channels.

Proprioceptive (Low-Dim) Modalities

Joint positions $q_t\in\mathbb{R}^{n}$
Joint velocities $\dot{q}_t\in\mathbb{R}^{n}$
Joint torques $\tau_t\in\mathbb{R}^{n}$
End-effector position $p_{ee,t}\in\mathbb{R}^{3}$ and orientation $R_{ee,t}\in SO(3)$ or $q_{ee,t}\in S^3$
Force/torque at wrist $f_t,\mu_t\in\mathbb{R}^{3}$
Pose for each object $i$ : $p_{i,t}$ , $R_{i,t}$

Vision

RGB images: $I^{\text{rgb}}_{c,t}\in [0,255]^{H\times W\times 3}$ (one or more cameras)
Depth: $I^{\text{depth}}_{c,t}\in \mathbb{R}^{H\times W}$

Other Channels

Finger/contact pressures (where implemented)
Custom MuJoCo sensor outputs

Observations in the low-dim setting (use_object_obs=True, use_camera_obs=False) are concatenated vectors:

$o_t = [q_t;\,\dot{q}_t;\,p_{ee,t};\,\mathrm{vec}(R_{ee,t});\,\{p_{i,t},\,\mathrm{vec}(R_{i,t})\}_{i=1}^{N_o}] \in \mathbb{R}^d$

Activating camera-based modes appends obs['rgb_c0'], obs['depth_c0'], etc., to the observation dictionary.

3. Data Generation and Collection Protocols

Data is generated dynamically—no static demonstration archives are provided. Two canonical sources are supported:

Scripted / Demonstration Data

Human teleoperation using SpaceMouse or keyboard provides action commands.
Data streams $(o_t, a_t, r_t, \text{done}_t)$ are recorded at control frequency (default 20 Hz).
Data are saved per user code as .npz (NumPy) or HDF5 files.

Reinforcement Learning Data

Off-policy methods (e.g., SAC) generate trajectory rollouts (e.g., $T=500$ steps, $500$ epochs).
Evaluation episodes are interleaved at set intervals, matching task horizons.

Reproducibility

env.reset(seed) seeds MuJoCo RNG and initializers for consistent stochasticity.
Benchmarks report performance as mean ± std over 5 seeds; all code and hyperparameters are versioned with the benchmark suite.

4. Data Storage, Access, and Serialization

robosuite emphasizes online data instantiation and flexible serialization:

Environments specified in /robosuite/envs/, assets in /robosuite/models/.
Demonstration scripts (e.g., /examples/record_demonstrations.py) support human demonstration recording to .npz.
Benchmarking and data logging scripts are found under /benchmark/.
Data output formats: np.savez() (NumPy archives) or HDF5 (via h5py).

Example Workflow

import robosuite as suite
from robosuite.controllers import load_controller_config
import numpy as np

controller_cfg = load_controller_config(default_controller="OSC_POSE")
env = suite.make(
    "PickPlace",
    robots="Panda",
    use_camera_obs=False,
    has_renderer=False,
    has_offscreen_renderer=False,
    reward_shaping=True,
    controller_configs=controller_cfg,
    control_freq=20,
    horizon=500,
)
obs = env.reset(seed=0)
episode = {'obs': [], 'act': [], 'rew': [], 'info': []}
done = False
while not done:
    action = policy.predict(obs)  # user policy
    next_obs, reward, done, info = env.step(action)
    episode['obs'].append(obs)
    episode['act'].append(action)
    episode['rew'].append(reward)
    episode['info'].append(info)
    obs = next_obs
np.savez('traj_seed0.npz', obs=np.array(episode['obs']), act=np.array(episode['act']),
         rew=np.array(episode['rew']), info=episode['info'])

5. Evaluation Metrics and Benchmarking Protocol

Task-specific success criteria, diverse reward shaping, and standardized reporting ensure consistent benchmarking:

Success Criteria

Boolean task-specific flags in info['success'] at $t_\text{end}$ or done=True
Examples:
- Block Lifting: $z_{\text{cube}} - z_{\text{table}} > \delta_h$
- Block Stacking: $\|p_A - p_B - [0,0,h_{\text{cube}}]\| < \varepsilon_{\text{pos}}$ and orientation error $<\varepsilon_{\text{ori}}$
- Table Wiping: $\frac{\text{cleaned\_area}}{\text{total\_area}} > 0.9$

Reward Functions

robosuite supports both sparse and dense reward schemes, e.g.,

$r_t = -\|p_{ee,t} - p_{target}\|^2$

Orientation penalties can use quaternion geodesics,

$r^{\text{ori}}_t = -\|\mathrm{Log}(q_{ee,t}^* \otimes q_{ee,t})\|^2$

Composite shaping may weight terms for position, orientation, and binary grasp success.

Evaluation Protocols

20 evaluation episodes per checkpoint
Task horizon: 500 steps (25 s at 20 Hz)
Reported metrics: Mean return ± std (5 seeds), success rate (% success episodes)
Visualization: learning curves, bar charts for success rates

6. Usage Guidance and Reproducibility Best Practices

robosuite prescribes methodological practices for robust data collection and benchmarking:

Use impedance-based OSC_POSE controllers for sample-efficient learning.
Set random seeds via env.reset(seed); record these in logs and metadata.
Prefer low-dim state for sim-to-real; enable use_camera_obs and use_offscreen_renderer for vision.
Collect sparse rewards initially for exploration; introduce dense shaping incrementally.
Align experiment horizons with baseline (≥500 steps).
Record episodic data in compressed .npz or HDF5; maintain code/hyperparameter provenance.
Typical workload: 20 Hz control, 12 GB RAM, 2 days for 5 seeds × 9 tasks × 500 epochs × 500 steps (no GPU required unless using vision-based policies).
Community collaboration and updates are facilitated via the robosuite repository and robosuite.ai.

The robosuite dataset and its collection protocol operationalize a reproducible, extensible standard for robotic learning research, allowing researchers to instantiate, instrument, and analyze a wide array of manipulation tasks within a unified framework (Zhu et al., 2020).

PDF Markdown Chat (Pro)

References (1)

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Robosuite Dataset.