ManipulaTHOR: Visual Manipulation Simulator

Updated 26 February 2026

ManipulaTHOR is a simulation framework for embodied visual object manipulation that integrates a high-DOF arm and photorealistic kitchen environments.
It provides rich multimodal sensing and robust physics, enabling research on 3D obstacle avoidance, grasp planning, and disturbance minimization.
The framework formalizes the ArmPointNav task as an MDP/POMDP, offering a scalable testbed for end-to-end reinforcement learning and sim-to-real transfer.

ManipulaTHOR is a simulation framework for embodied visual object manipulation that extends the AI2-THOR platform with a high-DOF manipulatory arm and robust physics for complex, long-horizon mobile manipulation tasks in photorealistic 3D kitchen environments. It serves as a testbed for addressing the substantial challenges of end-to-end learning in mobile manipulation, including 3D obstacle avoidance, grasp planning, manipulation under occlusion, and generalization to unseen objects and scenes (Ehsani et al., 2021, Ni et al., 2021).

1. Environment and System Architecture

ManipulaTHOR augments the AI2-THOR platform by integrating a mobile agent equipped with a three-joint arm (shoulder, elbow, wrist) and a 6-DOF spherical grasper, operating in a hemisphere of 0.6335 m radius resembling Kinova JACO robotic arms. Physics and collision detection are powered by Unity's PhysX, allowing full scene dynamics with rigid-body interactions, arm–object contacts, and cascading object forces. The simulator achieves rates of ~300 fps on multi-GPU systems, given the additional cost of contact-rich simulation relative to navigation-focused environments (Ehsani et al., 2021, Ni et al., 2021).

The platform exposes rich multimodal sensing: egocentric RGB and/or depth images, proprioceptive arm state, grasper status, and global navigation signals (GPS, compass). Object grasping is abstracted as a collision sphere intersection, eliminating low-level hand kinematics and emphasizing arm-space planning. Manipulation tasks are set in 29–30 kitchen scenes, each populated with ≳150 object instances and articulated receptacles, yielding high scene complexity (Ehsani et al., 2021, Ni et al., 2021).

2. Formal Task Definition: ArmPointNav

ManipulaTHOR formalizes visual mobile manipulation through the ArmPointNav task, cast as a Markov Decision Process (MDP) or, under partial observability, a POMDP. Each episode specifies an initial agent state, a target object category and location, and a goal pose for object placement. The hidden system state $s_t$ encompasses the mobile base's 3D pose, arm configuration, target and obstacle object poses, and full observable sensory data. Observations $o_t$ include 224×224 depth (or RGB) images and a 3-vector for relative target goal coordinates; object pose ground truth is never directly provided to the agent (Ni et al., 2021).

The canonical discrete action space $\mathcal{A}$ comprises mobile base controls (MoveAheadContinuous, RotateLeft/Right), arm joint increments in $x$ , $y$ , and height, wrist joint rotation (pitch, yaw), camera tilt (LookUp/Down), gripper operations (PickUp/PutDown), and explicit episode termination (Done), summing to 21 actions. The "large" action set, which incorporates camera and wrist rotations omitted from initial baselines, is found critical for disturbance minimization (Ni et al., 2021).

The shaped ArmPointNav reward is:

$r_t = R_\text{success} \cdot \mathbb{1}\{\text{object at goal} \wedge \text{DONE}\} + R_\text{pickup} \cdot \mathbb{1}\{\text{pickup}\} + \Delta d_\text{arm}^o + \Delta d_o^\text{goal}$

where $d_\text{arm}^o = \|p_\text{wrist} - p_\text{object}\|_2$ and $d_o^\text{goal} = \|p_\text{object} - p_\text{goal}\|_2$ . The success criterion is satisfied if the target object, at Done, lies within $\epsilon\approx 5$ cm of goal and no unacceptable collisions have occurred (Ehsani et al., 2021).

3. Disturbance Penalization and Curriculum

Addressing the critical real-world constraint of minimizing disturbance to non-target objects, ManipulaTHOR introduces a disturbance-penalized reward:

$r_t' = r_t - \lambda_\text{disturb} \cdot (D_t - D_{t-1})$

with the disturbance term $D_t = \sum_{i\neq\text{target}} \|pos_i^t - pos_i^0\|_2$ , penalizing cumulative displacement of non-target objects (Ni et al., 2021). Direct optimization under this reward can lead to degenerate optima—e.g., agents that immediately terminate after grasping, to avoid incurring any penalty.

To overcome such local minima, a two-stage curriculum is used:

Stage I: Exploration (20M frames), optimizing the vanilla reward ( $\lambda=0$ ) to bootstrap visuomotor competence in navigation and manipulation.
Stage II: Refinement (+25M frames), activating the disturbance penalty ( $\lambda\approx 15$ ) and fine-tuning from the Stage I policy to learn disturbance-avoiding behaviors (Ni et al., 2021).

This staged approach avoids premature convergence to degenerate policies that exploit the disturbance penalty while never completing the primary task.

4. Agent Architecture and Auxiliary Tasks

The learning agent employs a depth-based ResNet-18 encoder (group norm) generating a 512-dimensional latent $z_t$ , concatenated with polar goal coordinates and previous action, and followed by a GRU(512). Actor–critic heads, conditioned on the recurrent state, output the action distribution $\pi(b_t)$ and the value function $V(b_t)$ . Policies are optimized with PPO ( $\gamma=0.99$ , $T=200$ steps per episode) (Ni et al., 2021).

To accelerate convergence and encode penalty structure, an auxiliary disturbance-prediction head is trained to predict, given current belief $b_t$ and proposed action $a_t$ , the likelihood of causing disturbance ( $\geq \tau=1$ mm) in the next step:

$\hat{c}_{t+1} = \text{DisturbMLP}(b_t, a_t) \in [0,1]$

Ground truth is a binary indicator of disturbance. As the class is heavily imbalanced ( $\sim$ 90% of actions are non-disturbing), Focal Loss is used, yielding a total per-step loss:

$\mathcal{L} = \mathcal{L}_\text{PPO} + \beta \mathcal{L}_\text{focal}, \quad (\beta=0.1)$

The auxiliary head significantly improves data efficiency and final performance under the disturbance-penalized metric (Ni et al., 2021).

5. Evaluation Protocol and Results

Evaluation uses 5 held-out test scenes with novel objects, reporting:

SR (success rate): proportion of episodes with successful pick, transport, and place of the target
SR $_\text{woD}$ : fraction of episodes with both success and $D_T \leq 0.01$ m (disturbance-free)
PuSR (pick-up success rate)
Episode length statistics (Ehsani et al., 2021, Ni et al., 2021)

Empirical results demonstrate:

Improved baselines with ResNet-18 and enhanced action space yield SR $\approx$ 74%, SR $_\text{woD}\approx$ 18%. Adding the disturbance auxiliary head increases SR to 78% at 20M frames.
Directly training with disturbance penalty from scratch leads to poor local optima (SR $\approx$ 18%, SR $_\text{woD}\approx$ 10%).
The two-stage curriculum with disturbance-prediction head achieves SR $\approx$ 81.3%, SR $_\text{woD}\approx$ 47.1%, improving disturbance-free success over the best $\lambda=0$ baseline by $\approx$ 10 percentage points and outperforming PPO-Lagrangian constrained RL by $\approx$ 30 percentage points.
Enlarging the action space (including LookDown, wrist rotation) provides a further $\approx$ 4 percentage point gain under the disturbance-penalized objective.
Varying $\lambda$ shows monotonic SR $_\text{woD}$ improvement, with $\lambda\approx 15$ best balancing task completion and disturbance minimization (Ni et al., 2021).

6. Empirical Analysis, Limitations, and Future Directions

ManipulaTHOR enables observation of complex agent behaviors such as sweeping arm trajectories around occluders and delicate object placements. Disjoint two-stage policies (pick + place separate) fail to match end-to-end recurrent policies, and depth-only sensing outperforms RGB or RGBD configurations in disturbance-free manipulation (Ehsani et al., 2021).

Limitations include challenges in fine-grained 3D collision avoidance, failures due to visual occlusion during arm motion, and over-rotation or granularity-induced placement errors. The substantial performance gap between seen and novel objects (SR $_\text{woD}$ : 39.4% vs 32.7% in (Ehsani et al., 2021)) highlights open problems in generalization.

Prospective improvements suggested include continuous-action control, learned point-cloud perception, grasper models with articulated jaws, hierarchical RL or curricula for multi-object rearrangement, and sim-to-real transfer via domain adaptation or real robot benchmarking (Ehsani et al., 2021).

7. Significance and Broader Context

ManipulaTHOR uncovers core embodied AI challenges absent in classical navigation: embodied 3D manipulation, rich multiphysics interaction, occlusion management, and complex reward shaping for disturbance minimization. The ArmPointNav paradigm forms a bridge between navigation and dexterous robotic manipulation, offering a scalable experimental platform for reinforcement learning, representation learning, and safe RL advances in high-fidelity environments (Ehsani et al., 2021, Ni et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

ManipulaTHOR: A Framework for Visual Object Manipulation (2021)

Towards Disturbance-Free Visual Mobile Manipulation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ManipulaTHOR.