Papers
Topics
Authors
Recent
Search
2000 character limit reached

DualTHOR: Dual-Arm Robotics Platform

Updated 26 February 2026
  • DualTHOR is a high-fidelity simulation platform that enables physically realistic dual-arm manipulation for embodied AI research.
  • It integrates Unity-based physics with a closed-loop Python–Unity control system to support continuous motion and stochastic contingency modeling.
  • The platform benchmarks language and vision-language planners in household tasks using realistic robot models like Unitree H1 and Agibot X1.

DualTHOR is a physics-based, simulation platform for dual-arm humanoid robots, designed to facilitate research in embodied AI, high-level multimodal planning, and contingency-aware control. Built as an extension of AI2-THOR within the Unity engine, DualTHOR introduces physically realistic bimanual interaction, stochastic action outcomes, and a holistic pipeline for benchmarking language and vision-language planners in long-horizon, household-scale scenarios. The platform supports integration of real-world robot morphologies, systematic uncertainty modeling, and stateful API access, offering a reproducible, extensible, and high-fidelity environment for dual-arm reasoning and robust task execution (Li et al., 9 Oct 2025, Li et al., 19 Jun 2025).

1. System Architecture and Platform Design

DualTHOR extends the AI2-THOR scene graph and Unity’s rigid-body physics to model dual-arm humanoids, replacing the original single-arm, wheeled agent with physically accurate robot models such as Unitree H1 and Agibot X1. The system architecture consists of three key layers:

  • Unity-Based Physics and Sensing: Incorporation of CAD-derived robot meshes, URDF-specified joint constraints, local collision primitives, and articulated kinematic chains. ArticulationBody components and mesh colliders allow continuous, physically plausible joint transitions and real-world contact interactions.
  • Closed-loop Python–Unity Control: High-level semantic actions (e.g., MoveTo, Pick, Place, Open) are issued over a Python JSON/TCP client interface. Each command triggers Unity to call an external IK solver service (OmniManip via gRPC or HTTP) to compute continuous joint trajectories, which are then smoothly interpolated and simulated in physics (Li et al., 19 Jun 2025).
  • Contingency and History Modules: The contingency mechanism samples discrete stochastic outcomes for each object-centric manipulation action, and the system logs complete multi-modal state–action–observation trajectories, supporting evaluation, re-planning, and data collection for both online and offline learning (Li et al., 9 Oct 2025).

Humanoid Models and Sensing

  • Unitree H1: Whole-body IK across both arms and minimal torso, with 4×4 homogeneous transforms for end-effectors, dynamic balancing, and multi-finger dexterous actuation.
  • Agibot X1: Decoupled 6-DOF per-arm IK, 3×3 rotation and translation per limb, simple parallel-jaw grippers.
  • Visual Sensing: Multi-camera egocentric RGB (optionally depth/point-map) configurations with minimal self-occlusion for robust perception (Li et al., 9 Oct 2025).

Design choices target support for continuous transitions, stochastic outcomes, and benchmarking across 10 room layouts, 68 objects, and 356+ tasks varying from single- to dual-arm essential and optional manipulation (Li et al., 19 Jun 2025).

2. Motion Generation and Contingency Modeling

Continuous Transition Mechanism

DualTHOR replaces all “flash” or discrete transitions (as previously in AI2-THOR) with continuous, trajectory-based updates in joint space. For each action, the platform solves the damped least-squares IK problem: minΔθJ(θ)ΔθΔx2+λΔθ2\min_{\Delta\theta} \|J(\theta)\,\Delta\theta - \Delta x\|^2 + \lambda \,\|\Delta\theta\|^2 subject to joint limits and linearized self-collision avoidance

θminθ+Δθθmax,Ccoll(θ+Δθ)0\theta_{min} \leq \theta + \Delta\theta \leq \theta_{max}, \quad C_{\text{coll}}(\theta+\Delta\theta)\geq 0

where J(θ)J(\theta) is the spatial Jacobian, Δx\Delta x the end-effector error, and λ\lambda the damping parameter (Li et al., 19 Jun 2025).

  • H1: Jointly optimizes both arm chains plus torso for bimanual actions and dynamic stability.
  • X1: Executes decoupled per-arm IK in local frames.

State updates are performed at $60$ Hz Unity physics with joint velocities vv, emulating physically plausible x(t+Δt)=x(t)+0ΔtfIK(x(t+s),a)dsx(t+\Delta t) = x(t) + \int_0^{\Delta t} f_{\mathrm{IK}}(x(t+s),a)\,ds (Li et al., 9 Oct 2025).

Stochastic Contingency Mechanism

Every physical manipulation (e.g., “pick up cup”) is associated with a categorical outcome probability distribution, e.g.: P(successs,a)=0.80,P(mug brokens,a)=0.10,P(contents spilleds,a)=0.10P(\text{success}|s,a)=0.80, \quad P(\text{mug broken}|s,a)=0.10, \quad P(\text{contents spilled}|s,a)=0.10 For each action, after reaching proximity, an outcome oCategorical(pa,s,1,...,pa,s,k)o \sim \mathrm{Categorical}(p_{a,s,1},...,p_{a,s,k}) is sampled, and both visual/feedback observations and discrete events are returned to the planner. Contingency probabilities are user-configurable, and mechanism is extensible to temporally correlated (non-i.i.d.) failure models (Li et al., 19 Jun 2025, Li et al., 9 Oct 2025).

3. Interfaces, State Representations, and Data Management

The DualTHOR interface is exposed via a standard Python client API communicating with Unity’s server:

  • API Methods:
    • reset_scene(scene_id)
    • step(action_dict) → returns {rgb, feedback, proprioception, done, info}
    • get_proprioception(), get_visual()
  • State Representations:
    • Joint states s=[θ1,...,θn,v1,...,vn]s = [\theta_1,...,\theta_n, v_1,...,v_n]
    • Base pose, height, room coordinates
    • Action space: high-level commands with discrete parameters (e.g., Pick(object_id, arm))
  • Observation Space:
    • Egocentric RGB ItI_t, optional depth/3D.
    • Proprioceptive input vectors at each step.
  • Execution Monitoring and Data Logging:
    • Step-level feedback (success/failure/contingency).
    • Extensive logging in JSON/HDF5; full replayable trajectories for offline learning or evaluation.
    • Undo/redo (state branching) support for rapid experimentation (Li et al., 9 Oct 2025).

4. Task Suite, Benchmarking, and Evaluation Protocol

DualTHOR provides a standardized suite of tasks divided into three categories with comprehensive coverage of bimanual (dual-arm) versus single-arm requirements across realistic room layouts:

Task Category Description Example Task
Dual-Arm Essential Impossible with one arm, requires collaboration Lift heavy container, hold & pour, two-hand open
Dual-Arm Optional Bimanual more efficient but not strictly required Carry two objects, parallel pick & place
Single-Arm Adapted from AI2-THOR tasks Basic pick, open, cook, fill, slice

Benchmarks are specified over 359 (Li et al., 9 Oct 2025) or 356 (Li et al., 19 Jun 2025) tasks with 10 randomized room layouts and 68 object types. Each task × robot pairing is evaluated across 50 trials, with per-action budget of 50 steps. Feedback includes outcome, observation, and contingency event for each attempt (Li et al., 9 Oct 2025, Li et al., 19 Jun 2025).

Baseline Planners:

  • Proprietary MLLMs: GPT-4o, Gemini-1.5-Pro
  • Open-source: Qwen2.5-VL-7B, InternVL2.5-8B
  • Prompt-enhanced: LLM-Planner, RAP, DAG-Plan (dependency-graph bimanual planner)

Performance:

  • Dual-Arm Essential (X1/H1):
    • DAG-Plan: 36%/42%
    • Proprio-MLLM: 59%/63%
  • Dual-Arm Optional (X1/H1):
    • DAG-Plan: 51%/52%
    • Proprio-MLLM: 71%/70%
  • Single-Arm (X1/H1):
    • DAG-Plan: 55%/58%
    • Proprio-MLLM: 73%/75%

Proprio-MLLM achieves a mean improvement of 19.75% over the strongest baselines on these metrics (Li et al., 9 Oct 2025, Li et al., 19 Jun 2025).

A key result is the marked drop in baseline success rates under increased contingency probability (e.g., GPT-4o on dual-arm essential, X1: 23.6% nominal success falling to 11.5% at highest failure rate) (Li et al., 19 Jun 2025).

5. Proprio-MLLM: Embodiment-Aware Planning Model

Proprio-MLLM is a proprioception-augmented, multimodal LLM planner tightly coupled to DualTHOR. Its core components are:

  • Proprioceptive Encoding:
    • At each timestep tt, proprioceptive state sts_t is encoded via MLP into “motion token” embeddings.
    • Sequences m={stk,...,st}m=\{s_{t-k},...,s_t\} are tokenized via VQ-VAE:

    z(m)=E(m),p=argminkz(m)ck2,e=cpz(m)=E(m), \quad p = \mathrm{argmin}_k\|z(m)-c_k\|_2, \quad e=c_p

    L=D(z(m))m2+αsg[z(m)]e2+βz(m)sg[e]2\mathcal{L} = \|D(z(m))-m\|^2 + \alpha\|{\rm sg}[z(m)]-e\|^2 + \beta\|z(m)-{\rm sg}[e]\|^2

  • Motion-Based Position Embedding (MPE):

    • For visual tokens at (x,y)(x,y) relative to robot centroid (xr,yr)(x_r, y_r),

    MPE(t,x,y)=[t,sign(yyr),sign(xxr)]\mathrm{MPE}(t,x,y) = [t,\, \mathrm{sign}(y-y_r),\, \mathrm{sign}(x-x_r)]

  • Cross-Spatial Encoder (CSE):

    • Fuses 2D (F2D=EQwen(I)F_{2D}=E_{\rm Qwen}(I)) and 3D (F3D=ECUT3R(I)F_{3D}=E_{\rm CUT3R}(I)) point map features via

    Ffused=MLP(F2D+Φ(F3D))F_{\rm fused} = \mathrm{MLP}(F_{2D} + \Phi(F_{3D}))

  • LLM Backbone:

  • Training:

This architecture is crucial for incorporating fine-grained embodiment awareness and spatial-motor reasoning, as evidenced by the performance improvement in dual-arm tasks.

6. Implementation, Assets, and Reproducibility

The platform builds on Unity 2023.2.3 for simulation, with a Python 3.10 control suite and C++ (gRPC or HTTP) IK back-end. Key performance and reproducibility features include:

  • Real-Time Performance: $40$–$60$ ms step latency on a single GPU (RTX 3090), $60$ Hz Unity simulation, asynchronous Python loop at $30$ Hz.
  • Repository and Codebase:
  • Reproducibility Pipeline:

1. Clone repository, install Unity assets. 2. Build and run headless Unity server. 3. Launch Python API server and connect to Unity instance. 4. Execute standardized evaluation script: examples/run_evaluate.py --planner proprio_mllm --tasks all

7. Research Impact and Open Challenges

DualTHOR highlights several central findings in contemporary embodied AI:

  • Dual-arm bimanual manipulation remains a significant obstacle for state-of-the-art LLM/VLM planning architectures; performance on truly essential bimanual tasks lags single-arm counterparts by >20 percentage points for all tested baselines (Li et al., 19 Jun 2025).
  • Stochastic contingency modeling surfaces planner brittleness; high-level agents experience a 30–50% success drop with more realistic per-action failure rates, unless explicit recovery and dependency-graph strategies (DAG-Plan) are used (Li et al., 19 Jun 2025).
  • Proprioception and embodiment awareness are pivotal for solving physically realistic, long-horizon tasks, as evidenced by the quantitative gains of the Proprio-MLLM planner (Li et al., 9 Oct 2025).

The platform underscores the need for further integration of perception, robust low-level control, and high-level symbolic reasoning for reliable real-world embodied intelligence. It establishes a common ground for benchmarking, systematic data collection, and cross-methodological comparison in dual-arm humanoid robotics (Li et al., 9 Oct 2025, Li et al., 19 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DualTHOR.