Papers
Topics
Authors
Recent
2000 character limit reached

AI2-THOR: Interactive Simulation for Embodied AI

Updated 28 December 2025
  • AI2-THOR is a photorealistic interactive simulation environment for embodied AI featuring high-fidelity 3D scenes, realistic physics, and diverse indoor layouts.
  • It supports detailed agent interactions through vision, manipulation, and navigation, enabling standardized evaluation of advanced reinforcement learning methods.
  • The platform integrates cost-aware action modeling and HC-GRPO optimization to balance task success with operational efficiency under real-world constraints.

AI2-THOR is a photorealistic interactive simulation environment developed to facilitate research in embodied AI, particularly for training, evaluating, and benchmarking agents capable of vision, manipulation, navigation, and multi-modal reasoning. It provides high-fidelity 3D scenes with support for physical interactions, enabling the development and assessment of agents—increasingly powered by large multimodal LLMs (MLLMs)—for complex, embodied tasks. AI2-THOR is extensively used as the standard testbed in recent work on cost-aware multimodal agents, including the implementation and validation of advanced reinforcement learning methods such as Heterogeneous Cost-Aware Group Relative Policy Optimization (HC-GRPO) (Zhou et al., 21 Dec 2025).

1. Core Features and Capabilities

AI2-THOR provides a suite of simulated indoor environments with diverse layouts (kitchens, living rooms, bedrooms, bathrooms), which model realistic objects, lighting, and physics. Agents can perceive the environment through RGB (and optionally depth) sensors, manipulate a variety of objects (e.g., opening, closing, picking up, placing), and physically navigate through dynamic scenes. The environment allows explicit control over agent viewpoints and enables randomized scene configurations for robust evaluation.

The platform exposes an API that supports both low-level (physical actions) and high-level (semantic queries, object masks) interactions, with deterministic and stochastic execution modes. This design supports experimentation with both traditional and neural agent architectures, including those involving chain-of-thought planning and multi-step reasoning.

2. Applications in Embodied AI Research

AI2-THOR has become the canonical benchmark for embodied decision-making and instruction-following tasks. Recent research utilizes the platform for:

  • Cost-Aware Embodied Search: ESearch-R1 demonstrates the integration of complex cost models in interactive tasks—balancing the costs of physical navigation, human attention (via interactive questioning), and memory recall—to achieve efficient task completion (Zhou et al., 21 Dec 2025).
  • Benchmarking of Multimodal Reasoning Agents: The environment enables systematic comparison of agents trained with supervised fine-tuning (SFT), standard reinforcement learning (e.g., PPO), and novel policy gradient methods (HC-GRPO). It supports measurements such as task success rate, total operational cost, and action-type distributions.
  • Sim2Real Research: By offering domain-randomized physics and diverse object categories, AI2-THOR supports research into generalization and transfer from simulation to real-world robotics.

3. Integration with Reinforcement Learning Frameworks

AI2-THOR is designed for seamless integration with contemporary RL pipelines. Recent work employs the platform to instantiate partially observable Markov decision processes (POMDPs) where the agent state consists of both egocentric visual observations and structured episodic memories. The reward function can be finely customized to reflect heterogeneous, real-world operational costs, such as:

R(τ)=Rtaskλt=0TC(at)R(\tau) = R_{task} - \lambda \sum_{t=0}^T C(a_t)

where C(at)C(a_t) quantifies the cost for each action, incorporating navigation distances, question-based attention cost with fatigue penalty, and memory retrievals (Zhou et al., 21 Dec 2025).

Table: Example Action Costs in AI2-THOR Experiments (Zhou et al., 21 Dec 2025)

Action Cost Formula Relative Magnitude
Navigate cnavd(pt,pt+1)c_{nav} \cdot d(p_t, p_{t+1}) Largest
Ask cask(1+αNask)c_{ask} \cdot (1 + \alpha N_{ask}) Moderate (inc. fatigue)
GetMemory cmemc_{mem} Smallest

Policy optimization protocols, such as HC-GRPO, are directly instantiated in this context, with agents’ returns empirically evaluated by executing reasoning traces and action sequences, followed by in-environment cost measurement.

4. HC-GRPO-based Embodied Agents in AI2-THOR

HC-GRPO departs from conventional actor-critic architectures by computing per-instruction, group-relative “advantages” based on sampled rollouts, without learning a separate critic network. For each user instruction, GG chain-of-thought reasoning traces and associated action sequences are generated; all are executed within AI2-THOR to yield cost-adjusted returns.

Given group mean μR\mu_R and standard deviation σR\sigma_R across the GG samples:

Ai=riμRσR+ϵA_i = \frac{r_i - \mu_R}{\sigma_R + \epsilon}

The surrogate loss used in training is:

Lq(θ)=1Gi=1Gmin{ρiAi,clip(ρi,1ϵ,1+ϵ)Ai}+βDKL[πθ(q)πref(q)]L_q(\theta) = -\frac{1}{G} \sum_{i=1}^G \min \left\{ \rho_i A_i,\, \mathrm{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i \right\} + \beta\, D_{KL}[\pi_\theta(\cdot|q)\,||\,\pi_{ref}(\cdot|q)]

where ρi\rho_i is the importance sampling ratio, DKLD_{KL} is a reference KL regularization, and AiA_i assigns positive/negative advantage to more/less efficient trajectories (Zhou et al., 21 Dec 2025).

Implementation in AI2-THOR supports batching over multiple instructions and parallel rollout execution, with performance and memory scaling enabled by multi-GPU setups (e.g., 8×NVIDIA H20 GPUs, batch M=8M=8 instructions × G=4G=4 rollouts each).

5. Empirical Evaluations and Metrics

HC-GRPO-trained agents in AI2-THOR are evaluated on ESearch-Bench, which measures:

  • Task Success Rate (SR): Fraction of successful target object retrievals.
  • Total Task Cost (TTC): Aggregated execution cost across all action types.
  • Composite Score (SwC): Joint metric of efficiency and accuracy.

Under ambiguous instructions, cost-sensitive ESearch-R1 achieves mean SR of 61.5% (vs. 60% for best ReAct baseline), while halving TTC from 3.3 (PPO ReAct) to 1.6. This is achieved by sharply reducing expensive Navigate actions and increasing optimal use of low-cost GetMemory steps, with regularization (KL and entropy bonuses) ensuring policy robustness and preventing catastrophic forgetting of initial SFT reasoning logic (Zhou et al., 21 Dec 2025).

6. Implementation Considerations and Stability

Empirical findings indicate that the stability of RL training in AI2-THOR hinges on several implementation protocols:

  • Policy Regularization: Maintaining a frozen supervised pre-training reference πref\pi_{ref} and explicit KL penalties is critical for large MLLM stability.
  • Learning Rates and Exploration Bonuses: Small learning rates and entropy coefficients are required to preserve chain-of-thought capabilities during RL fine-tuning.
  • Cost Modeling: Realistic and heterogeneous cost assignments (e.g., navigation, interaction, memory) are necessary to elicit strategic action distributions.

These choices jointly enable the development of robust agents aligned with both accuracy and operational efficiency under real-world constraints, validating AI2-THOR’s role as a research platform for next-generation embodied AI (Zhou et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AI2-THOR.