ProcTHOR: Indoor Simulation for Embodied AI

Updated 8 May 2026

ProcTHOR is a large-scale simulation platform that procedurally generates diverse, photo-realistic indoor scenes for embodied AI research.
Its pipeline integrates scene specification, generative models, and a Unity backend to create varied layouts with detailed physics and annotated object affordances.
ProcTHOR underpins multiple benchmarks for navigation, manipulation, and multi-agent collaboration, enabling advanced model generalization and domain adaptation studies.

ProcTHOR is a large-scale platform for sampling diverse, interactive, and customizable simulated indoor environments for embodied AI research. Built on top of the AI2-THOR Unity simulator, ProcTHOR enables researchers to procedurally generate an unbounded number of 3D homes, supporting a wide spectrum of embodied tasks, modalities, and agent types. ProcTHOR environments, datasets, and their derivate benchmarks have become foundational in evaluating navigation, manipulation, multi-agent collaboration, floor plan synthesis, and domain adaptation in embodied AI.

1. Procedural Generation Pipeline and Scene Specification

ProcTHOR produces high-diversity, fully interactive indoor scenes by sampling from parametric generative models defined over house layouts, room types, object placements, materials, lighting, and interaction affordances (Deitke et al., 2022). The core pipeline comprises three tightly coupled modules:

Scene Specification: Each environment is encoded as a JSON string specifying the geometry, room adjacency tree, walls, doors, floor/wall materials, objects, lighting, and stateful object properties.
Procedural Generator: Given a high-level “room spec,” the generator samples the number of rooms $R$ , extrudes a polygonal floor plan, partitions it into rooms via binary space partitioning, and connects rooms with doors/arches of sampled types. Room content is then populated by sampling object classes, receptive surfaces, and spatial arrangements based on learned co-occurrence statistics and annotated placement priors.
Unity Backend: Reads the JSON, instantiates all physical objects and agents, and exposes an API for agent visual and interaction sensing. Rendering is photorealistic, with per-episode randomization of textures, colors, and lighting.

Materials and lighting are randomized per episode. Layouts range from single-room apartments (<25 m²) to complex 10-room houses (>75 m²). Each of the up to 95 object categories per scene is annotated with affordance metadata (pickupable, receptacle, openable, etc.), and supports manipulation and per-object state variation. The complete Unity project, procedural scripts, and pre-baked 10 000-house dataset are distributed under Apache 2.0 (Deitke et al., 2022).

2. Dataset Structure, Statistics, and Benchmarks

The canonical release, ProcTHOR-10k, provides 10 000 sampled houses for training, 1000 for validation, and 1000 for testing, spanning an unprecedented diversity of layouts, object compositions, and surface/lighting configurations (Deitke et al., 2022, Luo et al., 2024). Salient statistics include:

Room count: 1–10 per house (mean ≈5), with multi-room connectivity graphs.
Objects: Each room contains up to 95 asset types, totaling 1 633 3D mesh instances across 108 classes.
Semantic Asset Groups: 18 groups enable combinatorial composition (>20M group instantiations).
Rendering performance: Up to 8.6k FPS (navigation) or 6.5k FPS (interaction) on 8 GPUs.

ProcTHOR supports standardized embodied AI benchmarks, e.g.:

ObjectNav (object-goal navigation): SR, SPL, SEL, episode length, curvature (Eftekhar et al., 2023).
ArmPointNav (mobile-manipulation): PickUp success, object displacement tasks.
MultiON (multi-object navigation): Success Rate, SPL, Kendall Tau order (Rajvanshi et al., 2023).
Heterogeneous Multi-Agent Collaboration: success rate, partial success, find-misplaced-object rate, communication efficiency (Liu et al., 2023).
Domain Adaptation Datasets: ProcTHOR-OD supports large-scale point cloud simulation for 3D detection studies (Zhao et al., 24 Aug 2025).

Ablations in (Deitke et al., 2022) demonstrate that increased scale in training environments leads to strict monotonic improvement in out-of-distribution navigation performance.

3. Embodied Tasks and Agent Interaction Modalities

ProcTHOR environments support a broad agent interface:

State space: Rich observations via RGB, depth, semantic segmentation, normal maps, and per-object instance IDs.
Action space: Navigation (MoveAhead, Rotate, Look), manipulation (Pickup, Put, Open, Close), arm control for mobile manipulation.
Agent pose: Continuous 3D coordinates and orientation (sampled over the nav-mesh, constrained by physics and obstacles).
Interaction: Realistic rigid-body physics, collision constraints, object state mutability (openable, movable, pickable).

Tasks formalized include pure navigation, navigation-plus-manipulation (e.g., pick-and-place), tidying-up, demand-driven navigation, and domain transfer in 3D detection. Benchmark-specific conventions (e.g., target object visibility, success radius, shortest-path reward shaping) are strictly defined for objective comparison (Eftekhar et al., 2023, Wang et al., 2023, Liu et al., 2023).

4. Enhancements, Derived Datasets, and Downstream Adaptation

Several works have extended ProcTHOR with new datasets and benchmarking protocols:

ProcTHOR-OD (Zhao et al., 24 Aug 2025): A 10k-scene point cloud dataset, generated by mesh export of statically sampled, single-room scenes with high-fidelity geometry, supports domain adaptation research in 3D object detection. Key annotation format: 3D bounding boxes (center, dimensions, yaw) with room-centric coordinates; IoU-based mAP metrics for evaluation.
Floorplan to Data-Structure Conversion (Luo et al., 2024): Raw ProcTHOR scenes are projected to JSON schema capturing room polygons, areas, dimensions, adjacency (bubble graphs), supporting constraint-driven floorplan generation and evaluation by LLMs.
MultiON Benchmark (Rajvanshi et al., 2023): ProcTHOR scenes form the substrate for evaluating multi-object navigation with LLM-based dynamic planning; episodes are specified with full object, room, and shortest-path metadata.
Heterogeneous Multi-Agent Collaboration (Liu et al., 2023): ProcTHOR-10k is used for sampling tidying-up tasks with variable agents, room counts, object configurations, and behavior cloning demo trajectories.
Physics-Enabled Interactive Navigation (Vashisth et al., 23 Feb 2026): ProcTHOR’s rigged physics supports lifelong, cluttered navigation with moveable obstacles and sequential object placement.

These derivatives expand application domains from embodied navigation to generative design, multi-agent systems, and domain adaptation, while exposing scale-induced gaps and transfer challenges.

5. Model Architectures and Algorithmic Advances Leveraging ProcTHOR

The scale and diversity of ProcTHOR enable training and evaluation of a wide spectrum of architectures. Notable developments include:

Task-Selective Bottleneck Representations: EmbCLIP-Codebook (Eftekhar et al., 2023) introduces a learnable, task-conditioned codebook bottleneck atop a frozen CLIP backbone, yielding a 256×10-dim intermediate representation. This reduces overfitting to noise and accelerates convergence (SR gain +6pp, –46 avg. steps, 4x curvature reduction), with robust generalization on ProcTHOR and downstream benchmarks.
Contrastive Language-Image Attribute Alignment: Demand-driven navigation (Wang et al., 2023) uses ProcTHOR for extracting demand–object mappings, leveraging LLM- and CLIP-based contrastive embedding alignment. Robust attribute conditioning yields substantial navigation and selection success gains versus baselines.
LLM-driven Dynamic Planning and Constraint Reasoning: SayNav (Rajvanshi et al., 2023) and interactive navigation (Vashisth et al., 23 Feb 2026) serialize internal 3D scene graphs derived from ProcTHOR environments and employ LLMs for generating, refining, and executing dynamic plans under partial observability, active perception, and lifelong cluttered manipulation constraints.
Hierarchical Multi-Agent Decision Models: (Liu et al., 2023) deploys graph-based reasoning and handshake-based communication atop ProcTHOR-10k, using commonsense graphs and CNN/transformer modules for scene understanding and collaborative planning.
Floorplan Generation LLMs: (Luo et al., 2024) adapts Llama3 with LoRA, conditioned on JSON-like partial constraint strings derived from ProcTHOR, enforcing numerical and adjacency constraints for generative design.

6. Evaluation, Zero-Shot Generalization, and Domain Gap Analyses

Comprehensive benchmarking on ProcTHOR demonstrates:

Strong Zero-Shot Transfer: Agents pretrained on ProcTHOR often outperform state-of-the-art baselines on six downstream leaderboards (RoboTHOR, Habitat, AI2-iTHOR, Rearrangement, ManipulaTHOR), even prior to fine-tuning (Deitke et al., 2022).
Scale Ablation: Performance on out-of-domain benchmarks improves monotonically with the number of ProcTHOR training environments, confirming the inductive gains from procedural scale (Deitke et al., 2022).
Domain Adaptation Limits: ProcTHOR-OD to ScanNet 3D detection transfer yields a residual domain gap (oracle mAP 46.3%, ProcTHOR-trained models 18.9%), with few-shot and DA baselines recovering only modest improvement (Zhao et al., 24 Aug 2025).
Physics and Multi-Agent Scaling: Highly cluttered, physics-enabled scenes require constraint-aware planning; naive baselines over/under-interact, while scene-graph + LLM strategies score higher long-term efficiency (Vashisth et al., 23 Feb 2026). Multi-agent communication models outperform single-agent and centralized baselines in tidying-up (Liu et al., 2023).

7. Limitations, Extensions, and Prospects

Limitations include single-floor-only layouts in v1.0 (multi-story and exteriors planned), persistent lighting domain gap versus real scans, and sim-to-real generalization gaps in physics-enabled or 3D-scan transfer tasks (Deitke et al., 2022, Zhao et al., 24 Aug 2025). Future extensions focus on:

Scaling to multi-floor, yard, and exterior scenes.
Procedural generation in domains beyond residential, e.g., retail, factories, outdoor.
Curriculum generation and data-driven adaptation of gen-specs based on agent weaknesses.
Enhanced physics (articulated objects, tactile, and audio simulation).
Integration with online, simulator-in-the-loop robotic agents and adaptive feedback loops.

ProcTHOR’s impact derives from its scalable procedural generator, broad modality and task support, alignment with open standards, and demonstrably strong empirical generalization—making it foundational in embodied AI research and benchmarking (Deitke et al., 2022, Eftekhar et al., 2023, Wang et al., 2023, Rajvanshi et al., 2023, Liu et al., 2023, Luo et al., 2024, Zhao et al., 24 Aug 2025, Vashisth et al., 23 Feb 2026).