AI2-THOR: An Interactive 3D Environment for Visual AI (1712.05474v4)

Published 14 Dec 2017 in cs.CV, cs.AI, and cs.LG

Abstract: We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

Authors (13)

Eric Kolve (13 papers)
Roozbeh Mottaghi (66 papers)
Winson Han (11 papers)
Eli VanderBilt (10 papers)
Luca Weihs (46 papers)
Alvaro Herrasti (11 papers)
Matt Deitke (11 papers)
Kiana Ehsani (31 papers)
Daniel Gordon (14 papers)
Yuke Zhu (134 papers)
Aniruddha Kembhavi (79 papers)
Abhinav Gupta (178 papers)
Ali Farhadi (138 papers)

Citations (975)

View on Semantic Scholar

Summary

AI2-THOR: An Interactive 3D Environment for Visual AI

Overview

AI2-THOR offers an advanced framework for visual AI research through near photo-realistic 3D indoor environments. This platform aims to bridge the gap between the visual understanding demonstrated by humans and the capabilities of current AI models. While most computer vision models rely on static images or video, AI2-THOR enables interactive learning akin to human real-world experience. Notably, the platform supports a broad spectrum of research domains, including deep reinforcement learning, imitation learning, planning, visual question answering, and object detection and segmentation, among others.

Distinguishing Features

AI2-THOR stands out due to several key factors:

Interactions: The platform supports various interactions, such as object state changes, arm-based manipulation, and causal interactions, allowing for complex task executions like filling a mug with water from a faucet.
Scenes: By leveraging procedural generation and manually designed scenes, AI2-THOR offers a diverse and expansive set of interactive environments. For example, it includes 120 rooms in iTHOR, 89 scenes in RoboTHOR, and 10 houses in ArchitecTHOR.
Quality: AI2-THOR's near photo-realistic objects and scenes enhance the transferability of learned models to real-world applications, outperforming simpler environments like ATARI games in visual complexity.
API: The platform features a robust Python API interfaced with the Unity 3D game engine, enabling functionalities such as navigation, force application, object interaction, and physics modeling.

Practical Significance

The practical implications of AI2-THOR are extensive. It provides a scalable and cost-effective proxy for real-world scenarios, overcoming the limitations of real-world robotic experiments that are often expensive, unsafe, or constrained. The ability to simulate thousands of iterations efficiently facilitates the training of generalized models capable of operating in diverse environments.

Key Components

API

AI2-THOR’s agent-simulator loop, comprising a front-end Python API and a back-end Unity engine, facilitates action execution and environment interaction. This seamless integration supports versatile research applications, as depicted in the agent-simulator lifecycle diagram.

Scene Datasets

AI2-THOR encompasses various scene datasets, including:

iTHOR: Features 120 modelled rooms from different living spaces.
RoboTHOR: Contains 89 maze-styled scenes designed for sim2real transfer studies.
ProcTHOR: Utilizes procedural generation to create 10,000 diverse houses, improving generalization.
ArchitecTHOR: Provides 10 manually designed evaluation houses to test models' real-world applicability.

Agents and Actions

The platform supports multiple agent embodiments and actions, ranging from navigation to arm-based interactions. Agents can perform a plethora of tasks, including opening objects, manipulating items with arms, performing environment queries, and more, allowing researchers to investigate complex behavior and control problems.

Image Modalities and Objects

AI2-THOR supports various image modalities, such as RGB, depth, semantic segmentation, and normals. It also features an extensive object database with over 3,500 interactive items, enabling rich experimental setups and comprehensive training regimes.

Environment Metadata

Comprehensive environment metadata is available, feeding detailed information on objects, scenes, and agent states, thus facilitating sophisticated reward functions and training techniques.

Research Applications

AI2-THOR has been instrumental in over 150 publications, aiding research in areas like:

Visual Navigation: Enhancements in navigation efficiency through semantic priors and procedural generation.
Audio-Visual Navigation: Tasks combining audio and visual inputs to locate sound sources.
Vision-and-Language: Embodied instruction-following and interactive question answering.
Sim2Real Transfer: Training in simulation and deploying in real-world settings.

Competitive Analysis

Compared to other simulators like iGibson 2.0, Habitat, and SAPIEN, AI2-THOR distinguishes itself with its scalability, interaction capabilities, and rich Unity integration. Performance benchmarking indicates competitive training speeds, making it a viable option for large-scale embodied AI research.

Conclusion

AI2-THOR presents a comprehensive and versatile interactive simulation platform, contributing significantly to embodied AI and visual intelligence research. By supporting a diverse array of scenes, interactions, and agents, AI2-THOR proves to be a valuable asset in advancing AI capabilities towards human-like visual understanding and interaction. For the latest updates and resources, researchers can visit the AI2-THOR website.

Related Papers

Find Related Papers