VirtualHome Simulation Engine
- VirtualHome Simulation Engine is a platform that represents household tasks as sequences of atomic actions, enabling rigorous evaluation of instruction-following models.
- It employs an LSTM-based parser-generator with reinforcement learning optimization to translate natural language or video input into executable stepwise programs.
- The engine integrates Unity3D simulation, pathfinding, and inverse kinematics, offering a large-scale, photorealistic dataset for multimodal research.
VirtualHome is a simulation engine for modeling and executing complex household activities by representing them as structured programs—finite sequences of atomic (inter)actions—and driving 3D agent avatars through these programs inside photorealistic Unity3D environments. The engine serves both as a research platform for instruction-following, program induction, and video semantic understanding, and as a large-scale generator of activity-labeled video datasets with full ground-truth (Puig et al., 2018).
1. System Architecture
The VirtualHome engine comprises four principal components: (a) Activity program databases, (b) program parser/generator, (c) program optimizer/interpreter, and (d) the Unity3D simulation backend.
- Activity Program Databases: Two collections are central. “ActivityPrograms” comprises 1,814 crowd-sourced, natural-language household task descriptions (average: 3.2 sentences, 21.9 words), each translated via manual verification into 2,821 executable stepwise programs (average: 11.6 steps, σ≈6.5). The “VirtualHome Activity” set contains 5,193 synthetically generated programs based on a probabilistic task grammar, each paired with human-written descriptions.
- Program Parser/Generator: Free-form instructions—either natural-language sentences or segmented video—are encoded via LSTM networks (tokenized using word2vec embeddings or video-step probabilities). An LSTM decoder with attention mechanism generates corresponding program steps sequentially. Reinforcement learning (self-critical policy gradient) is used to fine-tune program generation, maximizing both step overlap with ground-truth via Longest Common Subsequence (LCS) IoU, and executability within Unity3D.
- Optimizer/Interpreter: This module assigns concrete scene object instances to program abstract object arguments via backtracking search, computes geometric interaction poses, and produces navigation waypoints for agent traversal.
- Unity3D Backend: Six fully furnished 3D houses (~357 objects per home) and four humanoid avatar rigs are implemented. The simulator supports 12 implemented atomic (inter)actions (selected from an ontology of 75), NavMesh pathfinding, FinalIK-based inverse kinematics, procedural animation mapping, and a multi-camera recording API for RGB, depth, segmentation, pose, and flow signals.
Table 1: Key Components of the VirtualHome Engine
| Component | Main Role | Key Technologies/Details |
|---|---|---|
| Program DB | Task storage and diversity | 1,814 crowd, 5,193 synthetic, 75 actions, 308 objs |
| Program Synth | Natural language/program induction | LSTM encoder/decoder, RL optimization |
| Optimizer | Object/pose assignment, navigation | Backtracking; path/IK calculation |
| Simulator | 3D activity execution & sensor output | Unity3D, NavMesh, FinalIK, custom object effects |
2. Formalism: Programmatic Activity Representation
Each household activity is modeled as a sequence of steps, where each is an atomic action parameterized by up to two objects. The abstract syntax (BNF) is:
1 2 3 |
Program ::= ε | Step Program
Step ::= '[' Action ']' ObjectList
ObjectList ::= '<' ObjectName '>' '(' ID ')' { '<' ObjectName '>' '(' ID ')' } |
In LaTeX notation:
A program is .
Example (pouring milk into a glass):
- [Walk] ⟨FRIDGE⟩(1)
- [Open] ⟨FRIDGE⟩(1)
- [Grab] ⟨MILK⟩(2)
- [Walk] ⟨TABLE⟩(3)
- [Pour] ⟨MILK⟩(2) ⟨GLASS⟩(4)
This explicit symbolic representation enables both deterministic agent execution and unambiguous evaluation of semantic task coverage.
3. Atomic (Inter)actions: Semantics and Implementation
VirtualHome operationalizes activities by decomposing them into atomic (inter)actions, each precisely defined by a signature, parameters, preconditions, and effects. Of the ~75 verbs identified, 12 were fully implemented.
Selected examples:
- Grab(O):
- Parameters: (movable object)
- Preconditions: , agent.hand_empty = true
- Effects: agent.holding = ; agent.hand
- SwitchOn(E)/SwitchOff(E):
- Parameters: (toggleable appliance)
- Preconditions: proximity ()
- Effects: ON/OFF; animation/sound triggered
- Open(C)/Close(C):
- Parameters: (container)
- Preconditions: proximity ()
- Effects: OPEN, CLOSED; hinge/door animation
- Walk(T):
- Parameters: (target)
- Preconditions: none
- Effects: Trajectory path via NavMesh; agent traverses at speed
All atomic actions compose visually plausible, temporally ordered activity chains in the simulated scene. Interaction effects, agent states, and physical constraints are encoded in precondition and effect rules.
4. Simulator Configuration and Dataset Generation
The 3D environment is realized in Unity3D with several technical augmentations:
- Scenes: Six hand-modeled homes, each with static geometry, textured surfaces, and up to 357 interactive objects. Four humanoid rigs support multiple embodiments.
- Navigation: Unity NavMesh provides obstacle-aware navigation. Dynamic object attachment points and object support knowledgebase enable context-aware placement and affordance realization.
- Animation: Per-action mapping to native or procedural animation clips (e.g., “reach_and_grasp”), coordinated with FinalIK-derived inverse kinematics. Handedness, attach-point, and relative offsets are synchronized to object geometry.
- Physics and Collision: RigidBody and Collider components on active objects; continuous collision detection for agent-object interaction stability.
- Randomization: Scene, avatar, camera placement (6–9 static per room, each jittered), object placements, textures, and action velocity are randomized to enrich data diversity.
During program execution, all sensor streams—RGB (30 fps), depth, segmentation masks (semantic/instance), 2D/3D pose, and dense optical flow—are recorded with precise step-wise alignment.
5. Dataset Statistics and Specifications
VirtualHome offers both crowd-sourced and synthetic program datasets.
- ActivityPrograms (crowd-sourced):
- 1,814 instructions, 2,821 manually-verified programs
- 75 atomic actions, 308 object classes, 2,709 unique (action,objects) step types
- Program lengths: mean 11.6 steps (σ ≈ 6.5)
- Human-judged completeness: 64% perfect, 28% minor omissions, 8% major omissions
- VirtualHome Activity (synthetic):
- 5,193 programs generated from formal grammar
- Program length: mean 9.6 steps
- Each paired with human-generated descriptions
- Animated with 6 homes, 4 avatars, randomized cameras (yielding ~25,000 2-second video clips)
This system establishes a uniquely diverse and fully labeled corpus for evaluation and supervised learning in video/language domains.
6. Evaluation Metrics and Protocols
VirtualHome provides rigorous metrics for evaluating program parsing and execution:
- Program Overlap (LCS-IoU):
- Executability Rate:
- Element-wise Accuracy: Action label, object class, and full step match rates.
- RL Reward: Combined LCS overlap and executability, .
Reported benchmark values (text→program, VirtualHome Activity): MLE baseline step-IoU ~0.729, executability ~38.6%; RL optimization (full reward) achieves IoU ~0.774, exec ~39.8%.
7. End-to-End Example: Instruction to Execution
An illustrative transformation workflow:
- User Input: "Take an empty glass, open the fridge, pour milk into the glass, and then put the glass on the table."
- Program Induction (Encoder-Decoder):
- Encoder LSTM processes token embeddings.
- Decoder LSTM with attention emits the step sequence:
- [Walk] ⟨FRIDGE⟩(1)
- [Open] ⟨FRIDGE⟩(1)
- [Grab] ⟨MILK⟩(2)
- [Walk] ⟨TABLE⟩(3)
- [Pour] ⟨MILK⟩(2) ⟨GLASS⟩(4)
- [Put] ⟨GLASS⟩(4) ⟨TABLE⟩(3)
- Program Optimization:
- Assigns correct object instances (nearest fridge among duplicates).
- Computes hand/interaction mappings, navigation paths.
- Simulation:
- Agent navigates, interacts, and animates in 3D per program schedule.
- Event-specific effects (milk pouring shader, object parenting) produced in real time.
- Recording:
- All sensor data is captured with aligned timestamps for each program step.
VirtualHome thereby enables mapping of natural language or demonstrations to visually grounded, stepwise executable 3D agent behavior, supporting research in program induction, robotics, and multimodal learning (Puig et al., 2018).