SIMA 2: Embodied Agents & Metasurfaces
- SIMA 2 is a dual-purpose paradigm that integrates an embodied multimodal agent for diverse 3D virtual world tasks with advanced dual-stacked metasurfaces for ISAC.
- The embodied agent utilizes large-scale foundation models, multimodal perception, and self-improvement strategies to execute long-horizon, interactive tasks.
- The metasurface architecture deploys min–max phase-shift optimization and compressed sensing to significantly boost radar estimation and communication performance.
SIMA 2 encompasses two distinct, advanced research domains: (1) the SIMA 2 agent, a generalist embodied multimodal agent for virtual environments; and (2) SIMA 2 as an architecture of dual stacked intelligent metasurfaces in bistatic integrated sensing and communication (ISAC) systems. Below, both usages are explicated in full, as reflected in the primary literature (Team et al., 4 Dec 2025, Ranasinghe et al., 29 Apr 2025).
1. SIMA 2: Generalist Embodied Agent for Virtual Worlds
SIMA 2 is a goal-directed, interactive agent architecture for active perception, multimodal reasoning, and long-horizon task execution situated within highly diverse 3D virtual worlds. It integrates large-scale foundation models—specifically, a variant of Gemini—into an agentic loop that spans perception, internal reasoning, dialogue, action planning, and skill self-improvement (Team et al., 4 Dec 2025).
1.1 Problem Scope and Motivations
SIMA 2 is constructed to unify instruction following, dialog-based collaboration, and fine-grained, embodied control across games, simulators, and photorealistic engine-based worlds. Unlike conventional LLMs and vision-LLMs (VLMs), SIMA 2 must perform precise motor actions (keyboard, mouse) based solely on visual observations and mixed-modal user queries. The agent’s objectives encompass:
- Goal-driven action in worlds with widely varying dynamics, visuals, and affordances.
- Reasoned planning, both implicit (internal “thought”) and explicit (user dialogue).
- Robust generalization and self-driven skill acquisition in previously unseen domains.
The transition from SIMA 1, which handled only short, fixed natural language tasks and output direct action sequences without dialogue, to SIMA 2 reflects a significant leap in agentic and reasoning fluency, powered by integration with Gemini foundation models.
1.2 Core Architecture and Modalities
The agent backbone employs a Gemini Flash-Lite checkpoint (tens of billions of parameters) for tractable inference, with the option of hierarchical “steering” via larger Gemini Pro models (>100B parameters).
Agent–Environment Interface
- Observations: 720p RGB video frames (tokenized), augmented by current user instructions (text/sketch/image).
- Action Space: 96 discrete keyboard keys, mouse-click tokens, discretized positional movements.
- Actuation: Actions are output as structured text “chunks” parsed into low-level commands for environment input.
Internal Stream and Output
Within each output token window, SIMA 2 generates flexible blocks:
<Reason> ... </Reason>: Internal chain-of-thought for self-monitoring and planning.<Say> ... </Say>: Dialogue replies, clarifications, or status updates.<Act> ... </Act>: Embodied, parseable commands for execution.
Modular Composition
- Perception: Multimodal visual/textual encoder for comprehensive embedding of observations/prompts.
- Reasoning/Planning: Transformer backbone for generating reasoning traces and subgoal hierarchies.
- Action Decoder: Mapping of planned actions to environment-controllable events.
- Dialogue Interface: Conditional generation of interactive conversational or clarificatory text.
1.3 Learning Objectives and Self-Improvement
SIMA 2’s formalization employs the following constructs:
- State Space: history of observed video tokens, textual context, and generated reasoning/dialogue blocks.
- Action Space: is the discrete set of keyboard/mouse commands (text-encoded).
Learning Objectives
Self-Improvement Procedures
- Task proposals and open-ended reward assignment are provided by Gemini Pro, enabling curriculum-free skill acquisition via:
- Autonomous task-setting
- Self-execution and trajectory scoring
- Replay-buffered fine-tuning with synthesized data
1.4 Training, Data, and Implementation
- Datasets:
- Human-annotated trajectories (one-/two-person setups)
- Bridge data (10k Gemini Pro-annotated examples with intertwined Reason/Say tags)
- RL tasks with programmatic/ground truth verification
- Compute: Hundreds of A100-80GB GPUs, TPU pods for bridge data/reward scoring
- Training Regime: Typical batch size 1024, learning rate 1e−5, 100k SFT and 50k RL steps
1.5 Evaluation and Results
Environments
- Training: Construction Lab, Playhouse, WorldLab, Goat Simulator 3, Hydroneer, No Man’s Sky, Satisfactory, Space Engineers, Valheim, Wobbly Life
- Generalization: ASKA, MineDojo (50 tasks × 15 seeds), The Gunk, Genie 3
Metrics
- Success Rate (% of tasks completed within timeouts)
- Task completion time
- Reward-model score (0–100 for self-improvement)
Key Findings
| Aspect | SIMA 2 Outperformance over Baseline |
|---|---|
| Success Rate (train) | Doubles SIMA 1, within 5–10 pp of humans |
| Skill categories | Advances in all eight—esp. interaction, object management; lags in combat |
| Generalization | 10 pp gain over SIMA 1 on ASKA/MineDojo; multi-step tasks in The Gunk; zero-shot navigation in Genie 3 |
| Ablation | Gemini backbone alone: 3.2–7.0% success (vs. SIMA 2’s post-fine-tuning score) |
| Catastrophic Forgetting | <20% relative drop in coding, math, STEM-QA |
1.6 Case Studies and Limitations
Case Study Highlights:
- Embodied dialogue and internal chain-of-thought provide human-like transparency and iterative plan updates.
- Complex multi-modal instruction handling: correct sequencing and conditional action.
- Abstract task parsing (e.g., “do the opposite”) enabled by Gemini Pro guidance.
Limitations:
- Less efficacy on extremely long-horizon, stochastic, or high-precision tasks (e.g., split-second combat).
- Memory bottleneck at 4k tokens (no explicit episodic memory).
- Coarser ultra-low-level motor control compared to humans due to discretized action interface.
1.7 Conclusions and Outlook
- First demonstration of open-ended, self-improving embodied agents in open virtual domains.
- The multimodal transformer approach—melding perception, reasoning, planning, action, dialogue—enables generalist task coverage.
- Future work aims for transfer to real robots (“Gemini Robotica 1.5”), memory augmentation, scale-up to larger Gemini models, and Darwin-complete perpetual curricula in richer simulation worlds (Team et al., 4 Dec 2025).
2. SIMA 2: Dual-Stacked Intelligent Metasurfaces for ISAC
SIMA 2 also denotes a dual-metasurface architecture in bistatic ISAC, where both transmitter (TX) and receiver (RX) employ stacked intelligent metasurfaces (SIMs) for joint radar sensing and wireless communication enhancement (Ranasinghe et al., 29 Apr 2025).
2.1 Architecture and Parametrization
- TX-SIM: layers, meta-atoms per layer, inter-element spacing with spatial correlation governed by .
- RX-SIM: layers, meta-atoms per layer, same spacing, correlation .
The layer-wise response combines scattering and phase-shift terms. For TX,
Overall, the SIM-induced channel is:
where is the underlying doubly-dispersive matrix.
2.2 Min–Max Phase-Shift Optimization
A robust phase-shift assignment maximizes the minimum path gain among propagation paths (min–max problem):
subject to for all phases.
Gradient Update
Closed-form gradients and normalized steepest ascent drive optimization:
Pseudocode (Abbreviated)
1 2 3 4 5 |
for i in range(i_GD): identify weakest path index compute gradients compute normalization factors update phases with decayed step size |
2.3 Radar Parameter Estimation: Compressed Sensing & PDA
Post demodulation (AFDM/OFDM/OTFS), the measurement model is cast as:
Sparse recovery exploits a compressed sensing-based probabilistic data association (PDA) algorithm, estimating reflection delays and Dopplers.
2.4 ISAC-Enabling Waveform Design
The delay–Doppler (DD) model imposes:
The SIMA 2 modifications boost all path gains via SIM-induced harmonization, mitigating doubly-dispersive fading.
2.5 Numerical Performance
Sensing (RPE) MSE—Table 1
| Waveform | No SIM (baseline) | SIMA 2-Optimized | Res. limit |
|---|---|---|---|
| OFDM | 1.0 m @ 0 dB | 0.10 m (+10 dB) | 0.01 m |
| OTFS | 0.5 m | 0.05 m (+10 dB) | 0.005 m |
| AFDM | 0.4 m | 0.04 m (+10 dB) | 0.004 m |
Communication BER—Table 2
| Waveform | No SIM | SIMA 2-Optimized | Gain |
|---|---|---|---|
| OFDM | (+5dB) | 0.7dB | |
| OTFS | (+7dB) | 2.5dB | |
| AFDM | (+8dB) | 3dB |
SIMA 2 improvements persist even for waveforms optimized for sensing, and OTFS/AFDM see higher gains due to inherent DD-domain robustness.
2.6 Significance and Outlook
The dual-metasurface SIMA 2 setup enables up to 10 dB improvement in radar parameter estimation MSE and 1–3 dB better BER than no-SIM baselines, across a wide range of ISAC waveforms in highly doubly-dispersive bistatic links. This demonstrates the versatility of parametrized intelligent metasurfaces for joint radar and communications in next-generation wireless systems (Ranasinghe et al., 29 Apr 2025).
References:
- "SIMA 2: A Generalist Embodied Agent for Virtual Worlds" (Team et al., 4 Dec 2025)
- "Parametrized Stacked Intelligent Metasurfaces for Bistatic Integrated Sensing and Communications" (Ranasinghe et al., 29 Apr 2025)