SIMA 2: Embodied Agents & Metasurfaces

Updated 5 December 2025

SIMA 2 is a dual-purpose paradigm that integrates an embodied multimodal agent for diverse 3D virtual world tasks with advanced dual-stacked metasurfaces for ISAC.
The embodied agent utilizes large-scale foundation models, multimodal perception, and self-improvement strategies to execute long-horizon, interactive tasks.
The metasurface architecture deploys min–max phase-shift optimization and compressed sensing to significantly boost radar estimation and communication performance.

SIMA 2 encompasses two distinct, advanced research domains: (1) the SIMA 2 agent, a generalist embodied multimodal agent for virtual environments; and (2) SIMA 2 as an architecture of dual stacked intelligent metasurfaces in bistatic integrated sensing and communication (ISAC) systems. Below, both usages are explicated in full, as reflected in the primary literature (Team et al., 4 Dec 2025, Ranasinghe et al., 29 Apr 2025).

1. SIMA 2: Generalist Embodied Agent for Virtual Worlds

SIMA 2 is a goal-directed, interactive agent architecture for active perception, multimodal reasoning, and long-horizon task execution situated within highly diverse 3D virtual worlds. It integrates large-scale foundation models—specifically, a variant of Gemini—into an agentic loop that spans perception, internal reasoning, dialogue, action planning, and skill self-improvement (Team et al., 4 Dec 2025).

1.1 Problem Scope and Motivations

SIMA 2 is constructed to unify instruction following, dialog-based collaboration, and fine-grained, embodied control across games, simulators, and photorealistic engine-based worlds. Unlike conventional LLMs and vision-LLMs (VLMs), SIMA 2 must perform precise motor actions (keyboard, mouse) based solely on visual observations and mixed-modal user queries. The agent’s objectives encompass:

Goal-driven action in worlds with widely varying dynamics, visuals, and affordances.
Reasoned planning, both implicit (internal “thought”) and explicit (user dialogue).
Robust generalization and self-driven skill acquisition in previously unseen domains.

The transition from SIMA 1, which handled only short, fixed natural language tasks and output direct action sequences without dialogue, to SIMA 2 reflects a significant leap in agentic and reasoning fluency, powered by integration with Gemini foundation models.

1.2 Core Architecture and Modalities

The agent backbone employs a Gemini Flash-Lite checkpoint (tens of billions of parameters) for tractable inference, with the option of hierarchical “steering” via larger Gemini Pro models (>100B parameters).

Agent–Environment Interface

Observations: 720p RGB video frames (tokenized), augmented by current user instructions (text/sketch/image).
Action Space: 96 discrete keyboard keys, mouse-click tokens, discretized $Δx, Δy$ positional movements.
Actuation: Actions are output as structured text “chunks” parsed into low-level commands for environment input.

Internal Stream and Output

Within each output token window, SIMA 2 generates flexible blocks:

<Reason> ... </Reason>: Internal chain-of-thought for self-monitoring and planning.
<Say> ... </Say>: Dialogue replies, clarifications, or status updates.
<Act> ... </Act>: Embodied, parseable commands for execution.

Modular Composition

Perception: Multimodal visual/textual encoder for comprehensive embedding of observations/prompts.
Reasoning/Planning: Transformer backbone for generating reasoning traces and subgoal hierarchies.
Action Decoder: Mapping of planned actions to environment-controllable events.
Dialogue Interface: Conditional generation of interactive conversational or clarificatory text.

1.3 Learning Objectives and Self-Improvement

SIMA 2’s formalization employs the following constructs:

State Space: $\mathcal{S} ≈$ history of observed video tokens, textual context, and generated reasoning/dialogue blocks.
Action Space: $\mathcal{A}$ is the discrete set of keyboard/mouse commands (text-encoded).

Learning Objectives

Supervised Fine-Tuning (SFT):

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{(s,a^*)\in\mathcal{D}} \log \pi_\theta(a^* \mid s)$

Reinforcement Learning (RL):

$J(\theta) = \mathbb{E}_\tau\left[\mathbb{E}_{\pi_\theta}\left[\sum_t R_\tau(s_t, a_t)\right]\right]$

$\mathcal{L}_{\mathrm{RL}}(\theta) = -J(\theta) + \lambda\, D_{\mathrm{KL}}\bigl(\pi_\theta \Vert \pi_{\mathrm{SFT}}\bigr)$

Self-Improvement Procedures

Task proposals and open-ended reward assignment are provided by Gemini Pro, enabling curriculum-free skill acquisition via:
- Autonomous task-setting
- Self-execution and trajectory scoring
- Replay-buffered fine-tuning with synthesized data

1.4 Training, Data, and Implementation

Datasets:
- Human-annotated trajectories (one-/two-person setups)
- Bridge data ( $\sim$ 10k Gemini Pro-annotated examples with intertwined Reason/Say tags)
- RL tasks with programmatic/ground truth verification
Compute: Hundreds of A100-80GB GPUs, TPU pods for bridge data/reward scoring
Training Regime: Typical batch size 1024, learning rate 1e−5, $\sim$ 100k SFT and 50k RL steps

1.5 Evaluation and Results

Environments

Training: Construction Lab, Playhouse, WorldLab, Goat Simulator 3, Hydroneer, No Man’s Sky, Satisfactory, Space Engineers, Valheim, Wobbly Life
Generalization: ASKA, MineDojo (50 tasks × 15 seeds), The Gunk, Genie 3

Metrics

Success Rate (% of tasks completed within timeouts)
Task completion time
Reward-model score (0–100 for self-improvement)

Key Findings

Aspect	SIMA 2 Outperformance over Baseline
Success Rate (train)	Doubles SIMA 1, within 5–10 pp of humans
Skill categories	Advances in all eight—esp. interaction, object management; lags in combat
Generalization	$>$ 10 pp gain over SIMA 1 on ASKA/MineDojo; multi-step tasks in The Gunk; zero-shot navigation in Genie 3
Ablation	Gemini backbone alone: 3.2–7.0% success (vs. SIMA 2’s post-fine-tuning score)
Catastrophic Forgetting	<20% relative drop in coding, math, STEM-QA

1.6 Case Studies and Limitations

Case Study Highlights:

Embodied dialogue and internal chain-of-thought provide human-like transparency and iterative plan updates.
Complex multi-modal instruction handling: correct sequencing and conditional action.
Abstract task parsing (e.g., “do the opposite”) enabled by Gemini Pro guidance.

Limitations:

Less efficacy on extremely long-horizon, stochastic, or high-precision tasks (e.g., split-second combat).
Memory bottleneck at $\sim$ 4k tokens (no explicit episodic memory).
Coarser ultra-low-level motor control compared to humans due to discretized action interface.

1.7 Conclusions and Outlook

First demonstration of open-ended, self-improving embodied agents in open virtual domains.
The multimodal transformer approach—melding perception, reasoning, planning, action, dialogue—enables generalist task coverage.
Future work aims for transfer to real robots (“Gemini Robotica 1.5”), memory augmentation, scale-up to larger Gemini models, and Darwin-complete perpetual curricula in richer simulation worlds (Team et al., 4 Dec 2025).

2. SIMA 2: Dual-Stacked Intelligent Metasurfaces for ISAC

SIMA 2 also denotes a dual-metasurface architecture in bistatic ISAC, where both transmitter (TX) and receiver (RX) employ stacked intelligent metasurfaces (SIMs) for joint radar sensing and wireless communication enhancement (Ranasinghe et al., 29 Apr 2025).

2.1 Architecture and Parametrization

TX-SIM: $Q$ layers, $M$ meta-atoms per layer, inter-element spacing $d$ with spatial correlation governed by $R_{\rm TX}$ .
RX-SIM: $\tilde Q$ layers, $\tilde M$ meta-atoms per layer, same spacing, correlation $R_{\rm RX}$ .

The layer-wise response combines scattering and phase-shift terms. For TX,

$\Psi_q = \operatorname{diag}(e^{j\zeta^q_1}, \dots, e^{j\zeta^q_M})$

Overall, the SIM-induced channel is:

$h(t,\tau) = \mathbf{u}\, R_{\rm RX}^{1/2}\, \widetilde{H}(t,\tau)\, R_{\rm TX}^{1/2}\, \mathbf{v}$

where $\widetilde{H}(t,\tau)$ is the underlying doubly-dispersive matrix.

2.2 Min–Max Phase-Shift Optimization

A robust phase-shift assignment maximizes the minimum path gain among $P$ propagation paths (min–max problem):

$\max_{\Theta} J(\Theta), \quad J(\Theta) \equiv \min_p \mathcal O_p (\cdot)$

subject to $|\zeta^q_m| \le \pi$ for all phases.

Gradient Update

Closed-form gradients and normalized steepest ascent drive optimization:

$\nabla_{\bm\zeta_q}J = 2 \Im\{\Psi_q F_{q,\hat p} \Upsilon_{q,\hat p} \mathbf{v}\}$

Pseudocode (Abbreviated)

for i in range(i_GD):
    identify weakest path index
    compute gradients
    compute normalization factors
    update phases with decayed step size

2.3 Radar Parameter Estimation: Compressed Sensing & PDA

Post demodulation (AFDM/OFDM/OTFS), the measurement model is cast as:

$\mathbf{y} = E\mathbf{h} + \mathbf{n}$

Sparse recovery exploits a compressed sensing-based probabilistic data association (PDA) algorithm, estimating reflection delays and Dopplers.

2.4 ISAC-Enabling Waveform Design

The delay–Doppler (DD) model imposes:

$A_s(\tau, \nu) = \int s(t)s^*(t-\tau) e^{-j2\pi\nu t}dt$

The SIMA 2 modifications boost all path gains via SIM-induced harmonization, mitigating doubly-dispersive fading.

2.5 Numerical Performance

Sensing (RPE) MSE—Table 1

Waveform	No SIM (baseline)	SIMA 2-Optimized	Res. limit
OFDM	1.0 m $^2$ @ 0 dB	0.10 m $^2$ (+10 dB)	0.01 m $^2$
OTFS	0.5 m $^2$	0.05 m $^2$ (+10 dB)	0.005 m $^2$
AFDM	0.4 m $^2$	0.04 m $^2$ (+10 dB)	0.004 m $^2$

Communication BER—Table 2

Waveform	No SIM	SIMA 2-Optimized	Gain
OFDM	$1\times10^{-2}$	$3\times10^{-3}$ (+5dB)	0.7dB
OTFS	$5\times10^{-3}$	$8\times10^{-4}$ (+7dB)	2.5dB
AFDM	$4\times10^{-3}$	$5\times10^{-4}$ (+8dB)	3dB

SIMA 2 improvements persist even for waveforms optimized for sensing, and OTFS/AFDM see higher gains due to inherent DD-domain robustness.

2.6 Significance and Outlook

The dual-metasurface SIMA 2 setup enables up to 10 dB improvement in radar parameter estimation MSE and 1–3 dB better BER than no-SIM baselines, across a wide range of ISAC waveforms in highly doubly-dispersive bistatic links. This demonstrates the versatility of parametrized intelligent metasurfaces for joint radar and communications in next-generation wireless systems (Ranasinghe et al., 29 Apr 2025).

References:

"SIMA 2: A Generalist Embodied Agent for Virtual Worlds" (Team et al., 4 Dec 2025)
"Parametrized Stacked Intelligent Metasurfaces for Bistatic Integrated Sensing and Communications" (Ranasinghe et al., 29 Apr 2025)

Markdown Upgrade to Chat

References (2)

SIMA 2: A Generalist Embodied Agent for Virtual Worlds (2025)

Parametrized Stacked Intelligent Metasurfaces for Bistatic Integrated Sensing and Communications (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SIMA 2.

SIMA 2: Embodied Agents & Metasurfaces

1. SIMA 2: Generalist Embodied Agent for Virtual Worlds

1.1 Problem Scope and Motivations

1.2 Core Architecture and Modalities

Agent–Environment Interface

Internal Stream and Output

Modular Composition

1.3 Learning Objectives and Self-Improvement

Learning Objectives

Self-Improvement Procedures

1.4 Training, Data, and Implementation

1.5 Evaluation and Results

Environments

Metrics

Key Findings

1.6 Case Studies and Limitations

1.7 Conclusions and Outlook

2. SIMA 2: Dual-Stacked Intelligent Metasurfaces for ISAC

2.1 Architecture and Parametrization

2.2 Min–Max Phase-Shift Optimization

Gradient Update

Pseudocode (Abbreviated)

2.3 Radar Parameter Estimation: Compressed Sensing & PDA

2.4 ISAC-Enabling Waveform Design

2.5 Numerical Performance

Sensing (RPE) MSE—Table 1

Communication BER—Table 2

2.6 Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SIMA 2: Embodied Agents & Metasurfaces

1. SIMA 2: Generalist Embodied Agent for Virtual Worlds

1.1 Problem Scope and Motivations

1.2 Core Architecture and Modalities

Agent–Environment Interface

Internal Stream and Output

Modular Composition

1.3 Learning Objectives and Self-Improvement

Learning Objectives

Self-Improvement Procedures

1.4 Training, Data, and Implementation

1.5 Evaluation and Results

Environments

Metrics

Key Findings

1.6 Case Studies and Limitations

1.7 Conclusions and Outlook

2. SIMA 2: Dual-Stacked Intelligent Metasurfaces for ISAC

2.1 Architecture and Parametrization

2.2 Min–Max Phase-Shift Optimization

Gradient Update

Pseudocode (Abbreviated)

2.3 Radar Parameter Estimation: Compressed Sensing & PDA

2.4 ISAC-Enabling Waveform Design

2.5 Numerical Performance

Sensing (RPE) MSE—Table 1

Communication BER—Table 2

2.6 Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research