Papers
Topics
Authors
Recent
2000 character limit reached

SIMA 2: Embodied Agents & Metasurfaces

Updated 5 December 2025
  • SIMA 2 is a dual-purpose paradigm that integrates an embodied multimodal agent for diverse 3D virtual world tasks with advanced dual-stacked metasurfaces for ISAC.
  • The embodied agent utilizes large-scale foundation models, multimodal perception, and self-improvement strategies to execute long-horizon, interactive tasks.
  • The metasurface architecture deploys min–max phase-shift optimization and compressed sensing to significantly boost radar estimation and communication performance.

SIMA 2 encompasses two distinct, advanced research domains: (1) the SIMA 2 agent, a generalist embodied multimodal agent for virtual environments; and (2) SIMA 2 as an architecture of dual stacked intelligent metasurfaces in bistatic integrated sensing and communication (ISAC) systems. Below, both usages are explicated in full, as reflected in the primary literature (Team et al., 4 Dec 2025, Ranasinghe et al., 29 Apr 2025).


1. SIMA 2: Generalist Embodied Agent for Virtual Worlds

SIMA 2 is a goal-directed, interactive agent architecture for active perception, multimodal reasoning, and long-horizon task execution situated within highly diverse 3D virtual worlds. It integrates large-scale foundation models—specifically, a variant of Gemini—into an agentic loop that spans perception, internal reasoning, dialogue, action planning, and skill self-improvement (Team et al., 4 Dec 2025).

1.1 Problem Scope and Motivations

SIMA 2 is constructed to unify instruction following, dialog-based collaboration, and fine-grained, embodied control across games, simulators, and photorealistic engine-based worlds. Unlike conventional LLMs and vision-LLMs (VLMs), SIMA 2 must perform precise motor actions (keyboard, mouse) based solely on visual observations and mixed-modal user queries. The agent’s objectives encompass:

  • Goal-driven action in worlds with widely varying dynamics, visuals, and affordances.
  • Reasoned planning, both implicit (internal “thought”) and explicit (user dialogue).
  • Robust generalization and self-driven skill acquisition in previously unseen domains.

The transition from SIMA 1, which handled only short, fixed natural language tasks and output direct action sequences without dialogue, to SIMA 2 reflects a significant leap in agentic and reasoning fluency, powered by integration with Gemini foundation models.

1.2 Core Architecture and Modalities

The agent backbone employs a Gemini Flash-Lite checkpoint (tens of billions of parameters) for tractable inference, with the option of hierarchical “steering” via larger Gemini Pro models (>100B parameters).

Agent–Environment Interface

  • Observations: 720p RGB video frames (tokenized), augmented by current user instructions (text/sketch/image).
  • Action Space: 96 discrete keyboard keys, mouse-click tokens, discretized Δx,ΔyΔx, Δy positional movements.
  • Actuation: Actions are output as structured text “chunks” parsed into low-level commands for environment input.

Internal Stream and Output

Within each output token window, SIMA 2 generates flexible blocks:

  • <Reason> ... </Reason>: Internal chain-of-thought for self-monitoring and planning.
  • <Say> ... </Say>: Dialogue replies, clarifications, or status updates.
  • <Act> ... </Act>: Embodied, parseable commands for execution.

Modular Composition

  • Perception: Multimodal visual/textual encoder for comprehensive embedding of observations/prompts.
  • Reasoning/Planning: Transformer backbone for generating reasoning traces and subgoal hierarchies.
  • Action Decoder: Mapping of planned actions to environment-controllable events.
  • Dialogue Interface: Conditional generation of interactive conversational or clarificatory text.

1.3 Learning Objectives and Self-Improvement

SIMA 2’s formalization employs the following constructs:

  • State Space: S\mathcal{S} ≈ history of observed video tokens, textual context, and generated reasoning/dialogue blocks.
  • Action Space: A\mathcal{A} is the discrete set of keyboard/mouse commands (text-encoded).

Learning Objectives

LSFT(θ)=(s,a)Dlogπθ(as)\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{(s,a^*)\in\mathcal{D}} \log \pi_\theta(a^* \mid s)

J(θ)=Eτ[Eπθ[tRτ(st,at)]]J(\theta) = \mathbb{E}_\tau\left[\mathbb{E}_{\pi_\theta}\left[\sum_t R_\tau(s_t, a_t)\right]\right]

LRL(θ)=J(θ)+λDKL(πθπSFT)\mathcal{L}_{\mathrm{RL}}(\theta) = -J(\theta) + \lambda\, D_{\mathrm{KL}}\bigl(\pi_\theta \Vert \pi_{\mathrm{SFT}}\bigr)

Self-Improvement Procedures

  • Task proposals and open-ended reward assignment are provided by Gemini Pro, enabling curriculum-free skill acquisition via:
    • Autonomous task-setting
    • Self-execution and trajectory scoring
    • Replay-buffered fine-tuning with synthesized data

1.4 Training, Data, and Implementation

  • Datasets:
    • Human-annotated trajectories (one-/two-person setups)
    • Bridge data (\sim10k Gemini Pro-annotated examples with intertwined Reason/Say tags)
    • RL tasks with programmatic/ground truth verification
  • Compute: Hundreds of A100-80GB GPUs, TPU pods for bridge data/reward scoring
  • Training Regime: Typical batch size 1024, learning rate 1e−5, \sim100k SFT and 50k RL steps

1.5 Evaluation and Results

Environments

  • Training: Construction Lab, Playhouse, WorldLab, Goat Simulator 3, Hydroneer, No Man’s Sky, Satisfactory, Space Engineers, Valheim, Wobbly Life
  • Generalization: ASKA, MineDojo (50 tasks × 15 seeds), The Gunk, Genie 3

Metrics

  • Success Rate (% of tasks completed within timeouts)
  • Task completion time
  • Reward-model score (0–100 for self-improvement)

Key Findings

Aspect SIMA 2 Outperformance over Baseline
Success Rate (train) Doubles SIMA 1, within 5–10 pp of humans
Skill categories Advances in all eight—esp. interaction, object management; lags in combat
Generalization >>10 pp gain over SIMA 1 on ASKA/MineDojo; multi-step tasks in The Gunk; zero-shot navigation in Genie 3
Ablation Gemini backbone alone: 3.2–7.0% success (vs. SIMA 2’s post-fine-tuning score)
Catastrophic Forgetting <20% relative drop in coding, math, STEM-QA

1.6 Case Studies and Limitations

Case Study Highlights:

  • Embodied dialogue and internal chain-of-thought provide human-like transparency and iterative plan updates.
  • Complex multi-modal instruction handling: correct sequencing and conditional action.
  • Abstract task parsing (e.g., “do the opposite”) enabled by Gemini Pro guidance.

Limitations:

  • Less efficacy on extremely long-horizon, stochastic, or high-precision tasks (e.g., split-second combat).
  • Memory bottleneck at \sim4k tokens (no explicit episodic memory).
  • Coarser ultra-low-level motor control compared to humans due to discretized action interface.

1.7 Conclusions and Outlook

  • First demonstration of open-ended, self-improving embodied agents in open virtual domains.
  • The multimodal transformer approach—melding perception, reasoning, planning, action, dialogue—enables generalist task coverage.
  • Future work aims for transfer to real robots (“Gemini Robotica 1.5”), memory augmentation, scale-up to larger Gemini models, and Darwin-complete perpetual curricula in richer simulation worlds (Team et al., 4 Dec 2025).

2. SIMA 2: Dual-Stacked Intelligent Metasurfaces for ISAC

SIMA 2 also denotes a dual-metasurface architecture in bistatic ISAC, where both transmitter (TX) and receiver (RX) employ stacked intelligent metasurfaces (SIMs) for joint radar sensing and wireless communication enhancement (Ranasinghe et al., 29 Apr 2025).

2.1 Architecture and Parametrization

  • TX-SIM: QQ layers, MM meta-atoms per layer, inter-element spacing dd with spatial correlation governed by RTXR_{\rm TX}.
  • RX-SIM: Q~\tilde Q layers, M~\tilde M meta-atoms per layer, same spacing, correlation RRXR_{\rm RX}.

The layer-wise response combines scattering and phase-shift terms. For TX,

Ψq=diag(ejζ1q,,ejζMq)\Psi_q = \operatorname{diag}(e^{j\zeta^q_1}, \dots, e^{j\zeta^q_M})

Overall, the SIM-induced channel is:

h(t,τ)=uRRX1/2H~(t,τ)RTX1/2vh(t,\tau) = \mathbf{u}\, R_{\rm RX}^{1/2}\, \widetilde{H}(t,\tau)\, R_{\rm TX}^{1/2}\, \mathbf{v}

where H~(t,τ)\widetilde{H}(t,\tau) is the underlying doubly-dispersive matrix.

2.2 Min–Max Phase-Shift Optimization

A robust phase-shift assignment maximizes the minimum path gain among PP propagation paths (min–max problem):

maxΘJ(Θ),J(Θ)minpOp()\max_{\Theta} J(\Theta), \quad J(\Theta) \equiv \min_p \mathcal O_p (\cdot)

subject to ζmqπ|\zeta^q_m| \le \pi for all phases.

Gradient Update

Closed-form gradients and normalized steepest ascent drive optimization:

ζqJ=2{ΨqFq,p^Υq,p^v}\nabla_{\bm\zeta_q}J = 2 \Im\{\Psi_q F_{q,\hat p} \Upsilon_{q,\hat p} \mathbf{v}\}

Pseudocode (Abbreviated)

1
2
3
4
5
for i in range(i_GD):
    identify weakest path index
    compute gradients
    compute normalization factors
    update phases with decayed step size

2.3 Radar Parameter Estimation: Compressed Sensing & PDA

Post demodulation (AFDM/OFDM/OTFS), the measurement model is cast as:

y=Eh+n\mathbf{y} = E\mathbf{h} + \mathbf{n}

Sparse recovery exploits a compressed sensing-based probabilistic data association (PDA) algorithm, estimating reflection delays and Dopplers.

2.4 ISAC-Enabling Waveform Design

The delay–Doppler (DD) model imposes:

As(τ,ν)=s(t)s(tτ)ej2πνtdtA_s(\tau, \nu) = \int s(t)s^*(t-\tau) e^{-j2\pi\nu t}dt

The SIMA 2 modifications boost all path gains via SIM-induced harmonization, mitigating doubly-dispersive fading.

2.5 Numerical Performance

Sensing (RPE) MSE—Table 1

Waveform No SIM (baseline) SIMA 2-Optimized Res. limit
OFDM 1.0 m2^2 @ 0 dB 0.10 m2^2 (+10 dB) 0.01 m2^2
OTFS 0.5 m2^2 0.05 m2^2 (+10 dB) 0.005 m2^2
AFDM 0.4 m2^2 0.04 m2^2 (+10 dB) 0.004 m2^2

Communication BER—Table 2

Waveform No SIM SIMA 2-Optimized Gain
OFDM 1×1021\times10^{-2} 3×1033\times10^{-3} (+5dB) 0.7dB
OTFS 5×1035\times10^{-3} 8×1048\times10^{-4} (+7dB) 2.5dB
AFDM 4×1034\times10^{-3} 5×1045\times10^{-4} (+8dB) 3dB

SIMA 2 improvements persist even for waveforms optimized for sensing, and OTFS/AFDM see higher gains due to inherent DD-domain robustness.

2.6 Significance and Outlook

The dual-metasurface SIMA 2 setup enables up to 10 dB improvement in radar parameter estimation MSE and 1–3 dB better BER than no-SIM baselines, across a wide range of ISAC waveforms in highly doubly-dispersive bistatic links. This demonstrates the versatility of parametrized intelligent metasurfaces for joint radar and communications in next-generation wireless systems (Ranasinghe et al., 29 Apr 2025).


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SIMA 2.