Papers
Topics
Authors
Recent
2000 character limit reached

MiMo-Embodied: Unified Multimodal AI

Updated 23 November 2025
  • MiMo-Embodied is a class of foundation models integrating multimodal sensors with internal state representations to model both physical and cognitive processes.
  • It employs unified architectures that combine vision-language models and recurrent internal modules to achieve superior performance in robotics, autonomous driving, and developmental simulations.
  • MiMo-Embodied systems utilize dual embodiment by merging external affordance sensing with internal homeostatic regulation for robust and adaptable behavior.

MiMo-Embodied refers to a class of foundation models and simulation platforms designed to unify multimodal perception, language, and physical embodiment for both AI research and computational modeling of development. These systems integrate visual, linguistic, proprioceptive, tactile, and interoceptive inputs to enable agents to engage with the physical and social world, perform complex tasks, and support both engineering and scientific inquiry into embodied cognition. MiMo-Embodied frameworks span practical robotics, autonomous vehicles, cognitive development modeling, and computational neuroscience, offering architectures and benchmarks that prioritize not only external world interaction but also persistent self-modeling and homeostasis.

1. Core Principles: Dual Embodiment and Sensor Integration

MiMo-Embodied systems are distinguished by formalizing "dual embodiment": they jointly model both external and internal embodiment (Kadambi et al., 11 Oct 2025). External embodiment encompasses sensorimotor loops involving exteroceptive channels (vision, audio, touch, proprioception) and actuators (joints, effectors). Internal embodiment requires persistent modeling of internal states—such as energy, temperature, or drive—including homeostatic regulators and predictive forward models to maintain viability and anticipate sensory consequences.

Architecturally, MiMo-Embodied systems incorporate specialized encoders:

The integration of these sensor streams enables representation learning that is sensitive to both affordances in the environment and the agent's own ongoing physiological and motivational state.

2. Model Architectures and Training Pipelines

MiMo-Embodied models adopt unified vision-language architectures extended for embodiment:

  • Unified Vision-LLM (VLM): A backbone ViT (e.g., MiMo-VL) feeding into a shallow MLP projector, with outputs concatenated with word embeddings and processed by a decoder-only LLM with cross-attention (Hao et al., 20 Nov 2025).
  • Internal State Module: A recurrent internal state vector stRds_t \in \mathbb{R}^d maintained by a homeostatic regulator fHOMf_{\text{HOM}} and a predictive forward model fFWDf_{\text{FWD}}, both trained to minimize deviations from real interoceptive feedback and to predict the agent's future internal state (Kadambi et al., 11 Oct 2025).
  • Cross-Domain Architecture: There is no explicit architectural branching for separate domains; the same parameter set is co-trained on diverse embodied AI and autonomous driving data, exploiting inductive transfer (Hao et al., 20 Nov 2025).

A characteristic MiMo-Embodied training pipeline employs staged, multi-domain, and multi-task learning:

  1. Supervised Fine-Tuning on Embodied AI: Optimization of cross-entropy over diverse spatial, affordance, and planning targets.
  2. Supervised Fine-Tuning on Autonomous Driving: Extension to trajectory, perception, and status-prediction tasks.
  3. Chain-of-Thought (CoT) Supervised Learning: Explicit reasoning trace supervision to decompose tasks into multi-step, interpretable logic. Example target: "Step 1: observe…; Step 2: infer…; Final Answer…".
  4. Reinforcement Learning: Application of Group Relative Policy Optimization (GRPO), optimizing expected returns for task and grounding objectives, using group-normalized advantage.

Objective functions are combined at each stage:

L=αsL1+βsL2+γsL3δsE[logπA]L = \alpha_s L_1 + \beta_s L_2 + \gamma_s L_3 - \delta_s \mathbb{E}[ \log \pi \cdot A ]

where each LiL_i represents a stage-specific loss (Hao et al., 20 Nov 2025). Homeostatic and predictive losses are explicitly included as Lint=s^t+1it+12L_{\text{int}} = \| \hat s_{t+1} - i_{t+1} \|^2 and Lfwd=o^t+1ot+12L_{\text{fwd}} = \| \hat o_{t+1} - o_{t+1} \|^2 (Kadambi et al., 11 Oct 2025).

Region-centric video architectures (e.g., RynnEC) offer a complementary paradigm, with flexible mask decoders on top of frame-level ViT backbones, explicit spatio-temporal tracking losses, and a curriculum that transitions from mask alignment through object understanding to spatial QA and referring segmentation (Dang et al., 19 Aug 2025).

3. Benchmarks, Empirical Evaluation, and Positive Transfer

MiMo-Embodied systems are empirically evaluated across a broad sweep of embodied AI and real-world interaction tasks:

  • Embodied AI Benchmarks: Task planning (e.g., EgoPlan2, RoboVQA), affordance prediction (e.g., RoboAfford, VABench-Point), and spatial reasoning (e.g., CV-Bench, MetaVQA-VQA) (Hao et al., 20 Nov 2025).
  • Autonomous Driving Benchmarks: Environmental perception (DriveLM, nuScenes-QA), status prediction (DriveLM interaction), and planning (DriveAction, NuInstruct) (Hao et al., 20 Nov 2025).
  • Region-Level Cognition Benchmarks: RynnEC-Bench measures object property understanding, segmentation, and spatial cognition with region-focused metrics (IoU, MRA, RoA, text QA scores) (Dang et al., 19 Aug 2025).

Empirical results show state-of-the-art (SOTA) performance across embodied AI (e.g., RoboRefIt: 82.3%) and autonomous driving (e.g., nuScenes-QA: 56.71%) (Hao et al., 20 Nov 2025). Ablation studies demonstrate strong positive transfer: embodied fine-tuning enhances spatial-temporal reasoning for driving; driving tasks sharpen risk assessment for robotics. Chain-of-Thought and RL yield additive improvements (CoT: +3–4pts; RL: +2–3pts) in long-horizon planning and spatial accuracy.

A summary of performance for MiMo-Embodied (highest/second-best in parentheses):

Benchmark MiMo-Embodied 2nd-Best
RoboRefIt 82.30 80.42
DriveLM (prospect) 57.85 58.10
nuScenes-QA 56.71 53.40
VABench-Point 46.93 41.50

On region-centric video benchmarks, RynnEC-7B achieves 56.2 overall on RynnEC-Bench, outperforming Gemini-2.5 Pro by a substantial margin in spatial and object properties subdomains (Dang et al., 19 Aug 2025).

4. Foundational Simulation Platforms: Infant Models and Agent Realism

The MIMo (Multi-Modal Infant Model) simulation platform embodies the MiMo-Embodied philosophy for computational neuroscience and developmental robotics (López et al., 11 Sep 2025, Mattern et al., 2023). MIMo and its successor, MIMo-v2, combine

  • Realistic body growth using anthropometric datasets and sigmoidal/logarithmic growth curves
  • Muscle actuation (Hill-type or MuJoCo spring-damper), with strength dynamically scaling by segment volume
  • Foveated vision with age-dependent visual acuity and log-polar cortical magnification implemented via CSF-based frequency filtering
  • Sensorimotor delays modeled as conduction time buffers with empirically grounded velocity maturation

Inverse kinematics support up to 88-DOF operational space tasks, and scene randomization (room geometry, toys, textures) promotes policy robustness (López et al., 11 Sep 2025). The simulation is implemented in MuJoCo and compatible with standard Gymnasium APIs and third-party engines, supporting high-fidelity developmental experiments such as reaching, grasping, posture control, and self-body localization.

5. Application Domains: Robotics, Driving, Cognition, and 3D Reconstruction

MiMo-Embodied models are deployed in both robotic and cognitive contexts:

  • Robotics & Manipulation: MiMo-Embodied models control physical or simulated robots (e.g., ALOHA-2, Apollo humanoid, UR5e) using end-to-end policy learning with multimodal sensory feedback (Kadambi et al., 11 Oct 2025, Hao et al., 20 Nov 2025, Qi et al., 24 Sep 2024). They perform tasks such as object manipulation, cup drinking, and maintaining homeostasis (e.g., battery >50%).
  • Autonomous Driving: The same model architectures achieve strong performance on driving perception, decision, and planning tasks, exploiting shared spatial reasoning and real-world grounding (Hao et al., 20 Nov 2025).
  • Cognitive Development Modeling: MIMo-v2 enables in silico studies of infant motor and sensory learning, capturing body growth, acuity development, and sensorimotor delays, providing a research substrate for developmental neuroscience (López et al., 11 Sep 2025, Mattern et al., 2023).
  • Active 3D Reconstruction: AIR-Embodied integrates multimodal prompting and closed-loop reasoning with MLLMs to maximize scene reconstruction quality, jointly optimizing NBV (next best view), manipulation, and perception via task- and geometry-informed losses (Qi et al., 24 Sep 2024).

A distinguishing feature is that MiMo-Embodied agents utilize both internal homeostatic drives and exteroceptive affordances to plan, act, and adapt in complex, time-varying environments.

6. Computational Mechanisms: Attention, Memory, and Loss Functions

MiMo-Embodied frameworks extend transformer architectures with cross-modal attention and explicit memory structures:

  • Cross-modal Attention: Transformers ingest concatenated embeddings from exteroceptive and interoceptive encoders. Each layer attends over both external sensory tokens and the internal state embedding, enabling the agent to align goals, world context, and internal needs (Kadambi et al., 11 Oct 2025).
  • Long-term Episodic Memory: Stateful memory slots allow for the retention and usage of episodic experience across long time horizons.
  • Loss Design: Training objectives combine task-specific RL loss, internal state prediction loss, and forward-model prediction of future sensory outcomes. For region-centric approaches, mask decoder objectives combine cross-entropy and IoU-based losses; spatio-temporal consistency is encouraged by region embedding tracking losses.

Pseudocode for an embodied agent interleaves encoding, internal state update, transformer forward pass, action generation, environmental feedback, state and observation prediction, and loss computation with joint gradient updates (Kadambi et al., 11 Oct 2025).

7. Outlook and Significance

MiMo-Embodied advances the field of foundation models by integrating multimodal sensory streams, reasoning, memory, and homeostasis in a unified, scalable architecture. Empirical evidence across domains suggests strong positive transfer, SOTA performance, and robustness to real-world complexities. Rigorous simulation platforms such as MIMo-v2 establish groundwork for computationally tractable, physically plausible developmental models. A noteworthy implication is that by embedding both internal drives and world affordances, MiMo-Embodied systems bridge statistical representation learning with physically grounded, drive-regulated cognition—key for the next generation of both AI agents and computational neuroscience frameworks (Kadambi et al., 11 Oct 2025, Hao et al., 20 Nov 2025, López et al., 11 Sep 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MiMo-Embodied.