Robot Foundation Models Overview
- Robot Foundation Models (RFMs) are large transformer-based neural architectures that integrate perception, language, and action for versatile robotic control.
- RFMs enable zero- or few-shot generalization by collapsing traditional robotics pipelines into scalable, data-driven frameworks for tasks like manipulation and navigation.
- Architectural variants such as monolithic, modular, and multi-agent designs provide trade-offs in integration, safety, and performance for real-world robotic applications.
Robot Foundation Models (RFMs) are large-scale, multimodal neural architectures, typically transformer-based, that unify perception, language understanding, and action planning for embodied robots. Unlike classical robot software, where perception, planning, and control are realized as separate, manually engineered modules, RFMs seek to provide a generalist backbone: they ingest raw sensor data (RGB, depth, point clouds, tactile), high-level instructions (natural language, speech), and emit sequences of control actions or calls to reusable skill libraries. The prevailing motivation is that a single pre-trained, large-capacity model can accelerate adaptation, enable zero- or few-shot generalization, and collapse the traditional robotic pipeline into a scalable, data-driven paradigm. RFMs have rapidly become central to the pursuit of general-purpose, language-driven robotic autonomy, with demonstrated progress in manipulation, navigation, high-level reasoning, and human–robot collaboration (Sun et al., 30 Nov 2025, Sartor et al., 2024, Khan et al., 14 Jul 2025, Xiao et al., 2023).
1. Definition, Role, and RFM Assumptions
RFMs are defined as large models pre-trained on diverse robot-centric data—trajectories, demonstrations, images, language, and occasionally touch, force, or audio streams. Their purpose is to serve as a generalist substrate that can be adapted or prompted for new robots, environments, and tasks, outperforming the rigid, task-specific training pipelines of traditional robotics (Hu et al., 2023, Sartor et al., 2024).
A key paradigm in RFM design has historically been the "monolithic assumption": the belief that a single, sufficiently large and diverse neural model can subsume all cognitive functions in robotics, including world-state estimation, long-horizon planning, motion policy generation, uncertainty handling, and collaborative interaction with humans, in a single end-to-end pipeline (Sun et al., 30 Nov 2025). This assumption underpins the integrated vision–language–action (VLA) policies and embodied vision–LLMs (VLMs) that dominate recent literature.
However, real-world deployment has revealed fundamental limitations with the monolithic approach:
- Lack of embodiment-aware action grounding: RFMs typically output symbolic or language-like actions (e.g., “pick up the cup”), deferring actual execution to hand-coded controllers; they lack direct integration of kinematic, dynamic, and geometric collision constraints.
- Weak geometric grounding: While semantic recognition is strong, open-set 3D pose estimation in cluttered or dynamic scenes remains unreliable, leading to failures in precise manipulation tasks.
- Reactive brittleness: Autoregressive VLAs have limited memory and verification, often hallucinating actions or failing to replan when the environment or human collaborators intervene unexpectedly (Sun et al., 30 Nov 2025).
2. RFM Architectures: Monolithic, Modular, and Multi-Agent
RFM architectures can be classified along multiple axes:
| Type | Description | Example Models |
|---|---|---|
| Monolithic | Unified multimodal transformer for all functions | RT-2, Gato, RoboCat |
| Modular | Specialized sub-models for perception/planning/control | SayCan, Code-as-Policy |
| Multi-Agent | Loosely coupled LLM-driven agents for each cognitive role | InteractGen |
The monolithic model approach treats foundation models as a single pipeline; end-to-end VLA policies are trained over massive demonstration corpora [RT-2, Gato].
The modular approach decomposes the stack, for instance, with LLM planners, VLM perceptual front-ends, and explicit skill libraries. A canonical example is SayCan, which fuses LLM-generated high-level plans with learned or engineered affordance models (Sun et al., 30 Nov 2025, Khan et al., 14 Jul 2025, Hu et al., 2023).
Multi-agent orchestration is exemplified by InteractGen, which decomposes robot cognition into specialized LLM agents: Perceiver (continuous sensing), Planner (dependency-aware Thought-of-Action planning), Assigner (robot–human allocation), Validator (pre-execution verification), and Manager (reflection, error handling, human clarification). This closed-loop collective orchestrates perception, planning, execution, and failure recovery with explicit verification and robust delegation between robots and humans (Sun et al., 30 Nov 2025).
3. Scaling Laws, Emergent Capabilities, and Generalization
Meta-analyses have revealed that RFMs exhibit scaling laws analogous to NLP and CV domains but with significantly steeper exponents: the performance (measured as success or failure rate) follows a power law in compute (), data volume (), and model size ():
Typical exponents for RFMs (absolute value) are three to four times larger than pure-LLMs (, , ), compared to $0.05$, $0.076$, $0.095$ for LLMs (Sartor et al., 2024). This implies RFMs accrue performance improvements more rapidly with increased data or parameter count, but also face rapidly diminishing returns.
- Data scaling: Doubling demonstration data reduces failure rate by ~19%.
- Model scaling: Doubling model size reduces failure rate by ~24%.
Emergent capabilities—such as multi-step planning or zero-shot adaptation—appear only when scaling specific resources past critical thresholds (e.g., > episodes or > parameters) (Sartor et al., 2024). This emergence is sharper than in language-only domains.
However, zero-shot generalization to entirely new robot embodiments (morphology, camera configuration) remains limited. Studies such as (Salzer et al., 2024) show that pre-trained models like RT-1-X require embodiment-specific fine-tuning for new robots, with skill transfer outpacing object-identity transfer—a critical bottleneck for universal generalization.
4. Integration in Embodied Systems: Perception, Planning, Action
RFMs have been configured to serve as the central backbone for perception, planning, and control across manipulation, navigation, and high-level reasoning.
- Perception: VLMs (e.g., CLIP, BLIP, GroundingDINO) are used for open-vocabulary detection and semantic labeling; models such as PaLM-E and Gato fuse continuous sensor traces into language-conditioned representations (Kawaharazuka et al., 2024, Khan et al., 14 Jul 2025).
- Planning: LLMs are used for high-level task decomposition via chain-of-thought prompting and symbolic code-generation pipelines. Structured approaches, such as dependency-aware planners, exploit explicit representations (e.g., (Sun et al., 30 Nov 2025, Bai et al., 28 Dec 2025, Li et al., 2024).
- Action: Low-level policies range from direct behavioral cloning of trajectory data to diffusion-based policy generators and code synthesizers. Integration with symbolic or predefined skill primitives remains common for guaranteeing kinematic and safety properties (Khan et al., 14 Jul 2025, Tölle et al., 15 May 2025).
A notable system, InteractGen (Sun et al., 30 Nov 2025), operationalizes a closed-loop multi-agent framework, orchestrating perception, hierarchical planning, human–robot allocation, action verification, and reflective fault handling. Each specialized agent operates over shared memory, ensuring outcomes are verified and failures trigger replanning—a critical advance over brittle monolithic policies.
5. Safety, Regulation, and Physical Risk Mitigation
RFMs, when deployed in the real world, expose significant safety and reliability challenges. End-to-end behavior cloning policies have no formal guarantees; purely data-driven approaches may require impractically large datasets to sufficiently cover unsafe edge cases (Tölle et al., 15 May 2025, 2503.07404, Kojima et al., 19 May 2025).
The ATACOM safety layer formally augments pre-trained RFMs via a modular, post-hoc action projection technique:
- The robot’s control output is projected onto the tangent space of the constraint manifold defined by hand-specified or a priori known geometric constraints (e.g., collision, joint limits, workspace).
- The corrected safe action is:
guaranteeing forward invariance under the constraint (Tölle et al., 15 May 2025, 2503.07404).
Empirically, applying ATACOM preserves task success while attaining 100% safe executions in both quasi-static and dynamic settings, with a negligible increase in execution time. Limitations persist, including the need for analytic constraint knowledge and reliance on depth or multi-view perception for workspace modeling.
Broader risk-mitigation frameworks for RFMs now encompass:
- Pre-deployment: formal safety verification, stress-testing under simulated and adversarial perturbations.
- Run-time: LLM/VLM-generated monitors ("code-as-monitor"), OOD detection, and control-barrier–filtered actions.
- Post-incident: LLM-initiated recovery, teleoperator fallback, weighted imitation learning for negative intervention weighting (Kojima et al., 19 May 2025).
6. Evaluations: Benchmarks, Metrics, and Human-in-the-Loop Studies
Evaluation protocols for RFMs emphasize both simulation and real-world studies, real-world open use, and human-in-the-loop feedback.
- InteractGen (Sun et al., 30 Nov 2025):
- Simulation (TextWorld-derived office): 77% task success, 3% redundancy.
- Real-world (400 m² office): 70% success, 68% robot subtask rate, with humans performing 31% of subtasks (mostly high-dexterity).
- Substantial improvements (e.g., +15% success over monolithic baselines, –25% token usage per task).
- User survey: 73% satisfaction, 81% perceived productivity gain.
- RT-1-X (Salzer et al., 2024): zero-shot transfer to unseen robots fails; fine-tuning recovers ~23% task success.
- ATACOM (Tölle et al., 15 May 2025, 2503.07404): post-processed RFM policies achieve 100% "safe success rate" without loss in overall success rate; real-world deployments show the safety layer eliminating dangerous behaviors that would trigger hardware shutoff.
Key evaluation metrics include: Success Rate (SR), Completion Rate (CR), Redundancy Rate (RR), Robot Subtask Rate (RSR), wall-clock time per task, and token usage. Additionally, safety incidents, human intervention rates, and satisfaction scores provide critical feedback for human–robot collaboration performance.
7. Future Directions and Limitations
Major research thrusts for RFMs include:
- Generalization and Adaptation: Increasing diversity in pretraining data (multi-embodiment, multi-view, mixed-object), explicit morphology encoding, and data-efficient adaptation methods (e.g., LoRA, prefix-tuning) for new robots (Salzer et al., 2024).
- System-level Robustness: Standardized embodied AI benchmarks, open-ended real-world deployments, more realistic simulation-to-real transfers (Sartor et al., 2024, Bai et al., 28 Dec 2025).
- Safety and Ethics: Automated discovery of safety constraints via vision–LLMs, integration of formal verification pipelines, and development of ethical reward shaping and runtime alignment checks (Tölle et al., 15 May 2025, Kojima et al., 19 May 2025, Khan et al., 14 Jul 2025).
- Multi-Agent and Modular Architectures: Closed-loop orchestration of specialized agents for scalable, interpretable, and robust robot intelligence—foundation models as regulated components rather than monolithic controllers (Sun et al., 30 Nov 2025).
- Holistic Multimodality: Grounded integration of audio, haptics, and proprioception for robust interaction; real-time edge deployment challenges remain (Khan et al., 14 Jul 2025).
Current RFMs remain constrained by data scarcity, high computational demands, incomplete coverage of safety constraints, and limited ability to generalize zero-shot across robot embodiments and out-of-distribution environments. Safety guarantees presently require explicit constraint modeling and do not automatically extend to open-world interaction.
References
- (Sun et al., 30 Nov 2025) Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration
- (Sartor et al., 2024) Neural Scaling Laws in Robotics
- (Salzer et al., 2024) Bringing the RT-1-X Foundation Model to a SCARA robot
- (Tölle et al., 15 May 2025) Towards Safe Robot Foundation Models Using Inductive Biases
- (2503.07404) Towards Safe Robot Foundation Models
- (Kojima et al., 19 May 2025) A Comprehensive Survey on Physical Risk Control in the Era of Foundation Model-enabled Robotics
- (Bai et al., 28 Dec 2025) Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives
- (Khan et al., 14 Jul 2025) Foundation Model Driven Robotics: A Comprehensive Review
- (Xiao et al., 2023) Robot Learning in the Era of Foundation Models: A Survey
- (Hu et al., 2023) Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis