Papers
Topics
Authors
Recent
Search
2000 character limit reached

Robot Foundation Models Overview

Updated 3 February 2026
  • Robot Foundation Models (RFMs) are large transformer-based neural architectures that integrate perception, language, and action for versatile robotic control.
  • RFMs enable zero- or few-shot generalization by collapsing traditional robotics pipelines into scalable, data-driven frameworks for tasks like manipulation and navigation.
  • Architectural variants such as monolithic, modular, and multi-agent designs provide trade-offs in integration, safety, and performance for real-world robotic applications.

Robot Foundation Models (RFMs) are large-scale, multimodal neural architectures, typically transformer-based, that unify perception, language understanding, and action planning for embodied robots. Unlike classical robot software, where perception, planning, and control are realized as separate, manually engineered modules, RFMs seek to provide a generalist backbone: they ingest raw sensor data (RGB, depth, point clouds, tactile), high-level instructions (natural language, speech), and emit sequences of control actions or calls to reusable skill libraries. The prevailing motivation is that a single pre-trained, large-capacity model can accelerate adaptation, enable zero- or few-shot generalization, and collapse the traditional robotic pipeline into a scalable, data-driven paradigm. RFMs have rapidly become central to the pursuit of general-purpose, language-driven robotic autonomy, with demonstrated progress in manipulation, navigation, high-level reasoning, and human–robot collaboration (Sun et al., 30 Nov 2025, Sartor et al., 2024, Khan et al., 14 Jul 2025, Xiao et al., 2023).

1. Definition, Role, and RFM Assumptions

RFMs are defined as large models pre-trained on diverse robot-centric data—trajectories, demonstrations, images, language, and occasionally touch, force, or audio streams. Their purpose is to serve as a generalist substrate that can be adapted or prompted for new robots, environments, and tasks, outperforming the rigid, task-specific training pipelines of traditional robotics (Hu et al., 2023, Sartor et al., 2024).

A key paradigm in RFM design has historically been the "monolithic assumption": the belief that a single, sufficiently large and diverse neural model can subsume all cognitive functions in robotics, including world-state estimation, long-horizon planning, motion policy generation, uncertainty handling, and collaborative interaction with humans, in a single end-to-end pipeline (Sun et al., 30 Nov 2025). This assumption underpins the integrated vision–language–action (VLA) policies and embodied vision–LLMs (VLMs) that dominate recent literature.

However, real-world deployment has revealed fundamental limitations with the monolithic approach:

  • Lack of embodiment-aware action grounding: RFMs typically output symbolic or language-like actions (e.g., “pick up the cup”), deferring actual execution to hand-coded controllers; they lack direct integration of kinematic, dynamic, and geometric collision constraints.
  • Weak geometric grounding: While semantic recognition is strong, open-set 3D pose estimation in cluttered or dynamic scenes remains unreliable, leading to failures in precise manipulation tasks.
  • Reactive brittleness: Autoregressive VLAs have limited memory and verification, often hallucinating actions or failing to replan when the environment or human collaborators intervene unexpectedly (Sun et al., 30 Nov 2025).

2. RFM Architectures: Monolithic, Modular, and Multi-Agent

RFM architectures can be classified along multiple axes:

Type Description Example Models
Monolithic Unified multimodal transformer for all functions RT-2, Gato, RoboCat
Modular Specialized sub-models for perception/planning/control SayCan, Code-as-Policy
Multi-Agent Loosely coupled LLM-driven agents for each cognitive role InteractGen

The monolithic model approach treats foundation models as a single pipeline; end-to-end VLA policies are trained over massive demonstration corpora [RT-2, Gato].

The modular approach decomposes the stack, for instance, with LLM planners, VLM perceptual front-ends, and explicit skill libraries. A canonical example is SayCan, which fuses LLM-generated high-level plans with learned or engineered affordance models (Sun et al., 30 Nov 2025, Khan et al., 14 Jul 2025, Hu et al., 2023).

Multi-agent orchestration is exemplified by InteractGen, which decomposes robot cognition into specialized LLM agents: Perceiver (continuous sensing), Planner (dependency-aware Thought-of-Action planning), Assigner (robot–human allocation), Validator (pre-execution verification), and Manager (reflection, error handling, human clarification). This closed-loop collective orchestrates perception, planning, execution, and failure recovery with explicit verification and robust delegation between robots and humans (Sun et al., 30 Nov 2025).

3. Scaling Laws, Emergent Capabilities, and Generalization

Meta-analyses have revealed that RFMs exhibit scaling laws analogous to NLP and CV domains but with significantly steeper exponents: the performance (measured as success or failure rate) follows a power law in compute (CC), data volume (DD), and model size (MM):

FR(x)=αxβ+γ,x{C,D,M}\mathrm{FR}(x) = \alpha \cdot x^{\beta} + \gamma,\quad x \in \{C, D, M\}

Typical exponents for RFMs (absolute value) are three to four times larger than pure-LLMs (βC0.15|\beta_C| \approx 0.15, βD0.30|\beta_D| \approx 0.30, βM0.39|\beta_M| \approx 0.39), compared to $0.05$, $0.076$, $0.095$ for LLMs (Sartor et al., 2024). This implies RFMs accrue performance improvements more rapidly with increased data or parameter count, but also face rapidly diminishing returns.

  • Data scaling: Doubling demonstration data reduces failure rate by ~19%.
  • Model scaling: Doubling model size reduces failure rate by ~24%.

Emergent capabilities—such as multi-step planning or zero-shot adaptation—appear only when scaling specific resources past critical thresholds (e.g., >10510^5 episodes or >10810^8 parameters) (Sartor et al., 2024). This emergence is sharper than in language-only domains.

However, zero-shot generalization to entirely new robot embodiments (morphology, camera configuration) remains limited. Studies such as (Salzer et al., 2024) show that pre-trained models like RT-1-X require embodiment-specific fine-tuning for new robots, with skill transfer outpacing object-identity transfer—a critical bottleneck for universal generalization.

4. Integration in Embodied Systems: Perception, Planning, Action

RFMs have been configured to serve as the central backbone for perception, planning, and control across manipulation, navigation, and high-level reasoning.

A notable system, InteractGen (Sun et al., 30 Nov 2025), operationalizes a closed-loop multi-agent framework, orchestrating perception, hierarchical planning, human–robot allocation, action verification, and reflective fault handling. Each specialized agent operates over shared memory, ensuring outcomes are verified and failures trigger replanning—a critical advance over brittle monolithic policies.

5. Safety, Regulation, and Physical Risk Mitigation

RFMs, when deployed in the real world, expose significant safety and reliability challenges. End-to-end behavior cloning policies have no formal guarantees; purely data-driven approaches may require impractically large datasets to sufficiently cover unsafe edge cases (Tölle et al., 15 May 2025, 2503.07404, Kojima et al., 19 May 2025).

The ATACOM safety layer formally augments pre-trained RFMs via a modular, post-hoc action projection technique:

  • The robot’s control output is projected onto the tangent space of the constraint manifold defined by hand-specified or a priori known geometric constraints (e.g., collision, joint limits, workspace).
  • The corrected safe action is:

usafe(x,uRFM)=udrift(x)+uerr(x)+N(x)uRFMu_{\mathrm{safe}}(x, u_{\mathrm{RFM}}) = u_{\mathrm{drift}}(x) + u_{\mathrm{err}}(x) + N(x)u_{\mathrm{RFM}}

guaranteeing forward invariance under the constraint g(x)0g(x) \leq 0 (Tölle et al., 15 May 2025, 2503.07404).

Empirically, applying ATACOM preserves task success while attaining 100% safe executions in both quasi-static and dynamic settings, with a negligible increase in execution time. Limitations persist, including the need for analytic constraint knowledge and reliance on depth or multi-view perception for workspace modeling.

Broader risk-mitigation frameworks for RFMs now encompass:

  • Pre-deployment: formal safety verification, stress-testing under simulated and adversarial perturbations.
  • Run-time: LLM/VLM-generated monitors ("code-as-monitor"), OOD detection, and control-barrier–filtered actions.
  • Post-incident: LLM-initiated recovery, teleoperator fallback, weighted imitation learning for negative intervention weighting (Kojima et al., 19 May 2025).

6. Evaluations: Benchmarks, Metrics, and Human-in-the-Loop Studies

Evaluation protocols for RFMs emphasize both simulation and real-world studies, real-world open use, and human-in-the-loop feedback.

  • InteractGen (Sun et al., 30 Nov 2025):
    • Simulation (TextWorld-derived office): 77% task success, 3% redundancy.
    • Real-world (400 m² office): 70% success, 68% robot subtask rate, with humans performing 31% of subtasks (mostly high-dexterity).
    • Substantial improvements (e.g., +15% success over monolithic baselines, –25% token usage per task).
    • User survey: 73% satisfaction, 81% perceived productivity gain.
  • RT-1-X (Salzer et al., 2024): zero-shot transfer to unseen robots fails; fine-tuning recovers ~23% task success.
  • ATACOM (Tölle et al., 15 May 2025, 2503.07404): post-processed RFM policies achieve 100% "safe success rate" without loss in overall success rate; real-world deployments show the safety layer eliminating dangerous behaviors that would trigger hardware shutoff.

Key evaluation metrics include: Success Rate (SR), Completion Rate (CR), Redundancy Rate (RR), Robot Subtask Rate (RSR), wall-clock time per task, and token usage. Additionally, safety incidents, human intervention rates, and satisfaction scores provide critical feedback for human–robot collaboration performance.

7. Future Directions and Limitations

Major research thrusts for RFMs include:

Current RFMs remain constrained by data scarcity, high computational demands, incomplete coverage of safety constraints, and limited ability to generalize zero-shot across robot embodiments and out-of-distribution environments. Safety guarantees presently require explicit constraint modeling and do not automatically extend to open-world interaction.


References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Robot Foundation Models (RFMs).