Robots Need More than VLA and World Models

Published 4 Jun 2026 in cs.RO | (2606.06556v1)

Abstract: Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper's main contribution is highlighting that merely scaling VLA and world models is insufficient without robust grounding mechanisms.
It demonstrates how integrating heterogeneous physical experiences with physics-grounded simulation can bridge the gap between unstructured data and robot supervision.
The study proposes a comprehensive architecture featuring physical data engines, cross-embodiment retargeting, and self-improving deployment loops for advanced robot intelligence.

Rethinking the Path to Generalist Robot Intelligence: Beyond VLA Models and World Models

Introduction: The Core Bottleneck in Robotic Generalization

"Robots Need More Than VLA and World Models" (2606.06556) provides a comprehensive analysis of the future of generalist robot intelligence. The central thesis is that viewing generalization in robotics purely as a large-scale policy learning or Vision-Language-Action (VLA) model scaling problem is inadequate. The authors contend that a more foundational roadblock persists: the lack of effective mechanisms to translate unstructured and heterogeneous physical experience—including human motion, internet-scale video, simulation rollouts, and interactive demonstrations—into robot-grounded supervision. This calls for an architectural rethinking beyond current trends, specifically through the introduction of data, embodiment, world-model, and reward interfaces capable of bridging the gap between the physical world's behavior and robot-usable signals.

Scaling Limits of Robot-Native Supervision

Most state-of-the-art robotics learning pipelines depend on robot-native datasets: demonstrations paired with time-synchronized high-fidelity action, state, and reward labels in the coordinate system of a specific robot. This paradigm is exemplified by the expansion in both scale and diversity of datasets such as RoboNet (Dasari et al., 2019), BridgeData V2 [walke2023bridgedata], DROID [khazatsky2024droid], RH20T [fang2023rh20t], and large-scale policy efforts like BC-Z [BCZ], RT-1 [brohan2022rt], RT-2 [zitkovich2023rt], Open X-Embodiment [o2024open], and Octo [team2024octo].

While these pipelines demonstrate that learning performance increases with dataset scale and diversity, they reinforce a critical limitation: the supervision pipeline is bottlenecked by prior grounding of action, embodiment, and rewards. The expense and challenge arises from the necessity to generate physical experience that is both safe and meaningful for robot learning, where each data point is physically executed, and failures carry operational costs.

Figure 1: Next generation robotics will come from advances that go well beyond scaling vision language action (VLA) models.

Expanding the Supervision Set: Weakly Grounded Experience

The paper identifies a broad research thrust aiming to leverage weakly grounded observations such as human or internet-scale videos for robot learning. Instead of robot-action pairs, these approaches rely on observation-only sequences, sometimes augmented with language or tacit task metadata. The central technical issue here is the absence of embodiment-specific actions and reward signals, necessitating the discovery of intermediate latent variables—representations, progress signals, latent actions, or behavioral priors—that can be later mapped to a given robot’s action space.

The literature on this front includes:

Representation learning via contrastive or self-supervised objectives (e.g., R3M [nair2022r3m], VIP [ma2022vip], MVP [radosavovic2023real], VC-1 [majumdar2023we]).
Latent action modeling (e.g., LAPA [ye2024latent], UniVLA [bu2025univla]), where unsupervised clustering or VQ-VAE-style objectives are used to compress observed transitions into discrete, transferable codes.
Task progress/reward inference and grounding from video using models like PROGRESSOR [ayalew2025progressor], Adapt2Reward [yang2024adapt2reward], ReWiND [zhang2025rewind], TimeRewarder [liu2025timerewarder], and SARM [chen2025sarm].

Nevertheless, these approaches do not obviate the need for grounding. The transformation from weak physical signals to usable robot reward and action still necessitates robust mechanisms for correspondence and semantic alignment, particularly when bridging diverse embodiments.

Experience Generation: Simulation and World Modeling

As an orthogonal scaling axis, the expansion of physical experience via simulation, synthetic trajectory generation, or learned world models circumvents some constraints of hardware-collected data. The utility of simulated environments (RLBench [james2019rlbenchrobotlearningbenchmark], Meta-World [yu2020meta], ManiSkill [mu2021maniskillgeneralizablemanipulationskill], CALVIN [mees2022calvinbenchmarklanguageconditionedpolicy], LIBERO [liu2023liberobenchmarkingknowledgetransfer]) lies in their ability to mass-produce goal-oriented demonstrations, counterfactuals, and failure cases. Automated generation frameworks such as MimicGen [mandlekar2023mimicgendatagenerationscalable], RoboCasa [nasiriany2024robocasa], RoboGen [wang2024robogenunleashinginfinitedata], and RoboGSim [li2025robogsimreal2sim2realroboticgaussian] dramatically increase scalability.

Recent advances in real-to-sim–to-real transfer, leveraging digital twin construction or 3D Gaussian Splatting (e.g., RL-GSBridge [wu2025rlgsbridge3dgaussiansplatting], Real-is-Sim [abouchakra2025realissimbridgingsimtorealgap]) enable greater sim2real correspondence, allowing robust policy transfer in both manipulation and mobile robotics (e.g., SOUS VIDE [low2025sous], SINGER [adang2025singer], GRaD-Nav/GRaD-Nav++ [chen2025grad, chen2025grad++]).

In parallel, world models that parameterize action-conditioned environment transitions—ranging from latent-space predictors (PlaNet [hafner2019planet], Dreamer [Dreamerv1, DreamerV2], DayDreamer [wu2022daydreamerworldmodelsphysical]), to generative object-centric or physics-grounded models (e.g., FOCUS [ferraro2023focusobjectcentricworldmodels], GWM [lu2025gwm], ContactGaussian-WM [wang2026contactgaussianwmlearningphysicsgroundedworld], ParticleFormer [huang2025particleformer3dpointcloud])—aim to enable robot imagination and data-efficient continual policy improvement. Uncertainty calibration for world models is emerging as a critical requirement, with works such as [mei2025world, li2025uncertainty, ward2026foundational] highlighting the importance of statistical confidence and error detection during planning-by-imagination.

Analytically, the authors stress that the efficacy of simulated or generated experience hinges not on visual realism but on preservation of physical variables critical for downstream control: 3D geometry, contact, force, stability, and constraints.

Four Missing Pillars for Physical Intelligence

The authors articulate a comprehensive architecture for next-generation physical intelligence, operationalized as four interfaces that unlock world-scale supervision for robotics:

Physical Data Engines for Embodied Autolabelling Systematic ingestion and automatic structuring of heterogeneous, asynchronous physical data to recover temporally aligned event sequences, object/actor states, task phases, contacts, latent actions, and episodic outcomes. Such engines unify perception, segmentation, physical reasoning, and language grounding, enabling continual autolabeling from in-the-wild demonstrations, failures, wearables, or deployment streams.
Task-Preserving Retargeting across Embodiments Mechanisms to map the discovered latent actions or physical event representations into the action spaces of diverse robot morphologies while preserving task-relevant causal outcomes—not mere kinematic pose-matching but transfer of meaningful object state transitions.
Physics-Grounded World Models Action-conditioned models that predict the physical consequences of candidate actions with appropriate uncertainty quantification, operating in structured physical state spaces (object-centric, 3D geometric, mechanics-aware) and supporting policy imagination, planning, and closed-loop error attribution.
Self-Improving Deployment Loops via Task-Conditioned Reward Grounding Real-time closed-loop grounding of deployment outcomes; each trajectory or failure is converted into robot-usable supervision by inferring progress, rewards, failures, and task phases, with credit assignment for policy, world model, retargeting, or data engine updates.

Broader Implications, Contrasts, and Future Directions

A central claim of the paper is that VLAs and world models, as currently instantiated, are merely proximal layers in a physical intelligence stack. The focus must move from mere scaling of policies to the advancement of grounding architectures—those that integrate data, embodiment, physical, and reward grounding with continual cycles of supervised and self-supervised improvement.

Notably bold conclusions include:

The ability to utilize the world's physical experience at web scale is not constrained by model size but by the presence or absence of effective grounding mechanisms.
VLAs and large world models alone, when trained on robot-native data, will plateau unless the broader grounding interface is addressed.
The next "foundation model" for robotics is likely to be a closed-loop pipeline, not a monolithic neural network.

The implications are far-reaching:

Theorists are prompted to formalize new abstractions relating to temporal event alignment, latent action ontologies, and physic-based causal modeling to robustify cross-embodiment generalization.
Practitioners are encouraged to invest in instrumentation and wearables for autolabelling, development of multi-modal fusion in data engines, and rigorous evaluation beyond task-completion rates.
There is an explicit challenge for future work: the creation of robust, uncertainty-aware, physics-grounded simulation and world models that scale to both diverse objects and embodiments, and reliably generate counterfactuals for longitudinal improvement.

Conclusion

The paper "Robots Need More Than VLA and World Models" (2606.06556) advances the discourse on next-generation robotics by systematically reframing the limitation of current approaches. It identifies the transformative potential not in further scaling model architectures, but in developing architectural primitives and feedback mechanisms that enable robots to ingest, interpret, and learn from the world's messy, unstructured behavioral data.

The analysis provided is highly relevant for the design of any future physical-intelligence system: a shift from robot-centric, highly curated pipelines to a grounding-centric architecture wherein all data streams—human, robot, video, simulation, language, and tactile—are transformed into actionable supervision. Such a system will realize generalist robot intelligence not by overfitting to increasingly large robot-native datasets, but by fundamentally expanding what can count as supervision for policy, model, and reward learning in the physical world.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain-English Summary

This paper argues that making truly smart, general-purpose robots isn’t just about training bigger “vision-language-action” (VLA) models on more robot demos. The real blockage is that the world is full of useful videos and movements (like people doing chores, factory work, or YouTube tutorials), but robots can’t directly learn from most of it. Why? Because those videos don’t say which buttons a robot should press, how hard to push, or when a task is “good enough.” The authors explain what’s missing and outline four key pieces that would let robots turn messy real-world experience into clear instructions they can learn from.

What Questions Are They Asking?

Why doesn’t simply collecting more robot demonstrations and scaling up VLA models give us truly general robots?
How can robots learn from the huge amount of human and internet video that doesn’t come with robot-specific actions or rewards?
What new tools and “interfaces” do we need so robots can understand tasks, physics, and success like humans do?

How Did They Approach It?

This is a “position and survey” paper. That means the authors:

Review lots of recent robot-learning work (big robot datasets, generalist policies, learning from human video, simulation, and learned world models).
Point out a common pain point: most data becomes useful only after it’s “grounded” (translated into a robot’s body, sensors, actions, and task goals).
Propose four missing pieces—like adapters—that convert raw real-world behavior into robot-ready learning signals.

To make the ideas easier:

Vision-Language-Action (VLA) model: like a robot brain that sees images, reads instructions, and outputs motor commands.
Grounding: turning what you see or read into concrete steps a specific robot can execute (like converting “open the door” into the right arm motions and forces).
Embodiment: each robot body is different (number of joints, grippers, size), so the same task needs different exact moves.
World model: a robot’s “mental simulator” that predicts what will happen if it takes an action (like imagining a move before doing it).
Reward: a score telling the robot if it’s getting closer to finishing the task (a progress bar for success).

What Did They Find and Why It Matters?

The authors say the bottleneck isn’t just “learn a better policy.” It’s that we’re missing mechanisms to turn the world’s messy, unlabeled behavior into the clear, robot-specific supervision that policies need. They highlight four missing components:

Data interfaces for auto-labeling unstructured behavior:
- Problem: Internet videos show how objects move and which steps matter, but they don’t include robot action commands.
- Idea: Automatically extract useful signals from video, like “this frame is halfway through the task,” or “a grasp happens here,” or “the object moved from A to B.”
Embodiment interfaces for retargeting human motion to robot actions:
- Problem: A human hand and a robot gripper are very different.
- Idea: Build translators that map “what changed in the world” (like “cup moved to shelf”) into the specific motor commands for any robot body.
World-model interfaces for physics-grounded 3D reasoning:
- Problem: Pretty video predictions aren’t enough; robots need to respect 3D geometry, contact, friction, and forces.
- Idea: Give robots reliable internal simulators that predict real physical consequences—and know when they’re uncertain—so they can plan safely.
Reward interfaces for inferring task progress and success from video and language:
- Problem: Robots often need a “score” to learn, but we rarely have hand-made rewards for every new task.
- Idea: Use video and language to estimate progress (like a task progress bar) and to decide when the task is done correctly.

Along the way, the paper surveys three big directions and their limits:

Robot-native supervision (classic demos with action labels):
- Works well and has powered much of today’s progress, but is expensive, risky for hardware, and doesn’t scale like text/images on the internet.
Learning from weakly grounded videos (no robot actions in the data):
- Videos can teach good visual features, reveal “latent actions” (like “grasp” or “place” without exact motor commands), and give progress signals.
- But these signals still need to be grounded into a specific robot’s body to become usable actions and rewards.
Generating physical experience (simulation and learned world models):
- Simulators and synthetic demos can create lots of safe practice, and “world models” let robots imagine the future.
- The catch: generated experience must keep the important physical details (3D geometry, contact, materials) and handle uncertainty. Otherwise, plans learned in fake worlds won’t work on real robots.

The key takeaway: the next leap won’t come from just bigger VLAs. It will come from a pipeline that starts with broad real-world experience (human videos, simulations, language) and systematically grounds it into robot-ready actions, states, and rewards.

What Does This Mean for the Future?

If we build these four interfaces, robots could learn from the wider world, not just carefully collected robot demos. That would mean:

Faster learning from everyday videos and human activity.
Easier transfer of skills across different robot bodies (from small arms to humanoids).
Safer, smarter planning using physics-aware world models that know when they might be wrong.
More robust training at scale, because rewards and progress can be inferred instead of hand-coded.

In simple terms: to get truly helpful, general-purpose robots, we must teach them not only how to act, but also how to understand what they see, translate it into their own bodies, imagine consequences like a physicist, and judge progress like a coach. With these pieces in place, robots can learn from the world as it is—messy, unlabeled, and full of useful clues—rather than waiting for perfect robot-only datasets.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps that the paper identifies or implies but does not resolve, phrased to guide actionable research.

Unified autolabeling from unstructured behavior: No standardized pipelines or benchmarks exist for automatically extracting robot-usable supervision (e.g., contact events, object states, task phases, success/failure) from raw, egocentric or third-person videos and other modalities with calibrated confidence scores and error bars.
Which variables to extract from video: It remains unclear which latent variables (e.g., subgoals, contact intents, forces, object states, constraints) are most useful to extract from human/internet videos for downstream control, and how to prioritize and evaluate them.
Grounding latent actions to executable control: Methods are missing for reliably mapping learned latent “action” codes from video into embodiment-conditioned, safety-constrained robot commands across diverse kinematics/dynamics while guaranteeing real-time execution and stability.
Human-to-robot retargeting with contact and force fidelity: Robust retargeting across embodiments is unsolved when human demonstrations lack force/torque data; datasets and methods are needed to infer or inject missing contact forces, stiffness, and friction to preserve task semantics.
Cross-embodiment alignment spaces: There is no standard way to define, learn, and evaluate morphology-agnostic latent spaces that align heterogeneous robot action/state spaces (different DOF, actuation limits, sensing suites) for both manipulation and loco-manipulation.
Task-progress and reward inference that generalizes: Video- and language-derived progress/reward models lack calibration and robustness, often overfit to spurious correlations; methods are needed for uncertainty-aware, embodiment-agnostic rewards that transfer to novel tasks and viewpoints.
Evaluating the quality of inferred supervision: Benchmarks and metrics are missing for (i) autolabel accuracy (e.g., contact timing/pose error), (ii) reward usefulness (policy improvement, preference alignment), and (iii) retargeting fidelity (task success under embodiment changes).
Causal grounding vs. correlation: How to discover causal task variables and counterfactuals from video so that learned rewards/labels don’t exploit dataset artifacts remains open; practical causal evaluation protocols are lacking.
Active data engines and deployment loops: Concrete strategies are needed for uncertainty- and value-aware data acquisition that decide when to request teleoperation, run additional robot trials, or mine new videos to correct grounding errors during deployment.
World models with physically meaningful predictions: Learned models rarely preserve contact-rich dynamics, friction, deformables, and object permanence over long horizons; scalable 3D, object-centric, and material-aware models with verified physical fidelity are needed.
Uncertainty- and risk-aware world models: There is no standard for calibrating, validating, and propagating epistemic/aleatoric uncertainty in learned world models into planning/MPC, including triggers for deferring to sensing or humans under high uncertainty.
Avoiding hallucination loops in model-based control: Mechanisms to detect and prevent feedback loops where model errors cause compounding control mistakes are underdeveloped; recovery strategies and model trust metrics are required.
Hybrid neural–physics simulators at real-time rates: Practical methods to fuse neural scene representations (e.g., 3D Gaussian splats) with differentiable contact/rigid/deformable physics that run in real time on robot hardware are not established.
Sim-to-real validation of synthetic demonstrations: The field lacks ablations showing which synthetic data generation choices (trajectory perturbations, contact/dynamics fidelity, domain randomization) most influence real-world transfer, especially for contact-rich tasks.
Scalable, auto-updating digital twins: Automated pipelines to build and maintain dynamic digital twins (geometry, articulation, materials) from onboard robot sensing, with minimal manual intervention, remain immature.
Action tokenization for high-frequency control: There is no principled, general tokenization/compression scheme for high-rate continuous control that balances modelability with closed-loop stability and latency constraints, nor standardized metrics to compare tokenizers.
Multimodal grounding beyond vision-language: Integrating tactile, force-torque, and audio into autolabelling, reward inference, and world models remains largely unexplored; datasets with synchronized multimodal streams and fusion methods are needed.
Long-horizon temporal structure and options: Discovering reusable subgoals/options and stage boundaries from video, and composing them reliably in VLAs or hierarchical controllers for multi-stage tasks, lacks robust algorithms and evaluations.
Closed-loop vs. open-loop supervision: Current video-derived labels often ignore feedback; methods are needed to produce supervision that anticipates closed-loop corrections (e.g., slip recovery, compliance adjustments) usable by feedback controllers.
Continual learning under non-stationarity: How to update rewards/world models/retargeting maps online as environments and embodiments change, while preventing catastrophic forgetting and maintaining safety, is open.
Diagnosing grounding failures: Tooling to localize failure sources (perception vs. autolabels vs. embodiment mapping vs. reward vs. world model) and provide actionable debugging signals to engineers is missing.
Interfaces and APIs for the “grounding stack”: Well-defined, interoperable interfaces between data autolabelling, embodiment mapping, world models, reward models, and VLA policies—with standardized I/O formats, uncertainty representations, and latency budgets—are not yet available.
Scaling laws and data-mixture design: The field lacks quantitative scaling studies that disentangle the contributions of robot-native, human video, synthetic, and imagined experience, and prescribe optimal mixtures under resource constraints.
Robustness to occlusion and partial observability: Methods to produce reliable supervision (labels, rewards) when task-critical contacts are occluded or sensors fail are underdeveloped; principled use of priors and multimodal cues is needed.
Safety and verification before execution: Formal methods to verify inferred rewards and retargeted actions for safety constraints (joint limits, contact forces, human proximity) prior to execution are missing from the proposed pipeline.
Bias, privacy, and licensing in internet-scale videos: Governance frameworks to prevent unsafe or unethical behavior transfer, ensure consent/attribution, and audit biases in autolabels/rewards derived from public videos are not addressed.
Compute and real-time constraints: The paper does not resolve how to run the full grounding stack (autolabelling, world-model rollouts, reward inference, VLA) under tight onboard compute and latency budgets for mobile and humanoid platforms.

View Paper Prompt View All Prompts

Practical Applications

Summary

This position paper argues that scaling Vision-Language-Action (VLA) policies alone will not yield generalist robot intelligence. The core bottleneck is converting abundant, unstructured physical behavior (human videos, internet video, simulation, tactile streams) into grounded robot supervision. The authors outline four missing interfaces that unlock this conversion: (1) data interfaces for auto-labeling unstructured behavior, (2) embodiment interfaces for retargeting human motion to robot actions, (3) world-model interfaces for physics-grounded 3D reasoning, and (4) reward interfaces for inferring task progress and success from video/language. Surveyed advances (robot-native datasets, learning from action-free video, simulation/data generation, learned world models, uncertainty quantification) suggest concrete applications across sectors.

Below are actionable applications, grouped by deployment horizon. Each item lists sector(s), likely tools/products/workflows, and key assumptions/dependencies affecting feasibility.

Immediate Applications

These can be piloted or deployed now using existing methods, datasets, and tooling referenced in the paper.

Application: Auto-labeling pipelines that turn human/robot videos into robot-usable supervision (progress, success, task phases)
- Sectors: manufacturing, logistics, household robotics, education; academia
- Tools/Workflows: video-language reward models and progress estimators (PROGRESSOR, Adapt2Reward, ReWiND, TimeRewarder, SARM-like stage models); phase segmentation from existing CCTV/egocentric streams; integrate with robot training loops
- Assumptions/Dependencies: sufficient video quality/coverage; minimal seed labels for calibration; robustness to embodiment/viewpoint shifts; privacy/compliance for video data
Application: Representation pretraining from human/internet video to reduce robot data collection
- Sectors: robotics startups, labs; education
- Tools/Workflows: frozen visual backbones (R3M, VIP, MVP, VC-1) plugged into imitation/RL for new tasks; quick adaptation on small real-robot datasets (e.g., BridgeData V2, DROID, Open X-Embodiment)
- Assumptions/Dependencies: domain shift handling; coverage of target objects/tasks in pretraining data; compute for pretraining/fine-tuning
Application: Latent-action tokenization as an intermediate supervision signal from video
- Sectors: software tooling for robot learning; academia
- Tools/Workflows: learn discrete latent action codes from video (LAPA, UniVLA-style) and map to robot-specific controllers with limited labeled data; use as auxiliary targets in VLA pretraining
- Assumptions/Dependencies: effective embodiment decoders; moderate amount of paired robot data per embodiment for grounding; consistency across viewpoints
Application: Synthetic demonstration generation to multiply scarce seeds
- Sectors: manufacturing, service robotics; academia
- Tools/Workflows: demonstration synthesis from few seeds (MimicGen), large-scale simulated environments (RoboCasa, RoboCasa365, ManiSkill, RLBench) to train imitation/VLA policies; task variation at scale
- Assumptions/Dependencies: sim physics adequate for contact/friction; distribution randomization; sim-to-real validation; curated object assets
Application: Real-to-sim-to-real digital twins for safe policy evaluation and iteration
- Sectors: warehousing, healthcare facilities, retail, labs; policy/safety
- Tools/Workflows: scene reconstruction with 3D Gaussian Splatting; platforms like RL-GSBridge, Real-is-Sim, RoboGSim to evaluate and stress-test policies before deployment; domain randomization and online adaptation
- Assumptions/Dependencies: accurate reconstruction of geometry/appearance; update cadence for dynamic scenes; bridging domain gaps; safety monitors on real robots
Application: Uncertainty-aware world models to gate actions and detect policy failure
- Sectors: safety-critical robotics (healthcare, autonomous mobile robots), compliance; academia
- Tools/Workflows: learned world models with calibrated uncertainty for planning/monitoring; latent uncertainty quantification (as in works cited by the paper) to veto high-risk actions; runtime anomaly detection for VLA manipulation
- Assumptions/Dependencies: calibration of uncertainty; coverage of operational distribution; conservative fallbacks and human-in-the-loop
Application: Cross-embodiment data pooling and fine-tuning for new robots
- Sectors: integrators, platform vendors; education
- Tools/Workflows: pretrain VLA policies on pooled datasets (Open X-Embodiment, RT-X, Octo) and adapt to new sensor/action spaces; action tokenization/FAST-like compression for efficient training
- Assumptions/Dependencies: consistent schemas across datasets; action/observation remapping; minimal robot-specific calibration data
Application: Teleoperation and portable demonstrations to harvest high-quality, contact-rich skills
- Sectors: contract manufacturing, field service; academia
- Tools/Workflows: low-cost teleop rigs (ALOHA) plus sequence-level IL (e.g., diffusion policies); portable demos (HuMI-style) for whole-body skills where teleop is difficult; multimodal datasets (RH20T) for force/tactile grounding
- Assumptions/Dependencies: capable teleop hardware; synchronization of multi-sensor streams; task diversity; annotating success/phase where needed
Application: Process analytics from passive videos (phase mining, success/failure, dwell times)
- Sectors: operations/industrial engineering; policy/compliance
- Tools/Workflows: task-phase segmentation and progress estimation from weakly labeled video; dashboards for continuous improvement and training needs
- Assumptions/Dependencies: definable task taxonomies; acceptance of video analytics by workforce; privacy and governance
Application: Curriculum and benchmarking in simulation for reproducible research and skill pretraining
- Sectors: academia, education; platform vendors
- Tools/Workflows: adopt Meta-World, CALVIN, LIBERO, RLBench, ManiSkill for multi-task and language-conditioned training; standardized evaluation protocols; bridges to real-robot validation
- Assumptions/Dependencies: benchmarks reflect real constraints; sim-to-real pathways; community consensus on metrics

Long-Term Applications

These require further research, scaling, integration, or standardization across the proposed four interfaces.

Application: End-to-end grounding-centric robot stack (data, embodiment, world model, reward) for generalist robots
- Sectors: household robotics, logistics, retail, hospitality, labs
- Tools/Workflows: unified “physical data engine” for auto-labeling, embodiment retargeting modules, 3D physics-grounded world models, and deployment reward loops feeding a VLA policy layer
- Assumptions/Dependencies: standardized interfaces and ontologies; reliable closed-loop learning under safety constraints; cross-vendor interoperability
Application: Internet-scale video-to-robot skill transfer
- Sectors: consumer/service robots; software
- Tools/Workflows: large latent-action vocabularies learned from in-the-wild video; embodiment decoders for many robot morphologies; continual self-supervised grounding with sparse real labels
- Assumptions/Dependencies: scalable retargeting across embodiments; dataset licensing/privacy; robust handling of occlusions/contacts missing in videos
Application: Universal interactive learned simulators for planning, training, and evaluation
- Sectors: platform vendors, autonomy stacks; academia
- Tools/Workflows: UniSim/Genie-like learned simulators combining video prediction with action-conditioned dynamics; plug-in physics priors; policy training entirely in learned environments before real deployment
- Assumptions/Dependencies: faithful 3D geometry and contact dynamics; uncertainty-calibrated predictions; safety wrappers for sim-to-real gaps
Application: World-model co-processors embedded in robot OS
- Sectors: embedded systems, robotics platforms
- Tools/Workflows: on-device 3D object-centric/point-cloud world models (ParticleFormer, PointWorld, object-centric models) for MPC, counterfactual reasoning, and failure prediction
- Assumptions/Dependencies: real-time performance on edge hardware; tight sensor fusion (RGB-D, force, tactile); online adaptation
Application: Skill marketplaces using standardized latent actions and reward models
- Sectors: software platforms, integrators, OEMs
- Tools/Workflows: publish/subscribe of skills as embodiment-agnostic latent programs with accompanying reward/progress functions; auto-compilation to specific robots
- Assumptions/Dependencies: community standards for latent action spaces; verification/certification pipelines; IP/licensing models
Application: Self-improving deployment loops (autonomous task discovery, self-labeling, and safe self-training)
- Sectors: all deployed robotics (factories, warehouses, homes)
- Tools/Workflows: robots log interaction, infer phases/rewards from onboard VLMs, generate synthetic counterfactuals with world models, and fine-tune policies under safety monitors
- Assumptions/Dependencies: reliable failure detection; uncertainty-aware planning; governance and audit trails
Application: Whole-body humanoid generalists with 3D world-model guidance
- Sectors: manufacturing, logistics, services
- Tools/Workflows: dual-system VLA action heads plus physics-grounded world models (as hinted by GR00T N1, Gemini Robotics, Helix); humanoid-aligned state encodings and cross-embodiment controllers (e.g., LeVERB, WholeBodyVLA, HEX)
- Assumptions/Dependencies: safe whole-body control; contact-rich data; robust locomotion-manipulation coordination; real-time inference
Application: Healthcare and surgical robotics learning from expert videos
- Sectors: healthcare
- Tools/Workflows: latent-action and reward inference from surgical video; sim/digital twin validation; world-model uncertainty gating; gradual autonomy with human supervision
- Assumptions/Dependencies: stringent regulation; privacy and consent; extremely high reliability and interpretability; haptics/force sensing
Application: Construction, energy, and agriculture robots trained from expert egocentric videos
- Sectors: construction, energy (inspection/maintenance), agriculture
- Tools/Workflows: body-worn camera capture for task semantics; embodiment retargeting to manipulators/mobile platforms; site-scale 3D world models for planning (Gaussian splats + physics)
- Assumptions/Dependencies: ruggedization; outdoor perception robustness; variable materials and deformables; safety and union/workforce acceptance
Application: Policy and standards for data, safety, and privacy in grounding-centric robotics
- Sectors: policy/regulation, industry consortia
- Tools/Workflows: standards for cross-embodiment datasets (Open X-Embodiment-style schemas), evaluation protocols for sim-to-real validity, requirements for uncertainty calibration and safety monitors, privacy guidelines for video-based learning
- Assumptions/Dependencies: multi-stakeholder alignment; testbeds and public benchmarks; certification bodies
Application: Consumer “show-and-tell” teaching tools
- Sectors: daily life, consumer robotics, education
- Tools/Workflows: smartphone capture of a user performing a task; auto-inferred phases/rewards; latent action program compiled to the home robot; world-model previews for user approval
- Assumptions/Dependencies: easy capture and calibration; robust embodiment retargeting; intuitive UI and safeguards; on-device or private-cloud processing

Cross-cutting Dependencies to Monitor

High-fidelity sensing (RGB-D, force/torque, tactile, audio) to bridge visual gaps in contact-rich tasks.
Efficient action tokenization/compression for high-rate control within VLA architectures.
Domain randomization and online adaptation for sim-to-real robustness; continuous digital-twin updates.
Data rights, privacy, and licensing for human/internet video; provenance tracking in training pipelines.
Compute and energy budgets for training/inference; edge acceleration for world models and VLA control.
Safety layers: calibrated uncertainty, runtime monitors, fail-safes, and human-in-the-loop oversight.

View Paper Prompt View All Prompts

Glossary

3D Gaussian Splatting: A neural scene representation technique that models scenes as collections of 3D Gaussians for photorealistic, efficient rendering and reconstruction. Example: "use 3D Gaussian Splatting and related reconstruction methods"
Action-conditioned: Describes models or predictions that take explicit action inputs to forecast future states or observations. Example: "action-conditioned world models for imagined robot experience"
Action tokenisation: Compressing continuous, high-frequency control signals into discrete or compact tokens for efficient modeling by sequence models. Example: "action-tokenisation methods address a complementary bottleneck"
Affordance-aware: Accounting for what actions are possible with objects in a scene when planning or selecting skills. Example: "affordance-aware skill selection"
Affordances: The actionable possibilities offered by objects or environments to an agent, given its capabilities. Example: "robot affordances"
Autolabelling: Automatically generating labels (e.g., actions, events, rewards) from raw, unstructured data such as videos. Example: "autolabelling unstructured behaviour"
Autoregressive dynamics model: A model that predicts the next state or observation based on past states/observations in a sequential, step-by-step fashion. Example: "an autoregressive dynamics model"
Bimanual manipulation: Coordinated control and use of two robot arms/hands to perform tasks. Example: "a diffusion-transformer architecture for bimanual manipulation"
Chain-of-thought reasoning: Structured, step-by-step intermediate reasoning used by a model to improve decision quality or disambiguate tasks. Example: "structured chain-of-thought reasoning"
Counterfactual: Hypothetical alternative outcomes or trajectories used for reasoning or data augmentation without physically executing them. Example: "counterfactual interaction data"
Cross-embodiment: Methods or datasets that span multiple robot morphologies or bodies, enabling transfer across different embodiments. Example: "cross-embodiment datasets"
Digital twin: A high-fidelity virtual replica of a real-world system used for simulation, planning, or evaluation. Example: "digital-twin simulation environments"
Diffusion models: Generative models that learn to denoise data from noise, enabling synthesis of complex distributions, including action sequences. Example: "showing that diffusion models can represent multimodal action distributions"
Domain randomisation: Training over randomised simulator parameters (visual/dynamics) so policies generalise to real-world variability. Example: "Domain randomisation is one of the dominant strategies"
Egocentric: First-person, agent-perspective sensory data, often from head- or body-mounted cameras. Example: "egocentric human videos"
End effector: The terminal part of a robot arm (e.g., gripper, hand) that interacts with the environment. Example: "end-effector poses"
Flow matching: A training paradigm for generative models that learns continuous-time flows mapping simple to complex distributions. Example: "using a flow-matching architecture"
Force-torque measurement: Sensing that records forces and torques at robot joints or end effectors for contact-rich tasks. Example: "force-torque measurements"
Gaussian process: A nonparametric Bayesian model used for probabilistic regression and dynamics modeling with uncertainty estimates. Example: "Gaussian-process dynamics models"
Graph networks: Neural architectures operating on graph-structured data to model interactions among entities (e.g., objects). Example: "graph networks can simulate complex physical systems"
Hamiltonian Neural Networks: Models that learn a system’s Hamiltonian and use Hamilton’s equations to enforce energy-consistent dynamics. Example: "Hamiltonian Neural Networks learn a Hamiltonian function"
Imitation learning: Learning policies by mimicking expert demonstrations rather than relying solely on trial-and-error. Example: "imitation learning"
Inverse reinforcement learning: Inferring reward functions from observed behavior, often across different embodiments. Example: "cross-embodiment inverse reinforcement learning"
Lagrangian Neural Networks: Models that parameterise a system’s Lagrangian and derive equations of motion via the Euler–Lagrange equations. Example: "Lagrangian Neural Networks parameterise a Lagrangian"
Latent actions: Action-like abstract codes inferred from observations (e.g., video) that can later be mapped to executable robot commands. Example: "learning unified latent actions"
Latent dynamics: Compact hidden-state models that predict how the environment evolves, often learned from pixels. Example: "learning compact latent dynamics from pixels"
Loco-manipulation: Integrated control of locomotion and manipulation, typically for whole-body robots. Example: "loco-manipulation"
Model-based control: Control strategies that use explicit models of system dynamics for planning and action selection. Example: "for model-based control"
Model-based reinforcement learning: RL methods that leverage learned or known dynamics models for planning or data-efficient policy improvement. Example: "model-based reinforcement learning"
Model predictive control: An optimization-based control method that plans over a receding horizon using a predictive model. Example: "model-predictive control"
Neural scene representation: Learned 3D scene models (e.g., radiance fields, Gaussians) used for rendering, simulation, or planning. Example: "a combination of a neural scene representation with physical simulation"
Object-centric world models: World models that explicitly represent and predict dynamics of individual objects and their interactions. Example: "Object-centric world models"
Proprioceptive: Internal sensing of a robot’s own state (e.g., joint angles, velocities) used for control. Example: "proprioceptive states"
Real-to-sim-to-real: Pipelines that reconstruct real scenes into simulation, train or refine in sim, then deploy back to the real world. Example: "a 3D-Gaussian-Splatting-based real-to-sim-to-real reinforcement-learning framework"
Reinforcement learning: Learning control policies via trial-and-error interaction guided by rewards. Example: "reinforcement learning"
Retargeting: Mapping human motions or demonstrations to robot-specific action spaces while preserving task semantics. Example: "retargeting human motion to robot actions"
State-space models: Sequence models (often linear-time SSMs) designed for efficient long-context reasoning and control. Example: "state-space-model architectures"
Symplectic structure: A geometric property of Hamiltonian systems preserved by specific integrators/models for stable long-horizon physics. Example: "enforce Hamiltonian or symplectic structure"
Teleoperation: Human-operated remote control of robots, often used to collect demonstrations. Example: "low-cost teleoperation systems"
Time-contrastive learning: A self-supervised approach that learns representations by contrasting temporally nearby vs. distant frames. Example: "using time-contrastive learning"
Uncertainty quantification: Estimating the confidence or calibration of model predictions to support safe planning and control. Example: "uncertainty quantification for a world model"
Video-LLMs: Multimodal models that jointly process video and text for understanding tasks, progress, or rewards. Example: "video-LLMs"
Vision-Language-Action (VLA): Models that map visual inputs and language instructions to robot actions. Example: "vision-language-action models"
VQ-VAE: Vector-quantised variational autoencoder that discretises latent spaces for learning compact codebooks. Example: "VQ-VAE-style objective"
World models: Learned predictive models of environment dynamics used for imagination, planning, or policy learning. Example: "world models"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Robots Need More than VLA and World Models

Summary

Rethinking the Path to Generalist Robot Intelligence: Beyond VLA Models and World Models

Introduction: The Core Bottleneck in Robotic Generalization

Scaling Limits of Robot-Native Supervision

Expanding the Supervision Set: Weakly Grounded Experience

Experience Generation: Simulation and World Modeling

Four Missing Pillars for Physical Intelligence

Broader Implications, Contrasts, and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-English Summary

What Questions Are They Asking?

How Did They Approach It?

What Did They Find and Why It Matters?

What Does This Mean for the Future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Summary

Immediate Applications

Long-Term Applications

Cross-cutting Dependencies to Monitor

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Robots Need More than VLA and World Models

Summary

Rethinking the Path to Generalist Robot Intelligence: Beyond VLA Models and World Models

Introduction: The Core Bottleneck in Robotic Generalization

Scaling Limits of Robot-Native Supervision

Expanding the Supervision Set: Weakly Grounded Experience

Experience Generation: Simulation and World Modeling

Four Missing Pillars for Physical Intelligence

Broader Implications, Contrasts, and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-English Summary

What Questions Are They Asking?

How Did They Approach It?

What Did They Find and Why It Matters?

What Does This Mean for the Future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Summary

Immediate Applications

Long-Term Applications

Cross-cutting Dependencies to Monitor

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research