World Action Models: The Next Frontier in Embodied AI
Abstract: Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching robots to not only react to what they see right now, but also to “imagine” what will happen next and choose actions accordingly. The authors call this new kind of robot brain a World Action Model (WAM). It combines two abilities:
- predicting how the world will change in the next moments, and
- deciding what the robot should do.
The paper is a survey: it doesn’t propose a single new model. Instead, it maps and explains the fast‑growing research area around WAMs, compares different designs, and points out challenges and opportunities.
Key goals and questions
The authors set out to do four main things:
- Define what a World Action Model is, and how it’s different from older ideas like VLA policies (which map “what I see + instruction” straight to actions) and world models (which only predict future images or states).
- Organize the many recent methods into a clear “family tree” so people can see how they’re similar or different.
- Review the data that trains these models (like robot demonstrations, simulation, and everyday human videos).
- Summarize how researchers test these models and what’s still missing in today’s evaluations.
In simple terms: What exactly are WAMs? How are people building them? What data do they learn from? How do we know if they’re any good?
How the paper approaches the topic
Because this is a survey, the “method” is to read, sort, define, and compare many papers and systems. The authors:
- Give a clear definition: A WAM should predict future world states (for example, the next camera frames or a compact “future state” representation) and produce actions that match those predicted futures.
- Build a taxonomy (a structured classification):
- Cascaded WAMs: first “imagine” the future (like predicting a short video of what will happen) and then choose actions based on that imagined future. Think of this as a two-step pipeline: imagine → act.
- Joint WAMs: predict futures and actions together inside one unified model. Think of this as a single engine that both plans and decides at the same time.
- Explain modeling styles with everyday analogies:
- Autoregressive generation: like writing a story one word at a time—predicting the next bit based on the past.
- Diffusion-based generation: like starting with a noisy, blurry picture and gradually sharpening it until a clear future scene appears.
- Review training data sources:
- Robot teleoperation (an expert controlling a robot while recording what to do),
- “Portable” human demonstrations (people wearing sensors or cameras to capture how they perform tasks, then transferring that knowledge to robots),
- Simulation (virtual worlds where robots can practice safely and cheaply),
- Internet-scale egocentric video (huge amounts of first-person videos that show how the real world behaves).
- Group evaluation methods by what they measure:
- Visual fidelity: are predicted futures visually realistic?
- Physical commonsense: do predictions obey everyday physics (e.g., cups don’t pass through tables, objects fall when unsupported)?
- Action plausibility: do the chosen actions actually look like they would solve the task?
Main findings and why they matter
Here are the big takeaways, explained plainly:
- What’s new about WAMs: Older VLA models are great at understanding instructions and matching them to actions, but they’re mostly reactive—they don’t explicitly think ahead. WAMs add the missing “what happens next?” layer. That forward thinking helps robots handle new situations better, because they can foresee consequences rather than just copy past actions.
- Two major design paths:
- Cascaded WAMs are modular and easier to build piece-by-piece (first predict future, then act), but the two parts can drift apart (good predicted video doesn’t always mean good actions).
- Joint WAMs tightly couple future prediction and action choice, which can lead to better consistency, but they’re often more complex and compute-hungry.
- Generation choices have trade-offs:
- Autoregressive models are straightforward and flexible but can accumulate small mistakes over time (like a story that goes off track).
- Diffusion models capture multiple possible futures and keep long sequences more consistent, but they’re slower and heavier to run.
- Data is the fuel—and there’s more of it now:
- WAMs can learn from videos that don’t include explicit robot actions, because predicting “what happens next” doesn’t always need action labels. That means we can use huge public video datasets to teach robots about the world’s physics and everyday patterns.
- Current tests are incomplete:
- Many benchmarks check if the predicted video looks sharp, but fewer directly test physics understanding or whether the chosen robot motions would actually work on real tasks. The community needs better, more holistic evaluation.
Why this matters: If robots can imagine the near future and choose actions that fit that future, they’ll be more reliable in messy, real environments—like picking up a slippery glass, folding laundry, or navigating a crowded kitchen—without needing to see the exact same situation during training.
Implications and potential impact
- Smarter, safer robots: WAMs can reduce trial-and-error in the real world by “mentally rehearsing” outcomes first, making robots safer and more efficient.
- Better generalization: With a sense of physics and future prediction, robots can adapt to new objects and tasks they haven’t seen before.
- Larger, cheaper training data: By learning from everyday videos (not just expensive robot demos), robots can gain broader world knowledge much faster.
- New research directions: The paper highlights open challenges—like making models faster, aligning imagined futures with precise control, using touch and 3D sensors, and creating better tests—that will guide future breakthroughs.
In short, this survey argues that the next step for embodied AI is to give robots “foresight.” By uniting imagination (predicting the world) with decision-making (choosing actions), World Action Models aim to make robots more capable, reliable, and ready for the real world.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of unresolved issues the paper highlights or implies across architecture, data, evaluation, and deployment for World Action Models (WAMs).
Architectural design and learning dynamics
- Lack of principled criteria for choosing between Cascaded vs. Joint WAMs: When does explicit factorization
p(o',a|o,l)=p(o'|o,l)p(a|o',o,l)outperform direct joint modeling, and how does coupling strength affect sample efficiency, robustness, and control success? - Insufficient ablations quantifying “predictive commitment”: How much explicit future-state prediction (pixels, flow, latent JEPA-like targets) is necessary to improve control versus merely using video-model backbones without world-model supervision?
- Unclear trade-offs among generation modalities: When do autoregressive, diffusion-based, and JEPA-style predictive objectives yield better long-horizon control, stability, and planning performance under identical data/compute budgets?
- Action decoding ambiguity: Comparative evidence is missing on discrete tokenization vs. diffusion-based continuous action heads vs. inverse-dynamics decoders from predicted futures, especially under real robot latency constraints and contact-rich dynamics.
- Causality vs. correlation in learned dynamics: How to ensure WAMs learn interventionally correct models (counterfactual consistency under different actions) rather than correlational predictors from passively observed videos?
- Uncertainty modeling and risk-aware control: How should WAMs quantify and propagate epistemic/aleatoric uncertainty from future-state predictions into action selection and safety constraints?
- Memory and long-horizon compositionality: What architectures best handle skill sequencing, temporal credit assignment, and hierarchical planning within WAMs (e.g., options, subgoal prediction, or multi-timescale latents)?
- Bridging language-conditioned and action-conditioned futures: How to integrate high-level linguistic goals with low-level action-conditioned state transitions in a single model without mode collapse or misalignment?
- Physics priors and structured inductive biases: Open design space for integrating differentiable physics, object-centric representations, or 3D geometry to improve contact modeling, deformables, and liquid dynamics in WAMs.
Training data and supervision
- Action supervision scarcity in internet-scale video: Robust, scalable methods for inferring latent actions and calibrating them to robot actuation spaces remain under-validated, particularly for precision manipulation.
- Morphology and embodiment gaps: Systematic approaches are needed to transfer from human egocentric videos to heterogeneous robots (kinematics, actuation limits, compliance) and quantify residual gaps post-transfer.
- Multimodal sensing underrepresented: Tactile, force/torque, audio, and proprioception are sparsely integrated; standardized datasets and alignments for multimodal futures and actions are lacking.
- 3D/4D state representation coverage: Few datasets provide consistent multi-view, depth, or 3D point/mesh annotations paired with actions; benchmarks for 4D-consistent prediction and control are immature.
- Data governance and reproducibility: Heavy reliance on closed-source video models and proprietary datasets hinders apples-to-apples comparisons and reproducibility; licensing, privacy, and safety curation protocols are not standardized.
- Mobile manipulation and long-horizon tasks: Training corpora with realistic navigation-plus-manipulation, scene rearrangement, and multi-room tasks with paired action signals remain limited.
Evaluation and benchmarking
- Misalignment between video metrics and control success: Current visual fidelity metrics (PSNR/SSIM/LPIPS/FVD) incompletely reflect physical correctness or action utility; standardized control-centric metrics for future-prediction usefulness are missing.
- Inadequate physical commonsense tests for intervention: Existing physics evaluations rarely probe counterfactual action-conditioned predictions (same scene, different actions); need causal intervention benchmarks with ground-truth outcomes.
- Plausibility-to-executability gap: Limited protocols to test whether predicted futures are not only plausible but also reachable by the robot under dynamics and constraints; need executability and feasibility checks tied to control performance.
- Cross-paradigm head-to-head comparisons: Few studies control for data, compute, and tasks when comparing Cascaded vs. Joint, AR vs. diffusion, and explicit vs. implicit representations on the same manipulation suites.
- Real-world, on-robot evaluation standardization: Benchmarks that assess latency, success rates, safety events, and recovery under disturbances in the real world are sparse and inconsistent across labs.
- OOD and robustness testing: Systematic stress tests for distribution shift (novel objects, lighting, surfaces), adversarial prompts, and sensor failures for WAMs are not yet established.
Integration with reinforcement learning and planning
- Stable learning with learned dynamics: How to mitigate compounding model errors, representation drift, and off-policy bias when using WAMs as surrogate environments for RL at scale?
- Reward modeling from generative predictors: Converting likelihoods or diffusion scores into calibrated, non-exploitable reward signals remains under-theorized and empirically fragile (reward hacking, miscalibration).
- Planning with predicted futures: Best practices for MPC/trajectory optimization using video/latent futures (cost functions, horizon, uncertainty penalties) and how to couple with policy distillation are not standardized.
- Data-efficient on-robot improvement: Sample-efficient methods to refine WAMs online with limited real interaction while maintaining safety and preventing catastrophic forgetting are underexplored.
Systems, performance, and deployment
- Real-time constraints: Diffusion and large transformer backbones incur high latency and energy costs; concrete methods for acceleration (distillation, caching, sparse sampling, anytime inference) and their control trade-offs need validation.
- Safety and verification: Lack of formal safety guarantees or runtime monitors that can veto actions when predicted futures violate constraints; frameworks for conformance testing of WAMs before deployment are immature.
- Edge deployment and hardware co-design: Guidelines for mapping WAM inference to onboard accelerators and cameras (multi-view synchronization, bandwidth limits) are scarce; no standardized reference stacks.
- Failure diagnosis and interpretability: Tooling to localize whether failures stem from state prediction, action decoding, or perception bottlenecks is limited; need interpretable intermediates and debugging protocols.
Theory and foundations
- Formal definition completeness: The proposed
p(o',a|o,l)framing leaves open how to characterize optimality, identifiability, and minimal sufficient predictive targets for control performance guarantees. - Generalization theory: Little theoretical grounding connects world-prediction accuracy (in pixels or latents) to bounds on control regret/success under distribution shift; need task-relevant generalization metrics.
- Causal validation: Methods to empirically verify that learned transition models support correct counterfactual reasoning (do-calculus style tests) remain to be developed for embodied settings.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed with today’s components (as surveyed), using off‑the‑shelf VLA policies, pretrained video/world models, and existing datasets/benchmarks.
- Visual foresight for robot execution and supervision
- Sectors: robotics (manufacturing, logistics, home), healthcare (non-critical assistive tasks), education
- Tools/Products/Workflows: Cascaded WAMs that render short video rollouts of anticipated futures before executing actions; human-in-the-loop “preview and approve” panels on robot HMIs; plug-ins that wrap video diffusion/backbones to visualize plan candidates from existing VLAs
- Assumptions/Dependencies: Access to calibrated cameras and scene geometry; inference GPUs for video prediction; matching training data domain; latency tolerances for short-horizon preview
- Synthetic demonstrations to boost imitation learning
- Sectors: robotics (factory assembly, pick-and-place, lab automation), education/research
- Tools/Products/Workflows: World-model-based data synthesizers (e.g., Ctrl-World/RoboScape-style pipelines) to generate or augment task demonstrations; filtering heuristics for action plausibility and physics consistency; fine-tuning existing VLAs on filtered synthetic demos
- Assumptions/Dependencies: Quality of world model’s physical commonsense; domain similarity to target deployment; mechanisms to down‑weight unrealistic generations
- Reward modeling without hand-engineered rewards
- Sectors: robotics RL (process optimization, mobile manipulation)
- Tools/Products/Workflows: Reward plug-ins that derive scalar signals from generative consistency (e.g., diffusion likelihoods/entropy or goal-video alignment) to drive policy optimization in simulation or world-model rollouts
- Assumptions/Dependencies: Strong pretrained generative world models covering task distribution; careful calibration to avoid reward hacking; compute for iterative denoising or likelihood estimation
- In-silico policy evaluation using data-driven simulators
- Sectors: industry QA/safety, academia (benchmarking, ablations), policy (internal compliance checks)
- Tools/Products/Workflows: World-model-based “virtual test tracks” (e.g., WorldEval/WorldGym-style setups) for reproducible, low-risk testing of policies and instructions; batch evaluation of edge cases before on-robot trials
- Assumptions/Dependencies: Scenario coverage and fidelity of the learned world model; process alignment with real-world acceptance tests; mechanisms to quantify sim2real gap
- Portable human demonstrations (UMI-style) for training
- Sectors: consumer robotics, assistive/home robots, education
- Tools/Products/Workflows: Low-cost teleop/demo capture rigs to collect human manipulation data at scale; pipelines to convert human videos into latent actions and train WAMs/VLAs
- Assumptions/Dependencies: Privacy and consent for data collection; embodiment mapping (human-to-robot morphology); robust pose/hand-keypoint extraction
- Cross-embodiment skill transfer from internet/egocentric video
- Sectors: robotics (household tasks, hospitality, retail)
- Tools/Products/Workflows: Latent-action inference from unlabeled videos to pretrain dynamics; fine-tune on a small amount of robot data for transfer
- Assumptions/Dependencies: Coverage of target tasks in internet-scale video; reliable latent action discovery; safety gating during transfer
- Multimodal sensing fusion for contact-sensitive tasks
- Sectors: precision manufacturing/assembly, lab automation, prosthetics research
- Tools/Products/Workflows: WAMs that integrate depth and tactile tokens (visuo‑tactile world models) for better force-aware planning; upgrading data pipelines to include tactile streams
- Assumptions/Dependencies: Availability of tactile/force sensors and synchronized capture; datasets with multimodal alignment; added inference bandwidth
- Predictive teleoperation assistance
- Sectors: hazardous environment handling (nuclear, offshore), space robotics, remote inspection
- Tools/Products/Workflows: Operator UIs that show forecasted next frames and action consequences; suggestion overlays from WAMs to reduce operator workload
- Assumptions/Dependencies: Robust network latency management; trust calibration with operators; conservative safety constraints for suggestions
- Research and curriculum deployment
- Sectors: academia, workforce upskilling (robotics programs)
- Tools/Products/Workflows: Adoption of the taxonomy (Cascaded vs Joint, explicit vs implicit state representations) to structure labs; use of curated datasets and benchmarks from the survey (e.g., LIBERO, ManiSkill, RoboCasa)
- Assumptions/Dependencies: Access to open-source checkpoints, datasets, and the Awesome-WAM repository; course infrastructure for compute
- Software products and platforms
- Sectors: software/tools
- Tools/Products/Workflows: “WAM-as-a-service” APIs for imagination/previews; reward-modeling SDKs; evaluation harnesses that wrap learned world models for CI/CD of robotics stacks
- Assumptions/Dependencies: Clear SLAs around fidelity/latency; licensing for underlying models and datasets; observability for failure cases
- Internal policy and safety guidelines for predictive evaluation
- Sectors: policy/compliance within organizations deploying robots
- Tools/Products/Workflows: Checklists that require world-model-based virtual tests and physical commonsense checks before live trials; incident review workflows that replay in-world-model what-if counterfactuals
- Assumptions/Dependencies: Organizational buy-in; alignment with regional safety standards; documented traceability of virtual-to-real test results
Long-Term Applications
These use cases aim at broader deployment and scale and will require advances in model fidelity, long-horizon reasoning, real-time performance, data coverage, and governance.
- Generalist household and service robots with predictive foresight
- Sectors: consumer robotics, hospitality, eldercare
- Tools/Products/Workflows: Joint WAMs (diffusion- or autoregressive-based) that co-generate future states and actions for long-horizon tasks; on-device uncertainty estimation and replanning
- Assumptions/Dependencies: Reliable long-horizon prediction under distribution shifts; strong safety/fail-safe mechanisms; efficient on-robot inference or hardware acceleration
- Mobile manipulation in dynamic environments
- Sectors: warehouses, hospitals, retail restocking
- Tools/Products/Workflows: Unified-stream joint WAM backbones combining navigation and manipulation; closed-loop planners that simulate multi-step futures before committing
- Assumptions/Dependencies: Robust multi-view and 3D consistency; online adaptation to changing layouts; real-time constraints
- Cross-embodiment, cross-domain learning from internet-scale videos
- Sectors: robotics at scale, education
- Tools/Products/Workflows: Training WAMs on vast egocentric corpora for physics priors; large-scale latent action discovery; systematic pipelines for human-to-robot mapping
- Assumptions/Dependencies: Data licensing/privacy; morphology-aware alignment methods; bias and safety auditing at internet scale
- Visuo-tactile world models for fine manipulation and prosthetics
- Sectors: healthcare (prosthetics, surgical assistance), advanced manufacturing
- Tools/Products/Workflows: WAMs that unify visual, tactile, and proprioceptive futures for sub-millimeter control; simulators that include contact/deformation dynamics
- Assumptions/Dependencies: High-quality tactile datasets; sensors integrated into end-effectors; precise synchronization and calibration
- Digital twins with physics-aware WAM agents
- Sectors: energy (plant operations), construction, utilities, semiconductor fabs
- Tools/Products/Workflows: WAM-driven agents embedded in facility digital twins to test procedures, schedule maintenance, and simulate interventions before field execution
- Assumptions/Dependencies: Accurate CAD/BIM alignment to sensor views; scalable multi-view 4D generation; integration with enterprise asset systems
- Reward modeling as an alignment layer for autonomous systems
- Sectors: policy/governance, industry autonomy
- Tools/Products/Workflows: Generative world models as “judges” to assess trajectory alignment with goals and constraints; learned rewards audited for safety and fairness
- Assumptions/Dependencies: Techniques to prevent specious likelihoods/reward hacking; transparent interpretability; standardized audits
- Regulatory certification using world-model-based tests
- Sectors: policy/regulation
- Tools/Products/Workflows: Standardized benchmarks for physical commonsense and action plausibility as part of certification; test suites that stress long-horizon safety-critical scenarios in silico
- Assumptions/Dependencies: Cross-industry agreement on metrics; validated links between virtual and real-world risk; governance for dataset biases
- Human–robot collaboration with predictive intent visualization
- Sectors: manufacturing, logistics, healthcare support
- Tools/Products/Workflows: Interfaces where robots display predicted futures (trajectories/video) to communicate intent and improve trust and coordination
- Assumptions/Dependencies: Human factors validation; occlusion- and privacy-aware rendering; fast on-the-fly predictions
- Edge-capable WAMs via efficiency advances
- Sectors: embedded/edge robotics, drones
- Tools/Products/Workflows: Distilled or flow-matching-based video models, sparse/low-rank transformers, accelerated sampling to meet edge latency/power budgets
- Assumptions/Dependencies: Hardware support (NPUs/GPUs); accuracy–efficiency trade-off acceptable for safety; robust fallback behaviors
- Bimanual and humanoid manipulation with predictive control
- Sectors: general-purpose humanoids, complex assembly
- Tools/Products/Workflows: Multi-stream joint WAMs capturing coordination; training on large bimanual/humanoid datasets; predictive contact planning
- Assumptions/Dependencies: Stable whole-body control; rich training data of coordinated behaviors; sophisticated safety monitors
- Multi-agent and social dynamics modeling
- Sectors: public-facing robots (retail, transport hubs), smart buildings
- Tools/Products/Workflows: WAMs that model other agents’ trajectories and social norms to plan safe, socially compliant actions
- Assumptions/Dependencies: Datasets with dense human–robot interaction; normative policy design; privacy-sensitive perception
- Robust sim-to-real bridging with self-calibrating world models
- Sectors: all robotics domains
- Tools/Products/Workflows: Self-supervised adaptation where WAMs align simulated and real distributions online; uncertainty-aware planning that hedges model errors
- Assumptions/Dependencies: Continual learning without catastrophic forgetting; reliable uncertainty estimates; guardrails for distribution shifts
- Disaster response and field robotics with foresight under uncertainty
- Sectors: emergency response, mining, offshore
- Tools/Products/Workflows: WAMs for hypothesis rollouts in partially observed, rapidly changing scenes; policy selection by outcome plausibility
- Assumptions/Dependencies: Training on rare/chaotic dynamics; robust sensing in extreme conditions; high-tolerance hardware
- Virtual labs and education at scale
- Sectors: education, workforce development
- Tools/Products/Workflows: Classroom-accessible WAM simulators for experimentation with policy learning, reward modeling, and evaluation without physical robots
- Assumptions/Dependencies: Open, lightweight world-model checkpoints; cloud compute credits or on-prem clusters; curricular materials
- Game/animation engines with physics-aware character control
- Sectors: software, entertainment
- Tools/Products/Workflows: WAMs that predict plausible scene futures to drive interactive characters and scene physics in real time
- Assumptions/Dependencies: Latency-optimized inference; tools integration (engine plug-ins); acceptable trade-offs between realism and speed
These applications leverage the survey’s core insights: unify predictive world modeling with action generation (Cascaded or Joint), exploit varied data sources (teleoperation, portable human demos, simulation, internet-scale egocentric video), and evaluate along visual fidelity, physical commonsense, and action plausibility. Feasibility depends on data quality, compute budgets, safe deployment practices, and continued progress on long-horizon prediction, multimodal fusion, and standardized evaluation.
Glossary
- 3D point-flow representation: A representation that models scene states and robot actions as points with flows in 3D space for unified dynamics and control. "a 3D point-flow representation."
- Action Chunking: A technique that groups low-level actions into higher-level segments to smooth execution and improve temporal consistency. "Action Chunking and Temporal Ensembling"
- Action-Conditioned World Models: World models that predict future observations conditioned on current state and agent actions. "Action-conditioned world models describe how an environment evolves in response to the agent's actions"
- Autoregressive Tokenization: Treating actions as discrete tokens generated sequentially with an autoregressive model. "Autoregressive Tokenization, which treats actions as discrete linguistic tokens generated sequentially"
- Autoregressive video world models: Generative world models that predict future video tokens frame by frame in an autoregressive manner. "Autoregressive video world models represent an important paradigm of generative world models."
- Cascaded WAM: A WAM architecture that first predicts future states and then derives actions aligned with those predictions. "Cascaded WAM explicitly factorizes the objective, formally p(o', a | o, l) = p(a | o',o,l)p(o' | o,l)"
- Cross-attention Mechanisms: Attention modules where one modality (e.g., language) attends to another (e.g., vision) to guide representation fusion. "Cross-attention Mechanisms, enabling dynamic interaction between task prompts and visual tokens"
- Dense optical flow: Per-pixel motion field estimation between frames to capture detailed visual dynamics. "dense optical flow"
- Diffusion-based Synthesis: Using diffusion models to generate continuous action distributions for control. "Diffusion-based Synthesis, which attaches a generative action expert to the VLM"
- Diffusion-based video world models: World models that use diffusion processes to model a distribution over possible future visual observations. "Diffusion-based video world models extend generative world models by explicitly modeling the distribu- tion over possible future observations."
- Egocentric video: First-person viewpoint video captured from a camera worn by the actor. "internet-scale egocentric video"
- FiLM-based layers: Feature-wise Linear Modulation layers that condition visual features with language embeddings. "FiLM-based layers to condition visual features on language embeddings"
- Forward Predictive Modeling: The requirement that a model explicitly forecasts future environment states as part of its reasoning. "Forward Predictive Modeling: The model must forecast the physical evolution of the environment"
- Implicit Latent-space Dynamics Models: Models that learn and predict environment dynamics in a compact latent space rather than pixel space. "Implicit Latent-space Dynamics Models"
- JEPA (Joint-Embedding Predictive Architecture): A paradigm that predicts target embeddings from context embeddings to learn abstract, predictive representations. "JEPA [299], short for joint-embedding predictive architecture, provides a general paradigm for learning pre- dictive representations in an abstract embedding space."
- Large Vision-LLMs (LVLMs): Large pretrained models that jointly process visual and textual data to provide strong semantic priors. "Large Vision- LLMs (LVLMs)"
- Latent action model: A learned module that infers action variables from video-only data without action labels. "latent action model to infer action variables from unlabeled video clips"
- Model-based reinforcement learning: An RL approach that uses a learned model of environment dynamics for planning or policy learning. "model-based reinforcement learning"
- Proprioceptive signals: Internal sensor readings (e.g., joint angles, velocities, forces) that describe the agent’s own state. "proprioceptive signals"
- Recurrent State-Space Model (RSSM): A latent dynamics architecture combining deterministic and stochastic components for sequence prediction and planning. "Recurrent State-Space Model (RSSM)"
- Sim-to-real transfer: Transferring policies trained in simulation to real-world robotic systems. "sim-to-real transfer"
- Transformer State-Space Model (TSSM): A transformer-based latent dynamics model for predicting state evolution over time. "Transformer State-Space Model (TSSM)"
- Unified Stream: A backbone design that processes multiple modalities or streams within a single unified architecture. "Unified Stream and Multi-Stream backbones."
- Variational Autoencoders (VAEs): Probabilistic autoencoders that learn compressed latent representations via variational inference. "variational autoencoders (VAEs) to compress images from pixel space into a latent space"
- Video Policies: Policies that inherit or leverage video-generation backbones to extract spatiotemporal representations for action prediction. "Video policies often refer to models defined by their structural heritage using gen- erative video architectures (e.g., Diffusion Transformers) as a backbone"
- Vision-Language-Action (VLA) models: Embodied foundation models that map visual observations and language instructions to robot actions. "Vision-Language-Action (VLA) models are a class of embodied foundation models"
- World Action Models (WAMs): Embodied models that jointly model future states and actions by unifying world dynamics prediction with action generation. "World Action Models (WAMs)"
- World Models (WM): Predictive models that capture environment transition dynamics to simulate future observations under actions. "World Models (WM) are defined as predictive transition functions that internalize the causal dynamics of the physical environment."
- Zero-shot generalization: The ability to perform tasks or adapt to scenarios without task-specific training examples. "zero-shot generalization"
Collections
Sign up for free to add this paper to one or more collections.