Orca: The World is in Your Mind
Abstract: We introduce Orca, an initial instantiation of a general world foundation model. Orca learns a unified world latent space from multimodal world signals and exposes it through multimodal readout interfaces. Rather than optimizing isolated next-token, next-frame, or next-action prediction, we are centered on Next-State-Prediction modeling, offering a unified state-transition modeling route toward understanding, predicting, and acting upon the world. Orca learns through two complementary paradigms: unconscious learning captures dense natural state transitions from continuous videos, and conscious learning models sparse meaningful state transitions by language-described events and VQA supervision. For pre-training, we construct a large-scale world-learning inventory data, including 125K hours of video data and 160M event annotations. After pre-training, Orca learns a unified world latent space. To examine whether the learned latent supports downstream, we evaluate it by three representative downstream readouts: text generation, image prediction, and embodied action generation. Orca's backbone is frozen, and only the lightweight modality-specific decoders are trainable. Experiments show the scalability of the proposed paradigm and verify that stronger world latent enables stronger downstream readouts. Orca outperforms similar-sized specialized baselines. These results show that Orca, as a general world foundation model, presents a promising approach to understanding, predicting, and acting upon the world. Finally, we discuss the current limitations, aiming to provide useful insights and inspiration for the community.
First 10 authors:
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces Orca, a new kind of AI model that tries to learn how the world works, not just how to finish sentences or make pictures. Think of Orca as building a hidden “mental map” of the world. This map helps it understand what’s happening now, predict what will happen next, and decide what to do. Orca learns from videos and language, then shows what it knows by:
- writing text (explaining what it sees),
- predicting images (what a scene will look like next),
- and generating robot actions (what to do).
What questions does the paper ask?
The paper asks two big, simple questions:
- If we give Orca more data and make it bigger, does it keep getting better at learning the world’s patterns?
- If Orca’s inner “world map” becomes stronger, does that help it do better on different tasks like writing, predicting images, and controlling robots?
How does Orca learn? (Methods in simple terms)
Orca learns in two complementary ways, a bit like how people do:
- Unconscious learning: like watching the world go by. Orca watches lots of videos and tries to predict what the next moment will look like. This teaches it natural changes over time—how objects move, how scenes change, and what’s physically likely.
- Conscious learning: like following instructions or reading explanations. Orca gets short text descriptions of key “events” in a video (for example, “the robot grasps the cup” or “the door opens”) and learns to connect those instructions with the changes it sees. It also answers questions about videos (Visual Question Answering, or VQA) to build common sense.
Here’s how the system is put together:
- Encoder (the “listener” and “thinker”): It takes in images/video frames and text, and builds the hidden world state—an internal representation, or “latent space,” that captures what’s going on and how it can change.
- Decoders (the “speakers”): Small, task-specific parts that turn the hidden state into outputs. There are different decoders for text, images, and actions.
Important detail: After Orca learns this inner world state, the researchers “freeze” it (they stop changing the encoder). Then they only train the small decoders. This tests whether the learned world state is truly general and useful for many tasks, not just one.
Data and training at a glance:
- Orca pre-trains on a huge collection of real-world videos (125,000 hours planned; this version uses about one-tenth), 160 million event annotations, and 11.5 million video–question pairs.
- It optimizes three goals at once: predict the next state from just video, predict event-guided next states with text, and answer questions about videos.
What did they find? (Main results)
The authors report four key findings:
- Orca scales well
- As they used more data and larger models, Orca’s learning kept improving (the training loss kept going down). This suggests the approach is solid and keeps benefiting from scale.
- A stronger “world state” boosts many tasks
- When Orca’s core was better trained, its performance improved on all three outputs (text, images, and actions)—even though the core was frozen and only tiny decoders were trained per task. This shows the inner world map is genuinely useful.
- Better text understanding and reasoning
- On several tests that measure understanding of time, motion, spatial relations, and common sense, Orca matched or beat other models of similar size. It especially improved on:
- Predicting how states change over time (state transitions),
- Reasoning about cause and effect (commonsense and counterfactuals),
- Keeping motion consistent over multiple steps.
- More grounded image prediction and robot control
- Image prediction: On a real-world benchmark where the task is to predict what a scene will look like after an interaction, Orca made more realistic, instruction-following predictions than other image models (fewer hallucinations, better object consistency).
- Robot actions: In real robot tests with new settings and new objects (out-of-domain), Orca helped produce action plans that made steady progress, got stuck less, and recovered better from mistakes than strong baselines—even though Orca’s pre-training did not include action labels. That’s a big deal: it hints that learning from videos can transfer to robot control and reduce the need for tons of costly robot data.
Why is this important? (Implications)
- One model, many skills: By learning a single inner “world map,” Orca can support very different tasks—writing, predicting images, and controlling robots—just by adding small decoders. This is a step toward general-purpose AI that understands and acts in the real world.
- Stronger foundations, easier adaptation: Because the core stays frozen and only tiny decoders are trained, it becomes easier and cheaper to adapt the system to new tasks.
- Fewer labels, more real-world learning: Orca learns a lot from unlabeled videos (just watching the world), and only uses language where it matters (events and Q&A). This could make building powerful AI systems more practical and less dependent on hand-labeled data.
- Toward safer, more reliable behavior: The ability to predict what comes next, follow instructions, maintain consistency, and recover from errors is exactly what we need for AI that operates in messy, real-world environments.
A quick note on limitations
- Early version: Orca currently focuses on vision and language; other signals (like audio, force, or touch) are future work.
- Not tuned for record-breaking scores: The decoders are intentionally lightweight to test transfer from the core, not to chase perfect benchmark results.
- More data ahead: The team has more data they haven’t used yet; performance may improve further with future iterations.
Bottom line
Orca is a first step toward a “world model” that learns a shared internal representation of how things change. By combining “just watching” with “learning from instructions,” it builds a hidden world state that transfers well to text, images, and robot actions. As the model and data scale up, all these abilities get better—suggesting a promising path toward AI that can understand, predict, and act in the real world.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of unresolved issues, uncertainties, and unexplored directions identified in the paper. Each point is phrased to be concrete and actionable for follow-up research.
- Modeling and objectives
- The state-transition formulation claims support for arbitrary temporal offsets (), but training and evaluation appear to focus on adjacent frames or adjacent events; the model’s ability to predict long-horizon, non-adjacent future/past states remains untested.
- Uncertainty in state transitions is not explicitly modeled; the approach supervises deterministic latent targets without a clear mechanism for multi-modal futures or aleatoric/epistemic uncertainty.
- The latent “teacher-forcing” objective matches predicted vision latents to encoder latents, but the exact loss (e.g., L2, cosine, contrastive) and its implications for representation geometry, stability, and calibration are not specified.
- The role and capacity of the “two-layer MLP” predictor in the encoder are underspecified; it is unclear whether this bottleneck limits the richness of learned dynamics or long-range dependencies.
- The claimed causal competence (e.g., “task intentions,” “causal premises”) is not evaluated with causal identification/intervention benchmarks; no experiments disentangle correlation vs. causation or test counterfactual consistency beyond standard VQA/QA tasks.
- The interplay and relative importance of “unconscious” (video-only) vs. “conscious” (language-conditioned) learning are not quantified; ablations on loss weights () are referenced but not presented.
- The model assumes a unified latent for both forward and backward transitions, but there is no analysis of time-reversal consistency or whether a single latent is sufficient for both directions.
- Representation properties and interpretability
- The structure of the learned world latent (dimensionality, invariances/equivariance, disentanglement, compositionality) is not characterized; there is no probing of whether the latent encodes objects, dynamics, forces, or contact states.
- No interpretability or diagnostic tools are applied to examine how specific latent dimensions relate to physical properties or task-relevant factors (e.g., contact, affordances, friction).
- Stability of the latent under viewpoint changes, lighting, occlusions, and visual perturbations is not systematically evaluated.
- Modalities and signals
- Despite framing as a “world model,” pre-training currently uses only vision and language; no integration of audio, force/tactile, proprioception, depth, or other physical signals is attempted or evaluated.
- The model’s ability to incorporate proprioception and action signals into the state (beyond using proprioception downstream in the action expert) is untested, limiting validation of embodied state estimation.
- Data and supervision
- Event segmentation and annotation quality are not described in detail (method, inter-annotator agreement, error rates); the model’s sensitivity to noisy or ambiguous event captions remains unknown.
- The data mixture, frame rates, resolutions, and domain balance are unspecified; there is no analysis of how each data subset (egocentric, exocentric, robot, natural dynamics) contributes to performance.
- Only one-tenth of the video corpus is used in this version; the scaling behavior with the full dataset (and potential saturation) remains open.
- Potential data leakage or overlap between pre-training videos and evaluation settings (especially PRICE-V0.1 tasks/contexts) is not ruled out.
- Training, compute, and scalability
- Scaling evidence is limited to two model sizes (0.8B and 4B) and partial data usage; the presence of classical scaling laws (and transition to compute/data-limited regimes) is not established.
- Training compute, wall-clock, and energy usage are not reported; the compute/data efficiency of the paradigm vs. alternatives (e.g., pixel-level or contrastive video objectives) is unclear.
- The impact of freezing the vision encoder (for latent supervision) on adaptivity and representation drift is not explored; end-to-end vs. frozen encoder trade-offs remain unstudied.
- Evaluation: text/readout
- Text evaluation aggregates heterogeneous benchmarks (MVBench, TemporalBench, 3DSRBench, SWITCH) without task-specific analyses of failure modes; adversarial or distribution-shifted prompts are not tested.
- The comparison set lacks ablations that control for backbone differences (e.g., same VLM backbone with/without Orca objectives) beyond a single Qwen3.5 baseline.
- Evaluation: image prediction (PRICE-V0.1)
- The image prediction evaluation relies primarily on LLM-as-a-judge scoring; sensitivity to judge model choice, prompt phrasing, and bias is not quantified (despite large score variance across judges).
- There is no evaluation with standard, objective visual forecasting metrics (e.g., FVD, LPIPS, temporal consistency) or task-specific affordance/contact consistency measures.
- Only single-frame prediction is discussed; multi-step rollouts, compounding error analysis, and stability over longer horizons are not evaluated.
- The degree of domain overlap between training videos and PRICE-V0.1 scenes/tasks is unclear; cross-environment generalization beyond the collected benchmark is untested.
- Evaluation: action generation
- Action results are limited to five tasks on a single robot platform; generalization across robots, grippers, control modalities, and sensor suites is unexplored.
- The approach relies on a separate DiT-based Action Expert trained from scratch with only 200 trajectories/task; the contribution of the world latent vs. the action model capacity is not isolated beyond a few baselines.
- Closed-loop deployment properties (latency, control frequency, robustness to perception errors, and failure recovery strategies) are not quantified; safety and intervention protocols are not described.
- Long-horizon task execution and hierarchical planning using the latent are not evaluated; it is unknown whether the latent supports planning beyond immediate next-state conditioning.
- Success rates remain relatively low, and “near-success” metrics improve; the failure cases are not categorized to identify systematic weaknesses (e.g., contact reasoning, grasp stability, trajectory smoothness).
- Fairness and baseline comparability
- Baseline parity is imperfect: some baselines (e.g., π0.5) are pre-trained on large robot datasets while Orca’s action expert is trained from scratch; the fairness of these comparisons and conclusions about latent quality can be confounded.
- Comparable ablations where all methods share identical decoders, data regimes, and parameter budgets (including larger backbone sizes) are missing.
- Robustness, safety, and ethics
- Safety considerations for real-world robot control using learned latents are not addressed (e.g., safety monitors, collision checks, fail-safes, human-in-the-loop).
- Biases in pre-training data and resulting downstream behaviors are not analyzed; the impact of biased language/event annotations on state transition predictions is unknown.
- Environmental and social costs (compute footprint, data governance, privacy of video sources) are not discussed.
- Reproducibility and release
- Many critical details are deferred to appendices (some absent in the provided text), and some formulae appear incomplete; end-to-end reproducibility (code, weights, data splits, PRICE-V0.1 release) is not fully documented within the paper.
- Hyperparameters for pre-training (sampling ratios, loss weights, query token initialization), event segmentation pipelines, and readout training schedules are insufficiently specified for replication.
- Future extensions and open research directions
- How to extend the state to include additional modalities (audio, tactile, force/torque, depth) and whether unified latent learning improves downstream embodied tasks remains an open question.
- Can the latent support explicit object-centric or physics-informed representations (e.g., contact graphs, dynamics parameters), and does this improve transfer to robotics/planning?
- What is the best way to incorporate temporal abstraction (options/events) and memory into the latent to support long-horizon reasoning and planning?
- Does joint training of encoder and readouts (vs. freezing) yield substantially better performance, and what are the trade-offs in generality and overfitting?
- Can the system perform counterfactual predictions and evaluate causal consequences of hypothetical actions in a grounded, measurable way?
- How stable is the learned latent under distribution shift (novel objects, textures, lighting, clutter) and adversarial perturbations, and how can robustness be improved?
Practical Applications
Immediate Applications
The following use cases can be prototyped or deployed today by leveraging Orca’s frozen world-encoder with lightweight readouts and the documented training/inference workflows.
- Predictive anomaly detection from video — Sectors: manufacturing, energy, logistics, security
- What it does: Use Orca’s latent to predict next-state visuals and compare against observed frames; large deviations flag equipment faults, process drifts, or safety risks.
- Potential tools/workflows: Frozen Orca encoder + image readout (MLP adaptor + SD3.5/other diffusion backends); residual scoring dashboards; camera-based monitoring.
- Assumptions/dependencies: Adequate camera coverage and synchronization; tolerance for LLM-judge biases if used for evaluation; compute for real-time inference; rights to process video data.
- Data-efficient robot skill learning — Sectors: robotics (manufacturing, warehousing, service)
- What it does: Train a DiT-based Action Expert on top of the frozen Orca latent using only 100–200 trajectories per task; exhibits stronger OOD progress and recovery than vision-language baselines.
- Potential tools/workflows: Orca encoder + MLP adaptor + DiT Action Expert (flow matching); ROS2 integration; small-scale teleop/kinesthetic demos; rule-based/PRM-as-a-Judge evaluation.
- Assumptions/dependencies: Reliable proprioception and time-aligned video; safety interlocks; domain-specific calibration; legal/safety reviews for production robots.
- Anticipatory AR guidance for procedures — Sectors: education, field service, consumer “how-to”
- What it does: Provide step-by-step visual or textual guidance by predicting the next state of an ongoing task and answering “what happens if” queries.
- Potential tools/workflows: Mobile AR app with on-device/offloaded Orca encoder; text readout via LM head; optional image readout for visual overlays.
- Assumptions/dependencies: Latency constraints for user experience; robust tracking; privacy compliance for user-captured video.
- Video-centric tutoring and training — Sectors: education, enterprise L&D, creator tools
- What it does: Explain causal chains and temporal steps in demonstrations (e.g., lab experiments, tool usage), answer VQA about processes, and visualize future states.
- Potential tools/workflows: Orca text readout for Q&A and summaries; image readout for “future-state” visualizations; PRICE-V0.1-style evaluation prompts for quality control.
- Assumptions/dependencies: Domain-specific evaluation and content QA; avoiding overreliance on automated judges; curation of representative training clips.
- Event-centric video indexing and search — Sectors: media, enterprise knowledge, surveillance
- What it does: Segment continuous footage into meaningful events, index by causal/temporal descriptors, and support queries like “find when the clamp disengaged before the jam.”
- Potential tools/workflows: Conscious learning head for event-conditioned latent extraction; embedding store over event latents; retrieval APIs.
- Assumptions/dependencies: Event annotation bootstrapping (semi-automatic); storage/compute for large video corpora; privacy and compliance.
- Predictive human–robot collaboration safety — Sectors: manufacturing, healthcare support, service robots
- What it does: Anticipate near-term human motion from video and adjust robot plans or slow zones accordingly to reduce close calls.
- Potential tools/workflows: Orca encoder + light action-readout controller integration with safety PLCs; conservative “predict-then-brake” logic.
- Assumptions/dependencies: Conservative thresholds to avoid nuisance stops; calibrated perception; adherence to ISO/ANSI robot safety standards.
- Simulation-lite pretraining for control — Sectors: robotics research, autonomy R&D
- What it does: Use video-only pretraining to bootstrap policy learning (IL/RL), reducing dependence on expensive simulators and action-labeled corpora.
- Potential tools/workflows: Frozen Orca encoder as feature extractor in RL/IL pipelines; adapters for policy heads; offline datasets of teleop videos.
- Assumptions/dependencies: Domain gap between pretraining and deployment environments; reward shaping or task-specific heads still required.
- Infrastructure acceleration for multimodal training — Sectors: AI/ML platforms, academia
- What it does: Adopt FlagScale-based FSDP2 sharding, chunked cross-entropy, recomputation, and comm prefetching to achieve ~4.4× throughput gains.
- Potential tools/workflows: Integrate Orca’s training optimizations into existing VLM/VLA training stacks.
- Assumptions/dependencies: Engineering effort for adoption; cluster networking performance; correctness and stability checks.
- Benchmarking next-state prediction — Sectors: academia, evaluation vendors, policy testing
- What it does: Use PRICE-V0.1 and the four-dimension capability breakdown (state transition, commonsense, spatial, dynamics) to evaluate models on real-world interaction prediction.
- Potential tools/workflows: Evaluation prompts; multi-judge aggregation (Gemini, GPT, Gemma, etc.); leaderboards and reproducible scripts.
- Assumptions/dependencies: LLM-judge variance and bias; need for periodic human audits; licensing for evaluator models.
- Visual forensics and continuity checking — Sectors: media integrity, compliance, insurance
- What it does: Detect unnatural or tampered transitions by comparing predicted versus observed latents/images across frames in high-value footage.
- Potential tools/workflows: Batch inference pipelines; anomaly scoring; human-in-the-loop review dashboards.
- Assumptions/dependencies: False positive management; controlled capture conditions improve reliability.
Long-Term Applications
These applications require additional research, broader modality coverage (e.g., tactile, force, audio), larger/cleaner datasets, tighter safety verification, or productization work.
- General-purpose household robots with OOD robustness — Sectors: consumer robotics, eldercare
- What it could do: Perform diverse chores with minimal per-task demonstrations, recover from errors, and adapt to new layouts/objects guided by language.
- Dependencies: Rich multimodal signals (vision, force, tactile), long-horizon planning in latent space, strong on-device inference, rigorous safety.
- Autonomous driving prediction and planning — Sectors: automotive, mobility
- What it could do: Unified next-state latent for forecasting agents, planning, and counterfactual “what-if” maneuvers under language-specified goals.
- Dependencies: Sensor fusion (LiDAR, radar), real-time guarantees, large-scale driving corpora, regulatory certification.
- Digital twins with counterfactual reasoning — Sectors: manufacturing, smart cities, energy
- What it could do: Maintain a live latent of system state across cameras and sensors; run language-conditioned “what-if” transitions for operational planning and hazard analysis.
- Dependencies: Multisensor integration, standardized interfaces, calibration across sites, governance for intervention decisions.
- Assistive AR with intent prediction and hazard warnings — Sectors: industrial safety, healthcare, construction
- What it could do: Predict user/task state and proactively warn of hazards or missteps; visualize safe next states.
- Dependencies: Ultra-low latency edge inference; robust user intent models; certification for safety-critical guidance.
- Scientific discovery via modality-extended world models — Sectors: materials, biology, astronomy, microscopy
- What it could do: Model state transitions in complex systems by ingesting non-visual modalities (spectra, force, fluorescence), enabling causal queries and hypothesis testing.
- Dependencies: Domain-specific sensors and labels; interpretable latent probes; collaboration with scientists for validation.
- Emergency response and resilience planning — Sectors: public safety, disaster management
- What it could do: Anticipate structural failures, fire spread, or crowd flows from video feeds and language-conditioned scenarios; propose actions.
- Dependencies: Physics-aware priors, calibrated uncertainty, integration with command-and-control systems, ethical safeguards.
- Agricultural automation with predictive handling — Sectors: agri-tech
- What it could do: Predict crop/fruit dynamics under manipulation; plan gentle grasping and harvesting strategies with few demonstrations.
- Dependencies: Seasonal/domain shifts, tactile integration, robust outdoor perception.
- Sports analytics and coaching — Sectors: sports tech, media
- What it could do: Forecast player motion/play evolution and evaluate counterfactual strategies; generate prescriptive feedback.
- Dependencies: High-quality tracking data; fairness and privacy considerations; latency for live use.
- Healthcare motion forecasting and assistive robotics — Sectors: healthcare, rehabilitation
- What it could do: Predict patient motion (e.g., fall risk), assist rehabilitation robots, and plan safer handoffs.
- Dependencies: Clinical validation, privacy, regulatory approvals, bias and robustness audits.
- Security and behavior anticipation with strict governance — Sectors: security, transportation, retail
- What it could do: Anticipate potentially risky behaviors in dense environments for early interventions.
- Dependencies: Strong policy and oversight due to bias risks; explainability; opt-in and privacy frameworks.
- Orca SDK and ecosystem — Sectors: software, robotics, developer tools
- What it could do: Provide a unified encoder API with plug-in readouts (text/image/action), event segmentation tools, and connectors (ROS2, Unity/Unreal).
- Dependencies: Stable APIs, licensing for third-party decoders (e.g., SD3.5), community datasets and benchmarks, documentation and support.
- Model-based planning and control in latent space — Sectors: robotics, industrial control
- What it could do: Multi-step rollouts under language/task conditions for MPC or model-based RL; closed-loop controllers that reason over next-state latents.
- Dependencies: Calibrated predictive uncertainty; long-horizon credit assignment; integration with safety shields.
- Policy and standards for next-state world models — Sectors: governance, standards bodies
- What it could do: Define evaluation/benchmarking protocols (e.g., next-state reliability, recovery metrics), data governance for massive video/event corpora, and guardrails for action-capable models.
- Dependencies: Multistakeholder coordination, public datasets with transparent provenance, periodic audits and red-teaming.
Notes on feasibility across applications:
- Data quality/coverage: Many applications rely on large, diverse, and compliant video datasets and high-quality event annotations.
- Multimodal expansion: Achieving robust performance in physical interaction often requires force/tactile/audio beyond vision/language.
- Safety and regulation: Action-generating systems must adhere to industry-specific safety standards and undergo rigorous validation.
- Compute and latency: Real-time deployments (AR, HRC, autonomy) demand optimized inference pipelines and potentially edge accelerators.
- Evaluation bias: Automated LLM-based judges are useful for rapid iteration but require human oversight, especially in safety-critical contexts.
Glossary
- Action Expert: A learned action-generation module conditioned on latent states, used to produce robot control trajectories. "The Action Expert is a DiT-based model with flow-matching loss, and it is trained from scratch."
- activation recomputation: A memory-saving training technique that recomputes intermediate activations during backpropagation instead of storing them. "further apply activation recomputation to trade moderate computation overhead for substantial memory savings"
- all-gather communication: A distributed training operation that gathers shards of parameters or activations across devices. "overlap FSDP all-gather communication with computation"
- Chunked Cross-Entropy Loss: A memory-efficient loss computation that avoids materializing full logits by processing them in chunks. "We adopt Chunked Cross-Entropy Loss to avoid materializing full logits during loss computation"
- conscious learning: A paradigm that learns sparse, meaningful state transitions guided by explicit language conditions or instructions. "Conscious learning aims to learn meaningful and sparse state transitions under the constraints of instructions."
- counterfactual reasoning: Inferring outcomes under hypothetical or alternative conditions, used to assess causal understanding. "Orca achieves more reliable common-sense reasoning and counterfactual reasoning through causal alignment of conscious learning."
- DiT-based: Refers to models based on Diffusion Transformers for generative processes. "The Action Expert is a DiT-based model with flow-matching loss"
- ego-centric interaction: First-person viewpoint data capturing interactions from the actor’s perspective. "Ego-centric interaction captures first-views experience during physical interaction"
- embodied action: Physical actions executed by a robot or agent within the environment. "records embodied action in robotic environments"
- event-conditioned state transition: Predicting state changes conditioned on a language-described event or instruction. "2) event-conditioned state transition"
- exo-centric manipulation: Third-person viewpoint data focusing on object-centered changes during manipulation. "exo-centric manipulation provides third-views of object-centered changes"
- FlagScale: A distributed training framework used to scale and optimize model training. "We use FlagScale and rebuild the Orca training with FSDP2"
- flow-matching loss: A training objective for generative models that learns to match probability flows, often used in diffusion-like models. "The Action Expert is a DiT-based model with flow-matching loss"
- forward/backward pre-fetching: A scheduling strategy that overlaps communication and computation by pre-loading needed data for forward and backward passes. "We introduce forward/backward pre-fetching to overlap FSDP all-gather communication with computation"
- FSDP2: A version of Fully Sharded Data Parallel that shards model parameters for memory-efficient distributed training. "with FSDP2, enabling more flexible parameter sharding"
- Gaussian noise: Random noise sampled from a normal distribution, commonly added during denoising-based image training. "The ground truth image with Gaussian noise is fed into another path of SD3.5 through a frozen VAE."
- learnable query vectors: Trainable tokens inserted into model inputs to extract or predict specific latent representations. "implemented through a set of learnable query vectors"
- LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning method that updates low-rank adapters instead of full model weights. "During this module training, only the MLP adaptor and the LoRA parameters are trainable."
- Milestone25%: A trajectory-level metric indicating the proportion of executions that reach 25% task progress. "M25 and M50 are Milestone25% and Milestone50%."
- Milestone50%: A trajectory-level metric indicating the proportion of executions that reach 50% task progress. "M25 and M50 are Milestone25% and Milestone50%."
- multi-level event segmentation: Dividing videos into hierarchical event segments (coarse to fine) for structured annotation of transitions. "Event data is derived from A. Video Data through multi-level event segmentation and language annotation."
- multi-step denoising: Iterative refinement in generative models to transform noisy inputs into clean outputs. "The final predicted image is obtained through multi-step denoising."
- Next-State-Prediction modeling: A unified modeling approach centered on predicting future (or past) world states rather than just tokens or frames. "grounded in Next-State-Prediction modeling"
- parameter sharding: Splitting model parameters across devices to reduce per-device memory usage in distributed training. "enabling more flexible parameter sharding"
- proprioception: Internal sensing of a robot’s own state (e.g., joint angles, velocities) used as input for control. "receives the latent, robot proprioception state, and noisy action"
- state abstraction: Compressing raw multimodal inputs into compact, informative latent representations of world state. "learns a unified world latent space for state abstraction and state transition."
- state-transition modeling: Learning how states evolve over time or under conditions, serving as a unified paradigm across domains. "such a model should use state-transition modeling as a unified paradigm"
- teacher forcing: Training technique where ground-truth targets are fed into the model to guide next-step predictions. "to perform teacher forcing on the predicted latent"
- Unconscious learning: A paradigm that learns dense, natural state transitions purely from observation without explicit labels. "Unconscious learning aims to learn natural and dense state transitions from continuous video."
- VAE: Variational Autoencoder; a generative model that encodes data into a latent distribution and decodes it back. "through a frozen VAE"
- VLM: Vision-LLM; a model jointly trained on visual and textual data to align both modalities. "uses a native pre-trained VLM"
- VQA: Visual Question Answering; a task where models answer questions about visual content. "VQA response generation"
- world latent space: A shared latent representation capturing the underlying state of the world across modalities. "Orca learns a world latent space"
Collections
Sign up for free to add this paper to one or more collections.