SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Published 17 Jun 2026 in cs.RO and cs.CV | (2606.18610v1)

Abstract: Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces SC3-Eval, a framework that combines forward-inverse dynamics and cross-view consistency for more faithful robot policy evaluation.
The methodology mitigates autoregressive rollout errors by using an uncertainty-driven early termination mechanism based on inverse dynamics.
Experimental results show high Pearson correlation and low ranking errors, validating SC3-Eval’s diagnostic precision for both in-distribution and out-of-distribution evaluations.

SC3-Eval: Self-Consistent Video-Based Policy Evaluation for Robot Foundation Models

Overview

This work presents SC3-Eval, a systematic framework for evaluating generalist robot manipulation policies via action-conditioned, self-consistent video generation. SC3-Eval addresses critical issues in model-based policy evaluation, notably compounding error in autoregressive rollouts, multi-view coherence, and generalization to out-of-distribution policies. By enforcing forward-inverse dynamics and cross-view consistency during training, and leveraging inverse dynamics for test-time uncertainty estimation and early termination, SC3-Eval substantially improves the faithfulness and diagnostic utility of video-world-model based policy evaluators.

Technical Contributions

Self-Consistent Training

Central to SC3-Eval is the adaptation of a unified video-action Transformer backbone (UVA (Li et al., 28 Feb 2025, Zhu et al., 3 Apr 2025)), fine-tuned with three mutually-supporting objectives:

Forward-Inverse Dynamics Consistency: The model is jointly trained for both forward (predicting future frames from actions) and inverse (recovering actions from frames) dynamics. All parameters are shared; this regularization forces the forward model to generate rollouts from which the intended actions can be decoded, tightly anchoring rollouts to the physical action manifold and robustly mitigating drift endemic in forward-only models.
Cross-View Consistency: For real-world settings using multiple synchronized cameras (e.g., third-person and wrist-mounted views), the model is tasked to inpaint held-out views from the remainder and actions, explicitly enforcing geometric and temporal coherence across views. This mechanism is more sample-efficient and targeted than concatenation-based spatial self-attention.
Test-Time Consistency via Inverse Dynamics: At inference, the inverse dynamics head is reused to estimate per-chunk consistency: after generating frames from an action chunk, the model infers which actions would have produced those frames, and high $\ell_2$ discrepancy triggers early termination. This ablation-free, uncertainty-driven truncation is crucial for suppressing trajectory drift, especially under distributional shift.

The system is instantiated with Cosmos3-Nano (Aditi et al., 1 Jun 2026) pretraining, trained on a 381-hour, three-view table-bussing dataset, and subjected to extensive evaluation with vision-language-action policies.

Experimental Results

Faithful Closed-Loop Policy Evaluation

On both in-distribution (table bussing) and out-of-distribution (reverse mapping, unseen at train time) settings, SC3-Eval achieves robust closed-loop policy evaluation:

Pearson Correlation: Achieves 0.929 closed-loop $r$ , consistently exceeding prior video-model baselines—Ctrl-World (Guo et al., 11 Oct 2025), IRASim (Zhu et al., 2024), Cosmos-Predict 2.5 (NVIDIA et al., 28 Oct 2025)—by a substantial margin in three of four settings.
Mean Maximum Rank Violation (MMRV): Achieves 0.119, lowest in all evaluation regimes, reflecting excellent pairwise policy ranking for checkpoint selection.
Outcome-Level Fidelity: Rather than merely predicting aggregate success rates, SC3-Eval accurately reproduces per-trajectory failure modes (e.g., language-following, lifting, or placing errors), with the highest reproduction rate across all failure categories. This property renders SC3-Eval superior for fine-grained diagnostics relative to aggregate-only baselines.

Effectiveness of Consistency Mechanisms

Ablation studies confirm that both the inverse dynamics and cross-view inpainting losses are essential and complementary: removal of either increases policy ranking error and degrades out-of-distribution generalization. The uncertainty-driven early termination mechanism is shown to be particularly effective in suppressing error accumulation on OOD splits, outperforming static truncation or pure autoregressive rollouts.

Robustness to Distribution Shift

SC3-Eval generalizes effectively to the reverse table bussing variant, which shares low-level skill primitives with the training data but presents unseen object-to-destination associations. The strong performance on this OOD setting attests to both the robustness of the unified backbone and the anchoring effect of the consistency objectives.

Limitations and Future Work

Despite its fidelity, SC3-Eval’s inference speed (2.3 seconds per rollout chunk on a high-end GPU) is orders of magnitude below that of fast analytical simulators, limiting throughput during iterative policy development. The reported evaluation scenario focuses on short-horizon (20s) manipulation; expected failure modes for longer-horizon tasks include cumulative drift and degradation of visual coherence. Strategies for addressing these include leveraging longer pretraining trajectories, exploration of hierarchical rollout decomposition, and architectural innovations for long-term memory (e.g., (Xiao et al., 16 Apr 2025)).

Implications

SC3-Eval demonstrates that forward-inverse consistent, autoregressive video world models can serve as highly reliable policy evaluators—matching human-level ranking precision and providing transparent, interpretable rollouts. The method’s diagnostic granularity supports not only checkpoint selection but also targeted improvement of generalist policies. By avoiding reliance on physics-based simulators or per-scene geometric reconstruction, SC3-Eval establishes a practical path for rapid, scalable evaluation of generalist robot policies in diverse table-top manipulation domains.

The test-time uncertainty signal derived from the inverse dynamics head is immediately applicable for rollout reliability assessment and for automated scoring or active data curation. The framework is modular and amenable to transfer across domains and policy families, given an appropriate pretraining prior and sufficient action-conditioned observation data.

Conclusion

SC3-Eval introduces a principled and empirically validated foundation for video-model-based evaluation of robot manipulation policies. By enforcing forward-inverse and cross-view consistency in training, and leveraging inverse dynamics for test-time uncertainty estimation, SC3-Eval delivers state-of-the-art closed-loop faithfulness and diagnostic transparency. This architectural and methodological template sets a new standard for scalable, high-fidelity policy evaluation in robot learning. Future work may focus on acceleration, scaling to longer horizons, and generalization to multi-task, multi-domain scenarios.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper is about a new way to test robot “brains” (policies) without always using a real robot, which is slow, costly, and hard to scale. The authors build SC3‑Eval, a system that imagines what a robot’s cameras would see in the future if the robot took certain actions. Think of it like a very smart, physics‑aware movie generator that lets us practice and grade robot behavior in a virtual world.

Key questions the paper asks

Can we predict how well a real robot policy will do by “rolling it out” inside a video simulator instead of on a physical robot?
How do we stop small prediction mistakes from snowballing into big errors during long imagined rollouts?
If the robot uses multiple cameras (like two room views and a wrist camera), how do we keep all those views consistent with each other?
How can we tell, during testing, that the simulation has drifted off course and should be stopped?

How SC3‑Eval works (in simple terms)

Imagine a robot policy as a driver and the simulator as the road and scenery generator. The driver (policy) presses the pedals and turns the wheel (sends actions). The simulator shows what the world would look like next (future video). SC3‑Eval strengthens this simulator with three “consistency checks” so it stays realistic and helpful.

Here are the three big ideas, with everyday analogies:

Forward–inverse dynamics consistency:
- Forward is “actions → video” (if I move like this, what will I see?).
- Inverse is “video → actions” (given what I saw, what action must have happened?).
- SC3‑Eval trains both together in one model. This is like practicing both speaking and listening in a new language; doing both keeps you honest. If the forward mode makes weird frames, the inverse mode can’t recover the right actions, and the model learns to avoid drifting into impossible futures.
Cross‑view consistency (multi‑camera agreement):
- The robot watches the scene from several cameras at once. SC3‑Eval trains the model to fill in a missing camera view from the others, like asking one eyewitness to sketch the scene based on another eyewitness’s description. This keeps all views coherent over time.
Test‑time consistency (an on‑the‑fly “reality check”):
- During rollout, SC3‑Eval uses the inverse mode to “read back” the actions from the generated frames and compares them to the actions the policy actually sent. If they disagree too much, that’s a sign the simulation is drifting. The system then stops early instead of piling on more error. Think of it as a built‑in lie detector that says, “We’re off track—let’s not trust the rest of this rollout.”

A few more practical choices make it robust:

It predicts a slightly longer future chunk than the policy will actually use, then only keeps the first part. Training on longer snippets teaches better motion and contact dynamics, while still giving the policy short, fresh context at each step.
It’s built on a unified “video+action” backbone (a transformer that handles both), so the same model can do forward prediction, inverse action recovery, and view inpainting.

What the researchers did to test it

They trained SC3‑Eval on 381 hours of real robot videos doing “table bussing” (moving dishes and trash to the right places) using three synchronized cameras (two external and one wrist). Then they evaluated seven different real robot policies in two ways:

Offline (open‑loop): feed the real actions into the simulator and see if the generated video matches reality.
Online (closed‑loop): let the policy act on the generated frames, just like it would on a real robot, and score how well the simulator predicts the policy’s real‑world success.

They also tested an out‑of‑distribution (OOD) variant called “reverse table bussing” (swap where objects should go) to check generalization.

Main findings and why they matter

Strong match to real‑world performance:
- SC3‑Eval’s predicted policy success closely matches real success, with a high closed‑loop Pearson correlation around 0.93 (1.0 would be perfect). It also gets the best “ranking” score (MMRV ≈ 0.12; lower is better), meaning it orders policy checkpoints similarly to real‑world testing.
- In many cases, online (closed‑loop) evaluation is just as faithful as offline. That’s important because real policies will act on whatever images they see, including small simulation quirks.
Better than strong baselines:
- It outperforms three prior video‑model evaluators (Ctrl‑World, IRASim, and Cosmos‑Predict 2.5) across multiple settings.
Reproduces specific failure types, not just pass/fail:
- Beyond overall success rates, SC3‑Eval captures why runs fail (e.g., misunderstanding a command, failing to lift, or failing to place). This makes it useful for debugging, not just scoring.
Generalizes to a new task variant:
- Even when the destination mapping was swapped (OOD), the system stayed reliable, though—as expected—performance dipped somewhat compared to in‑distribution tasks.

Why this matters: Reliable and detailed evaluation in simulation lets researchers compare and improve robot policies much faster and more cheaply than doing every test on hardware.

Implications and future impact

Faster, cheaper robot development: Teams can screen many policy versions virtually, saving scarce robot time for the most promising candidates.
Better debugging: Because it reproduces specific failure modes, SC3‑Eval can guide fixes (e.g., is it a grasping issue or a misinterpreted instruction?).
Broader reach: The same ideas—forward/inverse training, multi‑view inpainting, and test‑time early stop—could help other video world models stay stable and trustworthy.

Limitations and what’s next:

It’s not yet real‑time; physics engines can still run faster, so speeding up the video generator is a key next step.
It’s validated on short tasks (~20 seconds). Longer tasks may need stronger long‑term memory or hierarchical goals to avoid drift over time.

In short: SC3‑Eval is a smarter, self‑checking “future video” simulator for robots. By keeping actions and images in sync, matching multiple cameras, and stopping when things go off‑track, it predicts real robot performance well and even mirrors the same kinds of mistakes—making it a practical tool for building better robot policies.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to guide follow-up research:

Generalization beyond one scene: The model is trained/evaluated in a single workspace with fixed camera placements; robustness to new scenes, backgrounds, lighting, and camera extrinsics/intrinsics is untested.
New robot embodiments and action spaces: Only delta end-effector (7D) actions are considered; transfer to other robots, control spaces (joint/torque), grippers, or bimanual setups is unknown.
Unseen objects and categories: Evaluation uses 12 known categories; performance with novel objects, materials, and clutter regimes is not assessed.
Broader OOD shifts: The out-of-distribution task only swaps destination semantics; robustness to pixel-level shifts (novel views, textures, occluders), dynamics shifts, and distractors remains unmeasured.
Camera robustness: Tolerance to missing views, desynchronization, variable frame rates, and moving cameras is not evaluated; cross-view inpainting may fail without fixed, calibrated geometry.
Lack of 3D/geometry grounding: Cross-view consistency uses inpainting without explicit 3D constraints; benefits of geometric priors (e.g., pose, depth, epipolar constraints, NeRF/splat fields) are unexplored.
Physics and contact fidelity: The evaluator relies on video likelihoods; incorporating physics priors or contact-aware dynamics for better manipulation realism is left open.
Long-horizon performance: Experiments are short (~20 s); failure modes and mitigation strategies for minute-scale or task-chained evaluations are not established.
Early-termination bias: How truncating rollouts affects success scoring, fairness across policies, and comparability to full-horizon real rollouts is not analyzed.
Uncertainty calibration: The inverse-dynamics discrepancy is used as an ad hoc reliability signal; its calibration to real error, sensitivity to the threshold T, and transferability across tasks/data shifts are not studied.
Ambiguity in inverse dynamics: Multiple action sequences can produce similar frames; how non-identifiability affects the discrepancy signal and early termination is unclear.
Alternative uncertainty schemes: Combining the inverse-discrepancy signal with ensembles, epistemic/aleatoric modeling, or predictive variance is not explored.
Closed-loop interaction effects: How policies adapt differently to model frames vs. real frames (e.g., perception biases, latency) lacks systematic behavioral analysis beyond correlation metrics.
Termination scheduling: The choice of chunk size, horizon decoupling (l′ vs. l), and adaptive schedules has not been optimized or theoretically grounded.
Metric breadth and calibration: Only Pearson r and MMRV are reported; confidence intervals, calibration curves, and error decompositions (by object, stage, or view) are missing.
Small policy sample: Only seven VLA checkpoints of one architecture are used; generalization to other policy families (RL, classical pipelines, visuomotor transformers) and larger policy sets is unknown.
Failure-mode taxonomy depth: The three-category taxonomy (language/lift/place) is coarse; finer-grained causal attribution (perception vs. planning vs. control) is not provided.
PSNR as a proxy: Offline PSNR may not reflect manipulation-relevant fidelity; perceptual/action-consistent metrics (e.g., LPIPS, task-conditioned scores) are not evaluated.
Scalability and efficiency: Inference is slow (2.3 s per chunk) and training uses 32 GB200 GPUs; empirical speed-accuracy trade-offs and system-level accelerations (sampler, caching) need quantification.
Data efficiency and scaling laws: The dependence of evaluator fidelity on dataset size, diversity, and model capacity (Cosmos3 variants) is not characterized.
Ablation depth: Mode-mixture probabilities, multi-FPS mixing, and pseudo-action augmentation lack systematic sensitivity analyses or principled selection criteria.
Robustness to policy outliers: Behavior when policies produce highly off-manifold actions (e.g., random/noisy) is not stress-tested; failure detection and safe termination policies under strong shift remain open.
View/occlusion handling: Re-entry success is shown qualitatively for the wrist view; quantitative robustness to long occlusions and persistent state-tracking without explicit memory is not measured.
Memory mechanisms: The approach eschews explicit long-term memory; comparisons to memory-augmented models (e.g., WorldMem) and hybrids are missing.
Scoring truncated rollouts: The exact treatment of early-terminated trajectories in success computation (e.g., penalty schemes, imputation) is not specified or validated for fairness.
Annotation automation: VLM-based auto-annotation using the discrepancy signal is mentioned but not benchmarked for accuracy, bias, or failure cases.
Safety and failure containment: How evaluator errors could mislead policy selection (e.g., optimistic bias on rare failures) and mitigation strategies are not analyzed.
Reproducibility and assets: Availability of code, trained weights, and the 381-hour dataset (or equivalents) is unclear; reproducibility across hardware/software stacks is not documented.
Theoretical grounding: There is no formal analysis of why forward–inverse sharing reduces drift or bounds its error; developing guarantees or conditions for consistency would strengthen the approach.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete ways SC3-Eval’s methods can be used today, along with likely sectors, enabling tools/workflows, and feasibility notes.

Policy screening, ranking, and gating without robot time
- Sectors: Robotics platforms; warehousing/logistics; manufacturing; consumer robotics
- What to deploy:
- Integrate SC3-Eval as a “pre-deployment evaluator” in CI/CD for robot policies (e.g., a ROS node or cloud microservice that runs closed-loop rollouts and returns Pearson/MMRV scores and failure breakdowns)
- Use uncertainty-driven early termination to stop unreliable imagined rollouts and avoid misleading scores
- Assumptions/dependencies:
- Evaluator must be trained/fine-tuned on data similar to the target domain (objects, camera placements, action representation)
- Requires synchronized multi-view video or retraining if only single view is available
- GPU availability; current throughput is slower than real time (≈2.3s per 24-frame chunk on a GB200)
Failure-mode diagnostics for rapid iteration
- Sectors: R&D, QA, and DevOps for robotics
- What to deploy:
- “Virtual failures lab” that runs matched imagined rollouts against real rollouts to reproduce and classify failure types (language-following vs. lifting vs. placing), surfacing root causes earlier
- Dashboards that visualize per-trajectory divergence and per-chunk inverse-consistency scores
- Assumptions/dependencies:
- Requires a simple failure taxonomy and annotation guidelines (or adoption of the paper’s three criteria)
- Quality of diagnostics improves with coverage of relevant objects and interactions in training data
Regression testing and checkpoint selection for VLA policies
- Sectors: Software for robot learning; robotics research labs; integrators
- What to deploy:
- Standardized test suites that track MMRV and Pearson r across policy checkpoints
- Fixed seed, matched initial-condition sets for reproducible, closed-loop evaluator runs
- Assumptions/dependencies:
- Reliability evaluated at short-horizon manipulation (≈20s episodes in the paper); longer tasks may require further tuning/data
- Thresholds for early termination need light calibration on held-out trajectories
Cost-efficient QA by replacing a portion of real robot runs
- Sectors: Operations and fleet management
- What to deploy:
- Shift a fraction of routine QA to SC3-Eval rollouts, reserving robot time for spot checks and outliers flagged by high uncertainty
- Assumptions/dependencies:
- Correlation with real performance must be validated for each new task family or environment
- Compute scheduling to amortize offline evaluation in batch overnight workflows
Consistent multi-view video synthesis for perception debugging
- Sectors: Perception/vision within robotics stacks
- What to deploy:
- Use cross-view inpainting to generate coherent multi-camera scenes to probe camera placement, occlusion handling, and re-entry behavior of wrist or auxiliary views
- Assumptions/dependencies:
- Works best where the evaluator was trained on similar multi-view rigs; may need re-training for different camera geometries
- Not a substitute for diverse real imagery; treat as a complementary tool for controlled probes
Semi-automated policy outcome labeling
- Sectors: Data operations; dataset curation
- What to deploy:
- Combine SC3-Eval’s per-chunk inverse-consistency signal with VLM-based scoring to pre-label success/failure and prioritize human review
- Assumptions/dependencies:
- VLM quality and task-specific prompts must be validated; human spot checks remain necessary
- Inverse-consistency is an empirical reliability indicator, not a calibrated probability
Methodological “recipes” for practitioners
- Sectors: Industry and academia
- What to deploy:
- Adopt prediction–execution horizon decoupling (train with longer l′, execute with shorter l) for better generation fidelity during evaluation
- Apply multi-FPS training and pseudo-action augmentation when policy action logs are sparse or noisy
- Assumptions/dependencies:
- Benefit depends on the backbone’s pretraining horizon and the availability of synchronized pose/action logs

Long-Term Applications

The following opportunities require additional research, engineering, scaling, or validation before broad deployment.

Cross-domain, generalist evaluator-as-a-service
- Sectors: Robotics across warehousing, manufacturing, service/home, agriculture, energy
- What could emerge:
- A cloud service hosting a large, multi-task SC3-Eval variant to benchmark diverse manipulation policies without per-scene asset building
- Assumptions/dependencies:
- Requires broad, curated multi-task/multi-embodiment datasets and substantial compute
- Robustness to larger distribution shifts (materials, lighting, embodiments) must be demonstrated
Real-time safety monitor using forward–inverse consistency
- Sectors: Safety-critical robotics; healthcare; cobotics; standards/certification
- What could emerge:
- A runtime “off-manifold detector” that compares commanded vs. inverse-recovered actions on real camera streams to pause/abort when behavior deviates
- Assumptions/dependencies:
- Significant latency reduction (via model distillation, caching, or specialized accelerators) and rigorous threshold calibration
- Certification evidence for conservative behavior under uncertainty
Training-time integration: conservative model-based policy optimization
- Sectors: Robot learning research; autonomous systems
- What could emerge:
- Offline/online RL algorithms that use the inverse-consistency signal to terminate synthetic rollouts (MOReL-style) or to penalize rewards (MOPO-style), improving policy safety and sample efficiency
- Assumptions/dependencies:
- Algorithmic work to stabilize policy learning with non-ensemble uncertainty
- Larger-scale studies showing transfer from simulated rollouts to real robots
Long-horizon task evaluation with hierarchical structure
- Sectors: Service robotics; field robotics; home/assistive robots
- What could emerge:
- Hierarchical evaluators that stitch subgoal-conditioned SC3-Eval segments, enabling minute-long evaluations of household chores, inspection, or maintenance tasks
- Assumptions/dependencies:
- Longer training trajectories and/or subgoal supervision; improved temporal coherence in the backbone
Multi-camera system design and validation
- Sectors: Systems engineering; facility integration
- What could emerge:
- Use cross-view inpainting metrics to score camera placements, redundancy, and calibration quality before physical install
- Assumptions/dependencies:
- Extensions to arbitrary camera counts and geometry; robust calibration pipelines
Synthetic data generation for rare failures and stress testing
- Sectors: QA; safety; reliability engineering
- What could emerge:
- Controlled generation of long-tail failure scenarios (e.g., transparent/reflective objects, clutter dynamics) for training perception and recovery policies
- Assumptions/dependencies:
- Conditioning mechanisms to steer to specific edge cases; validation that synthesized failures mirror real ones
Standards and pre-certification protocols for generalist robot policies
- Sectors: Policy/regulation; consortia; insurers
- What could emerge:
- Simulation-lite conformance tests that use video world models to assess policy robustness, failure modes, and guardrails prior to on-site trials
- Assumptions/dependencies:
- Independent validation across vendors and tasks; published confidence intervals and known failure bounds
Cross-sector transfer: inspection/maintenance and surgical-assist evaluation
- Sectors: Energy/utilities (valve turning, panel ops); healthcare (assistive manipulation)
- What could emerge:
- Task-specific evaluators that model multi-view camera rigs (borescopes, endoscopes, pole cams) for pre-clinical/field validation of manipulation policies
- Assumptions/dependencies:
- Domain-specific data collection (materials, tools, anatomy/fixtures); stringent safety oversight

Key cross-cutting dependencies to consider

Data relevance and coverage: Evaluator fidelity depends on training data closely matching the target distribution (objects, textures, kinematics, views).
Compute and latency: Current throughput is suitable for offline evaluation but not for tight real-time loops; acceleration and distillation are active needs.
Horizon limits: Short-horizon performance is strong; long-horizon stability will require hierarchical or memory-augmented extensions.
Calibration and thresholds: Early-termination thresholds and reliability signals must be tuned per task; they are not calibrated probabilities.
Multi-view synchronization: Benefits hinge on synchronized, consistent camera streams; miscalibration reduces cross-view gains.

View Paper Prompt View All Prompts

Glossary

Action-conditioned video world model: A generative video model that predicts future frames conditioned on a sequence of robot actions, enabling simulated policy rollouts instead of real execution. "Action-conditioned video world models offer a scalable alternative by simulating policy rollouts."
Autoregressive drift: The gradual deviation that accumulates when a model repeatedly feeds its own predictions back as inputs over time. "to identify the inverse dynamics objective as an implicit grounding mechanism that mitigates autoregressive drift."
Autoregressive rollout: A rollout where future predictions are generated step-by-step using the model’s previous outputs, making errors compound over time. "Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution."
Cross-view consistency: A training or inference constraint ensuring that multiple synchronized camera views remain physically and visually coherent with each other. "Cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism."
Cross-view inpainting: Predicting a held-out camera view from the remaining views (and actions) to enforce multi-view coherence. "We add a cross-view inpainting mode that randomly teal{selects} one view and asks the model to inpaint the other views, so that each view is supervised to stay consistent with the rest."
Delta end-effector pose (delta-EE): A robot action representation as the change in the end-effector’s pose (translation, rotation) and gripper state over a timestep. "Each action is recorded as a delta end-effector pose (delta-EE) and represented as a $7$-dimensional vector with three components for translation, three for axis-angle rotation, and one for gripper width."
Distribution shift: A mismatch between the training data distribution and the conditions encountered during evaluation or deployment. "Distribution shift between training data and the rollouts a learned dynamics model is queried at is a long-standing challenge in model-based teal{reinforcement learning (RL)}"
Early termination (uncertainty-driven): A strategy to stop simulated rollouts when a model-derived uncertainty or inconsistency signal indicates the rollout has become unreliable. "an uncertainty-driven early-termination criterion derived without modification to the training procedure from the inverse dynamics mode"
Ensemble disagreement: An uncertainty estimation technique that measures how much multiple models differ in their predictions. "Classical responses estimate the model's predictive uncertainty, often through ensemble disagreement~\citep{pets}"
Flow matching: A generative modeling objective that learns a continuous flow transporting noise to data, here used to train video-action token denoising. "the per-instance loss is a flow matching objective~\citep{lipman2023flow,liu2023flow} on the noised tokens"
Forward dynamics: Predicting future observations (e.g., video frames) given past observations and the action sequence. "We exploit the model's shared token space and jointly train a forward dynamics mode that reconstructs noised video tokens from the action stream, and an inverse dynamics mode that reconstructs noised action tokens from the video."
Forward-inverse dynamics consistency: Joint training that enforces agreement between forward predictions (frames from actions) and inverse predictions (actions from frames), grounding generation in plausible actions. "Forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold"
Gaussian splatting: A 3D scene reconstruction/rendering technique that represents scenes as collections of Gaussian primitives for fast, photorealistic rendering. "Real-to-sim approaches such as PolaRiS~\citep{polaris} reconstruct real scenes via Gaussian splatting but require per-scene reconstruction."
Inverse dynamics: Inferring the actions that produced observed state transitions or video frames. "and an inverse dynamics mode that reconstructs noised action tokens from the video."
Mean Maximum Rank Violation (MMRV): A metric that quantifies how much a predicted ranking violates the true ranking, focusing on worst-case pairwise errors. "The Mean Maximum Rank Violation (MMRV)~\citep{li2024simpler} captures the consistency of pairwise policy rankings, the relative ordering property most directly relevant to checkpoint selection in practice."
Off-manifold: Being outside the set of physically plausible states or action-state combinations learned during training. "terminates the rollout once it drifts off-manifold."
Offline RL (offline reinforcement learning): Learning policies from a fixed dataset without interacting with the environment during training. "rather than during offline RL training, where uncertainty must be folded into rewards or used to construct a pessimistic MDP."
Open-loop: Executing or evaluating a fixed action sequence without using feedback from generated or observed states. "offline (open-loop) rollouts that condition the world model on the real-world action sequence (isolating video fidelity from policy interaction)"
Out-of-distribution: Inputs, tasks, or behaviors that differ significantly from those seen during training. "It generalizes to a held-out task variant absent from the training data, and beyond predicting aggregate success rates, it reproduces the specific failure modes that policies exhibit in real-world rollouts. Qualitative rollout videos are available on the project teal{website}... generalizes to an out-of-distribution task semantic,"
Pearson correlation coefficient: A statistic measuring linear correlation between two sets of values (here, predicted vs. real success rates). "The Pearson correlation coefficient $r(R, R_{\mathcal{W})$ captures linear agreement between predicted and real-world success rates"
Pessimistic MDP: A modified Markov Decision Process that integrates uncertainty estimates to penalize or terminate trajectories in uncertain regions. "MOReL~\citep{morel} converts the uncertainty signal into a pessimistic MDP that terminates rollouts once a threshold is crossed"
Receding-horizon schedule: A control or evaluation scheme that plans over a longer horizon but executes only the first part before replanning with new observations. "We follow the policy's own receding-horizon schedule."
Rectified-flow: A flow-based generative modeling approach with straightened paths that can improve sampling and training stability. "inheriting the rectified-flow formulation of the Cosmos3 backbone."
Sim-to-real gap: The discrepancy between behaviors in simulation and real-world performance due to modeling and fidelity limitations. "leaving residual sim-to-real gaps."
Unified dynamics model: A single transformer-based model that, via masked token training, can perform forward prediction, inverse action inference, and view inpainting. "{SC3-Eval} builds on the unified dynamics model architecture, such as UVA~\citep{li2025uva,zhu2025uwm}, a single transformer that operates jointly on video and action tokens."
Video foundation model: A large-scale pre-trained video model that serves as a general-purpose backbone adaptable to new tasks. "a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator"
Vision-language-action (VLA) policy: A robot policy that conditions on visual inputs and natural language instructions to produce action commands. "teal{Seven VLA policy checkpoints} with the $\pi_{0.5}$ ~\citep{pi05} architecture are evaluated in our experiments."
World simulator: A learned or modeled environment that predicts the consequences of actions, enabling policy evaluation without real-world execution. "Our goal is to construct a world simulator $\mathcal{W}$ such that the policy performances $R_{\mathcal{W}, i}$ obtained by rolling out each $\pi_i$ inside $\mathcal{W}$ correlate strongly with the real-world performances $R_i$ ."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Summary

SC3-Eval: Self-Consistent Video-Based Policy Evaluation for Robot Foundation Models

Overview

Technical Contributions

Self-Consistent Training

Experimental Results

Faithful Closed-Loop Policy Evaluation

Effectiveness of Consistency Mechanisms

Robustness to Distribution Shift

Limitations and Future Work

Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

Key questions the paper asks

How SC3‑Eval works (in simple terms)

What the researchers did to test it

Main findings and why they matter

Implications and future impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Key cross-cutting dependencies to consider

Glossary

Open Problems

Continue Learning

Collections

Tweets