Distinguish whether VLA task successes reflect competence or chance

Determine whether successful task executions by Visual Language Action models in robotic manipulation benchmarks (e.g., VLATest on SimplerEnv scenarios) are attributable to learned model competence rather than stochastic chance by establishing execution-level criteria or diagnostics that can make this distinction on a per-run basis.

Background

The paper argues that binary success/failure metrics used in current evaluations of Visual Language Action (VLA) models do not capture execution quality or model confidence. In reviewing results from VLATest, the authors observed that many nominally successful runs appeared low-quality and raised concerns about whether some successes were merely accidental rather than competency-driven.

This motivates the need to ascertain the underlying cause of a recorded success, separating genuine policy capability from luck or incidental factors, which current practices do not reveal.

References

Moreover, it was often unclear whether the task was completed successfully due to model competence or merely by chance.

— Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots (2507.17049 - Valle et al., 22 Jul 2025) in Introduction (Section 1)

Distinguish whether VLA task successes reflect competence or chance

Background

References

Related Problems