Simulation-to-real validation of paraphrase-robustness vulnerabilities

Determine whether the paraphrase-robustness vulnerabilities observed for vision–language–action models when evaluated with LIBERO-Para in the LIBERO simulation environment persist when these models are deployed on physical robotic platforms.

Background

All experiments and analyses in the paper are conducted in the LIBERO simulation environment using the proposed LIBERO-Para benchmark and the PRIDE metric. The authors find substantial performance degradation under paraphrased instructions and attribute failures largely to planning-level errors driven by object-referent variation.

However, simulations may not fully reflect real-world factors such as sensor noise, physics, and rendering fidelity. The authors explicitly note that additional validation is necessary to ascertain whether the same paraphrase-robustness vulnerabilities documented in simulation also occur on physical robots.

References

This study evaluates VLA models within the LIBERO simulation environment. As simulations differ from real-world settings in rendering fidelity, physics modeling, and sensor noise, further validation is required to determine whether the observed vulnerabilities in paraphrase robustness persist on physical robotic platforms.

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models  (2603.28301 - Kim et al., 30 Mar 2026) in Limitations