Automated, reliable task-success evaluation for robotic manipulation
Develop a fully automated and reliable methodology to assess task success in real-world robotic manipulation videos, robust to diverse failure modes and ambiguous semantics, eliminating reliance on proxy metrics and partial human validation in evaluation suites such as the Embodied World Model Benchmark (EWMBench).
References
Fully automated and reliable assessment of task successâparticularly under diverse failure modes and ambiguous semanticsâremains an open challenge.
— Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
(2508.05635 - Liao et al., 7 Aug 2025) in Limitations, bullet "Evaluation Methodology"