Reproducibility of UI-TARS 1.5 and similar GUI agents on AndroidWorld

Ascertain whether the reported performance of GUI interaction models such as UI-TARS 1.5 can be reproduced on the AndroidWorld benchmark under standardized evaluation conditions, and identify factors that prevent replication of the published results.

Background

In evaluating Android GUI agents, the authors aimed to use models like Qwen2.5-VL and UI-TARS. They report difficulty finding evaluation scripts compatible with AndroidWorld and an inability to replicate published performance, particularly for UI-TARS 1.5. Similar reproducibility concerns have been publicly raised by others.

Establishing reproducibility on AndroidWorld is important for fair comparison and for validating claims about agent capabilities in realistic device-control tasks. Standardized evaluation procedures would help determine whether reported metrics are achievable and under what conditions.

References

However, we were unable to fully reproduce the reported performance, especially for UI-Tars 1.5.

— Dyna-Mind: Learning to Simulate from Experience for Better AI Agents (2510.09577 - Yu et al., 10 Oct 2025) in Appendix, Additional Details on AndroidWorld — Other Implementation/Evaluation Details

Reproducibility of UI-TARS 1.5 and similar GUI agents on AndroidWorld

Background

References

Related Problems