Reproducibility of UI-TARS 1.5 and similar GUI agents on AndroidWorld
Ascertain whether the reported performance of GUI interaction models such as UI-TARS 1.5 can be reproduced on the AndroidWorld benchmark under standardized evaluation conditions, and identify factors that prevent replication of the published results.
References
However, we were unable to fully reproduce the reported performance, especially for UI-Tars 1.5.
— Dyna-Mind: Learning to Simulate from Experience for Better AI Agents
(2510.09577 - Yu et al., 10 Oct 2025) in Appendix, Additional Details on AndroidWorld — Other Implementation/Evaluation Details