Evaluating Agents That Autonomously Generate Their Own Tasks
Establish reliable evaluation protocols and criteria for large language model agents that autonomously generate their own tasks in open-ended environments, enabling systematic assessment of their capabilities, behaviors, and progress without relying solely on predefined, single-run task performance.
References
Evaluating agents that generate their own tasks remains an open challenge; here we summarize qualitative observations of our system.
— LLM Agents Beyond Utility: An Open-Ended Perspective
(2510.14548 - Nachkov et al., 16 Oct 2025) in Section 3 (Qualitative Results), opening paragraph