Evaluating Agents That Autonomously Generate Their Own Tasks

Establish reliable evaluation protocols and criteria for large language model agents that autonomously generate their own tasks in open-ended environments, enabling systematic assessment of their capabilities, behaviors, and progress without relying solely on predefined, single-run task performance.

Background

The paper studies an LLM-based ReAct agent augmented with goal generation and persistent memory to operate in open-ended settings. Unlike traditional single-run task solvers, such agents must choose their own tasks, interact over extended horizons, and leave persistent artifacts, making conventional benchmarks inadequate.

In the qualitative evaluation, the authors note that assessing such agents is inherently difficult because they do not operate under fixed objectives or horizons. This motivates the need for new evaluation methodologies tailored to agents that set and pursue their own goals.

References

Evaluating agents that generate their own tasks remains an open challenge; here we summarize qualitative observations of our system.

— LLM Agents Beyond Utility: An Open-Ended Perspective (2510.14548 - Nachkov et al., 16 Oct 2025) in Section 3 (Qualitative Results), opening paragraph

Evaluating Agents That Autonomously Generate Their Own Tasks

Background

References

Related Problems