LifelongAgentBench: Unified LLM Benchmark
- LifelongAgentBench is a unified benchmark that assesses LLM agents' ability to acquire, transfer, and retain skills over sequential, interdependent tasks.
- It employs Dockerized interactive environments and automatic label verification to rigorously measure metrics like catastrophic forgetting and transfer efficiency.
- The framework features modular design with human-like task dependencies, enabling precise evaluation of memory retention and adaptive skill composition.
LifelongAgentBench is a unified, reproducible benchmark framework specifically devised to evaluate the lifelong learning ability of LLM agents. In contrast to existing evaluation protocols that measure static reasoning over isolated tasks, LifelongAgentBench systematically tests agents for their capacity to accumulate, retain, and transfer skills across a dependency-rich sequence of skill-grounded, interactive tasks. It centers on three domains—Database (DB), Operating System (OS), and Knowledge Graph (KG)—and incorporates fully automated label verification, modular extensibility, and protocols for quantitative assessment of catastrophic forgetting, transfer efficiency, and retention in LLM-based agents (Zheng et al., 17 May 2025).
1. Benchmark Motivation and Design Principles
LifelongAgentBench addresses a foundational limitation observed in LLM-based agents: the inability to encode, recall, and transfer acquired knowledge across temporally spaced, interdependent tasks. Existing benchmarks such as WebArena and AgentBench focus on single-turn, parallel, or isolated tasks, failing to capture knowledge accumulation or reuse, and lacking rigorous verification mechanisms. LifelongAgentBench is constructed to directly evaluate whether an agent can: (a) acquire atomic skills, (b) transfer these skills to new, dependent tasks, and (c) maintain competence over arbitrarily long, non-i.i.d. task sequences.
Inter-task dependencies are intrinsic to all benchmark environments. The goal is to closely approximate the conditions under which human-like lifelong learning is required: each new task’s objective and available state are conditionally sampled from the agent’s previous interactions, enforcing both memory retrieval and compositional reasoning.
2. Interactive Environments and Skill Taxonomy
LifelongAgentBench offers three Dockerized interactive environments, each with a defined atomic skill set and deterministic, automatable interfaces for label verification:
| Environment | Implementation | Skills / Primitives |
|---|---|---|
| Database (DB) | MySQL container | 22 SQL primitives (SELECT, WHERE, GROUP BY, subqueries, etc.) |
| Operating System (OS) | Ubuntu container | 29 Bash commands (cp, mv, grep, sed, mkdir, etc.) |
| Knowledge Graph (KG) | in-memory SPARQL engine | 7 graph ops (get_relations, argmax, intersection, etc.) |
Each environment is engineered so that the task construction, initial state, and goals of each new task depend directly and explicitly on the agent’s previously completed actions. For example, in the DB domain, SQL queries are constructed with sampled subsets of skills, and inter-task skill overlap is quantitatively measured as:
where is the proportion of skills in task that overlap with . OS sequences involve persistent modifications to user/group state and file-system artifacts, requiring agents to recall and manipulate historical artifacts. In KG tasks, variable bindings and intermediate results persist and must be referenced in subsequent queries.
3. Automatic Label Verification
A central design feature is automatic, environment-anchored label verification, enabling large-scale, reproducible evaluation without manual inspection.
- DB: For SELECT, returned tuples are checked against ground truth. For mutation operations, the post-action table’s MD5 hash is compared to canonical answers.
- OS: Each Bash primitive must return exit code 0. File system and group state transitions are verified through checksums or inode comparisons.
- KG: Results of each SPARQL query are matched to the expected output by executing the reference action plan.
To capture edge-case inconsistency, 10% of all tasks in every environment are subject to human audit.
4. Evaluation Metrics and Protocols
LifelongAgentBench models agent interaction as a sequence of goal-conditioned POMDPs, and formalizes several quantitative metrics:
- Task Success Rate: Let indicate success on task . Cumulative success over tasks is
- Cumulative Reward: As each agent receives reward , the total reward is
- Knowledge Retention Rate:
Here, after the completion of all tasks, the first tasks are revisited to assess retention.
- Transfer Efficiency:
This measures the performance benefit on task due to prior cumulative experience.
Strict sequential execution is enforced; no parallelism or shuffling is permitted.
5. Empirical Findings: Limitations of Experience Replay
Standard continual learning often relies on experience replay to preserve old knowledge. However, for LLM-based agents, naïvely concatenating large replay buffers rapidly saturates the context window, introducing irrelevant or low-quality historical content. Empirical evaluations with Llama-3.1-8B highlight this:
- In DB: Success increases from 0.19 to 0.78 as replay grows to 64 examples.
- In OS: Success peaks at 4–16 replayed examples (0.43→0.50) but then degrades.
- In KG: Small replay buffers help moderately, but large buffers cause out-of-memory errors.
Reasoning-optimized models such as DeepSeek-R1, which produce long chain-of-thought traces, exacerbate this due to excessive context length, ultimately decreasing downstream execution fidelity.
6. Group Self-Consistency Mechanism
LifelongAgentBench introduces a group self-consistency mechanism to address context-window overflow and stabilize predictions under replay. The method partitions retrieved experience examples into groups, performs independent inference for each, and selects the final action by majority voting:
1 2 3 4 5 6 7 |
Inputs: current_task_prompt P, replay_buffer B with K examples, group_count G Split B into G disjoint subsets B_1,…,B_G (size K/G each) for g in 1..G: prompt_g = P + serialize(B_g) answer_g = LLM(prompt_g) final_answer = majority_vote(answer_1,…,answer_G) return final_answer |
In controlled experiments with Llama-3.1-8B on DB with =16:
- Without grouping: accuracy = 0.61, average tokens 17,874.
- With 16 groups: accuracy = 0.75, average tokens 2,888.
In KG, token count drops from ~56k to ~11k with negligible accuracy loss. This mechanism delivers both context compression and output stabilization.
7. Reproducibility, Extensibility, and Best Practices
LifelongAgentBench is architected for modular extensibility. The platform is organized around six components—ModelPool, Agent, Environment, HistoryFactory, Controller, Callbacks—communicating via a thin RPC protocol. New environments are integrated by subclassing the abstract Environment interface and providing deterministic verification for domain-specific tasks. Benchmark reproducibility is ensured through container snapshots, fixed random seeds, and rigorous code scrutiny.
Best practices distilled from the benchmark include:
- Constructing fine-grained skill taxonomies to ensure systematic coverage of atomic capabilities.
- Investing in memory-efficient retrieval and experience compression, especially as task sequences scale.
- Leveraging modular callback interfaces to experiment with new lifelong skill acquisition or retrieval strategies.
- Ensuring all labels and outcomes are validated via environment-anchored, deterministic verifiers.
LifelongAgentBench thus provides the methodological backbone for studying knowledge accumulation, catastrophic forgetting, and information transfer in LLM-based agents, directly facilitating innovation in adaptive, memory-capable agent design (Zheng et al., 17 May 2025).