Papers
Topics
Authors
Recent
2000 character limit reached

LifelongAgentBench: Unified LLM Benchmark

Updated 13 December 2025
  • LifelongAgentBench is a unified benchmark that assesses LLM agents' ability to acquire, transfer, and retain skills over sequential, interdependent tasks.
  • It employs Dockerized interactive environments and automatic label verification to rigorously measure metrics like catastrophic forgetting and transfer efficiency.
  • The framework features modular design with human-like task dependencies, enabling precise evaluation of memory retention and adaptive skill composition.

LifelongAgentBench is a unified, reproducible benchmark framework specifically devised to evaluate the lifelong learning ability of LLM agents. In contrast to existing evaluation protocols that measure static reasoning over isolated tasks, LifelongAgentBench systematically tests agents for their capacity to accumulate, retain, and transfer skills across a dependency-rich sequence of skill-grounded, interactive tasks. It centers on three domains—Database (DB), Operating System (OS), and Knowledge Graph (KG)—and incorporates fully automated label verification, modular extensibility, and protocols for quantitative assessment of catastrophic forgetting, transfer efficiency, and retention in LLM-based agents (Zheng et al., 17 May 2025).

1. Benchmark Motivation and Design Principles

LifelongAgentBench addresses a foundational limitation observed in LLM-based agents: the inability to encode, recall, and transfer acquired knowledge across temporally spaced, interdependent tasks. Existing benchmarks such as WebArena and AgentBench focus on single-turn, parallel, or isolated tasks, failing to capture knowledge accumulation or reuse, and lacking rigorous verification mechanisms. LifelongAgentBench is constructed to directly evaluate whether an agent can: (a) acquire atomic skills, (b) transfer these skills to new, dependent tasks, and (c) maintain competence over arbitrarily long, non-i.i.d. task sequences.

Inter-task dependencies are intrinsic to all benchmark environments. The goal is to closely approximate the conditions under which human-like lifelong learning is required: each new task’s objective and available state are conditionally sampled from the agent’s previous interactions, enforcing both memory retrieval and compositional reasoning.

2. Interactive Environments and Skill Taxonomy

LifelongAgentBench offers three Dockerized interactive environments, each with a defined atomic skill set and deterministic, automatable interfaces for label verification:

Environment Implementation Skills / Primitives
Database (DB) MySQL container 22 SQL primitives (SELECT, WHERE, GROUP BY, subqueries, etc.)
Operating System (OS) Ubuntu container 29 Bash commands (cp, mv, grep, sed, mkdir, etc.)
Knowledge Graph (KG) in-memory SPARQL engine 7 graph ops (get_relations, argmax, intersection, etc.)

Each environment is engineered so that the task construction, initial state, and goals of each new task depend directly and explicitly on the agent’s previously completed actions. For example, in the DB domain, SQL queries are constructed with sampled subsets of skills, and inter-task skill overlap is quantitatively measured as:

Overlap(m,n)=2 a(m) a(n)a(m)+a(n)\mathrm{Overlap}(m, n) = \frac{2\,a^{(m)}\,a^{(n)}}{a^{(m)} + a^{(n)}}

where a(m)a^{(m)} is the proportion of skills in task mm that overlap with nn. OS sequences involve persistent modifications to user/group state and file-system artifacts, requiring agents to recall and manipulate historical artifacts. In KG tasks, variable bindings and intermediate results persist and must be referenced in subsequent queries.

3. Automatic Label Verification

A central design feature is automatic, environment-anchored label verification, enabling large-scale, reproducible evaluation without manual inspection.

  • DB: For SELECT, returned tuples are checked against ground truth. For mutation operations, the post-action table’s MD5 hash is compared to canonical answers.
  • OS: Each Bash primitive must return exit code 0. File system and group state transitions are verified through checksums or inode comparisons.
  • KG: Results of each SPARQL query are matched to the expected output by executing the reference action plan.

To capture edge-case inconsistency, 10% of all tasks in every environment are subject to human audit.

4. Evaluation Metrics and Protocols

LifelongAgentBench models agent interaction as a sequence of goal-conditioned POMDPs, and formalizes several quantitative metrics:

  • Task Success Rate: Let si∈{0,1}s_i \in \{0,1\} indicate success on task ii. Cumulative success over TT tasks is

S(T)=∑i=1Tsi.S(T) = \sum_{i=1}^T s_i.

  • Cumulative Reward: As each agent receives reward Ri:=siR_i := s_i, the total reward is

Rcum=∑i=1NRiR_{\text{cum}} = \sum_{i=1}^N R_i

  • Knowledge Retention Rate:

KR=1M∑i=1Msiafter\mathrm{KR} = \frac{1}{M}\sum_{i=1}^M s_i^{\rm after}

Here, after the completion of all NN tasks, the first MM tasks are revisited to assess retention.

  • Transfer Efficiency:

TE=1N−1∑i=2N(si−sibaseline)\mathrm{TE} = \frac{1}{N-1}\sum_{i=2}^N \bigl(s_i - s_i^{\rm baseline}\bigr)

This measures the performance benefit on task ii due to prior cumulative experience.

Strict sequential execution is enforced; no parallelism or shuffling is permitted.

5. Empirical Findings: Limitations of Experience Replay

Standard continual learning often relies on experience replay to preserve old knowledge. However, for LLM-based agents, naïvely concatenating large replay buffers rapidly saturates the context window, introducing irrelevant or low-quality historical content. Empirical evaluations with Llama-3.1-8B highlight this:

  • In DB: Success increases from 0.19 to 0.78 as replay grows to 64 examples.
  • In OS: Success peaks at 4–16 replayed examples (0.43→0.50) but then degrades.
  • In KG: Small replay buffers help moderately, but large buffers cause out-of-memory errors.

Reasoning-optimized models such as DeepSeek-R1, which produce long chain-of-thought traces, exacerbate this due to excessive context length, ultimately decreasing downstream execution fidelity.

6. Group Self-Consistency Mechanism

LifelongAgentBench introduces a group self-consistency mechanism to address context-window overflow and stabilize predictions under replay. The method partitions retrieved experience examples into GG groups, performs independent inference for each, and selects the final action by majority voting:

1
2
3
4
5
6
7
Inputs: current_task_prompt P, replay_buffer B with K examples, group_count G
Split B into G disjoint subsets B_1,…,B_G (size K/G each)
for g in 1..G:
    prompt_g = P + serialize(B_g)
    answer_g = LLM(prompt_g)
final_answer = majority_vote(answer_1,…,answer_G)
return final_answer

In controlled experiments with Llama-3.1-8B on DB with KK=16:

  • Without grouping: accuracy = 0.61, average tokens ≈\approx 17,874.
  • With 16 groups: accuracy = 0.75, average tokens ≈\approx 2,888.

In KG, token count drops from ~56k to ~11k with negligible accuracy loss. This mechanism delivers both context compression and output stabilization.

7. Reproducibility, Extensibility, and Best Practices

LifelongAgentBench is architected for modular extensibility. The platform is organized around six components—ModelPool, Agent, Environment, HistoryFactory, Controller, Callbacks—communicating via a thin RPC protocol. New environments are integrated by subclassing the abstract Environment interface and providing deterministic verification for domain-specific tasks. Benchmark reproducibility is ensured through container snapshots, fixed random seeds, and rigorous code scrutiny.

Best practices distilled from the benchmark include:

  • Constructing fine-grained skill taxonomies to ensure systematic coverage of atomic capabilities.
  • Investing in memory-efficient retrieval and experience compression, especially as task sequences scale.
  • Leveraging modular callback interfaces to experiment with new lifelong skill acquisition or retrieval strategies.
  • Ensuring all labels and outcomes are validated via environment-anchored, deterministic verifiers.

LifelongAgentBench thus provides the methodological backbone for studying knowledge accumulation, catastrophic forgetting, and information transfer in LLM-based agents, directly facilitating innovation in adaptive, memory-capable agent design (Zheng et al., 17 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LifelongAgentBench.