APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Published 31 Mar 2026 in cs.CL, cs.AI, and cs.IR | (2603.29093v2)

Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present APEX-EM, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a structured experience representation encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a dual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench, KGQAGen-10k, and Humanity's Last Exam using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL's +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces APEX-EM, a non-parametric online learning framework that uses structured procedural-episodic memory to significantly improve autonomous agents’ performance.
It employs a novel Procedural Knowledge Graph (PKG) and a hybrid retrieval mechanism for iterative self-correction, achieving notable gains on benchmarks like KGQAGen-10k and BigCodeBench.
The system enables robust cross-domain transfer and reduces iteration counts, facilitating both positive and negative learning without updating model parameters.

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Motivation and Context

LLM-based autonomous agents, despite improvements in multi-step reasoning, code generation, and complex tool use, exhibit a fundamental deficiency in procedural memory. Unlike RL agents with experience replay buffers that enable knowledge accumulation and reinforcement through transitions, LLM agents re-derive solutions ab initio for each new task, regardless of prior structurally analogous experiences. Existing memory augmentations—narrative logs, code snippet repositories, entity graphs, and episodic buffers—are unstructured or domain-fragmented, limiting deep procedural reuse. Even recent works using value-based episodic memory (e.g., MemRL) rely on flat, text-based summaries and scalar Q-values, lacking dual-outcome learning and robust cross-domain procedural transfer.

APEX-EM directly addresses these limitations by instituting a non-parametric online learning framework centered on structured experience replay and hybrid retrieval, enabling genuine procedural-episodic memory for LLM-based agents. The system accumulates, retrieves, and reuses complete procedural plans, supporting both positive and negative transfer via layered, type-rich memory structures and compositional retrieval mechanisms.

System Architecture and Methodology

Structured Experience Representation

Central to APEX-EM is the Procedural Knowledge Graph (PKG). The PKG is a typed, relational ontology capturing entities, experiences, operations, sub-tasks, and task topics interconnected through semantically meaningful edges (e.g., FOLLOWED_BY, USES, structurally_similar_to, supersedes). Each experience encodes:

Goal Reflection: Task definition, constraints, and verification contracts.
Procedure Reflection: Parameterized, versioned procedural templates.
Evidence Reflection: Provenance-tagged support artifacts.
Execution Trace: Iterative attempts with results and feedback.
Error Registry: Structured error types, root causes, and recovery protocols.
Patch Reflection: Instrumented corrections and rationale.

Successful experiences serve as positive samples, whereas failed executions—with error decomposition—are retained for explicit negative guidance, directly analogous to hindsight experience replay in RL.

PRGII Workflow

Task processing is governed by the Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow (Figure 1), which systematically decomposes each task as follows:

Figure 1: The PRGII workflow. A task $t$ flows through Plan, Retrieve, Generate, Iterate, and Ingest phases, with PKG mediating both structured retrieval and entity resolution.

Plan: Extracts intent, entity, and schema hypotheses, computes a structural signature (abstracted operation sequences).
Retrieve: Hybrid retrieval queries memory by semantic similarity (embedding space), structural signature match (LCS-based), and plan graph traversal, yielding dual-tracks $P^+$ (successes) and $P^-$ (failures).
Generate: LLM prompts are conditioned on both the structured plan and contextually relevant retrieved experiences.
Iterate: Multi-step self-correction using rigorous validators and feedback loops (including judge and oracle signals).
Ingest: Evaluation and quality gating; integration of successful/failed traces into PKG, updating entity/procedure versions as required.

Hybrid Retrieval and Cross-Domain Structural Transfer

APEX-EM's hybrid retrieval mechanism integrates:

Semantic search (embeddings on Goal Reflection fields)
Structural signature matching (LCS/Jaccard over operation sequences, yielding robust cross-domain transfer)
PKG traversal (guided multi-hop graph walks)

Unlike prior works tied to surface-level text similarity, this architecture enables retrieval and adaptation of structurally analogous procedural traces across semantically distinct task domains.

Empirical Analysis and Results

Evaluation Benchmarks

Comprehensive evaluation spans three primary domains:

BigCodeBench (code generation)—high-complexity API usage and code synthesis tasks.
KGQAGen-10k (multi-hop structured query generation on Wikidata).
Humanity's Last Exam (HLE) (multi-domain expert-level reasoning).

Comparisons are made under frozen-backbone conditions (Claude Sonnet/Opus 4.5) against MemRL and a spectrum of RAG/episodic baselines.

Main Results

Numerical Gains

KGQAGen-10k: APEX-EM attains 89.6% accuracy (95.3% CSR), a +48.3 pp improvement over the baseline and exceeding the oracle retrieval upper bound (GPT-4o + SP at 84.9%).
BigCodeBench: Achieves 83.3% SR/84.0% CSR from 53.9%, markedly surpassing MemRL's +11.0pp on comparable splits.
HLE: Reaches 48% SR from 25.2%, with entity-graph retrieval, outperforming all baseline strategies.

Figure 2: Epoch-by-epoch success rates for ablation configurations across benchmarks, demonstrating strong early-phase improvements and eventual plateauing characteristic of non-parametric online updates.

Ablation Analysis

Component contribution is task-dependent:

Rich judge feedback (qualitative reward decomposition) is critical for structured query generation (+10.3pp on KGQA) but marginal for code generation, where pass/fail is sufficient.
Iterative self-correction can partially substitute for finer feedback quality, particularly in domains where pattern adaptation is easier.

Iteration Efficiency

Memory accumulation reduces average iteration count per task (key for computational cost):

On BCB, 13%–17% fewer iterations; on KGQA, ~40% reduction.
Figure 3: Iteration distribution shift from E1 to E10, with a significant increase in first-attempt solutions and reduction in maximum-iteration edge cases.

Cross-Domain Transfer

APEX-EM's structural signatures enable robust transfer even when task embeddings are disjoint. Unlike MemRL, which relies on semantic retrieval, APEX-EM maintains effectiveness on benchmarks (such as HLE) where intra-set embedding similarity is intentionally low.

Qualitative Dynamics

Cold-start is mitigated via curated seed plans and offline-to-online bootstrapping.
PKG maintenance (compaction, entity swaps, patching) prevents memory bloat and guarantees up-to-date procedural generalization.

Theoretical and Practical Implications

The APEX-EM framework firmly establishes non-parametric online learning as a tractable, model-agnostic pathway for agent knowledge accumulation without model weight updates. By promoting structured, type-rich procedural traces and dual-outcome learning, it closes the procedural memory gap for LLM agents and offers full auditability and modular repair that flat/episodic memories cannot.

Practically, APEX-EM provides increased sample efficiency (fewer inferences per solved task), robust negative transfer avoidance, and directly supports continuous deployment in dynamic environments (e.g., evolving APIs or KBs via PKG entity versioning).

Figure 4: BCB iteration reduction across epochs, demonstrating consolidation toward first-attempt correctness due to memory adaptation.

Figure 5: KGQA iteration reduction, with a pronounced reduction in multi-attempt solutions over online experience accumulation.

Future Directions

Evaluating on additional embodied and tool-use agent benchmarks (e.g., ALFWorld, Lifelong Agent Bench).
Automated extraction and optimization of structural signatures for finer-grained procedural induction.
Integrating Q-value estimation (MemRL-style) with PKG-based structure for composite utility-aware retrieval.
Plan/prompt compression for improved context usage without degradations in procedural fidelity.

Conclusion

APEX-EM advances the state of the art in agentic memory by unifying non-parametric online learning, structured procedural-episodic replay, and hybrid compositional retrieval. The paradigm shift from unstructured episodic memory to layered, type-rich, and negative-indexed experience enables LLM-based agents to realize persistent, generalizable learning at deploy time, without parameter updates. This positions APEX-EM as a foundational methodology for future research in memory-augmented and self-evolving autonomous systems (2603.29093).