Papers
Topics
Authors
Recent
Search
2000 character limit reached

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Published 6 Jan 2026 in cs.CL | (2601.03192v1)

Abstract: The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While LLMs possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.

Summary

  • The paper presents a framework that uses non-parametric reinforcement learning to update episodic memory via a value-based retrieval mechanism.
  • It demonstrates significant performance gains over baselines in tasks such as code generation and embodied navigation.
  • The approach effectively prevents catastrophic forgetting while enabling safe, continual adaptation without modifying the core LLM.

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Motivation and Background

LLMs have demonstrated strong few-shot reasoning and in-context learning capabilities, yet they lack the distinctive human ability to perform constructive episodic simulation: flexibly synthesizing solutions for novel tasks by leveraging and adapting past experiences. Standard approaches—such as fine-tuning or parameter-efficient continual learning—are often computationally expensive and susceptible to catastrophic forgetting. Non-parametric methods, most notably Retrieval-Augmented Generation (RAG), passively retrieve information based on semantic similarity without any consideration of the historical utility of retrieved knowledge, causing persistent issues when encountering similar but non-transferable or noisy memories.

This paper introduces MEMRL, a principled framework that enables self-evolving, memory-augmented LLM agents through non-parametric reinforcement learning (RL) directly on episodic memory. The central tenet is a strict decoupling of the LLM’s frozen, stable reasoning function from a plastic, utility-optimized episodic memory system. MEMRL formalizes memory-augmented decision-making as a value-based memory retrieval and updates utility estimates online using environmental feedback—thereby reconciling the stability-plasticity dilemma in agentic continuous learning.

MEMRL: Framework and Operational Principle

Memory Structure: Intent-Experience-Utility Triplet

The episodic memory bank in MEMRL is architected as triplets (z,e,Q)(z, e, Q), where zz denotes the intent embedding (semantic representation of the user query), ee encodes the experience (solution trace or trajectory), and QQ captures the learned utility (expected return of retrieving ee under similar intents). This triplet structure enables each memory item to be contextually and functionally indexed.

Two-Phase Retrieval: Semantic and Utility-Driven

Retrieval proceeds in two distinct phases:

  1. Phase A (Semantic Recall): For the current intent ss, a candidate pool C(s)C(s) is constructed using cosine similarity above a threshold.
  2. Phase B (Value-Aware Selection): Candidates from C(s)C(s) are scored via a weighted combination of semantic similarity and normalized Q-value:

score(s,z,e)=(1λ)sim(s,z)+λQ(z,e)\text{score}(s, z, e) = (1-\lambda) \cdot \text{sim}(s,z) + \lambda \cdot Q(z,e)

where λ\lambda modulates the trade-off between immediate semantic fit and empirically-learned utility. Top-kk memories are composed into the retrieval context.

Non-Parametric RL: Utility Estimation and Memory Update

Upon completion of an action (via the frozen LLM), the agent receives a scalar reward. For memories used in the context, Q-values are updated using a temporal-difference rule:

QQ+α(rQ)Q \leftarrow Q + \alpha (r - Q)

This non-parametric, Monte Carlo-style update enforces convergence to expected utility under stationary conditions without modifying the LLM weights or other parametric components.

Theoretical Analysis: Stability, Convergence, and Catastrophic Forgetting

  • Convergence: The exponential moving average update of Q converges in expectation to the true mean reward for each (intent, experience) pair, as shown through direct analysis of the error dynamics.
  • Stability: Under stationary task and frozen model assumptions, the variance of Q-values remains bounded even in the presence of reward noise, avoiding unbounded oscillations in memory utility estimation.
  • Global Stability: By modeling the interaction of retrieval policy and value estimation as a Generalized Expectation-Maximization process, the framework ensures that policy and memory utility estimates jointly converge to stationary points. This effectively prevents catastrophic forgetting, observed empirically as a lower regression rate from previously solved to failed tasks compared to other baselines.

Empirical Evaluation and Ablation

Benchmarks and Baselines

MEMRL was evaluated on diverse and challenging benchmarks:

  • BigCodeBench (code generation)
  • ALFWorld (embodied navigation and multi-step reasoning)
  • Lifelong Agent Bench (OS and DB interaction)
  • Humanity's Last Exam (HLE, complex multidisciplinary problem solving)

Baselines included standard RAG, reflection-augmented memory, procedural memory (MemP), and retrieval-based critiques (Self-RAG).

Key Results

MEMRL dominates all baselines across domains in both runtime learning and transfer scenarios. On ALFWorld, for example, MEMRL posts a last-epoch accuracy of 0.507, providing a 56% relative improvement over MemP and an 82% gain relative to agents without memory. Crucially, in sequential, exploration-heavy environments, MEMRL's value-aware retrieval acts as a trajectory verifier, with strong correlation (r=0.861r=0.861) between Q-estimates and actual task success rates. Even in low-similarity environments (HLE), MEMRL demonstrates significant performance gains by runtime memorization of specific solutions. Ablations confirm that optimal performance arises from a balanced trade-off between semantic matching and utility weighting (λ=0.5\lambda=0.5); memory overcapacity or ignoring semantic similarity impairs robustness and stability.

Stability and Forgetting

MEMRL achieves the lowest empirical forgetting rate (fraction of tasks regressing from success to failure), demonstrating its theoretical guarantees in practice. Removing core design elements (Q-value normalization, similarity gate) leads to instability and increased forgetting.

Implications and Future Directions

MEMRL operationalizes value-based credit assignment within the retrieval process, transforming episodic memory from a passive, similarity-based knowledge base into an active, utility-aware substrate for agentic RL. The approach supports:

  • Continual, safe adaptation: Agents improve at deployment without risk to the integrity of their highly-tuned backbone via non-invasive memory updates.
  • Trajectory-verifying behavior: Particularly in sequential or multi-step domains, MEMRL's utility mechanism filters brittle or near-miss strategies, promoting transfer of robust procedures and corrective heuristics over superficially similar failures.
  • Task-structure awareness: Gains scale with structural repetitiveness, but MEMRL is also effective in low-similarity domains due to its capacity for on-the-fly memorization.
  • Generalizability: Learned utility mappings foster improved transfer to held-out tasks and domains without additional parameter updates.

Future research may explore hierarchical or hierarchical memory architectures, more sophisticated utility models (e.g., context-sensitive or compositional Q-functions), and integration with broader metacognitive and planning components. There is potential for MEMRL to underpin safe and interpretable test-time learning regimes and robustify open-ended autonomous agents.

Conclusion

MEMRL establishes a robust, theoretically-sound paradigm for self-evolving LLM agents by combining a utility-driven episodic memory system with a frozen, stable LLM core. It resolves core challenges in runtime continual learning and demonstrates marked, quantifiable gains over state-of-the-art baselines across a spectrum of complex tasks. The framework lays a foundation for further advances in non-parametric agentic learning, bridging RL and memory-augmented reasoning at scale (2601.03192).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Simple Explanation of the Paper: MEMRL — Teaching AI to Learn From Its Own Memories

What is this paper about?

This paper introduces a way for AI assistants to get better over time by learning from their past experiences—without changing their core “brain.” The method is called MEMRL. It lets an AI store what it tried before, how well it worked, and then use that information to make smarter choices next time.

Think of it like a student who keeps a notebook of solved problems with sticky notes that say “This trick works great!” or “This didn’t help.” The student’s basic knowledge stays the same, but their notebook—and the usefulness ratings on each page—keeps improving.

What questions are the researchers trying to answer?

The paper focuses on three big questions:

  • How can an AI improve as it’s being used (after it’s deployed), like a person learning from daily experiences?
  • How can it do that without constantly retraining its big LLM (which is slow, expensive, and can cause it to forget old skills)?
  • How can it avoid grabbing “similar but useless” memories and instead pick the ones that are actually helpful?

How does MEMRL work? (Methods in everyday language)

The core idea is to separate “stable thinking” from “changeable memory.”

  • Stable thinking: The big LLM itself stays frozen—its internal weights don’t change.
  • Changeable memory: The AI keeps a growing “memory bank” of past attempts, each stored with: 1) Intent: what the user wanted (a compact representation of the question/task), 2) Experience: what the AI tried (like a solution trail or steps taken), 3) Utility: a usefulness score (a number that acts like a star rating, often called a Q-value in reinforcement learning).

Here’s how the AI answers a new task:

  1. Two-Phase Retrieval:
    • Phase A: Find memories that are semantically similar (like searching for related notes).
    • Phase B: Re-rank those candidates by their learned usefulness (pick the ones with the best “star ratings,” not just the closest match in meaning).
  2. Generate an answer using the chosen memory context plus the frozen LLM.
  3. Learn from feedback:
    • If the answer works (e.g., test passes, task succeeds), increase the usefulness scores of the memories it used.
    • If it fails, decrease them.
    • It may also write a new summary of what happened and add it to memory with an initial score.

That “adjust the score based on outcome” step is reinforcement learning (RL). It’s “non-parametric,” which just means the AI learns by updating the memory’s usefulness scores, not by changing the LLM’s internal weights.

A quick analogy:

  • You’re choosing a strategy from your notebook to solve a new puzzle.
  • First you pick pages that look related (similar topic).
  • Then you prefer the ones you previously marked as most helpful (highest star rating).
  • After you try it, you adjust the stars—up for success, down for failure.
  • Over time, your notebook becomes a powerful guide, even though your basic knowledge doesn’t change.

What did they find?

Across several tough benchmarks, MEMRL beat other methods that either:

  • don’t use memory,
  • only use similarity-based retrieval (RAG), or
  • use memory with hand-crafted rules.

Key results (high level):

  • In exploration-heavy, multi-step tasks (like navigating virtual environments in ALFWorld), MEMRL’s advantage is the biggest. It not only solves more tasks at the end, but also solves more unique tasks at least once during training (it explores better and remembers what works).
  • In single-step tasks (like many code problems), it still helps, but the gap is smaller because there’s less complex structure to reuse.
  • The usefulness scores (Q-values) truly predict success: memories with higher Q-values led to better outcomes, showing the AI is learning to rank strategies by real utility, not just “looks similar.”
  • MEMRL reduces “forgetting”: as it learns new things, it doesn’t lose old effective strategies. That’s because it doesn’t change the LLM’s weights—only the memory scores—so the core skills stay stable.

Why is this important?

  • It learns at runtime: The AI improves while you use it—like learning from practice—without costly retraining.
  • It’s safer for long-term skills: Because the model’s core is frozen, it avoids “catastrophic forgetting” (where learning something new breaks old skills).
  • It’s smarter about memory: Instead of grabbing the most similar past example (which might be misleading), it picks what has proven useful in the past.
  • It transfers good habits: Useful strategies learned in one task can help in related tasks later, especially in multi-step problems.

What could this change in the future?

  • More reliable personal assistants and agents that get better with you: They can adapt to your tasks and tools over time without losing their base knowledge.
  • Lower costs to improve AI after deployment: No need for constant fine-tuning cycles.
  • Better performance in complex, step-by-step jobs: From software automation to virtual robotics, agents can learn “what actually works” and reuse it.

In short, MEMRL gives AI a practical way to self-improve like a person: keep a diary of experiences, rate which ones work, and use the best ones next time—all without rewriting its brain.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of unresolved issues that future work could address to strengthen, extend, or validate MEMRL.

  • Non-stationarity vs. theory: Stability and convergence analyses rely on a frozen inference policy and a stationary task/reward distribution; real-world deployments face drifting tools, evaluators, and user intents. How does MEMRL behave under non-stationary rewards, shifting task mixes, or periodic model/evaluator updates?
  • Reward design and noise: The framework primarily uses binary success/failure signals; it does not study graded, delayed, or noisy rewards. What reward shaping, denoising, or aggregation strategies yield reliable Q estimates across heterogeneous tasks and evaluators?
  • Credit assignment granularity: Utilities are updated at the level of whole retrieved memories. How to attribute utility to sub-trajectories, steps, or subskills (e.g., via eligibility traces, segment-level Q, or hierarchical credit assignment) to improve learning in long-horizon tasks?
  • Long-horizon and delayed returns: The runtime update uses a Monte Carlo terminal-style rule with no multi-step TD or discounting. Can multi-step returns, eligibility traces, or n-step TD improve sample efficiency and stability for lengthy, partially successful trajectories?
  • Exploration strategy for retrieval: Two-Phase Retrieval trades off similarity and Q but lacks explicit exploration (e.g., epsilon-greedy, UCB). What exploration policies over memory selection accelerate discovery while preventing premature exploitation?
  • Cold-start behavior: When the memory is empty or sparse, performance relies on the frozen LLM without explicit bootstrapping. What initialization schemes (synthetic experiences, seeding from offline logs) or exploration policies reduce cold-start inefficiency?
  • Memory growth, pruning, and deduplication: The paper appends summarized experiences to memory but does not specify scalable curation policies (eviction, merging, deduplication). How should Q-guided pruning, diversity constraints, or compression be designed to control memory bloat and keep retrieval latency bounded?
  • Scalability of retrieval: The approach assumes TopK recall with thresholds but does not analyze time/space complexity or latency on very large memories (e.g., millions of items). What indexing, ANN search, sharding, or multi-stage filtering is needed to scale?
  • Cross-task utility calibration: Q-values are compared within candidate pools via z-score normalization; cross-task comparability and calibration are not addressed. How to normalize or calibrate utilities so that Q is meaningfully comparable across intents and domains with different reward scales?
  • Negative transfer and interference: Similarity gating reduces, but does not eliminate, cross-domain contamination (e.g., high-Q memories retrieved across semantically close but incompatible tasks). How to detect and mitigate negative transfer (e.g., via task tags, causal gating, counterfactual tests)?
  • Robustness to poisoned or adversarial memories: There is no defense against malicious or spurious high-Q experiences (e.g., reward hacking, corrupted evaluators). What robust estimation, outlier detection, or trust scoring prevents memory poisoning?
  • Sensitivity to hyperparameters: Limited ablations cover λ (Q weight), k1/k2, and similarity thresholds; learning rate α, Q initialization, and thresholding strategies are not systematically studied. How sensitive is performance to these choices, and can adaptive/auto-tuning methods stabilize them?
  • Q initialization and bias: Qinit and its effect on early retrieval decisions are unspecified. How do different priors (e.g., optimism in the face of uncertainty) affect exploration-exploitation dynamics and convergence?
  • Context budget and composition: Injecting multiple memories competes with the token budget; ordering and formatting of retrieved items may affect LLM reasoning. What policies optimize selection, ordering, and formatting under tight context constraints?
  • Evaluation scope and baselines: Comparisons omit learned retrievers/indices and bandit/RL-for-retrieval baselines (e.g., contextual bandits, kNN-Q, offline RL rankers). How does MEMRL compare against strong learning-to-retrieve methods and parametric retrievers trained with RL?
  • Out-of-distribution (OOD) generalization: Transfer tests use held-out splits within the same benchmarks. Does a trained memory bank generalize to new domains/tools (cross-benchmark), and what mechanisms maintain performance under domain shift?
  • Sample efficiency and compute cost: The paper does not quantify wall-clock overheads (retrieval latency, summarization calls, memory updates) or sample efficiency vs. baselines. What is the runtime/compute trade-off to achieve the reported gains?
  • Summarization quality and provenance: Memory writebacks rely on LLM summaries without verification of factual or procedural fidelity. How to enforce correctness (e.g., verifiers, execution traces, provenance metadata) and prevent compounding hallucinations?
  • Partial observability and tool feedback: For tool-using agents, rewards and signals can be sparse or misleading (e.g., silent failures). How to incorporate richer signals (logs, intermediate tool states) into memory and utility updates?
  • Safety, privacy, and compliance: Storing episodic user interactions raises PII, consent, and retention concerns. What privacy-preserving memory mechanisms (redaction, differential privacy, encryption, per-user silos) can be integrated without degrading performance?
  • Multi-user and personalization: The framework does not address conflicts between heterogeneous user preferences or shared vs. personalized memories. How to partition or personalize memories and Q-values while enabling safe cross-user transfer?
  • Multi-modal and non-text settings: MEMRL is instantiated for textual intents; applicability to multi-modal inputs, physical robots, or continuous control remains untested. What adaptations (state embeddings, action abstractions) are required for such settings?
  • Stability claims beyond stationarity: The GEM-style monotonicity and Bellman-based arguments are outlined under idealized assumptions. Can formal guarantees be extended to the full two-phase retrieval loop with normalizations, changing candidate sets, and non-stationary retrieval distributions?
  • Handling empty candidate sets: When C(s) is empty, the agent falls back to the frozen LLM without mechanisms to create candidate experiences for similar future intents. How to opportunistically write experiences and bootstrap new intent clusters when recall fails?
  • Measuring catastrophic forgetting: The “forgetting rate” proxy is dataset-specific; broader measures (e.g., backward/forward transfer, retention under domain rotation) are not reported. Can comprehensive continual learning metrics validate stability across longer horizons and harder drifts?
  • Combining with lightweight parameter updates: The paper freezes the backbone. Could small, controlled parameter updates (e.g., LoRA adapters) complement MEMRL without incurring catastrophic forgetting, and under what regimes do hybrids outperform purely non-parametric methods?
  • Incomplete experiments and significance: Some results are reported before 10 epochs and without statistical significance tests. Do gains persist with full training, multiple seeds, and rigorous statistical analysis?
  • Failure analysis taxonomy: While near-miss “high-Q failures” are noted, a systematic taxonomy of failure types and their utility is missing. Can automatic detection of corrective heuristics in failures further improve retrieval and Q updates?

Practical Applications

Below is a concise mapping from the paper’s core ideas—runtime, non-parametric reinforcement learning on episodic memory with value-aware retrieval—to practical applications. Each item names sectors, potential tools/workflows, and key assumptions or dependencies that affect feasibility.

Immediate Applications

These can be piloted or deployed today using a frozen LLM, a vector store, and basic reward instrumentation.

  • Value-aware code assistant inside IDEs
    • Sectors: Software, Developer Tools
    • What it is: An IDE plugin that stores “intent–experience–utility” triplets for bug fixes, refactors, and internal-library usage, then retrieves past solutions by learned Q-value (test success, build pass, PR acceptance) rather than similarity alone.
    • Tools/products/workflows: “Value-aware RAG” extension for VS Code/JetBrains; CI/CD hooks to convert pass/fail and latency into rewards; memory dashboards to promote/demote snippets.
    • Assumptions/dependencies: Reliable test signals; access to repo/CI logs; privacy/governance for code artifacts; sensible λ, k1/k2 settings and similarity gating to prevent noisy retrieval.
  • Runbook copilot for IT/Ops and SRE
    • Sectors: IT Operations, DevOps/SRE, Databases
    • What it is: Terminal/ChatOps agent that retrieves OS/DB command sequences by utility (historical incident resolution, MTTR reduction), not just keyword match.
    • Tools/products/workflows: Slack/Teams bot; shell wrapper that logs exit codes, error rates, rollback events as rewards; utility-weighted runbook retrieval.
    • Assumptions/dependencies: Strong guardrails for destructive commands; incident tagging; stable reward shaping (e.g., success, time, error count).
  • Contact-center resolver that learns which answers work
    • Sectors: Customer Support, CX
    • What it is: Agent that ranks KB articles, macros, and workflows by learned utility (first-contact resolution, CSAT) across intents (issue types).
    • Tools/products/workflows: CRM integration (Zendesk, Salesforce); auto-logging ticket outcomes as rewards; continuous promotion of high-Q “playbooks.”
    • Assumptions/dependencies: Clean outcome labels (FCR, reopen rate); deflection accuracy; PII handling and retention policies.
  • SQL/BI assistant that optimizes by success metrics
    • Sectors: Data/Analytics, BI
    • What it is: Query copilot that retrieves example queries and data wrangling steps based on utility (valid result, performance, approval by reviewer).
    • Tools/products/workflows: BI plugins (Mode, Looker, PowerBI); reward from runtime success, execution time, cost; value-aware retrieval of prior queries/patterns.
    • Assumptions/dependencies: Safe sandboxes; performance telemetry; schema-change detection to avoid staleness.
  • RPA/playbook optimizer for repeated business processes
    • Sectors: Enterprise Automation, Back Office
    • What it is: Bot that chooses prior “near-miss” or successful steps by Q-values (success rate, exception volume), reducing brittle heuristics.
    • Tools/products/workflows: RPA suites (UiPath, Power Automate) with MEMRL memory; exception handling as negative reward; promotion of high-Q steps.
    • Assumptions/dependencies: Instrumented outcomes; process change alerts; oversight for compliance.
  • Personalized tutoring with value-aware example retrieval
    • Sectors: Education, EdTech
    • What it is: Tutor that prioritizes examples/explanations with high utility for a learner profile (measured by correctness on follow-up problems).
    • Tools/products/workflows: LMS plugin; rewards from mastery checks; student-level episodic memory with Q-values.
    • Assumptions/dependencies: Fairness/consent; privacy of student data; offline evaluation for curriculum drift.
  • Enterprise knowledge assistant with utility curation
    • Sectors: Knowledge Management, HR/Legal/Finance Ops
    • What it is: Assistant that re-ranks policies, SOPs, and internal memos by learned utility (document leads to correct completion with fewer iterations).
    • Tools/products/workflows: Confluence/SharePoint connectors; rewards from task completion and reviewer acceptance; “trajectory verifier” that penalizes superficially correct but brittle guidance.
    • Assumptions/dependencies: Clear completion criteria; versioning to prevent outdated high-Q items; audit trails.
  • SecOps playbook selector
    • Sectors: Cybersecurity
    • What it is: Triage assistant that retrieves incident-response steps by utility (containment success, dwell-time reduction) across similar alerts.
    • Tools/products/workflows: SOAR/SIEM integration; rewards from incident outcomes; similarity gating to avoid context detachment.
    • Assumptions/dependencies: Safety constraints; red-teaming; noisy labels and base-rate fallacies must be mitigated.
  • Personal productivity agent for repetitive tasks
    • Sectors: Daily Life, Productivity
    • What it is: Agent that learns which email drafts, calendar workflows, or text templates achieve higher response/acceptance rates for an individual.
    • Tools/products/workflows: Email/calendar integration; reward from reply rate/meeting acceptance; memory pruning and normalization.
    • Assumptions/dependencies: Consent and privacy; careful reward design to avoid manipulative content.
  • Research tool for runtime learning on memory
    • Sectors: Academia, Open-Source
    • What it is: A reference implementation/library (“MEMRL-lite”) to study stability-plasticity with frozen LLMs, shareable benchmarks and ablations.
    • Tools/products/workflows: Python SDK over vector DBs; plug-in verifiers; logging and evaluation suites; experiment tracking.
    • Assumptions/dependencies: Reproducible rewards/evaluators; dataset licenses; strong baselines for comparison.

Long-Term Applications

These require additional research, scaling, verification, and/or regulatory alignment before broad deployment.

  • Cross-domain, self-evolving enterprise agents
    • Sectors: Horizontal enterprise platform
    • What it is: Unified MEMRL layer that accumulates and shares high-utility experiences across departments (IT, Finance, Legal, Sales) with role-based access and safety filters.
    • Tools/products/workflows: Federated memory stores; utility-aware governance; cross-domain reward schemas.
    • Assumptions/dependencies: Robust privacy/tenancy; domain drift detection; policy-compliant sharing.
  • Real-world robotics with value-aware skill libraries
    • Sectors: Robotics, Manufacturing, Warehousing, Home
    • What it is: Robots that retrieve manipulation/navigation routines by Q-values learned from execution success across environments.
    • Tools/products/workflows: Skill memory with sensory embeddings; verifiers for safety and success; sim-to-real transfer pipelines.
    • Assumptions/dependencies: High-fidelity reward/evaluation; safety certification; handling non-stationary dynamics.
  • Clinical documentation and decision support with utility curation
    • Sectors: Healthcare
    • What it is: Assistants that rank templates, order sets, and care-path prompts by utility (reduced corrections, improved adherence/outcomes).
    • Tools/products/workflows: EHR integration; prospective trials; clinician-in-the-loop rewards.
    • Assumptions/dependencies: Regulatory approval; bias/harms analysis; rigorous validation and auditability.
  • Autonomous codebase maintenance/refactoring agents
    • Sectors: Software, DevTools
    • What it is: Long-running agents that learn high-utility upgrade/refactor patterns (e.g., framework migration) without finetuning the model.
    • Tools/products/workflows: Repo-scale memory with Q-values; rollout with canary branches; verifier and rollback tooling.
    • Assumptions/dependencies: Strong static/dynamic analyzers; safety checks; governance for large-scale changes.
  • Financial analysis and strategy copilot
    • Sectors: Finance
    • What it is: Utility-driven retrieval of analyses and playbooks based on backtest/live performance, risk-adjusted returns, and compliance outcomes.
    • Tools/products/workflows: Backtesting harness as reward engine; compliance guardrails; value-aware RAG over research notes.
    • Assumptions/dependencies: Market non-stationarity; strict regulation; causal validation to avoid overfitting to noise.
  • Grid and industrial operations assistants
    • Sectors: Energy, Industrial Control Systems
    • What it is: Utility-weighted procedures for grid balancing, maintenance, or plant operations that reflect historical success and safety.
    • Tools/products/workflows: Digital twins for safe reward learning; ICS connectors; anomaly and drift detection.
    • Assumptions/dependencies: Safety-critical verification; certification regimes; robust fallback to human control.
  • Government digital service chatbots that improve safely at runtime
    • Sectors: Public Sector, Policy
    • What it is: Citizen-facing agents that refine retrieval of forms and procedures by utility (issue resolution, reduced callbacks) while keeping a frozen backbone for stability and audit.
    • Tools/products/workflows: Auditable memory updates; public-interest KPIs as rewards; retention/minimization policies.
    • Assumptions/dependencies: Accessibility, fairness constraints; legal mandates for transparency and appeal.
  • Lifelong personal assistants with privacy-preserving episodic memory
    • Sectors: Consumer tech
    • What it is: Devices/services that accumulate a user’s episodic memories with Q-values across contexts (home, car, mobile), learning what works for that user.
    • Tools/products/workflows: On-device or federated learning over memory-only updates; encrypted vector stores; opt-in consent flows.
    • Assumptions/dependencies: Strong privacy; cross-device identity; drift and aging of memories.
  • Scientific workflow and experiment planning copilot
    • Sectors: R&D, Pharma, Materials
    • What it is: Assistant that ranks protocols and analysis pipelines by utility (replication success, yield) across labs and instruments.
    • Tools/products/workflows: ELN/LIMS integration; instrument telemetry as rewards; provenance and versioning in memory triplets.
    • Assumptions/dependencies: Standardized metadata; reproducibility; IP governance across collaborators.
  • Multi-agent memory federation and marketplaces
    • Sectors: Platform ecosystems
    • What it is: Systems where agents share or trade high-utility memories across tasks/organizations under policy constraints.
    • Tools/products/workflows: Utility calibration across domains; memory privacy budgets; reputation systems for shared memories.
    • Assumptions/dependencies: Interop standards; risk of leakage or misgeneralization; incentive design.

Notes on feasibility across applications:

  • Reward signal design is pivotal: stable, meaningful, and low-latency feedback is required to avoid reinforcing spurious strategies. In safety-critical domains, verifiers and human oversight are non-negotiable.
  • Non-stationarity (concept drift) must be detected; otherwise Q-values can become stale or harmful. Incorporate drift detectors, time decay, and revalidation.
  • The base LLM must be sufficiently capable; MEMRL optimizes retrieval, not generation quality. Poor base models limit upside.
  • Memory governance matters: privacy (PII/PHI), access controls, retention, and audit logs are essential, especially when failures may still earn high utility as “near-miss” lessons.
  • Stability enhancements from the paper (z-score normalization, similarity gating, balanced λ, compact k1/k2) should be treated as product defaults to minimize forgetting and noise.
  • MLOps/LLMOps integration (observability, A/B tests, rollbacks) is required to safely operate runtime-learning systems in production.

Glossary

  • Agentic memory: A class of external memory systems that enable agents to store, organize, and govern experiences for future use during interaction. "Early agentic memory introduced reflection mechanisms and hierarchical management to handle long context"
  • Analogical transfer: The process of applying knowledge from semantically similar past situations to a new problem. "Phase-A operationalizes analogical transfer (Gick & Holyoak, 1983) by recalling semantically similar past events."
  • Bellman backup: A value update operation that adjusts an estimate toward a target based on reward and discounted future value. "Utility-Driven Update refines these Q-values via environmental feedback and Bellman backup (Bellman, 1966)."
  • Bellman contraction: The property that value iteration shrinks the error between estimated and optimal values under the Bellman operator. "theoretically substantiate its stability via Bellman contraction, exploring how utility-driven updates minimize catastrophic forgetting while maximizing positive transfer."
  • Catastrophic forgetting: The tendency of a model to lose previously learned knowledge when updated on new data. "without the computational cost or catastrophic forgetting risks associated with weight updates."
  • Constructive Episodic Simulation: A cognitive mechanism where past experiences are recombined to imagine or plan solutions for new tasks. "a mechanism known as Constructive Episodic Simulation that allows adaptation without rewiring neural circuitry"
  • Coordinate ascent: An optimization method that alternately optimizes subsets of variables to increase an objective. "From a variational perspective, the system performs coordinate ascent on a global objective function J(Q, p) (the variational lower bound of expected reward)"
  • Cumulative Success Rate (CSR): The proportion of tasks solved at least once across the entire training process. "CSR indicates the percentage of tasks solved at least once during the training process."
  • Exponential moving average: A running average that weights recent observations more than older ones. "MEMRL updates its utility using the exponential moving average rule as formulated in Eq. 8"
  • Generalized Expectation-Maximization (GEM): A variant of EM that ensures monotonic improvement without fully maximizing each step. "we analyze MEMRL as a Generalized Expectation-Maximization (GEM) process (Dempster et al., 1977; Neal & Hinton, 1998)."
  • Intent-Experience-Utility triplet: A memory structure consisting of an intent embedding, the associated experience, and its learned utility score. "MEMRL organizes memory into a structured Intent-Experience-Utility triplet."
  • Markov Decision Process (MDP): A formal framework to model decision-making with states, actions, transitions, and rewards under the Markov property. "we formalize the interaction between the frozen LLM and external memory as a Markov Decision Process (MDP) (Puterman, 2014)."
  • Memory-Based Markov Decision Process (M-MDP): An MDP formulation that explicitly includes an evolving external memory component. "We adopt the formulation of Memory-Based Markov Decision Process (M-MDP) (Zhou et al., 2025)"
  • Memory reconsolidation: The process by which retrieved memories are updated and re-stored based on new outcomes. "Eq. 8 implements a form of memory reconsolidation (Haubrich & Nader, 2016)"
  • Monte Carlo integrator: An estimator that approximates an expectation by averaging sampled outcomes. "Q(m) acts as a Monte Carlo integrator striving to converge to:"
  • Monte Carlo style rule: A value update using sample returns without bootstrapping from future estimates. "with a Monte Carlo style rule (Metropolis & Ulam, 1949)"
  • Non-parametric reinforcement learning: RL methods that learn policies or values without updating model parameters (e.g., by updating external memory). "We introduce MEMRL, a non-parametric reinforcement learning algorithm"
  • Positive transfer: Improvement on new tasks due to beneficial reuse of knowledge learned from related tasks. "In high-similarity benchmarks, MEMRL succeeds via Positive Transfer-generalizing shared patterns to new instances."
  • Q-value: The expected return (utility) of taking a given action (here, retrieving a memory) in a particular state (intent). "selects them based on learned Q-values (utility)."
  • Retrieval-Augmented Generation (RAG): A method that augments model inputs with retrieved documents or memories based on semantic similarity. "Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) offers a non-parametric alternative"
  • Runtime Continuous Learning: The capability of an agent to improve performance post-deployment through ongoing interaction, without modifying core weights. "referred to as Runtime Continuous Learning (Javed et al., 2023; Silver & Sutton, 2025; Parisi et al., 2019; Wu et al., 2024)"
  • Sparsity threshold: A cutoff on similarity used to filter candidate memories to a compact set. "where 8 is a sparsity threshold."
  • Stability-plasticity dilemma: The challenge of balancing retention of prior knowledge (stability) with the ability to learn new information (plasticity). "Continual learning addresses the stability-plasticity dilemma, aiming to acquire new knowledge sequentially without suffering from catastrophic forgetting."
  • Stationary distribution: A time-invariant probability distribution over tasks or states assumed during analysis. "Tasks s are drawn from a stationary distribution over a fixed dataset."
  • Temporal-Difference (TD) error: The difference between current value estimates and target estimates incorporating immediate reward and next-step value. "using a Temporal-Difference (TD) error (Sutton, 1988)"
  • Trajectory Verifier: A role of the system in evaluating the overall validity of multi-step solutions, not just initial matches. "This analysis indicates that MEMRL transcends the role of a simple retrieval enhancer to function as a Trajectory Verifier."
  • Two-Phase Retrieval: A retrieval strategy that first recalls candidates by semantic similarity, then re-ranks them by learned utility. "MEMRL employs a Two-Phase Retrieval mechanism"
  • Utility-Driven Update: A memory update procedure that adjusts stored utility values based on observed rewards. "Utility-Driven Update refines these Q-values via environmental feedback and Bellman backup"
  • Value-Aware Retrieval: Selecting memories for context based on their estimated utility rather than similarity alone. "Value-Aware Retrieval selects experiences based on their learned Q-values, reflecting expected utility, rather than semantic similarity alone"
  • Variational lower bound: A tractable bound optimized to indirectly maximize an intractable objective (here, expected reward). "a global objective function J(Q, p) (the variational lower bound of expected reward)"
  • z-score normalization: Standardization that rescales values to zero mean and unit variance within a set. "where " denotes z-score normalization within the candidate pool,"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 24 tweets with 308 likes about this paper.