MemSkill: Evolving Memory Management in LLM Agents

Updated 7 February 2026

MemSkill is a self-evolving, skill-based paradigm that replaces rigid memory procedures with learnable, reusable routines in LLM agents.
It deploys a modular architecture featuring a controller, executor, and designer, enabling dynamic skill selection and continuous skill refinement.
Experimental evaluations show significant improvements over baselines in long-context and embodied reasoning tasks, affirming its practical impact.

MemSkill defines a self-evolving, skill-based paradigm for memory management in LLM agents, replacing rigid, hand-coded memory procedures with a bank of structured, learnable, and evolvable memory skills. Each memory skill encapsulates a reusable behavioral routine that the agent can invoke to extract, consolidate, or prune information from its interaction trace. MemSkill’s architecture integrates a controller that dynamically selects a small, relevant subset of skills for each context, a skill-conditioned executor that carries out memory edits, and a designer that continually refines and evolves the skill repository in response to hard cases. This adaptive framework enables robust, flexible, and portable memory workflows across diverse interaction patterns and downstream tasks, demonstrated by systematic improvements over baseline methods on long-context and embodied reasoning benchmarks (Zhang et al., 2 Feb 2026).

1. Definition and Motivation

A memory skill in the MemSkill framework is a parameterized, compositional, and reusable routine guiding LLM agents in memory extraction, consolidation, or pruning. Each skill features:

A concise textual description for indexing and retrieval
A structured instruction template specifying “Purpose,” “When to use,” “How to apply,” and “Constraints”
Typed actions: INSERT, UPDATE, DELETE, or NOOP

Traditional LLM memory approaches utilize static, hand-designed pipelines or heuristics, constraining them to fixed behaviors and impaired by diverse or evolving task requirements. MemSkill reframes memory actions as parameterized routines learnable from data, supporting span-level flexibility (arbitrary granularity) and compositional memory construction (joint skill selection per span), with the goal of minimizing human priors and enabling broad generalization across settings (Zhang et al., 2 Feb 2026).

2. System Architecture

MemSkill’s agent memory pipeline is decomposed into three coordinated modules: a controller, an executor, and a designer.

Controller: Receives the current span of interaction, contextually embeds it with retrieved memory items, and selects a subset of $K$ skills from the current skill bank via an embedding-based Top- $K$ policy. Given span $x_t$ , memory items $M_t$ , and skill bank $S_t$ , contextual and skill descriptions are encoded via a shared encoder; softmax over compatibility scores determines the skill selection distribution. Top- $K$ sampling is performed without replacement using Gumbel-Top-K [(Zhang et al., 2 Feb 2026), §2.1].
Executor: Provided with the span, pertinent memories, and selected skills, prompts the LLM to generate structured action blocks specifying INSERT, UPDATE, or DELETE operations. The executor’s output directly mutates the agent’s memory bank, enabling dynamic, skill-conditioned memory management [(Zhang et al., 2 Feb 2026), §2.2].
Designer: At fixed intervals, evaluates system failures accumulated in a sliding buffer, clusters them by similarity, and prompts an LLM to analyze root causes (storage, quality, retrieval errors). Based on this analysis, the designer proposes refinements to existing skills or the addition of new skill templates, rolling back changes if tail reward over the most recent training steps fails to improve performance [(Zhang et al., 2 Feb 2026), §2.3].

The table below summarizes the three modules and their primary inputs/outputs:

Module	Input(s)	Output(s)
Controller	span, memories, skill bank	K selected skills
Executor	span, memories, skills	Concrete memory actions
Designer	failures, skill bank	Skill refinements/additions

3. Learning and Skill Evolution

MemSkill employs a closed-loop, two-phase optimization process. The agent alternates between controller optimization (via PPO on the current skill bank) and skill bank evolution via the designer.

During controller optimization, each episode processes interaction spans, selects skills per span, updates memory, and accumulates a reward (e.g., F1, success rate) at episode conclusion. Failures are logged for designer analysis. The controller objective applies PPO with a clipped loss, utilizing the Top-K selection’s joint log-probability [(Zhang et al., 2 Feb 2026), §3.2].

During evolution, the designer clusters failure cases (based on a difficulty score incorporating reward and failure frequency), then prompts an LLM to recommend new or refined skill templates. If new skills do not yield an improved stabilized cycle reward $\bar{R}_{\mathrm{tail}}$ , the system reverts the skill bank to the most successful previous snapshot [(Zhang et al., 2 Feb 2026), §3.1–3.3].

This ongoing process yields a skill bank that adapts to encountered edge cases, operational shifts, and broader context diversity, providing resistance to overfitting to hand-crafted routines.

4. Memory Management Workflow

Formally, the memory bank $M_t$ after processing span $t$ is updated via a deterministic function of the prior memory, selected skills, and executor output: $M_{t+1} = f(M_t, A_t, o_t)$ where $K$ 0 is the ordered set of skills selected by the controller and $K$ 1 the executor’s structured memory actions (INSERT/UPDATE/DELETE).

A representative workflow, as detailed in [(Zhang et al., 2 Feb 2026), §4.2], might involve a sequence such as:

Input: “On 2026-03-04, Alice scheduled a meeting at 2 pm.” — Skills selected: {Capture Temporal Context, INSERT}; memory: add “Alice meeting: 2026-03-04 14:00.”
Input: “She later moved it to 3 pm.” — Skills: {Refine Temporal Details, UPDATE}; memory: update to “Alice meeting: 2026-03-04 15:00.”
Input: “The meeting was canceled.” — Skills: {Delete Invalid Memory}; memory: delete the meeting record.

This example illustrates compositional multi-skill selection per span and structured, interpretable memory evolution.

5. Experimental Evaluation and Results

MemSkill was evaluated on four benchmarks: LoCoMo (long dialogues), LongMemEval (100K-token chats), HotpotQA (documents), and ALFWorld (embodied tasks). Metrics included F1, LLM-judge (L-J), and (for embodied agents) success rate (SR).

Key findings [(Zhang et al., 2 Feb 2026), Table 1]:

On LLaMA-3.3B backbone:
- LoCoMo L-J: MemSkill 50.96 vs. best baseline 46.34 (+4.6)
- LongMemEval L-J: 59.41 vs. 45.54 (+13.9)
- ALFWorld Seen SR: 47.86 vs. 40.51 (+7.4)
- ALFWorld Unseen SR: 47.01 vs. 38.83 (+8.2)
Skill portability is demonstrated via Qwen-80B transfer: MemSkill retains gains without retraining.

Ablation studies confirm substantial performance drops without the controller (random skill selection: 5–11 point L-J drop) or without the designer (static skills: additional 6–17 point drop). Restricting evolution to refinement-only (no new skills) yields partial improvement but remains 4–6 points below the default [(Zhang et al., 2 Feb 2026), Table 2].

6. Analysis, Skill Evolution, and Generalization

Skill evolution in MemSkill reveals domain-specialization. For LoCoMo (dialogue), skills emphasize temporal and entity structuring (e.g., “Capture Temporal Context”), while for ALFWorld (embodied), skills address action constraints and object tracking (e.g., “Capture Action Constraints”). These emergent skills demonstrate broad generalization:

Portability across base LLMs (LLaMA/Qwen)
Robustness from dialogue to document and embodied task settings
Zero-shot transfer between benchmarks (e.g., LoCoMo→LongMemEval: MemSkill L-J 59.41 vs. baseline 45.54)

Limitations include reliance on LLM prompting for designer skill generation (potential for misdiagnosis), and training overhead from repeated LLM calls. Future research directions include skill merging, differential weighting, and hierarchical controller schemes [(Zhang et al., 2 Feb 2026), §6.3].

7. Relationship to MeKi and Implications for MemSkill Architectures

MeKi, a memory-based expert knowledge injection framework, informs potential designs for MemSkill by decoupling model capacity from runtime compute. In MeKi, per-skill/token expert vectors are stored in ROM, and injected at each layer via minimal gating mechanisms, yielding near-zero inference overhead regardless of skill-bank size (Ding et al., 3 Feb 2026).

This storage-compute decoupling suggests that a MemSkill system can generalize beyond token-indexed skills to banks keyed by n-grams, semantic classes, or task IDs. Runtime routing and gating can remain lightweight, with all training-time projections re-parameterized into static lookup tables and skill banks. Such a scheme would support parallel, selective “skill injection” at multiple granularity levels, with robust capacity scaling for on-device or resource-limited deployments [(Ding et al., 3 Feb 2026), §6].

A plausible implication is that large-scale, compositional MemSkill systems can scale the diversity and sophistication of skill routines without incurring prohibitive inference-time compute, provided appropriate static storage and compact gating interfaces are adopted. This unifies datacentric, evolvable skill engineering with efficient, hardware-constrained deployment.

Markdown Report Issue Upgrade to Chat

References (2)

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents (2026)

MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemSkill.