MemSkill Architecture: Adaptive Memory for LLMs

Updated 8 February 2026

MemSkill is a self-evolving memory-management architecture that models memory manipulation as a set of dynamic, natural-language skills.
It integrates a controller, executor, and designer in a closed-loop system to adaptively select, apply, and evolve memory update strategies.
Empirical evaluations demonstrate that MemSkill outperforms static baselines on benchmarks like LoCoMo, LongMemEval, and ALFWorld using reinforcement learning.

MemSkill is a self-evolving memory-management architecture for LLM agents, designed to eliminate the rigidity and inefficiency of static, hand-crafted memory operations. Instead of fixed procedures, MemSkill models memory manipulation as a set of structured, evolvable "memory skills," each represented by natural-language templates and selected dynamically according to context. At its core, the architecture is governed by a closed loop in which a controller learns to select relevant skills, an executor uses these skills to alter the agent’s memory, and a designer evolves the skill set by addressing recurring failures. This system jointly optimizes both the memory update policy and the evolving repertoire of skills, enabling adaptive and generalizable memory management across a variety of agent settings and interaction regimes (Zhang et al., 2 Feb 2026).

1. Design Objectives and Foundational Paradigms

MemSkill addresses three principal design goals:

Minimization of human priors: Memory management behaviors emerge from learning on agent data, not from hand-designed rules or heuristics.
Flexible extraction granularity: Skills are applicable to arbitrary text spans, rather than operating solely at the per-turn level.
Compositional, context-sensitive memory construction: In each generation, a small, context-dependent subset of skills is selected and composed by the controller for application by the executor.

MemSkill maintains two persistent structures:

Store Type	Granularity	Role
Memory bank	Per-interaction/trace	Extracts, consolidates, and prunes facts for current trajectory
Skill bank	Shared across traces	Stores and evolves reusable natural-language skill templates

These structures are dynamically updated through alternating phases of skill-use learning (controller and executor) and skill evolution (designer).

2. Core System Components

MemSkill consists of three principal modules: controller, executor, and designer, together orchestrating a continual process of skill learning and evolution.

2.1 Skill Bank

Initialization: Contains four primitive skills: INSERT, UPDATE, DELETE, SKIP.
Skill Representation: Each skill $s \in S$ $s \in S$ encapsulates:
- A short selection-oriented description.
- A detailed, structured template describing its purpose, conditions for invocation, operational instructions, constraints, and action type.
Evolution: Over time, new skills are added or existing ones refined according to designer feedback emerging from empirical agent failures.

2.2 Controller (Skill-Selection Policy)

Context Encoding: Each processing step $t$ involves embedding the current input span $x_t$ and the corresponding retrieved memories $M_t$ into a shared vector space: $h_t = f_{\rm ctx}(x_t,M_t)$ .
Skill Embedding: Skill descriptions $\mathrm{desc}(s_i)$ are embedded as $u_i = f_{\rm skill}(\mathrm{desc}(s_i))$ , allowing for compatibility with a dynamically changing skill bank.
Skill Scoring and Selection: Compute selection scores and sample an ordered Top- $K$ skill set $A_t$ using a Gumbel-Top-K strategy, with the probabilities:

$p_\theta(i\mid h_t)=\frac{\exp(z_{t,i})}{\sum_j\exp(z_{t,j})} \text{ where } z_{t,i} = h_t\cdot u_i.$

Learning: The controller is realized as an MLP and is optimized using Proximal Policy Optimization (PPO) with downstream task rewards.

2.3 Executor (Skill-Conditioned Memory Construction)

Input: Receives the span $x_t$ , current memory set $M_t$ , and selected skills $\{s_{a_{t,1}},…,s_{a_{t,K}}\}$ via a fixed prompt to an LLM.
Action Generation: Outputs a structured sequence of skill-parameterized memory update actions in a single LLM generation step. Actions include:
- INSERT: Creating new memory items.
- UPDATE: Revising specific memory items by index.
- DELETE: Removing specified memory items.
Effect: Updates the trace-specific memory bank according to parsed actions.

2.4 Designer (Skill Evolution Mechanism)

Activation: Triggered periodically (every $E$ training steps).
Stage 1—Hard-Case Aggregation:
- Maintains a buffer of recent query failures, with associated metadata: query, retrieved memories, answers, ground truth, scalar reward $r(q)$ , and failure count $c(q)$ .
- Assigns a difficulty score: $d(q) = (1-r(q))\cdot c(q)$ .
- Clusters queries and samples representative problematic cases using, for instance, k-means in embedding space.
Stage 2—LLM-Guided Skill Update:
- Prompts an LLM with cases and current skill bank to produce:
- Refined templates for existing skills.
- Proposals for new skills to address as-yet-uncovered memory behavior.
- Applies at most $M$ changes per evolution cycle.
- Reverts to earlier snapshots if validation rewards post-update do not improve.

New skill adoption is encouraged by temporarily biasing controller logits, ensuring at least a fraction $T$ of total probability mass is assigned to new skills for the first $\tau$ steps after introduction.

3. Algorithmic Procedure and Data Flow

Memory processing in MemSkill is structured around sequential, span-level analysis of agent interaction traces. The procedural flow comprises the following steps:

Segmentation: Each input trace is split into spans $x_1,\ldots,x_T$ .
Iterative Processing (for $t=1$ to $T$ ):
- Retrieve the top $R$ relevant items from the current trace-specific memory bank ( $M_t$ ).
- Controller computes contextual encodings and samples an ordered Top- $K$ skill set for the current span.
- Executor LLM generates and parses memory update actions under the selected skills.
- Memory bank is updated accordingly.
Trace Finalization:
- Evaluate memory-dependent queries to obtain a terminal reward ( $R$ ).
- Use PPO to update the controller’s policy based on this reward.
- Log failures to the designer’s buffer for eventual skill evolution cycles.

4. Training Objectives and Optimization

4.1 Reinforcement Learning for Skill-Selection

Reward Structure: Only the terminal step receives the episode reward ( $r_T=R$ ), all intermediate per-span rewards are zero.
Return Calculation:

$G_t = \sum_{k=t}^T \gamma^{k-t} r_k$

Policy Gradient Objective:

$L^{\rm policy}(\theta) = \mathbb{E}_t [ \min(\rho_t \hat{A}_t, \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\hat{A}_t) ]$

with likelihood ratios $\rho_t$ computed from the current and previous controller parameters.

Value Loss:

$L^{\rm value}(\phi) = \mathbb{E}_t [ (V_\phi(h_t) - G_t)^2 ]$

Entropy Regularization:

$H(\theta) = \mathbb{E}_t[-\sum_i p_\theta(i\mid h_t)\log p_\theta(i\mid h_t)]$

Overall Objective: Maximize

$L^{\rm policy} - c_v L^{\rm value} + c_H H(\theta)$

4.2 Skill Exploration Incentivization

To ensure exploration and integration of newly introduced skills following designer-driven evolution, controller logits are biased for $\tau$ steps so that newly added skills collectively receive at least $T_t = T_0 (1-t/\tau)$ probability mass at each step. This encourages rapid evaluation of new skills before annealing exploration pressure.

5. Experimental Results and Representative Ablations

MemSkill was benchmarked on LoCoMo and LongMemEval (long-context dialogues), HotpotQA (question answering with domain and format shift), and ALFWorld (simulated embodied tasks). Evaluation metrics include F1, LLM-judge scores, and task-specific success metrics.

Baseline comparisons included models such as No-Memory, Chain-of-Notes, ReadAgent, MemoryBank, A-MEM, Mem0, LangMem, and MemoryOS.

Key Results

On all evaluation suites and using two 70B–80B LLMs, MemSkill surpasses static-memory and RL-only baselines.
Demonstrated strong zero-shot transfer: skills learned on LoCoMo generalize to LongMemEval and HotpotQA without fine-tuning.
For embodied tasks, skill-conditioned memories increase success rates on both seen and unseen ALFWorld splits.

Ablation Results (LoCoMo, LLM-judge metric):

Experiment Variant	LLaMA	Qwen
Full MemSkill	50.96	52.07
– Controller replaced by random	45.86	41.24
– Designer replaced by static skills	44.11	34.71
– Designer: refinement only (no new)	44.90	46.97

Learning to select relevant skills provides an approximate 5-point gain.
Designer-driven evolution yields an additional 6–17-point improvement, especially under model distribution shift.
This suggests that both adaptive skill selection and ongoing skill discovery are crucial for generalizable, robust memory systems in LLM agents.

Representative Evolved Skills

LoCoMo: “Capture Temporal Context,” “Capture Activity Details,” “Handle Entity Relationships,” “Refine Temporal Details with Context.”
ALFWorld: “Capture Action Constraints,” “Track Object Location,” “Track Object Movements.”

6. Interpretation and Implications

MemSkill empirically demonstrates that (i) memory operations can be abstractionized as learnable skills rather than fixed routines, (ii) RL-based learning of skill selection yields measurable improvements in handling diverse and lengthy traces, and (iii) the skill bank itself benefits from continual, data-driven evolution based on agent error. The integration of controller, executor, and designer modules into a closed-loop produces a memory manager for LLM agents that adapts and generalizes without fixed heuristics or reliance on static, human-encoded priors. This architecture offers evidence supporting emergent, compositional memory management as a practical paradigm for advanced agent systems (Zhang et al., 2 Feb 2026).

Markdown Upgrade to Chat

References (1)

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemSkill Architecture.