Non-Parametric PPO for LLM Procedural Memory
- Non-Parametric PPO is a method that evolves reusable, discrete Skills in LLM agents without conventional neural network updates.
- It employs LLM-based semantic gradients to generate, refine, and validate candidate Skills via a PPO-style trust-region mechanism.
- The approach improves empirical performance and efficiency in sequential decision-making by maintaining a compact, high-utility procedural memory.
Non-Parametric Proximal Policy Optimization (Non-Parametric PPO) is a methodology for evolving reusable procedural memory in LLM agents performing sequential decision-making. Unlike classical parametric reinforcement learning algorithms, Non-Parametric PPO operates entirely in the space of discrete, symbolic Skills—defined procedures consisting of initiation, execution, and termination conditions—eschewing parameter updates or gradient learning within neural network weights. It serves as the core mechanism within the ProcMEM framework, which formalizes agent decision processes as Skill-augmented Markov Decision Processes (Skill-MDPs) and enables autonomous, efficient accumulation and refinement of procedural knowledge without model retraining (Mi et al., 2 Feb 2026).
1. Skill-MDP Formalism and Non-Parametric PPO Operator
The agent is modeled under a Skill-augmented MDP, represented as
where denotes the language-based state space, the primitive action space, and the pool of executable Skills. Decision making is governed by a hierarchical policy
and the objective is to maximize the long-term expected return
The only evolving element is . The evolution operator
applies Skill edits based on a batch of interaction trajectories. Non-Parametric PPO instantiates by analogy to conventional PPO, but operates on compositional Skill candidates rather than parameter vectors. New Skills are generated via semantic refinements and then undergo a PPO-inspired clipped-signal verification to ensure high-quality, bounded updates.
2. Procedural Steps and Algorithmic Structure
The Non-Parametric PPO procedure involves the following main stages for each iteration:
- Experience Collection: Execute the current Skill-augmented policy to collect a batch of interaction trajectories.
- Semantic Gradient Extraction & Candidate Generation: For each Skill invoked,
- Compute semantic gradients for each trajectory via LLM hindsight diagnosis.
- Aggregate gradients across the batch.
- Form candidate Skills as .
- PPO-Style Trust Region Verification (PPO Gate): For each candidate,
- Calculate importance ratios ,
- Estimate surrogate advantage ,
- Compute the PPO-style clipped objective
- Accept the candidate if .
- Score-Based Pool Maintenance: Update scores for all Skills, prune those with non-positive scores or those deemed redundant, maintaining the pool at or below capacity .
This process is summarized in Algorithm 1 of (Mi et al., 2 Feb 2026), contrasting with parametric PPO by manipulating symbolic Skill definitions rather than neural weights. Candidate Skill verification is performed via counterfactual log-probabilities under a frozen LLM, and pool maintenance uses online advantage-based scores.
3. Component Mechanisms
3.1 Semantic Gradients
A semantic gradient is a tuple of natural-language refinements corresponding to the initiation, execution, and termination subconditions of the Skill. These are derived from LLM-based retrospectives on Skill invocations, generating “refinement suggestions” that operate as language-space analogues to gradient steps. Batch-level aggregation of these gradients leads to the construction of candidate Skills whose symbolic structure is then empirically validated.
3.2 PPO Gate (Trust-Region Verification)
For validation, the PPO Gate applies a trust-region criterion by evaluating each candidate under the PPO-style clipped surrogate . The importance ratio measures the log-likelihood risk under the frozen LLM, and only candidates with strictly positive clipped score are admitted to the Skill pool. This mechanism mirrors PPO’s monotonic improvement constraint, but applies it to discrete, symbolic update proposals rather than parameter increments, filtering out unstable or illusory Skill edits.
3.3 Score-Based Pool Maintenance
Skill pool maintenance is governed by an online scoring mechanism. After each trajectory batch, Skills have their average gain recomputed as the mean advantage over their invocation steps. Cumulative scores determine eligibility for continuity; Skills with non-positive score are immediately pruned, and—if necessary to constrain the pool to capacity—further pruning is based on the lowest scores or semantic redundancy. This results in a highly compressed yet high-utility procedural memory.
4. Computational Profile and Convergence Properties
Each evolution iteration involves experience collection ( trajectories of mean length ), candidate generations and validations across the pool, and surrogate objective recalculations for each. The overall asymptotic complexity per iteration is
with empirically moderate settings of and . There is no formal convergence proof; however, surrogate score-based acceptance ensures monotonic improvement and pruning enforces bounded memory size, resulting in observed stabilization of Skill quality over time.
5. Empirical Outcomes and Comparative Analysis
Non-Parametric PPO demonstrates strong empirical performance on multiple benchmarks:
- Reuse Rates: In-domain reuse is approximately $0.925$ (baseline $0.35$–$0.70$), cross-task $0.825$–$0.90$, cross-agent $0.85$–$0.875$.
- Performance: ALFWorld success rates are $0.90$ (train) and $0.909$ (OOD) compared to $0.48$–$0.75$ for baselines. On Mastermind-v0, the average returns are $0.606$ (Base), $0.463$ (Hard), $0.333$ (Extreme).
- Efficiency: Only $816$ tokens total procedural memory, with an average $102$ tokens per Skill, dramatically less than episodic storage baselines. Prompt overhead is tokens/step with retrieval ratio .
Ablations reveal the necessity of each component: removing Non-Parametric PPO drops reuse to $0.563$ and return to $0.482$; absence of semantic gradients reduces PPO Gate pass rate by and results in collapsed pool quality; omitting the PPO Gate causes pool flooding and reuse collapses; eliminating score-based pruning leads to negative pool quality and vanishing long-term performance.
Visualizations of evolutionary Skill lineages and invocation distributions corroborate the stability and transparency of accumulated procedural knowledge. The combination of LLM-driven semantic refinement, trust-region gatekeeping, and advantage-based pruning yields a compact, robust, and highly reusable procedural memory system for LLM agents, outperforming prior episodic strategies in both efficiency and return without any neural weight updates (Mi et al., 2 Feb 2026).