PRAXIS: Procedural Recall for LLM Agents
- The paper introduces a nonparametric procedural memory that stores structured tuples, enabling rapid recall of successful past actions by indexing both environmental and internal states.
- Empirical evaluations demonstrate improved one-shot and Best-of-5 task accuracy, reduced steps to completion, and enhanced reliability in complex web-based tasks.
- The approach leverages efficient retrieval algorithms and ANN methods to ensure scalable memory management and robust adaptation in dynamic, stateful environments.
Procedural Recall for Agents with eXperiences Indexed by State (PRAXIS) is a nonparametric procedural memory augmentation for language-agent architectures. It enables agents to perform rapid, post-deployment procedural learning through experience storage and retrieval indexed by both environmental and internal states. This approach supports real-time recall of previously successful action sequences, significantly enhancing data efficiency, reliability, and generalization when agents are deployed in complex, stateful, or fast-changing environments, such as real-world web interfaces (Bi et al., 27 Nov 2025).
1. System Architecture and Workflow
PRAXIS operates as an adjunct to LLM agents. During online interactions, the agent records each step as a structured tuple: pre-action environment embedding, a vector summarizing the internal (task) state, the executed action, and the post-action environment embedding. The environment state is represented both visually (e.g., screenshot, DOM tree) and textually (e.g., serialized HTML), while the internal state vector encodes the agent's current directive or intention.
When selecting actions, the agent computes embeddings for the current environment and internal state, retrieves the most similar past experiences from memory, and integrates these “exemplars” directly into the LLM prompt for policy generation. This mechanism supports episodic recall analogous to trial-and-error learning in biological systems, biasing actions toward those previously effective in similar contexts.
2. Formalization and Retrieval Algorithms
Let denote the environment embedding at time . For each action taken in leading to , a memory entry is:
where and are pre- and post-action environment embeddings, is the internal state embedding, and is the action.
Retrieval at decision time involves a two-stage matching. For each candidate memory entry , compute similarities:
- Environment similarity:
- Internal similarity:
Domain-specific functions such as IoU between DOM bounding boxes multiplied by a DOM length overlap penalty are used for . Cosine similarity is used for internal state.
The retrieval algorithm proceeds as follows:
- Select the top- entries by environment similarity.
- Sort these by internal similarity, and filter by an environment similarity threshold .
The agent’s prompt is augmented with formatted summaries of these top experiences so the LLM can condition its action selection on explicit historical precedents.
3. Implementation Details and Complexity
The storage of each experience tuple involves constant-time encoder operations and memory append. For total memories, exhaustive linear retrieval costs with the cost of embedding comparison, but scalable approximate nearest neighbor (ANN) methods (e.g., HNSW, Faiss) can reduce this to sublinear time for large .
- Storage per step:
- Memory usage: , with the embedding dimensionality
- Retrieval (exact): ; retrieval (ann): with appropriate libraries
Memory growth is linear with agent lifetime. Pruning based on recency or low similarity prevents unbounded expansion.
Recommended configuration: , in $256$–$1024$; bounded .
4. Experimental Evaluation
PRAXIS was evaluated using the Altrina agent on the REAL benchmark, covering 112 web interaction tasks across 11 real-site replicas (Bi et al., 27 Nov 2025). Multiple vision-language backbones were tested: Llama 4, Qwen3-VL, Gemini 2.5 Flash, GPT-5, Claude Sonnet 4.5.
Notable empirical results include:
- One-shot task accuracy: improved from to (±1.2%)
- Best-of-5 accuracy: improved from to
- Average steps to completion: reduced from $25.2$ to $20.2$
- Reliability: increased from to
Table: Best-of-5 Accuracy (%)
| Model | No PM | With PM |
|---|---|---|
| Llama 4 | 47.3 | 52.7 |
| Qwen3-VL | 44.6 | 47.3 |
| Gemini 2.5 Flash | 59.8 | 61.6 |
| GPT-5 | 56.2 | 57.1 |
| Claude Sonnet 4.5 | 60.7 | 59.8 |
Performance increased as the retrieval breadth was increased to 20, then plateaued—a finding suggesting diminishing returns for larger due to LLM prompt size constraints.
Qualitative experiments indicate that procedural memory also supports generalization: action traces collected in some tasks (e.g., multi-page forms) facilitate unseen but structurally similar workflows.
5. Design Considerations and Extensions
Key parameters influencing effectiveness include embedding dimension, the number of stored memories (), the number of exemplars returned per query (), and matching thresholds. Empirical guidelines:
- –$20$ offers a good balance for LLM prompt integration
- Matching threshold (environment similarity) –$0.5$ filters spurious matches
Similarity functions can be further adapted—e.g., using learned encoders for DOM invariance or more sophisticated internal state models. For agents facing prompt-size or memory-overhead issues, approximate nearest neighbor techniques and pruning strategies are advocated.
While the initial implementation is for web environments, the underpinning architecture is domain-agnostic: by redefining state encoders and similarity metrics, PRAXIS generalizes to tasks in robotics, terminal/control environments, and others.
6. Limitations and Future Directions
Identified issues:
- LLM prompt-length constraints restrict and exemplar description detail, potentially limiting memory leverage—future hierarchical or chunked retrieval may mitigate this.
- The basic DOM-level similarity may fail under dramatic UI changes, pointing to the utility of learned cross-modal or structure-invariant encoders.
- Presently, procedural memory is immutable. Allowing update or “evolution” of entries as further information accrues would more closely mimic biological learning systems.
- Absence of an explicit out-of-distribution evaluation leaves full generalization properties to be further studied; current evidence is qualitative.
Potential extensions involve integrating adaptive similarity kernels, attention-based memory mechanisms, or multi-modal selectors—advancing beyond fixed nearest-exemplar contexts.
7. Conceptual Context and Broader Significance
PRAXIS provides a methodologically simple but effective engine for procedural learning in LLM-based agents, avoiding the need for costly model fine-tuning or offline retraining between deployments. By explicitly indexing experiences on both environment and internal states, and supporting real-time retrieval of actionable episodes, the method instantiates a form of lightweight, continual learning. This approach facilitates data-efficient adaptation and robustness in environments with evolving structure, and can potentially be unified with more formal strategies such as topological experience replay (Hong et al., 2022)—which similarly structures experience resources for backward value backup propagation in RL agents.
This suggests a convergent trajectory in reinforcement learning and LLM-based agent research, wherein explicit indexing and structural recall of experience can drive rapid, robust procedural adaptation across both parametric and nonparametric agent architectures.