Papers
Topics
Authors
Recent
Search
2000 character limit reached

MemFactory: Unified Inference & Training Framework for Agent Memory

Published 31 Mar 2026 in cs.CL and cs.AI | (2603.29493v2)

Abstract: Memory-augmented LLMs are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.

Summary

  • The paper presents a unified RL framework that standardizes the agent memory lifecycle into composable modules for efficient long-context optimization.
  • It leverages the novel GRPO algorithm to minimize VRAM usage and enhance training efficiency, yielding up to 14.8% performance improvement across tasks.
  • The framework supports rapid prototyping and reproducible baselines for various memory-augmented architectures, reducing integration burdens on commodity hardware.

MemFactory: A Unified Framework for Memory-Augmented RL Agents

Motivation and Context

Memory-augmented LLMs are indispensable for the transition toward persistent, context-rich AI agents capable of long-range reasoning, continual adaptation, and personalized dialogue across protracted interactions. Fragmentation in the implementation ecosystem has hindered research pace—existing approaches to RL-based memory optimization are typically dataset- or methodology-specific, tightly coupled to ad hoc pipelines, and pose nontrivial integration burdens. While unified training frameworks exist for LLM fine-tuning, they are predominantly stateless and do not support the stateful operations intrinsic to memory-centric agents. MemFactory directly addresses this gap, offering a standardized and modular RL-centric infrastructure designed to facilitate innovation in memory augmentation for LLM-based agents.

Framework Architecture

MemFactory abstracts the entire agent memory lifecycle into composable atomic modules, orchestrated across four primary layers: Module, Agent, Environment, and Trainer. Figure 1

Figure 1: The overall architecture of MemFactory, illustrating the interdependencies among the Modules, Agent, Environment, and Trainer layers.

  • Module Layer encapsulates the engineering of memory operations—structured as Extraction, Update, Retrieval, and end-to-end Agent modules. Each module adheres to standard interfaces (generate, rollout, inference) to support seamless transition between training and deployment, as well as across RL and static paradigms.
  • Agent Layer implements policy composition and execution. By leveraging plug-and-play modular integration, agents can be rapidly instantiated for classical architectures (e.g., Memory-R1, MemAgent, RMM) or novel configurations, with direct support for long-context optimization via efficient pre-trained model integration (Transformers, FlashAttention-2).
  • Environment Layer mediates data loading and multi-dimensional reward computation. It standardizes state representation and supports both explicit (memory banks) and implicit (long-context) agent-environment interfaces, enabling adaptable reward schedules including LLM-as-judge scoring and format-based correctness.
  • Trainer Layer employs Group Relative Policy Optimization (GRPO) for memory- and compute-efficient RL. By normalizing rewards among sampled policy outputs on a per-batch basis, GRPO entirely removes the requirement for a separately parameterized critic, significantly reducing VRAM demands in prolonged-context scenarios.

This architecture enables flexible iteration on memory policy research, supports out-of-the-box baselines for recent SOTA paradigms, and establishes a paradigm for reproducible, unified Memory-RL experimentation.

Technical Foundations and Algorithmic Contributions

Atomicization of Memory Operations

Standardizing the memory pipeline into extraction, update, and retrieval modules mirrors classical CRUD processes but endows the system with reinforcement-driven adaptability. For instance, the NaiveExtractor and NaiveUpdater replicate the paradigm in Memory-R1 but are interchangeable with more advanced selectors or consolidation engines. The RerankRetriever exemplifies modular enhancement by incorporating LRM-based reranking atop base semantic retrieval, reflecting best practices in post-retrieval factuality correction.

For end-to-end memory models (e.g., MemAgent), the Agent Module directly transforms input contexts and latent memory into a new recurrent state, bypassing discrete CRUD steps and mirroring recent trends in neural memory compression through scheduled overwriting rather than incremental modification.

RL Policy Optimization: GRPO

Conventional PPO-based RL for LLMs entails high computational overhead due to the critic. GRPO, as natively integrated in MemFactory, optimizes for both efficiency and stability by computing advantage for each policy instance as follows:

A^i=rimean({r1,,rG})std({r1,,rG})\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\})}

where rir_i is the reward for the ithi^{th} response among GG candidates. This minimizes VRAM, enables RL training on resource-constrained hardware, and naturally supports outcome-driven or LLM-as-judge reward mechanisms—a critical requirement for long-context or retrieval-augmented tasks.

Empirical Evaluation

MemFactory’s empirical study utilizes the MemAgent architecture as the reference agent, employing Qwen3-1.7B and Qwen3-4B-Instruct as base LLMs. Evaluation encompasses in-domain (eval_50, eval_100) and OOD (eval_fwe_16384) tasks from the MemAgent benchmark suite, with all pipelines executable on a single A800 80GB GPU. Experimental pipeline modifications preserve long-context characteristics while enhancing training efficiency (downsampled context length).

The results substantiate several strong claims:

  • Consistent Performance Gain: MemFactory-trained RL agents exhibit a 14.8% average improvement on the smaller base model. The 4B foundation model realizes a 7.3% mean improvement. Gains are maintained under out-of-distribution evaluation—demonstrating the policy generalization capacity of the GRPO-optimized agents.
  • Practical Training Efficiency: The entire process is reproducible and tractable on commodity GPU infrastructure, lowering the entry barrier for RL-based memory research.

Implications and Prospects for Memory-RL Research

MemFactory’s unified abstraction is positioned to significantly accelerate empirical research and combinatorial baselining in memory-augmented RL for LLMs. It creates infrastructure for:

  • Rapid Prototyping: Modular assembly and direct comparison of memory lifecycles, retrieval, and update policies.
  • Benchmarking and SOTA Integration: Out-of-the-box implementations for Memory-R1, MemAgent, and RMM enable systematic SOTA benchmarking.
  • RL-driven Agent Innovation: Researchers can explore nontrivial reward schedules and train recurrent or compositional agents with minimal engineering overhead.

The architecture is extensible—future directions could integrate hybrid offline/online RL, multi-agent memory synchronization, or dynamic data curriculum. Additionally, the modular interface could facilitate direct adoption of more sophisticated reasoning and reranking modules, or multi-modal memory engines. The native use of sample-efficient RL algorithms (GRPO) suggests practical viability for scaling toward billion-parameter class LLMs and even multi-task/autonomous agent platforms.

Conclusion

MemFactory fills a critical infrastructure void in the LLM ecosystem, presenting a modular, extensible, and RL-native framework for memory-augmented agents (2603.29493). It standardizes interfaces and decouples engineering, enabling rapid adoption and innovation in policy-driven memory management. Empirical evidence demonstrates its effectiveness in both in-domain and OOD tasks, justifying its adoption as a baseline for RL-based memory research. Its practical and theoretical flexibility positions it as a foundational infrastructure for the next wave of persistent, adaptive, long-horizon language agents.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

MemFactory: A simple explanation for teens

What is this paper about?

This paper introduces MemFactory, a “build-it-like-Lego” toolkit for teaching AI assistants how to remember and use information over time. Instead of being just chatbots that forget past conversations, these AIs keep notes, update them, and pull them back when needed—like a student with a well-organized notebook. MemFactory gives researchers a single, easy place to plug in memory parts, train them with feedback, and test how well they work.

What questions are the researchers trying to answer?

In plain terms, the paper asks:

  • How can we make it much easier to build AI agents that remember important things over long periods?
  • Can we train these agents to manage their memory (what to save, what to delete, what to use) using rewards—like practicing a game and getting points for good moves?
  • Can one common toolkit support different memory styles from recent research, so people don’t have to rebuild everything from scratch each time?

How did they build and test it?

The team designed MemFactory as a unified framework with four main parts. Think of it like organizing a workshop with clear stations:

  • Module Layer: Small “tools” that do core memory jobs:
    • Extractor: pulls useful facts from a conversation (like highlighting key points in a textbook).
    • Updater: decides to add, delete, or change a memory (like editing your notes).
    • Retriever: finds the most relevant memory when answering (like flipping to the right page).
    • Agent Module: an all-in-one option for memory styles that update and use memory in one go.
  • Agent Layer: Puts the tools together into a full AI “worker” that runs the memory strategy.
  • Environment Layer: Feeds in tasks and judges the agent’s actions with rewards (scores), both for long-term memory banks and very long conversations.
  • Trainer Layer: Teaches the agent using a reinforcement learning method called GRPO.

What is reinforcement learning (RL)? It’s like training a player in a game: the AI tries actions, gets rewards if the outcome is good, and learns to make better choices next time. Here, the “game” is memory management—when to extract, update, or retrieve.

What is GRPO (Group Relative Policy Optimization)? Imagine you try several answers to the same question and then compare each answer to the group average. You learn from what did better or worse than the group without needing a separate “judge model.” This saves computing power and still gives clear signals about which choices were good.

They also made MemFactory work out-of-the-box with several popular memory approaches from other papers:

  • Memory-R1: treats memory like a database you can add to, update, or delete from.
  • MemAgent: keeps a compact “state” that summarizes the past, so it doesn’t need to read everything again.
  • RMM: focuses on better picking and ranking which memories to use.

For testing, they used MemAgent-style agents and trained them with two open-source LLMs (Qwen3-1.7B and Qwen3-4B-Instruct). They evaluated on:

  • “In-domain” tests (similar to training tasks)
  • “Out-of-distribution” tests (different from training tasks, like surprise questions)

Everything ran on a single high-end GPU, showing that the setup is practical and reproducible.

What did they find, and why does it matter?

The key result: training memory behavior with MemFactory improved performance.

  • With the smaller model (Qwen3-1.7B), the average score improved by about 14.8%.
  • With the larger model (Qwen3-4B-Instruct), the average score improved by about 7.3% and was more consistent—even on surprise (OOD) tests.

Why this matters:

  • It shows that a single, unified toolkit can make complex memory training easier and still deliver better results.
  • It proves that reinforcement learning (with GRPO) can effectively teach an AI when and how to store or use memory.
  • It lowers the barrier for researchers to experiment, mix, and match memory strategies without heavy engineering.

What’s the bigger impact?

MemFactory could speed up progress in building smarter, long-term AI assistants—ones that remember useful facts, keep notes tidy, and personalize over time. By making the “memory brain” easy to assemble and train, it helps researchers focus on new ideas instead of rebuilding basic parts. Over time, this could lead to AI that’s more helpful in tutoring, customer support, planning, and personal assistants that truly remember what matters to you.

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and open questions that remain unresolved in the paper, framed to guide future research:

  • Limited empirical scope: only the MemAgent-style recurrent memory policy is evaluated; no experiments validate MemFactory on the other “out-of-the-box” agents (Memory-R1, RMM) despite claiming support.
  • Algorithmic comparison missing: GRPO is the only RL algorithm implemented/evaluated; there is no head-to-head comparison with PPO, REINFORCE, DPO/RLHF hybrids, or critic-free alternatives under identical settings.
  • Reward design opacity: the paper mentions “multi-dimensional rewards” (Format Rewards and LLM-as-a-Judge) but does not specify exact reward definitions, weighting, normalization, or how module-level credit is assigned within a trajectory.
  • Credit assignment across modules: it remains unclear how GRPO advantages are attributed to specific atomic operations (extract/update/retrieve) in multi-step pipelines, or how delayed task rewards are backpropagated to individual memory decisions.
  • Sensitivity to GRPO hyperparameters: no ablations on group size G, sampling temperature, advantage normalization, or KL penalties to assess training stability and sample efficiency for long-horizon memory tasks.
  • Exploration strategy: the framework does not describe exploration mechanisms (e.g., entropy bonuses, top-k/top-p sampling choices) or their effects on policy learning in memory-intensive settings.
  • Evaluation breadth: only MemAgent datasets are used; generalization to other memory tasks (e.g., knowledge-intensive QA, multi-session personalization, tool-augmented agents) is untested.
  • OOD robustness gap: the 1.7B model shows a performance drop on the OOD set; there is no analysis of why, nor strategies to mitigate overfitting or improve transfer (e.g., curriculum learning, regularization, data augmentation).
  • Lack of statistical rigor: results are averaged over four trials but report no confidence intervals, significance tests, or variance across random seeds; robustness claims are therefore uncertain.
  • Training data alterations: contexts are shortened to one-third the original length for efficiency, which may change task difficulty; the impact of this change on validity and comparability to prior work is not evaluated.
  • Baseline completeness: there is no comparison to original MemAgent training pipelines, heuristic memory baselines, or non-RL fine-tuning (e.g., SFT/DPO) within MemFactory to isolate the contribution of GRPO.
  • Module-level ablations: no experiments show how swapping extractors/updaters/retrievers affects performance, limiting evidence for the claimed “Lego-like” benefits and guiding principles for module selection.
  • Reward model reliability: reliance on LLM-as-a-Judge is not audited for bias, drift, or prompt sensitivity; no calibration against human judgments or generative reward models is provided.
  • Long-horizon credit leakage: the framework does not address potential temporal credit assignment issues in very long contexts (e.g., whether early extraction errors are correctable and how).
  • Memory quality diagnostics: there is no analysis of memory bank health over time (redundancy, contradiction rates, decay, versioning), nor tools/metrics to quantify memory consolidation quality.
  • Robustness to noise and adversaries: the framework does not study memory poisoning, conflicting information, low-quality retrievals, or adversarial prompts, and offers no defenses (e.g., trust scoring, provenance tracking).
  • Scalability limits: latency, throughput, and memory footprint for large banks/contexts are not reported; the impact of FlashAttention-2 and vLLM on end-to-end agent latency is unmeasured.
  • Cost of reranking with LRMs: the RerankRetriever uses LRMs but the compute and cost trade-offs versus simpler rerankers (bi-encoders, cross-encoders) are not quantified.
  • Vector database and storage back-ends: the framework abstracts retrieval but does not evaluate or integrate diverse storage/indexing options (e.g., HNSW, FAISS, disk-backed stores) or their impact on training/inference.
  • Multi-modal and tool-augmented memories: the design’s applicability to non-text modalities (vision/audio) or tool outputs (APIs, code) remains unexplored.
  • Non-stationary environments: how MemFactory handles evolving user profiles, shifting reward functions, or changing knowledge bases (continual learning) is unspecified.
  • Safety, privacy, and compliance: there is no discussion of PII handling, retention policies, redaction, or compliance safeguards in long-term memory agents.
  • Interpretability and controllability: the framework lacks mechanisms to introspect, edit, or constrain memory operations (e.g., human-in-the-loop approvals, rule constraints, or explanations for add/update/delete).
  • Reproducibility details: training time, wall-clock costs, exact seeds, and hardware variability are not reported; claims of “reproducible on a single A800 80GB” need broader hardware validation.
  • Benchmark standardization: no unified, public benchmark suite or evaluation protocol for memory agents is proposed to enable cross-paper comparisons using MemFactory.
  • Extensibility beyond discrete operations: the module interfaces target discrete CRUD/retrieval steps; support for differentiable or continuous memory representations (e.g., key-value differentiable memory, recurrent state spaces) is unclear.
  • Hierarchical or multi-agent policies: the framework does not address hierarchical RL for multi-level memory control or shared memory policies across collaborating agents.
  • Catastrophic forgetting and stability: there is no study of whether fine-tuning memory policies via GRPO induces forgetting of base model capabilities or destabilizes generation quality.
  • Deployment readiness: guidance on productionization (caching, streaming updates, online learning, monitoring drift) and operational best practices is absent.

Practical Applications

Below is an overview of practical, real-world applications derived from the MemFactory framework, its methods, and its empirical findings. The items are grouped into immediate and longer-term opportunities, with sector linkages, potential tools/workflows, and feasibility notes.

Immediate Applications

  • Standardized research and prototyping for memory-augmented agents
    • Sectors: academia, AI labs, AI tooling vendors
    • Tools/workflows: out-of-the-box agents (MemoryR1Agent, MemoryAgent, MemoryRMMAgent); Module/Agent/Environment/Trainer layers; GRPO-based trainer; SwanLab dashboards; Transformers + FlashAttention-2 + vLLM
    • Assumptions/dependencies: access to open LLM checkpoints (e.g., Qwen3), single-GPU or similar compute; availability of long-context datasets; basic RL expertise; reliable reward functions (format checks, LLM-as-a-judge)
  • Rapid reproduction and benchmarking of Memory-RL baselines
    • Sectors: academia, evaluation platforms, enterprise AI research
    • Tools/workflows: MemoryBankEnv and LongcontextEnv for standardized dataloading/rewards; plug-and-play extract/update/retrieve modules; avg@4 evaluation; OOD test sets
    • Assumptions/dependencies: community adoption of common metrics; careful prompt/reward design to avoid reward hacking
  • CX chatbots with consistent, deduplicated long-term memory
    • Sectors: software (customer support), e-commerce, telecom
    • Tools/workflows: NaiveExtractor + NaiveUpdater to maintain concise memory banks; RerankRetriever to improve retrieval precision; integration with CRM and ticketing systems
    • Assumptions/dependencies: privacy compliance (GDPR/CCPA), user consent; secure storage (encryption, access control); domain-specific reward signals (e.g., resolution/CSAT proxies)
  • Enterprise knowledge copilots that remember projects and decisions
    • Sectors: enterprise productivity, knowledge management, project management
    • Tools/workflows: MemAgent-style RecurrentMemoryModule for fixed-length, recurrent state across sessions; role-based access control; A/B testing of updater policies via Lego-like module swaps
    • Assumptions/dependencies: integration with enterprise systems (wikis, code repos, docs); information security and SSO; policy for retention and deletion (DEL operation)
  • Retrieval-augmented generation (RAG) pipelines with RL-tuned reranking and memory hygiene
    • Sectors: developer tools, internal search, documentation assistants
    • Tools/workflows: RerankRetriever with LRM reranking as a post-retrieval stage; updater-driven deduplication and conflict resolution; vector DB or knowledge graph backends
    • Assumptions/dependencies: availability and cost of LRM or strong reranker; latency budgets; retrieval evaluation signals (clicks, citations, exact-match)
  • Data-labeling-lite optimization for memory policies
    • Sectors: startups, SMEs, internal AI teams
    • Tools/workflows: GRPO with rule-based and LLM-as-a-judge rewards to fine-tune memory extraction/update/retrieval without dense human labels
    • Assumptions/dependencies: judge reliability and calibration; safeguards against exploitation of judge prompts; periodic human audits
  • Personal productivity agents with cross-session memory
    • Sectors: consumer productivity, email/calendar assistants
    • Tools/workflows: inference-only agents assembled from MemFactory modules; lightweight memory bank for contacts, tasks, preferences; updater-driven “forgetting” policies
    • Assumptions/dependencies: user consent and local-first storage when possible; on-device or edge inference constraints; simple rewards (task completion, reminders success)
  • Education pilots: tutors that track learner progress and misconceptions
    • Sectors: education technology
    • Tools/workflows: MemoryBankEnv to store learner profiles; retrieval/reranking for lesson content; GRPO rewards from quiz outcomes and rubric scores
    • Assumptions/dependencies: aligned curricula; COPPA/FERPA compliance; bias monitoring across student subgroups
  • Evaluation harness for product teams
    • Sectors: product analytics, MLOps
    • Tools/workflows: map business KPIs (resolution rate, time-to-answer) into reward functions; SwanLab tracking; A/B tests by swapping modules and hyperparameters
    • Assumptions/dependencies: careful KPI-to-reward translation; offline replay vs. online RL choice; guardrails for customer-facing experiments
  • Cost-efficient single-GPU RL fine-tuning for memory behaviors
    • Sectors: broad (SMEs, independent researchers)
    • Tools/workflows: GRPO (no critic) to reduce memory footprint; batch padding/masking routines provided by MemFactory’s Trainer
    • Assumptions/dependencies: stable reward distributions; hyperparameter tuning; reproducibility across GPUs/models

Long-Term Applications

  • Consumer assistants with multi-month, privacy-preserving memory at scale
    • Sectors: consumer software, smart home
    • Tools/workflows: advanced Updater policies (ADD/UPDATE/DEL/NONE) that enforce retention schedules and forgetting curves; MemOS-style orchestration
    • Assumptions/dependencies: longitudinal datasets; scalable storage; robust privacy controls (user dashboards, revocation APIs); evaluation beyond short horizons
  • Clinical assistants that maintain longitudinal patient context
    • Sectors: healthcare
    • Tools/workflows: MemoryBankEnv integrated with EHR; retrieval over structured medical knowledge graphs; RL-tuned memory policies anchored to clinical outcomes
    • Assumptions/dependencies: HIPAA/GDPR-H; clinical validation and safety; bias and fairness audits; model risk management
  • Compliant financial advisors with auditable client memory
    • Sectors: finance/wealth management, insurance
    • Tools/workflows: structured CRUD with audit trails; DEL operations for “right to be forgotten”; RL rewards tied to suitability checks and compliance rules
    • Assumptions/dependencies: regulatory approval (SEC/FINRA), documentation for audits; robust red-teaming and adversarial testing
  • Long-horizon memory for autonomous robots and embodied agents
    • Sectors: robotics, logistics, manufacturing
    • Tools/workflows: extend Agent/Module layers to multimodal memory (sensor, vision); RL in simulated/real environments; retrieval from maps/task graphs
    • Assumptions/dependencies: multimodal model support; real-time constraints; safety/reliability in physical settings; sample-efficient reward design
  • Lab and R&D assistants that track experiments, hypotheses, and findings
    • Sectors: pharmaceuticals, materials, academic research
    • Tools/workflows: structured memory schemas aligned with ELNs; Updater rules to resolve conflicting results; OOD generalization checks
    • Assumptions/dependencies: integration with lab instruments/LIMS; factual grounding and provenance; human-in-the-loop review
  • Memory governance standards and policy toolkits
    • Sectors: policy/regulation, compliance, standards bodies
    • Tools/workflows: MemFactory as a testbed for retention/deletion/right-to-be-forgotten policies; policy-driven reward shaping; transparency reporting
    • Assumptions/dependencies: consensus on definitions and metrics of “memory quality” and “harm”; cross-jurisdictional alignment
  • Shared organizational memory for multi-agent collaboration
    • Sectors: enterprise collaboration, project management, software engineering
    • Tools/workflows: concurrent MemoryBankEnv with access control, versioning, and conflict resolution; RL for retrieval under multi-user constraints
    • Assumptions/dependencies: database design for concurrency; identity and permissions; clarity on ownership and traceability
  • Reward modeling for memory quality at scale
    • Sectors: AI research, platform providers
    • Tools/workflows: replace LLM-as-a-judge with learned Generative Reward Models tailored to memory (faithfulness, utility, recency); GRPO/VERL-like training stacks
    • Assumptions/dependencies: curated datasets for reward model training; compute resources; methods to prevent reward misspecification
  • Edge/on-device personal memory agents
    • Sectors: mobile, IoT, automotive
    • Tools/workflows: quantized small LLMs; low-footprint GRPO or periodic federated updates; encrypted local memory stores
    • Assumptions/dependencies: device compute and energy limits; privacy-preserving telemetry; efficient update schedules
  • Cross-modal, lifelong memory for assistive and creative tools
    • Sectors: accessibility, education, media/creative software
    • Tools/workflows: extend extractors/retrievers to images/audio/video; unified memory schemas across modalities; multimodal rewards
    • Assumptions/dependencies: strong multimodal base models; datasets spanning modalities and long horizons; evaluation protocols
  • Safety-first memory policies (sensitive data avoidance and selective forgetting)
    • Sectors: security, compliance, public sector
    • Tools/workflows: safety classifiers integrated into Updater to prevent storing sensitive PII; RL penalties for retaining prohibited content; automated deletion workflows
    • Assumptions/dependencies: reliable detection of sensitive content; explainability/auditability; alignment with legal definitions
  • Community benchmarks and leaderboards for memory agents
    • Sectors: academia, industry consortia, evaluation platforms
    • Tools/workflows: standardized MemFactory tasks/environments (in-domain and OOD); long-horizon metrics and robustness tests
    • Assumptions/dependencies: broad adoption; maintenance of datasets and metrics; neutral hosting and governance

These applications leverage MemFactory’s core innovations—modular “Lego-like” memory operations, unified environments and rewards, and compute-efficient GRPO—to lower engineering barriers and accelerate real-world deployment of memory-augmented agents across sectors. Feasibility hinges on responsible reward design, privacy/compliance, dataset availability, and integration with production-grade storage and MLOps.

Glossary

  • Advantage (RL): The estimated benefit of an action over a baseline in policy-gradient methods, used to weight updates. "Specifically, the advantage A^i\hat{A}_i for the ii-th candidate with reward rir_i is calculated as:"
  • Agent Layer: The component that composes modules and executes the policy to produce interactions. "The Agent Layer builds upon the Module Layer and serves as the central policy executor"
  • Agent Module: A module type that encapsulates end-to-end memory policies without separate extract/update phases. "we introduce an additional Agent Module class within this layer."
  • Answer Agent: A model component trained to transform or distill memory into answers or summaries. "and an Answer Agent for memory distillation."
  • AutoClasses (Transformers): Factory utilities from Hugging Face Transformers to automatically load model/tokenizer classes. "we use AutoClasses from the Transformers library"
  • Baseline (policy gradient): A reference value subtracted from rewards to reduce variance in gradient estimates. "estimates the baseline through intra-group reward normalization."
  • CRUD operations: Discrete data-management actions—Create, Read, Update, Delete—used here for memory editing. "discrete CRUD operations (e.g., ADD, UPDATE, DELETE)"
  • DAPO: A reinforcement-learning optimization method (mentioned as Multi-Conversation DAPO) used to learn memory overwrite policies. "It employs Multi-Conversation DAPO to learn an optimal ``overwrite'' policy"
  • Direct Preference Optimization (DPO): An alignment algorithm that optimizes models directly from pairwise preferences. "alignment algorithms (e.g., SFT, DPO)"
  • Environment Layer: The component that prepares data into states and computes reward signals for the agent. "The Environment Layer serves as the interface between the agent and the task."
  • Exact-match reward: A rule-based reward that grants credit when outputs exactly match targets. "particularly when optimizing rule-based or outcome-driven rewards (e.g., exact-match or LLM-as-a-judge scores)"
  • FlashAttention-2: An optimized attention algorithm that accelerates and reduces memory use for long-sequence transformers. "we integrate FlashAttention-2"
  • Format Rewards: Reward signals that assess whether model outputs follow a prescribed structure or format. "including Format Rewards and LLM-as-a-Judge evaluations."
  • Group Relative Policy Optimization (GRPO): A PPO-style algorithm that removes the value model by normalizing rewards within sampled groups. "MemFactory natively integrates Group Relative Policy Optimization (GRPO) to fine-tune the agent's internal memory management strategies"
  • Intra-group reward normalization: The technique of normalizing rewards across a sampled group to compute advantages without a critic. "estimates the baseline through intra-group reward normalization."
  • Large Reasoning Models (LRMs): Models specialized for complex reasoning, used here to rerank retrieved memories. "the RerankRetriever leverages Large Reasoning Models (LRMs) to re-evaluate and rerank the initially retrieved memories"
  • LLM-as-a-Judge: An evaluation paradigm where an LLM scores model outputs for quality or correctness. "including Format Rewards and LLM-as-a-Judge evaluations."
  • LLM attribution signals: Signals indicating whether retrieved content influenced the model’s output, used as supervision. "It leverages unsupervised LLM attribution signals (i.e., whether a retrieved memory was actually cited)"
  • Memory bank: A persistent, updatable store of extracted facts and experiences for long-term use. "ensuring that the memory bank remains concise and highly accurate during prolonged user interactions."
  • Memory Extractor: A component that parses raw context to produce structured memory entries. "The primary function of the memory extractor is to parse the raw contexts into structured memory pieces."
  • Memory Retriever: A component that fetches relevant memories from the memory bank to condition responses. "The memory retriever is tasked with fetching the most relevant memory from the memory bank"
  • Memory Updater: A component that compares candidate memories with existing ones and decides how to modify the store. "the updater assigns one of four operations to manage the memory states"
  • MemAgent: A recurrent-memory paradigm that treats memory as a latent state and learns overwrite policies via RL. "MemAgent \cite{yu2025memagent}"
  • Memory-R1: A framework that trains a memory manager to execute discrete memory operations with RL. "Memory-R1 \cite{yan2026memoryr1}"
  • Memory-RL: The use of reinforcement learning to optimize memory extraction, updating, and retrieval in agents. "unique paradigm of Memory-RL."
  • Module Layer: The layer that defines standardized, plug-and-play atomic memory operations. "The Module Layer operates as the fundamental core of the framework"
  • Out-of-distribution (OOD): Data that differs from the training distribution, used to test generalization. "one out-of-distribution (OOD) dataset (eval_fwe_16384)."
  • Parameter-Efficient Fine-Tuning (PEFT): Techniques that fine-tune small subsets of parameters or adapters rather than the full model. "Parameter-Efficient Fine-Tuning (PEFT)"
  • Proximal Policy Optimization (PPO): A popular on-policy RL algorithm using clipped objectives and typically a value critic. "While Proximal Policy Optimization (PPO) is the prevailing algorithm"
  • REINFORCE algorithm: A Monte Carlo policy-gradient method that updates policies using sampled returns. "via the REINFORCE algorithm."
  • Retrieval-Augmented Generation (RAG): A method that retrieves external context to ground generative model outputs. "static Retrieval-Augmented Generation (RAG) paradigms"
  • Retrospective Reflection: A retrieval refinement approach that updates retrieval policies by reflecting on whether memories were cited. "via ``Retrospective Reflection.''"
  • Reranker: A lightweight model that reorders retrieved candidates based on relevance signals. "update a lightweight reranker online via the REINFORCE algorithm."
  • Reranking: A post-retrieval process that reorders candidates to improve precision. "Reranking is a widely adopted post-retrieval technique"
  • RerankRetriever: A retriever variant that applies a reranking step to initial retrieval results. "we also provide the RerankRetriever."
  • Recurrent state: A latent vector carried across segments to summarize and compress long histories. "treats a fixed-length latent memory variable as a recurrent state across text segments."
  • RecurrentMemoryModule: An implementation of end-to-end recurrent memory policies following the MemAgent design. "we implemented the RecurrentMemoryModule following its design."
  • Rollout trajectories: Sequences of states, actions, and rewards generated by the agent for training. "enabling researchers to seamlessly construct custom memory agents via a ``Lego-like'' architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards." [Note: use a more direct occurrence:] "and generate rollout trajectories during interaction."
  • Sample efficiency: The effectiveness of learning from limited sampled trajectories or rewards. "while maintaining high sample efficiency"
  • SFT (Supervised Fine-Tuning): Fine-tuning on labeled input-output pairs to align model behavior. "alignment algorithms (e.g., SFT, DPO)"
  • SwanLab: A training monitoring and visualization toolkit integrated into the framework. "natively integrates SwanLab"
  • Trainer Layer: The optimization engine that applies GRPO to fine-tune memory policies. "The Trainer Layer serves as the core optimization engine"
  • vLLM: A high-throughput LLM inference engine optimized for efficient serving. "such as vLLM"
  • Value network (critic): The model that estimates the expected return for states or state-actions in actor–critic methods. "it requires an auxiliary value network (critic) comparable in size to the policy model."
  • VERL: A specialized library for scalable RL pipelines supporting high-throughput training. "specialized libraries like VERL \cite{Sheng_2025} offer high-throughput infrastructures specifically optimized for large-scale Reinforcement Learning (RL) pipelines."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 290 likes about this paper.