ShardMemo: Cost-Aware Tiered Memory for LLMs
- ShardMemo is a tiered memory system with rigid budgets for working memory, persistent evidence, and procedural skills, ensuring predictable resource use.
- It employs masked MoE routing with cost-aware gating and adaptive probe selection, significantly enhancing retrieval quality and efficiency.
- Empirical results show ShardMemo outperforms previous systems on benchmarks like LoCoMo and HotpotQA, reducing latency and vector scans.
ShardMemo is a tiered, cost-aware external memory system for agentic LLM platforms, designed to deliver scalable, predictable, and budgeted memory retrieval in scenarios involving large, persistent evidence stores and procedural skill libraries. Its architecture addresses core bottlenecks encountered with centralized memory indexes and static partition strategies, particularly as memory volume and concurrent multi-agent execution increase. ShardMemo introduces strict scope constraints, cost-controlled routing, and sharded approximate nearest neighbor (ANN) architectures organized across three functional tiers, with explicit mechanisms for both agent-session state and reusable skills. Empirically, ShardMemo demonstrates significant retrieval quality and efficiency gains over previous agentic LLM memory systems such as GAM, notably on benchmarks like LoCoMo, HotpotQA, and ToolBench (Zhao et al., 29 Jan 2026).
1. Architectural Overview and Formal Specification
ShardMemo is constructed as a three-tiered memory service. Each request is processed through tiers with independent budgets and scope filters:
- Tier A: Maintains bounded agent- or session-specific "working memory" under a strict token cap , typically for short-lived notes or intermediate plans. For a request , working memory is selected via
where is a boolean scope predicate.
- Tier B: Handles persistent evidence in ANN-indexed shards, probing at most eligible shards in parallel for each query and merging their results into a unified Top- evidence set. Scope predicates mask ineligible shards before any scoring occurs, ensuring that ANN search and scoring are strictly limited to eligible data defined by metadata per shard.
- Tier C: Stores versioned, schema-validated procedural "skills" (e.g., tool-call templates plus validation tests), organized as a library . Skills are retrieved under a strict step budget via similarity search, with failures or inapplicability defaulting to fallback evidence retrieval through Tier B.
Each request embeds scope predicates and budgets , tightly controlling both access and resource consumption (Zhao et al., 29 Jan 2026).
2. Masked Mixture-of-Experts (MoE) Routing and Scope Enforcement
ShardMemo enforces a "scope-before-routing" guarantee by masking out ineligible Tier B shards from both routing and ANN search. For eligible shards , the system computes masked MoE gating scores: where concatenates the embedded query with additional structured features , is a learned summary for shard , and estimates per-shard cost (such as I/O or scan count). The tradeoff parameter adjusts cost aversion.
Normalized handling probabilities for each shard are given by
with ineligible shards strictly excluded from consideration.
Routing employs fixed Top- or adaptive Top- selection:
- Fixed Top-: Probes the highest-scoring shards by .
- Adaptive Top-: Sorts by , selects the smallest probe count covering a dynamic cumulative probability threshold (clipped between and ), subject to the global probe cap.
This adaptivity allows confident queries to probe fewer shards, while uncertainty triggers broader shard activation, always bounded by (Zhao et al., 29 Jan 2026).
3. Parallel Shard-Local ANN Retrieval and Global Merging
For the selected probe set , ShardMemo executes each shard-local ANN index in parallel, unions resulting candidate sets, re-applies scope filters, and returns the global Top- results under the request budget: $R^B_t = \TopK_{x \in \cup_{j \in \mathcal{P}_t} \mathrm{ANN}_j(\bm{z}_t)}\; \mathrm{score}(\bm{z}_t, x),\; |R^B_t| \leq K.$ This distributed retrieval prevents bottlenecks associated with centralized indexes and enables efficient, cost-controlled evidence selection for downstream agentic tasks.
When training data is available, evidence-to-shard supervision is used to optimize the router, concentrating probability mass on "gold" shards via a multi-positive set-likelihood loss: This generalizes cross-entropy and directly enhances the hit rates of the routing mechanism for both fixed and adaptive strategies (Zhao et al., 29 Jan 2026).
4. Procedural Skill Library and Fallback
Tier C of ShardMemo maintains a library of versioned, schema-checked skills, each comprising tool-call templates and associated deterministic validation tests. Retrieval occurs under a schema/tool scope filter and a tight skill budget : $\mathcal{U}_t = \TopR_{s \in \mathcal{W}_t}\;\mathrm{sim}(q_t, s),\quad |\mathcal{U}_t| \leq R.$ Skill execution is attempted with slot filling; if retrieval fails or an applicable skill is not found, the request is transparently passed to Tier B for evidence retrieval using the same probe and result budgets. This design ensures robust availability of both procedural and evidentiary retrieval under dynamic, budgeted constraints (Zhao et al., 29 Jan 2026).
5. Empirical Results and Performance Analysis
ShardMemo achieves significant improvements over prior memory systems by combining strict eligibility masking, cost-aware MoE gating, and parallel sharded ANN. On the LoCoMo conversational-memory benchmark using GPT-OSS-120B, it outperforms the strongest baseline (GAM) with gains of +5.70 F1 (single-hop), +5.11 F1 (multi-hop), +6.82 F1 (temporal), and +6.03 F1 (open-domain). Under a fixed-probe regime (), ShardMemo lifts ShardHit@3 from 0.67 (cosine-to-prototype) to 0.82, reduces average per-query vector scans from 521 to 414 (–20.5%), and lowers p95 latency from 95 ms to 76 ms (–19.8 ms).
On HotpotQA with large context windows (56K–448K tokens), ShardMemo delivers F1 scores of 63.41/61.88/57.95 across input lengths, consistently outperforming GAM by +1.31, +0.96, and +0.55 F1, respectively. ToolBench benchmarks demonstrate Tier C skill retrieval yields Precision@3 of 0.97 (+10.2% over embedding-similarity) and StepRed of 1.94 (+7.2%), with mean retrieval latency reduced by ~19% at . Budget sweeps always situate ShardMemo on a better accuracy–efficiency trade-off curve compared to both centralized and naively partitioned retrieval baselines (Zhao et al., 29 Jan 2026).
6. Scope Guarantees, Predictable Budgets, and ShardMemo’s Significance
By implementing mandatory eligibility masking ahead of all semantic scoring or ANN access, ShardMemo enforces strong multi-tenant, schema, and tool-level safety, never exposing out-of-scope data to routing or retrieval logic. Cost-aware gating and strict probe/result caps deliver predictable per-request resource consumption, essential for concurrent, high-volume agentic LLM deployments. The separation of working state, persistent evidence, and procedural skills into dedicated, budgeted tiers enables composability and system reliability as system complexity and parallelism scale. This formalization and empirical performance indicate ShardMemo’s role in advancing scalable, agentic external memory for LLMs beyond the limitations of central indexes and static partitions (Zhao et al., 29 Jan 2026).