Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShardMemo: Cost-Aware Tiered Memory for LLMs

Updated 5 February 2026
  • ShardMemo is a tiered memory system with rigid budgets for working memory, persistent evidence, and procedural skills, ensuring predictable resource use.
  • It employs masked MoE routing with cost-aware gating and adaptive probe selection, significantly enhancing retrieval quality and efficiency.
  • Empirical results show ShardMemo outperforms previous systems on benchmarks like LoCoMo and HotpotQA, reducing latency and vector scans.

ShardMemo is a tiered, cost-aware external memory system for agentic LLM platforms, designed to deliver scalable, predictable, and budgeted memory retrieval in scenarios involving large, persistent evidence stores and procedural skill libraries. Its architecture addresses core bottlenecks encountered with centralized memory indexes and static partition strategies, particularly as memory volume and concurrent multi-agent execution increase. ShardMemo introduces strict scope constraints, cost-controlled routing, and sharded approximate nearest neighbor (ANN) architectures organized across three functional tiers, with explicit mechanisms for both agent-session state and reusable skills. Empirically, ShardMemo demonstrates significant retrieval quality and efficiency gains over previous agentic LLM memory systems such as GAM, notably on benchmarks like LoCoMo, HotpotQA, and ToolBench (Zhao et al., 29 Jan 2026).

1. Architectural Overview and Formal Specification

ShardMemo is constructed as a three-tiered memory service. Each request qtq_t is processed through tiers with independent budgets and scope filters:

  • Tier A: Maintains bounded agent- or session-specific "working memory" under a strict token cap MM, typically for short-lived notes or intermediate plans. For a request qtq_t, working memory is selected via

At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,

where ψtA\psi^A_t is a boolean scope predicate.

  • Tier B: Handles persistent evidence in SS ANN-indexed shards, probing at most BprobeB_{\mathrm{probe}} eligible shards in parallel for each query and merging their results into a unified Top-KK evidence set. Scope predicates ψtB\psi^B_t mask ineligible shards before any scoring occurs, ensuring that ANN search and scoring are strictly limited to eligible data defined by metadata mjm_j per shard.
  • Tier C: Stores versioned, schema-validated procedural "skills" (e.g., tool-call templates plus validation tests), organized as a library W\mathcal{W}. Skills are retrieved under a strict step budget RR via similarity search, with failures or inapplicability defaulting to fallback evidence retrieval through Tier B.

Each request embeds scope predicates (ψtA,ψtB,ψtC)(\psi^A_t, \psi^B_t, \psi^C_t) and budgets (M,Bprobe,K,R)(M, B_{\mathrm{probe}}, K, R), tightly controlling both access and resource consumption (Zhao et al., 29 Jan 2026).

2. Masked Mixture-of-Experts (MoE) Routing and Scope Enforcement

ShardMemo enforces a "scope-before-routing" guarantee by masking out ineligible Tier B shards from both routing and ANN search. For eligible shards St={j:ψtB(mj)=1}\mathcal{S}_t = \{j : \psi^B_t(m_j) = 1\}, the system computes masked MoE gating scores: st,j=fθ(rt,σj)αct,j,jSt,s_{t,j} = f_\theta(\bm{r}_t, \sigma_j) - \alpha c_{t,j}, \qquad j \in \mathcal{S}_t, where rt\bm{r}_t concatenates the embedded query zt=Embed(qt)\bm{z}_t = \mathrm{Embed}(q_t) with additional structured features φt\bm{\varphi}_t, σj\sigma_j is a learned summary for shard jj, and ct,jc_{t,j} estimates per-shard cost (such as I/O or scan count). The tradeoff parameter α\alpha adjusts cost aversion.

Normalized handling probabilities for each shard are given by

pt,j=exp(st,j)kStexp(st,k),jSt,p_{t,j} = \frac{\exp(s_{t,j})}{\sum_{k \in \mathcal{S}_t} \exp(s_{t,k})},\qquad j \in \mathcal{S}_t,

with ineligible shards strictly excluded from consideration.

Routing employs fixed Top-BprobeB_{\mathrm{probe}} or adaptive Top-PP selection:

  • Fixed Top-BprobeB_{\mathrm{probe}}: Probes the BprobeB_{\mathrm{probe}} highest-scoring shards by st,js_{t,j}.
  • Adaptive Top-PP: Sorts by pt,jp_{t,j}, selects the smallest probe count b^t\hat b_t covering a dynamic cumulative probability threshold τt\tau_t (clipped between PminP_{\min} and PmaxP_{\max}), subject to the global probe cap.

This adaptivity allows confident queries to probe fewer shards, while uncertainty triggers broader shard activation, always bounded by BprobeB_{\mathrm{probe}} (Zhao et al., 29 Jan 2026).

3. Parallel Shard-Local ANN Retrieval and Global Merging

For the selected probe set Pt\mathcal{P}_t, ShardMemo executes each shard-local ANN index ANNj(zt)\mathrm{ANN}_j(\bm{z}_t) in parallel, unions resulting candidate sets, re-applies scope filters, and returns the global Top-KK results under the request budget: $R^B_t = \TopK_{x \in \cup_{j \in \mathcal{P}_t} \mathrm{ANN}_j(\bm{z}_t)}\; \mathrm{score}(\bm{z}_t, x),\; |R^B_t| \leq K.$ This distributed retrieval prevents bottlenecks associated with centralized indexes and enables efficient, cost-controlled evidence selection for downstream agentic tasks.

When training data is available, evidence-to-shard supervision GtStG_t \subseteq \mathcal{S}_t is used to optimize the router, concentrating probability mass on "gold" shards via a multi-positive set-likelihood loss: Lroute(t)=logjGtpt,j,minθ  Et[Lroute(t)].L_{\mathrm{route}}(t) = -\log\sum_{j \in G_t}p_{t,j},\quad \min_\theta \; \mathbb{E}_t[L_{\mathrm{route}}(t)]. This generalizes cross-entropy and directly enhances the hit rates of the routing mechanism for both fixed and adaptive strategies (Zhao et al., 29 Jan 2026).

4. Procedural Skill Library and Fallback

Tier C of ShardMemo maintains a library W\mathcal{W} of versioned, schema-checked skills, each comprising tool-call templates and associated deterministic validation tests. Retrieval occurs under a schema/tool scope filter ψtC\psi^C_t and a tight skill budget RR: $\mathcal{U}_t = \TopR_{s \in \mathcal{W}_t}\;\mathrm{sim}(q_t, s),\quad |\mathcal{U}_t| \leq R.$ Skill execution is attempted with slot filling; if retrieval fails or an applicable skill is not found, the request is transparently passed to Tier B for evidence retrieval using the same probe and result budgets. This design ensures robust availability of both procedural and evidentiary retrieval under dynamic, budgeted constraints (Zhao et al., 29 Jan 2026).

5. Empirical Results and Performance Analysis

ShardMemo achieves significant improvements over prior memory systems by combining strict eligibility masking, cost-aware MoE gating, and parallel sharded ANN. On the LoCoMo conversational-memory benchmark using GPT-OSS-120B, it outperforms the strongest baseline (GAM) with gains of +5.70 F1 (single-hop), +5.11 F1 (multi-hop), +6.82 F1 (temporal), and +6.03 F1 (open-domain). Under a fixed-probe regime (Bprobe=3B_{\mathrm{probe}}=3), ShardMemo lifts ShardHit@3 from 0.67 (cosine-to-prototype) to 0.82, reduces average per-query vector scans from 521 to 414 (–20.5%), and lowers p95 latency from 95 ms to 76 ms (–19.8 ms).

On HotpotQA with large context windows (56K–448K tokens), ShardMemo delivers F1 scores of 63.41/61.88/57.95 across input lengths, consistently outperforming GAM by +1.31, +0.96, and +0.55 F1, respectively. ToolBench benchmarks demonstrate Tier C skill retrieval yields Precision@3 of 0.97 (+10.2% over embedding-similarity) and StepRed of 1.94 (+7.2%), with mean retrieval latency reduced by ~19% at R=3R=3. Budget sweeps always situate ShardMemo on a better accuracy–efficiency trade-off curve compared to both centralized and naively partitioned retrieval baselines (Zhao et al., 29 Jan 2026).

6. Scope Guarantees, Predictable Budgets, and ShardMemo’s Significance

By implementing mandatory eligibility masking ahead of all semantic scoring or ANN access, ShardMemo enforces strong multi-tenant, schema, and tool-level safety, never exposing out-of-scope data to routing or retrieval logic. Cost-aware gating and strict probe/result caps deliver predictable per-request resource consumption, essential for concurrent, high-volume agentic LLM deployments. The separation of working state, persistent evidence, and procedural skills into dedicated, budgeted tiers enables composability and system reliability as system complexity and parallelism scale. This formalization and empirical performance indicate ShardMemo’s role in advancing scalable, agentic external memory for LLMs beyond the limitations of central indexes and static partitions (Zhao et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShardMemo.