- The paper presents Cartridges, which use a tiny, trainable KV-cache to replicate full in-context learning at 38.6× lower memory cost.
- It employs a self-study pipeline that integrates chunking and synthetic conversations to distill global context and improve reasoning benchmarks by 4–8 points.
- Experimental results on diverse corpora demonstrate that Cartridges match full ICL quality while offering 10–40× improvements in memory efficiency and throughput.
Overview
“Cartridges” introduces a practical alternative to brute-force long-context prompting. Instead of re-feeding a 100 k–500 k token corpus at every request (and paying the KV-cache cost each time), you freeze the LLM and train a tiny KV-cache offline—called a Cartridge—so that the model behaves as if the entire corpus were still in-context. A new self-supervised recipe, Self-Study, makes the cartridge both general-purpose (handles arbitrary downstream queries) and structurally aware (understands ordering and cross-references inside the corpus).
1. Why Cartridges?
| Inference mode |
Memory cost |
Throughput |
Generality |
Notes |
| Full ICL (entire corpus in prompt) |
O( |
C |
) |
Low |
| Prompt / KV-cache compression |
1–4× smaller |
Medium |
degraded |
Quality drops ≳ 2× compression |
| Cartridge (naïve next-tok) |
O(p) |
High |
✗ |
Memorises, poor generalisation |
| Cartridge + Self-Study |
O(p) |
High |
✓ |
Matches ICL quality at 38.6× less memory |
p ≈ 128–8 192 tokens (cache size, not vocab tokens).
2. Cartridge Parameterisation
- Use prefix-tuning style KV tensors:
Z = {z_k[l,1..p], z_v[l,1..p]} ∈ ℝ^{L×p×d} (one trainable key & value per layer).
- Initialize each
z_k, z_v with the actual KV vectors of the first p corpus tokens → stabilises training and converges within ~30 min on 8× H100 for Llama-8B.
- Freeze the backbone; only
Z is updated.
Why not LoRA? On long-context tasks LoRA adapters of the same memory budget reduce out-of-domain accuracy (e.g. −9 points on MMLU) and need special serving infra, whereas a KV blob can be hot-loaded into any inference engine.
3. Self-Study Training Pipeline
1
2
3
4
5
6
7
8
9
10
|
for epoch:
# 1. Pick random 0.5-4 k token chunk ĉ from corpus C
# 2. Sample a seed prompt s (structuring | summarisation | question |
use-case | creative)
# 3. Have the frozen LLM chat with itself:
A (user) : s
B (assistant) : answer
-> synthetic conversation (ĉ, A, B)
# 4. KL-distillation loss:
min_Z Σ_i KL[ P_teacher(·|ĉ ⊕ A[:i]) || P_student_Z(·|A[:i]) ] |
Key design choices
- Chunking lets you cover 400 k-token corpora with a short-context model and forces the cartridge to learn global rather than local n-gram patterns.
- Diverse seed prompts (5 generic types) push the synthetic questions away from pure memorisation and yield +4–8 points accuracy on reasoning benchmarks.
- Context distillation (matching the teacher logits) beats plain next-token prediction by +3–9 points depending on dataset.
Compute scale: 1–4 B synthetic tokens (64-GPU hours for Llama-8B). Quality keeps rising linearly with optimisation steps (see Fig. 5 in paper).
4. Serving Workflow
1
2
3
|
cartridge = torch.load("pepsi_10k.pt") # 120-512 KB to ~1 GB
model.load_kv_cache(cartridge) # same API as prefix KV
response = model.generate(user_query) |
No engineering changes to vLLM / SGLang / TGI. Throughput ≈ that of a p-token prompt irrespective of corpus size: on Llama-8B, 26× more requests per H100 at equal quality vs 128 k-token ICL.
5. Empirical Findings
Dataset → corpus size / task
- LongHealth 100 k (clinical QA, MC)
- QASPER 100 k (paper QA, free-form)
- MTOB 484 k (Kalamang→En translation)
Results (Llama-3 B unless stated):
| Method |
LongHealth Acc |
QASPER ↓ppl |
MTOB chrF |
KV Mem |
| ICL (full) |
55.4 |
6.9 |
28.4 |
7.2 GB |
| DuoAttention (2× compress) |
46.0 |
8.1 |
23.1 |
3.6 GB |
| Cartridge + Self-Study (p=2048) |
55.6 |
6.9 |
28.2 |
0.19 GB |
Additional observations
- Extends context length: Llama-8B (128 k ctx) + cartridge over full 484 k textbook scores +11 chrF v. ICL on first 130 k tokens.
- Composable: concatenate two independently trained cartridges (e.g., AMD + Pepsi 10-Ks) → model answers cross-document queries better than truncated-ICL and single-cartridge baselines.
- Robust: freezing the “attention sink” (first token) prevents training collapse; random-token initialisation works but converges slower.
6. Implementation Checklist
| Hyper-param |
Typical value |
Comment |
| Cache size p |
512–2 048 |
0.6 GB for 3 B model, 2 GB for 8 B |
| Chunk length |
0.5–4 k toks |
random uniform |
| Synthetic convs |
30 k–60 k |
≈ 1-2 B tokens |
| Optimiser |
AdamW (β1 = 0.9, β2 = 0.95) |
lr 3e-4, 400 warm-up |
| Distill T |
1.0 |
logit temperature |
| Training time |
25–40 min on 8× H100 (8 B) |
linear in #steps |
Scaling to bigger backbones: memory grows with p·d·L; training FLOPs roughly 1.5× original prefill FLOPs.
7. Limitations & Open Problems
- Up-front cost: need idle GPU time to train each cartridge; not yet real-time.
- Cartridge size vs OOD degradation: very large
p (>8 k) can slightly hurt unrelated tasks.
- Security / freshness: offline “snapshot” must be retrained when corpus updates.
- No formal guarantees for arbitrary reasoning skills; empirical only.
8. Practical Take-aways
- For any repeatedly accessed corpus (codebase, legal archive, patient file), pre-compute a cartridge once and serve it like a prefix → 10-40× memory win.
- If the corpus is larger than your model’s context, chunk + self-paper lets you leapfrog the limit without model surgery.
- Keep
p modest (≤ 2048 for 7-10 B models); invest compute in synthetic-data steps rather than cache width.
- Use generic seed prompts—you don’t need task-specific engineering—and logit-level distillation for fast convergence.