Cartridges: Lightweight and general-purpose long context representations via self-study (2506.06266v3)

Published 6 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

Summary

The paper presents Cartridges, which use a tiny, trainable KV-cache to replicate full in-context learning at 38.6× lower memory cost.
It employs a self-study pipeline that integrates chunking and synthetic conversations to distill global context and improve reasoning benchmarks by 4–8 points.
Experimental results on diverse corpora demonstrate that Cartridges match full ICL quality while offering 10–40× improvements in memory efficiency and throughput.

Overview

“Cartridges” introduces a practical alternative to brute-force long-context prompting. Instead of re-feeding a 100 k–500 k token corpus at every request (and paying the KV-cache cost each time), you freeze the LLM and train a tiny KV-cache offline—called a Cartridge—so that the model behaves as if the entire corpus were still in-context. A new self-supervised recipe, Self-Study, makes the cartridge both general-purpose (handles arbitrary downstream queries) and structurally aware (understands ordering and cross-references inside the corpus).

1. Why Cartridges?

Inference mode	Memory cost	Throughput	Generality	Notes
Full ICL (entire corpus in prompt)	O(	C	)	Low
Prompt / KV-cache compression	1–4× smaller	Medium	degraded	Quality drops ≳ 2× compression
Cartridge (naïve next-tok)	O(p)	High	✗	Memorises, poor generalisation
Cartridge + Self-Study	O(p)	High	✓	Matches ICL quality at 38.6× less memory

p ≈ 128–8 192 tokens (cache size, not vocab tokens).

2. Cartridge Parameterisation

Use prefix-tuning style KV tensors: Z = {z_k[l,1..p], z_v[l,1..p]} ∈ ℝ^{L×p×d} (one trainable key & value per layer).
Initialize each z_k, z_v with the actual KV vectors of the first p corpus tokens → stabilises training and converges within ~30 min on 8× H100 for Llama-8B.
Freeze the backbone; only Z is updated.

Why not LoRA? On long-context tasks LoRA adapters of the same memory budget reduce out-of-domain accuracy (e.g. −9 points on MMLU) and need special serving infra, whereas a KV blob can be hot-loaded into any inference engine.

3. Self-Study Training Pipeline

for epoch:
    # 1. Pick random 0.5-4 k token chunk  ĉ  from corpus  C
    # 2. Sample a seed prompt  s  (structuring | summarisation | question |
       use-case | creative)
    # 3. Have the frozen LLM chat with itself:
           A (user)  : s
           B (assistant) : answer
       -> synthetic conversation (ĉ, A, B)
    # 4. KL-distillation loss:
       min_Z  Σ_i KL[  P_teacher(·|ĉ ⊕ A[:i])  ||  P_student_Z(·|A[:i]) ]

Key design choices

Chunking lets you cover 400 k-token corpora with a short-context model and forces the cartridge to learn global rather than local n-gram patterns.
Diverse seed prompts (5 generic types) push the synthetic questions away from pure memorisation and yield +4–8 points accuracy on reasoning benchmarks.
Context distillation (matching the teacher logits) beats plain next-token prediction by +3–9 points depending on dataset.

Compute scale: 1–4 B synthetic tokens (64-GPU hours for Llama-8B). Quality keeps rising linearly with optimisation steps (see Fig. 5 in paper).

4. Serving Workflow

1
2
3

cartridge = torch.load("pepsi_10k.pt")          # 120-512 KB to ~1 GB
model.load_kv_cache(cartridge)                  # same API as prefix KV
response = model.generate(user_query)

No engineering changes to vLLM / SGLang / TGI. Throughput ≈ that of a p-token prompt irrespective of corpus size: on Llama-8B, 26× more requests per H100 at equal quality vs 128 k-token ICL.

5. Empirical Findings

Dataset → corpus size / task

LongHealth 100 k (clinical QA, MC)
QASPER 100 k (paper QA, free-form)
MTOB 484 k (Kalamang→En translation)

Results (Llama-3 B unless stated):

Method	LongHealth Acc	QASPER ↓ppl	MTOB chrF	KV Mem
ICL (full)	55.4	6.9	28.4	7.2 GB
DuoAttention (2× compress)	46.0	8.1	23.1	3.6 GB
Cartridge + Self-Study (p=2048)	55.6	6.9	28.2	0.19 GB

Additional observations

Extends context length: Llama-8B (128 k ctx) + cartridge over full 484 k textbook scores +11 chrF v. ICL on first 130 k tokens.
Composable: concatenate two independently trained cartridges (e.g., AMD + Pepsi 10-Ks) → model answers cross-document queries better than truncated-ICL and single-cartridge baselines.
Robust: freezing the “attention sink” (first token) prevents training collapse; random-token initialisation works but converges slower.

6. Implementation Checklist

Hyper-param	Typical value	Comment
Cache size p	512–2 048	0.6 GB for 3 B model, 2 GB for 8 B
Chunk length	0.5–4 k toks	random uniform
Synthetic convs	30 k–60 k	≈ 1-2 B tokens
Optimiser	AdamW (β1 = 0.9, β2 = 0.95)	lr 3e-4, 400 warm-up
Distill T	1.0	logit temperature
Training time	25–40 min on 8× H100 (8 B)	linear in #steps

Scaling to bigger backbones: memory grows with p·d·L; training FLOPs roughly 1.5× original prefill FLOPs.

7. Limitations & Open Problems

Up-front cost: need idle GPU time to train each cartridge; not yet real-time.
Cartridge size vs OOD degradation: very large p (>8 k) can slightly hurt unrelated tasks.
Security / freshness: offline “snapshot” must be retrained when corpus updates.
No formal guarantees for arbitrary reasoning skills; empirical only.

8. Practical Take-aways

For any repeatedly accessed corpus (codebase, legal archive, patient file), pre-compute a cartridge once and serve it like a prefix → 10-40× memory win.
If the corpus is larger than your model’s context, chunk + self-paper lets you leapfrog the limit without model surgery.
Keep p modest (≤ 2048 for 7-10 B models); invest compute in synthetic-data steps rather than cache width.
Use generic seed prompts—you don’t need task-specific engineering—and logit-level distillation for fast convergence.