Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 177 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Cartridges: Lightweight and general-purpose long context representations via self-study (2506.06266v3)

Published 6 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

Summary

  • The paper presents Cartridges, which use a tiny, trainable KV-cache to replicate full in-context learning at 38.6× lower memory cost.
  • It employs a self-study pipeline that integrates chunking and synthetic conversations to distill global context and improve reasoning benchmarks by 4–8 points.
  • Experimental results on diverse corpora demonstrate that Cartridges match full ICL quality while offering 10–40× improvements in memory efficiency and throughput.

Overview

“Cartridges” introduces a practical alternative to brute-force long-context prompting. Instead of re-feeding a 100 k–500 k token corpus at every request (and paying the KV-cache cost each time), you freeze the LLM and train a tiny KV-cache offline—called a Cartridge—so that the model behaves as if the entire corpus were still in-context. A new self-supervised recipe, Self-Study, makes the cartridge both general-purpose (handles arbitrary downstream queries) and structurally aware (understands ordering and cross-references inside the corpus).


1. Why Cartridges?

Inference mode Memory cost Throughput Generality Notes
Full ICL (entire corpus in prompt) O( C ) Low
Prompt / KV-cache compression 1–4× smaller Medium degraded Quality drops ≳ 2× compression
Cartridge (naïve next-tok) O(p) High Memorises, poor generalisation
Cartridge + Self-Study O(p) High Matches ICL quality at 38.6× less memory

p ≈ 128–8 192 tokens (cache size, not vocab tokens).


2. Cartridge Parameterisation

  • Use prefix-tuning style KV tensors: Z = {z_k[l,1..p], z_v[l,1..p]} ∈ ℝ^{L×p×d} (one trainable key & value per layer).
  • Initialize each z_k, z_v with the actual KV vectors of the first p corpus tokens → stabilises training and converges within ~30 min on 8× H100 for Llama-8B.
  • Freeze the backbone; only Z is updated.

Why not LoRA? On long-context tasks LoRA adapters of the same memory budget reduce out-of-domain accuracy (e.g. −9 points on MMLU) and need special serving infra, whereas a KV blob can be hot-loaded into any inference engine.


3. Self-Study Training Pipeline

1
2
3
4
5
6
7
8
9
10
for epoch:
    # 1. Pick random 0.5-4 k token chunk  ĉ  from corpus  C
    # 2. Sample a seed prompt  s  (structuring | summarisation | question |
       use-case | creative)
    # 3. Have the frozen LLM chat with itself:
           A (user)  : s
           B (assistant) : answer
       -> synthetic conversation (ĉ, A, B)
    # 4. KL-distillation loss:
       min_Z  Σ_i KL[  P_teacher(·|ĉ ⊕ A[:i])  ||  P_student_Z(·|A[:i]) ]

Key design choices

  1. Chunking lets you cover 400 k-token corpora with a short-context model and forces the cartridge to learn global rather than local n-gram patterns.
  2. Diverse seed prompts (5 generic types) push the synthetic questions away from pure memorisation and yield +4–8 points accuracy on reasoning benchmarks.
  3. Context distillation (matching the teacher logits) beats plain next-token prediction by +3–9 points depending on dataset.

Compute scale: 1–4 B synthetic tokens (64-GPU hours for Llama-8B). Quality keeps rising linearly with optimisation steps (see Fig. 5 in paper).


4. Serving Workflow

1
2
3
cartridge = torch.load("pepsi_10k.pt")          # 120-512 KB to ~1 GB
model.load_kv_cache(cartridge)                  # same API as prefix KV
response = model.generate(user_query)

No engineering changes to vLLM / SGLang / TGI. Throughput ≈ that of a p-token prompt irrespective of corpus size: on Llama-8B, 26× more requests per H100 at equal quality vs 128 k-token ICL.


5. Empirical Findings

Dataset → corpus size / task

  • LongHealth 100 k (clinical QA, MC)
  • QASPER 100 k (paper QA, free-form)
  • MTOB 484 k (Kalamang→En translation)

Results (Llama-3 B unless stated):

Method LongHealth Acc QASPER ↓ppl MTOB chrF KV Mem
ICL (full) 55.4 6.9 28.4 7.2 GB
DuoAttention (2× compress) 46.0 8.1 23.1 3.6 GB
Cartridge + Self-Study (p=2048) 55.6 6.9 28.2 0.19 GB

Additional observations

  • Extends context length: Llama-8B (128 k ctx) + cartridge over full 484 k textbook scores +11 chrF v. ICL on first 130 k tokens.
  • Composable: concatenate two independently trained cartridges (e.g., AMD + Pepsi 10-Ks) → model answers cross-document queries better than truncated-ICL and single-cartridge baselines.
  • Robust: freezing the “attention sink” (first token) prevents training collapse; random-token initialisation works but converges slower.

6. Implementation Checklist

Hyper-param Typical value Comment
Cache size p 512–2 048 0.6 GB for 3 B model, 2 GB for 8 B
Chunk length 0.5–4 k toks random uniform
Synthetic convs 30 k–60 k ≈ 1-2 B tokens
Optimiser AdamW (β1 = 0.9, β2 = 0.95) lr 3e-4, 400 warm-up
Distill T 1.0 logit temperature
Training time 25–40 min on 8× H100 (8 B) linear in #steps

Scaling to bigger backbones: memory grows with p·d·L; training FLOPs roughly 1.5× original prefill FLOPs.


7. Limitations & Open Problems

  1. Up-front cost: need idle GPU time to train each cartridge; not yet real-time.
  2. Cartridge size vs OOD degradation: very large p (>8 k) can slightly hurt unrelated tasks.
  3. Security / freshness: offline “snapshot” must be retrained when corpus updates.
  4. No formal guarantees for arbitrary reasoning skills; empirical only.

8. Practical Take-aways

  • For any repeatedly accessed corpus (codebase, legal archive, patient file), pre-compute a cartridge once and serve it like a prefix → 10-40× memory win.
  • If the corpus is larger than your model’s context, chunk + self-paper lets you leapfrog the limit without model surgery.
  • Keep p modest (≤ 2048 for 7-10 B models); invest compute in synthetic-data steps rather than cache width.
  • Use generic seed prompts—you don’t need task-specific engineering—and logit-level distillation for fast convergence.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 69 tweets and received 5027 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com