Question-Asking Compression (QA)
- Question-Asking Compression (QA) is an interactive lossy compression method for LLM outputs that uses iterative binary yes/no queries to refine answers.
- It reduces the communication cost by transmitting only the bits that resolve the small model’s uncertainty, achieving compression ratios as low as 0.0006.
- QA recovers up to 72% of the capability gap between smaller and larger models across benchmarks in math, science, and code.
Question-Asking Compression (QA) is an interactive lossy compression protocol for LLM outputs in which a small LLM does not receive a full answer from a stronger model, but instead iteratively asks binary yes/no questions and revises its own answer from the returned bits. In the formulation studied in “Less is More: Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains,” the protocol is framed as an alternative to transmitting long responses: the small model needs only the specific bits of guidance that resolve its uncertainty, not the entire surface form of the stronger model’s completion. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, with compression ratios of 0.0006 to 0.004 (Rinberg et al., 9 Feb 2026).
1. Concept and scope
QA is defined as an interactive lossy compression scheme for LLM outputs rather than as a conventional prompt or context compressor. Traditional compression asks how to encode the exact text as compactly as possible; QA asks how to transmit enough information for the smaller model to produce a useful, correct final answer even if the exact wording changes. The paper motivates this shift by arguing that LLM outputs are often semantically redundant and that interaction can be more efficient than one-shot transmission because communication is spent only on the model’s actual uncertainty (Rinberg et al., 9 Feb 2026).
This makes QA distinct from the broader literature on compression for question answering. In that adjacent literature, compression usually targets retrieved documents, prompts, KV caches, visual tokens, or long-horizon memory. Query-Guided Compressor (QGC), for example, argues that context compression for QA cannot be query-agnostic at high compression ratios and uses a query-guided context encoder, query-guided pooling, a query-document reviewing layer, and a dynamic compression strategy (Cao et al., 2024). CompAct treats compression as an active sequence of updates over retrieved segments rather than a single-step filter (Yoon et al., 2024). QG-VTC compresses visual tokens in a question-conditioned way for VQA (Li et al., 1 Apr 2025). Imprint formulates long-horizon egocentric memory as an online memory compression problem rather than a text summarization problem (Das et al., 1 Jul 2026). This suggests that “QA compression” is an overloaded label: in (Rinberg et al., 9 Feb 2026) it denotes question-asking compression, whereas in much of the surrounding literature it denotes compression mechanisms designed to preserve answer-bearing evidence for downstream question answering.
2. Protocol and formalization
The QA protocol in (Rinberg et al., 9 Feb 2026) is explicitly interactive. Let be the problem or prompt, SLM the small LLM, LLM the stronger model, and the question budget. The small model first produces an initial answer
For each round , it generates a binary yes/no question
the stronger model returns
and the small model updates its answer by
The central accounting claim is that each answer carries exactly one bit because the stronger model is constrained to respond only with Yes or No. The paper further states that the SLM’s questions are deterministic given the prompt, its current state, and fixed hyperparameters, so the only information that must actually be transmitted is the binary answer sequence . Under this accounting, the communication cost is simply
A refinement noted in the paper is that if the SLM can predict the answer distribution well, the expected cost per question can be less than 1 bit via arithmetic coding of the binary answers, with expected cost , although the main experiments use the simpler one-bit-per-answer framing (Rinberg et al., 9 Feb 2026).
Conceptually, the protocol resembles “Twenty Questions.” The smaller model proposes a partial solution, identifies uncertainty in that current solution, and asks targeted yes/no questions. Because later questions depend on earlier answers, the exchange is adaptive rather than static. The paper explicitly connects this to the intuition that adaptively chosen questions can extract the right information with far fewer bits than a single non-interactive message (Rinberg et al., 9 Feb 2026).
3. Information-theoretic framing and evaluation
The paper situates QA within a broader compression-compute perspective. For ordinary lossless compression it recalls the standard quantities
0
and
1
with mismatch penalty
2
For arithmetic coding of token sequences, the code length is approximated by
3
QA is presented as a lossy, interactive analog: the objective is not exact reconstruction of the stronger model’s response text, but preservation of enough semantic information for the small model’s final answer to improve (Rinberg et al., 9 Feb 2026).
Two evaluation quantities are central. The first is the recovery rate, defined in prose as the fraction of problems that become correct after QA among those that the small model initially gets wrong. The paper reports recovery separately on Medium, Hard, Very Hard, and all non-easy problems. The second is the compression ratio,
4
For QA, the compressed side is just the 10 binary answers, i.e. 10 bits. The uncompressed side is the Opus response length in bits, measured as token count times 5 bits/token under the cl100k_base tokenizer, giving
6
This definition yields the extremely small reported ratios because the denominator is the full strong-model response while the numerator is only the sequence of binary answers (Rinberg et al., 9 Feb 2026).
4. Experimental setting and empirical results
The QA experiments use Anthropic Claude-family models with Haiku as the SLM, Opus as the stronger model, and again Haiku as the final solver. The core protocol is therefore described as
7
The paper also studies BL-CoT, written as
8
to isolate iterative self-refinement without transfer from a larger model, and QA9,
0
to test whether stronger question generation improves recovery. The 8 benchmarks are GSM8K, MATH (Algebra), MATH (Geometry), MATH (Number Theory), GPQA (MC), MBPP, AIME, and HLE (Rinberg et al., 9 Feb 2026).
The headline empirical result is that with only 10 binary questions, QA recovers 23% to 72% of the capability gap between Haiku and Opus on easier benchmarks and 7% to 38% on harder benchmarks. Reported compression ratios are 0.0006 to 0.0037, with AIME: 0.0006 as the best case. The paper highlights that this is more than 100× smaller than prior LLM-based compression methods because QA is not attempting to transmit the full response text at all (Rinberg et al., 9 Feb 2026).
Performance is task-dependent. The detailed discussion states that QA improves recovery substantially on GSM8K, is especially strong on the MATH Number Theory medium subset, gives some gains on MBPP and HLE, and helps little on AIME, which is presented as the main exception. The paper also compares QA to earlier LLM-based compression baselines and reports prior SOTA LLM-based lossless compression at roughly 0.08 on enwik9, its own domain-adapted LoRA lossless method at about 0.03, and QA at 0.0006–0.004, underscoring how radically the communication budget changes once the goal shifts from full-response transmission to bit-level interactive guidance (Rinberg et al., 9 Feb 2026).
5. Relation to other question-conditioned compression paradigms
Question-Asking Compression is best understood as one member of a larger family of question-conditioned compression methods, but it operates at a different layer of the pipeline than most existing work.
| Paradigm | Representative method | Core mechanism |
|---|---|---|
| Interactive response compression | QA (Rinberg et al., 9 Feb 2026) | SLM asks binary questions; LLM returns yes/no bits |
| Query-aware context compression | QGC (Cao et al., 2024) | query-guided encoding, pooling, reviewing, dynamic compression |
| Active retrieved-document compression | CompAct (Yoon et al., 2024) | iterative segment updates with [COMPLETE]/[INCOMPLETE] |
| Question-guided visual compression | QG-VTC (Li et al., 1 Apr 2025) | question-conditioned visual token selection and recycling |
| Retrieval-oriented memory compression | Imprint (Das et al., 1 Jul 2026) | interaction-centric online memory compression |
QGC provides a useful contrast. It is a white-box compressor with a frozen LLM that jointly encodes the query and each retrieved document, pools 1-grams by relevance to the mean query representation, refines them with a reviewing layer, and allocates compression dynamically by document relevance. On LongChat-13B it reaches 69.19 Acc / 15.2x CR on NaturalQuestions, 57.72 EM / 7.9x CR on TriviaQA, and 52.12 F1 / 8.8x CR on HotpotQA, whereas LongLLMLingua reports 67.01 Acc / 4.1x CR, 51.51 EM / 3.7x CR, and 45.43 F1 / 3.8x CR respectively (Cao et al., 2024). The mechanism here is still one-shot delivery of compressed evidence to a reader, not interactive transfer of incremental corrective bits.
CompAct is closer in spirit to QA because it is also iterative, but the object being compressed is different. It actively updates a compressed context 2 from the question 3, current segment 4, and previous compressed context 5, while also generating an evaluation signal 6 containing [COMPLETE] or [INCOMPLETE]. In the main setting it uses Top-k retrieved documents: 30, Documents per segment: 5, Max iterations: 6, and on HotpotQA reports 47.6x compression with 35.5 EM / 46.9 F1, compared with 34.3x, 29.7 EM / 39.9 F1 for RECOMP and 3.4x, 25.6 EM / 35.3 F1 for LongLLMLingua (Yoon et al., 2024). Its novelty lies in active evidence accumulation over retrieved documents, whereas QA compresses the stronger model’s guidance rather than retrieved text.
Other systems extend the same question-conditioned intuition to different modalities and memory forms. QG-VTC keeps the question as an explicit retrieval signal over visual tokens and reports that performance remains above 94.3% of the original on average at 72 tokens, which is 1/8 of the original visual tokens, while average inference computational load drops to about 30% of baseline (Li et al., 1 Apr 2025). Imprint, by contrast, reframes long-horizon egocentric question answering as online memory compression and reports improvement from 31.0% to 35.8% QA accuracy, grounded accuracy from 10.8% to 64.8%, memory footprint reduction from 267 MB for all interaction records to 109 MB, and retrieval latency from 20.1 s/query to 1.7 s/query (Das et al., 1 Jul 2026). A plausible implication is that question-conditioning is emerging as a general design principle, but QA in (Rinberg et al., 9 Feb 2026) remains distinctive because the compressed object is not context, memory, or KV state; it is the inter-model communication itself.
6. Limitations, misconceptions, and significance
Several limitations are explicit. A major bottleneck is the SLM’s ability to ask useful questions: if the small model cannot identify its own blind spots, even perfect yes/no answers will not help much. The QA7 variant, in which Opus asks the questions, often improves performance, especially on harder tasks, which the paper uses as evidence that question quality is a key limiting factor. The method is also evaluated on tasks where correctness is relatively unambiguous; the paper states that it is not yet validated on open-ended tasks like creative writing or advice, where yes/no evaluation is much less well-defined. Finally, the protocol is bounded by the stronger model’s own competence: on very hard tasks, wrong or misleading yes/no answers can actively hurt the smaller model (Rinberg et al., 9 Feb 2026).
The paper is also explicit about negative results. Variants with early stopping by a judge did not beat the simple batch protocol, often because the judge stopped too early on easy tasks or failed to prevent regression on hard tasks. Increasing the budget from 10 to 100 questions produced gains that were described as small and inconsistent, not worth the 10× increase in bits. Regression on problems the small model already solved is reported as modest on standard benchmarks and roughly comparable to self-re-evaluation noise, while on frontier benchmarks such as AIME and HLE the regression appears severe but is attributed in large part to Haiku’s own inconsistency rather than to the QA protocol itself (Rinberg et al., 9 Feb 2026).
A common misconception is to treat QA as merely a constrained chain-of-thought or a degenerate form of prompt compression. The paper’s framing is narrower and more radical. A one-shot response must encode everything up front, regardless of what the receiver already knows; QA instead performs what the paper characterizes as adaptive communication, in which later bits depend on what the smaller model has already inferred and what it remains uncertain about. This suggests that interactive protocols can transfer useful knowledge far more efficiently than transmitting full responses. Within the broader compression literature, that claim aligns with a general movement toward preserving only query-relevant evidence, but QA pushes the idea to its most compressed form: the final transmitted object can be only 10 bits (Rinberg et al., 9 Feb 2026).