Compressed Token Distillation
- CTD is an umbrella term for techniques that compress and transfer token-level supervision, enabling efficiency gains in complex models.
- It involves diverse methods such as post-hoc reasoning trace compression, cross-tokenizer distillation via byte-level interfaces, and pre-attention representation compression.
- Empirical results demonstrate that CTD can reduce token counts by up to 18× while trading off minimal accuracy, offering a practical efficiency–fidelity balance.
Compressed Token Distillation (CTD) is a non-unified term in recent machine learning literature that denotes several related but distinct procedures for transferring capability under a reduced token budget. In one usage, CTD refers to distilling students on post-hoc compressed chain-of-thought traces so that training consumes fewer tokens and inference produces shorter rationales (Griot et al., 4 Jun 2026). In another, CTD denotes cross-tokenizer distillation, where teacher and student use different vocabularies and the transfer is mediated through a shared byte-level interface rather than a shared token space (Singh et al., 8 Apr 2026). Other papers use the term, or map naturally onto it, for pre-attention sequence compression in embedding models, spatio-temporally compressed supervision for video encoders, and related compressed-token or latent-token schemes (Zhang et al., 18 Nov 2025, Kim et al., 17 May 2026). This suggests that CTD is best understood as an umbrella for token-efficiency-oriented distillation rather than as a single standardized algorithm.
1. Terminological scope and conceptual boundaries
Recent papers attach the label CTD to different compression targets, supervision objects, and deployment goals. The common denominator is that a student or downstream model is trained to preserve useful behavior after some reduction, remapping, or compression of token-level supervision. What varies is whether the compressed object is a reasoning trace, a tokenizer interface, a pre-attention sequence, or a spatio-temporally pooled latent representation.
| Usage of CTD | Compressed or aligned object | Primary goal |
|---|---|---|
| "Compress-Distill" (Griot et al., 4 Jun 2026) | Teacher-produced chain-of-thought traces | Reduce training tokens and inference verbosity |
| "Cross-Tokenizer LLM Distillation through a Byte-Level Interface" (Singh et al., 8 Apr 2026) | Teacher and student token spaces via bytes | Enable distillation across mismatched tokenizers |
| "Jasper-Token-Compression-600M" (Zhang et al., 18 Nov 2025) | Input sequences before attention | Preserve embedding quality while reducing latency |
| "LiteFrame" (Kim et al., 17 May 2026) | Teacher vision tokens after spatio-temporal compression | Bypass redundant visual-token computation |
The ambiguity is explicit in the byte-level distillation paper, which states that CTD there means cross-tokenizer distillation, not compression of tokens (Singh et al., 8 Apr 2026). By contrast, the reasoning-trace and video papers use CTD in the more literal sense of compressing teacher supervision or teacher representations before student training (Griot et al., 4 Jun 2026, Kim et al., 17 May 2026). A plausible implication is that any encyclopedia treatment of CTD must distinguish the acronym’s local paper-specific meaning from its broader role as shorthand for compression-aware distillation.
2. Post-hoc compression of reasoning traces
In "Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation" (Griot et al., 4 Jun 2026), CTD means training smaller student models on teacher-produced chain-of-thought traces that have been post-hoc compressed to preserve the essential reasoning and the final answer while being much shorter. The pipeline has three stages. First, a teacher generates chain-of-thought within > …</think> plus a final answer, and only verified-correct traces are retained. The retained sets comprise 283,335 correct traces from Qwen3.5-397B-A17B and 281,911 correct traces from gpt-oss-120B. Second, an instruction-tuned compressor rewrites each correct triple into a shorter trace using a single, generic prompt at temperature $0.3$; the compressors are Llama-3.3-70B-Instruct and Ministral-3-14B-Instruct-2512. Third, students are trained on raw traces, compressed traces, or answer-only targets with next-token prediction on assistant tokens only, using the chat template User: q_i → Asst: <think> t̃_i a_i, where .
The students are Qwen3.5-0.8B-Base, Llama-3.1-8B, Qwen3.5-9B-Base, and gpt-oss-20B. Training uses either LoRA with rank $64$, , dropout $0.05$, learning rate , and one epoch, or full fine-tuning with learning rate and one epoch; FSDP v2 is used for 8B, 9B, and 20B. Tokenization and formatting use 16,384-token sequences, sample packing, BF16, FlashAttention 2, CutCrossEntropy, a consistent chat template, and greedy decoding at inference with an 8,192-token cap. The main grid contains 48 runs, with seven additional Qwen-teacher truncation ablations (Griot et al., 4 Jun 2026).
The reported compression is substantial. Character-level mean compression ratios under Qwen3.5-397B are 0 for Llama-70B and 1 for Ministral-14B; under gpt-oss-120B they are 2 and 3. Across the full study, compressed traces are reduced to 8.6–21.0% of the original character length, training tokens fall to 12–30% of raw, training speeds up by 4–5, and inference outputs become 6–7 shorter. At the same time, raw traces retain the highest downstream accuracy at every scale and for both teachers. Representative rows are Qwen teacher, Qwen-9B Full: raw 8 vs L70 9 vs M14 0, and gpt-oss teacher, gpt-oss-20B Full: raw 1 vs L70 2 vs M14 3 (Griot et al., 4 Jun 2026).
The paper therefore characterizes CTD as an accuracy–efficiency trade-off rather than a free improvement. Students retain up to 96% of raw-trace accuracy while achieving up to 18× higher per-token efficiency, and compressed traces dominate on cost and latency. A length-matched truncation ablation further shows that the benefit is not explained by “just fewer tokens”: model-compressed traces usually beat or match naive truncation at equal length, especially for smaller students, while also producing shorter inference outputs. At the 0.8B scale under LoRA, compressed traces narrow the raw-versus-compressed gap but do not exceed raw (Griot et al., 4 Jun 2026).
3. Self-distilled Long2Short reasoning compression
"TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs" (Zhang et al., 17 Nov 2025) operationalizes the central CTD idea in a self-distillation-style Long2Short setting rather than a teacher–student pipeline. The method uses only self-generated data and combines adaptive reasoning depth selection, distribution-aligned intra-step refinement, and a composite preference-learning objective. Its stated objective is to reduce chain-of-thought token usage while preserving reasoning fidelity and task accuracy.
Adaptive depth selection begins from multiple self-sampled responses per problem. If 4 is the number of sampled responses and 5 the number of correct ones, the method defines
6
Correct traces are sorted by token length, and the shortest correct traces up to index 7 are selected as preferred positives. The intuition given in the paper is that easier problems, with higher 8, should favor shorter chains, whereas harder problems should preserve more depth. The paper reports that 9 provides the best balance (Zhang et al., 17 Nov 2025).
The second component rewrites each reasoning step into a shorter form while constraining the model’s continuation distribution. For a step $0.3$0, the paper samples $0.3$1 candidate rewrites at temperature $0.3$2 and selects the shortest candidate satisfying a KL-divergence constraint with threshold $0.3$3 over a window $0.3$4. The optimization is written as
$0.3$5
The final training objective mixes length-aware DPO-L and SFT with $0.3$6, learning rate $0.3$7, batch size $0.3$8, full-parameter fine-tuning, and context length $0.3$9 (Zhang et al., 17 Nov 2025).
Empirically, the paper reports that DeepSeek-R1-Distill-Qwen-7B fine-tuned with TokenSqueeze achieves a 50% average token reduction while preserving accuracy on MATH500: baseline accuracy 0 with Len-T 1 versus TokenSqueeze accuracy 2 with Len-T 3. On AIME24 for the same model, TokenSqueeze improves accuracy from 4 to 5 while reducing Len-T from 6 to 7. The paper also reports up to 15.5% higher accuracy on AIME24 at 3K tokens and 43.1% higher accuracy on MATH500 at 1K tokens versus the base model. Ablations indicate that “No Refinement” mainly reduces the number of steps, whereas full TokenSqueeze additionally shortens per-step expression; DPO-L plus SFT yields the best balance relative to DPO or SFT alone (Zhang et al., 17 Nov 2025).
Within the broader CTD landscape, TokenSqueeze differs from trace-compression distillation in that it does not rely on an external teacher. Its compression is enforced through self-generated preference data and KL-constrained rewrites rather than through teacher-authored compressed traces. The paper nevertheless frames this as preserving logical content under token reduction, which closely aligns with the broader CTD objective (Zhang et al., 17 Nov 2025).
4. Cross-tokenizer distillation through a byte-level interface
In "Cross-Tokenizer LLM Distillation through a Byte-Level Interface" (Singh et al., 8 Apr 2026), CTD means cross-tokenizer distillation: transferring knowledge from a teacher LLM to a student LLM when the two use different tokenizers. If the teacher uses vocabulary 8 and tokenizer 9, and the student uses $64$0 and tokenizer $64$1 with $64$2 and $64$3, then the standard shared-vocabulary KL objective is not well-defined. The proposed baseline, Byte-Level Distillation (BLD), uses the byte level as a common interface. Teacher token probabilities are converted into byte-level probabilities, and a lightweight byte-level decoder head is attached to the student.
The method formalizes a byte alphabet $64$4 and derives teacher next-byte probabilities by summing over teacher tokenization paths compatible with a byte prefix. Exact computation is expensive, so the paper adopts the approximation of Vieira et al. (2025) via beam search with beam width $64$5 and pruning threshold $64$6. The reported setting $64$7, $64$8 achieves Jensen–Shannon divergence $64$9 to a high-precision reference with 0, 1, requires 2 s/sample for 100–150 byte sequences on 4×RTX 3090, and takes 3 days to precompute byte probabilities for the Tulu-3 dataset using parallelization (Singh et al., 8 Apr 2026).
On the student side, BLD adds a byte-level decoder head 4 in parallel to the token-level head. In experiments, 5 is fixed to 6, so only the first 10 bytes of a token receive supervision. The aggregate loss combines token-level supervised next-token prediction, byte-level cross-entropy on ground-truth bytes, and byte-level KL to match the teacher’s next-byte distribution. For tokenizer transfer, embeddings and LM head are reinitialized with Fast Vocabulary Transfer, LoRA rank is 7, and a representative loss setting uses 8 and 9 (Singh et al., 8 Apr 2026).
The reported results are mixed but competitive. In BPE→BPE transfer from Llama3.2-3B-Instruct to a Qwen2 tokenizer, BLD attains PiQA $0.05$0, ARC-C $0.05$1, BoolQ $0.05$2, MMLU $0.05$3, AGI-EN $0.05$4, AGI-ZH $0.05$5, and IFEval $0.05$6; the paper notes that it is competitive but struggles on instruction-following. In BPE→byte transfer, all methods degrade substantially, and no method dominates. In cross-model CTD from OpenMath2-Llama3.1-8B to Gemma2-2B, BLD reaches GSM8K $0.05$7 and MATH $0.05$8, outperforming SFT on GSM8K but not on MATH. The paper’s conclusion is explicit: consistent improvements across all tasks and benchmarks remain elusive, and CTD remains an open problem (Singh et al., 8 Apr 2026).
5. Pre-attention and spatio-temporal compressed representations
A separate line of work applies CTD to internal representations before the dominant compute stage, rather than to reasoning traces or tokenizer interfaces. In "Jasper-Token-Compression-600M Technical Report" (Zhang et al., 18 Nov 2025), the model is a bilingual embedding system that inserts a compression module between token embeddings and Transformer attention blocks. The module consists of a randomly initialized Qwen3MLP (SwiGLU) FFN followed by a training-free AdaptiveAvgPool1d. If input length is $0.05$9, threshold is 0, and compression ratio is 1, the target length rule is
2
Stage 2 uses fixed compression with 3 and 4, while Stage 3 samples 5 dynamically across several ranges. Distillation uses cosine alignment to teacher embeddings, then adds a batchwise similarity-preservation MSE; Stage 4 adds InfoNCE-style contrastive learning and soft KL distillation over similarity scores (Zhang et al., 18 Nov 2025).
The reported outcome is a 600M embedding model with English Mean(Task) 6 and Chinese Mean(Task) 7, compared with baseline Qwen3-Embedding-0.6B scores of 8 and 9. Latency at batch size 32 falls from 0 ms to 1 ms at length 2 for 3, and from 4 ms to 5 ms at length 6. The paper states that at 7–8, performance stays essentially flat versus the best setting while halving latency at long inputs (Zhang et al., 18 Nov 2025).
In "LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs" (Kim et al., 17 May 2026), CTD trains a compact student vision encoder to predict spatio-temporally compressed teacher representations directly. The teacher is InternViT-300M from InternVL3-8B, which emits dense tokens; the compression operator is Weighted Average Pooling (WAP), applied to teacher features to produce information-dense compressed targets. The student is a ViT-Base encoder with 12 layers, width 9, depth-wise 1D temporal convolutions, and progressive spatio-temporal downsampling. The core loss is
00
Training uses AdamW, cosine schedule, linear warmup, global batch size 01 on 8× H100 GPUs, learning rate 02, warmup 03 epochs, and total CTD pretraining of 04 epochs. A later LLM Adaptation stage fine-tunes both student encoder and LLM with LoRA rank 05, 06, and dropout 07 (Kim et al., 17 May 2026).
The paper reports a new latency–accuracy Pareto frontier. At the high-frame regime, the teacher baseline processes 32 frames at 256 tokens/frame with total latency 08 ms and average accuracy 09, whereas LiteFrame processes 256 frames at 16 tokens/frame with total latency 10 ms and average accuracy 11. At 64 frames, the figure caption reports LLM prefilling 12 faster and ViT encoding 13 faster than InternVL3-8B. An ablation compares CTD with a reconstructive variant, RTD, and finds that CTD without LMA already exceeds RTD plus LMA. Here CTD is not about language tokenization at all; it is about teaching a student to emit the teacher’s compressed latent tokens directly, thereby bypassing dense teacher computation (Kim et al., 17 May 2026).
6. Comparative context, diagnostics, and unresolved issues
Several adjacent papers do not define their method as CTD in the narrow sense but are important for understanding the broader compressed-token landscape. "Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in LLMs" (Zhao et al., 2024) studies compressed or gist tokens that summarize long contexts and shows that assigning them uniformly spread position identifiers inside the original input span improves memorization under RoPE. A closed-form rule consistent with the paper’s description is
14
With this position layout and a compression loss, the method reaches 15× compression with 96.9 BLEU4 and 31× compression with 84.4 BLEU4, whereas an ICAE-style reproduction without the position-ID design and without compression loss falls to 52.7 BLEU4 at 15× (Zhao et al., 2024).
"LLM as Token Compressor and Decompressor" (Li et al., 26 Mar 2026) develops a self-expressive autoencoding framework in which a pretrained LLM translates surface text into a variable-length sequence of discrete Z-tokens and reconstructs the original text from them. The compressor is autoregressive in the latent alphabet, the decompressor is constrained to the base vocabulary, and the total loss is 15. The method reports up to 18 times token reduction and near-exact reconstruction at moderate compression ratios, including BLEU-4 16 at 4× on Wikipedia. The paper explicitly contrasts this with CTD-style methods, noting that its primary objective is self-reconstruction rather than teacher-driven downstream supervision (Li et al., 26 Mar 2026).
A further diagnostic perspective appears in "Compressed code: the hidden effects of quantization and distillation on programming tokens" (Siniaev et al., 5 Jan 2026), which maps CTD to token-level preservation of code-relevant distributions under compression. The paper defines cold-start measures such as Programming Keywords Probability (PKP) and Special Tokens Probability (STP) from 17. It reports that DeepSeek-R1-Distill-Qwen-1.5B has PKP 18 and STP 19, whereas Qwen2.5 base models across sizes show much more balanced PKP, approximately 20–21, with STP approximately 22–23. It also reports that moderate quantization can improve the PKP–STP balance relative to more aggressive quantization, for example Q4_K_S versus Q2_K on Qwen2.5-Coder-7B. This line of work does not propose a new CTD algorithm, but it shows that compressed or distilled students can substantially redistribute probability mass over token categories (Siniaev et al., 5 Jan 2026).
Across these papers, several limitations recur. Reasoning-trace compression reduces training tokens and latency but does not overtake raw-trace accuracy, and medicine shows the largest raw advantage in the reported study (Griot et al., 4 Jun 2026). Byte-level cross-tokenizer distillation remains inconsistent across tasks, especially for instruction-following and BPE→byte transfer (Singh et al., 8 Apr 2026). Jasper still trails its 8B teacher on retrieval, LiteFrame depends on teacher quality and the suitability of WAP targets, TokenSqueeze remains offline-only and sensitive to the KL threshold 24, and cold-start token diagnostics do not replace contextual evaluation (Zhang et al., 18 Nov 2025, Kim et al., 17 May 2026, Zhang et al., 17 Nov 2025, Siniaev et al., 5 Jan 2026). The aggregate picture is therefore stable: CTD methods can yield large savings in token count, latency, or context length, but the dominant empirical pattern is a controlled trade-off between efficiency and fidelity rather than a universal accuracy gain.