Continuous Space Compression & Self-Distillation

Updated 27 November 2025

The paper introduces continuous space compression that projects high-dimensional structured data into compact, differentiable latent spaces while preserving essential features.
Self-distillation leverages a model's internal teacher–student dynamics to iteratively refine compressed representations, enhancing both efficiency and interpretability.
Empirical results demonstrate that these techniques improve performance in tasks such as language reasoning, semantic segmentation, and Transformer model compression with notable speedups and minimal accuracy loss.

Continuous space compression and self-distillation denote a set of techniques in modern deep learning whereby information, representational or reasoning processes, and predictive distributions are systematically mapped from a high-dimensional or structured form (such as natural language, full-resolution feature maps, or an over-complete functional basis) into a lower-dimensional continuous latent space. These mappings are optimized via self-distillation, in which a model or model component—often sharing weights or architecture with the student—acts as its own teacher, guiding the student to replicate the teacher’s behavior (outputs, internal representations, or latent dynamics) while operating in the compressed space. This paradigm underlies recent advances in LLM reasoning, semantic segmentation, and kernel methods, as well as highly-efficient Transformer-based NLP systems.

1. Core Concepts: Continuous Compression and Self-Distillation

Continuous space compression involves projecting original structured or discrete representations into a compact, continuous latent space. Rather than relying on quantization, pruning, or discrete symbol mapping, these methods allocate continuous-valued “slots” (e.g., vectors, hidden states, spectral components) to represent essential information, maintaining differentiability and supporting gradient-based learning throughout.

Self-distillation refers to the process in which a single model (or two tightly coupled branches/networks of the same architecture) alternates between “teacher” and “student” roles, typically by first generating outputs using one pathway, then training a second pathway to match these outputs—often with an additional constraint on aligning internal activations or statistical behavior. Unlike traditional teacher–student paradigms, self-distillation recycles a network’s own knowledge within the same or a closely related architecture, enabling progressive refinement or compression without architectural redesign.

2. Methodological Variants: Application Domains and Architectures

Table: Representative Frameworks

Area	Method / System	Compression Target
LLM Reasoning	CODI (Shen et al., 28 Feb 2025)	CoT→continuous latent
Semantic Segmentation	Bi-dir. Feature Compression (Zheng et al., 2022)	2D features→1D (lat/lon)
Hilbert Space/RKHS	Self-distillation (Mobahi et al., 2020)	Function spectrum
NLP Pre-training	MiniLM (Wang et al., 2020)	Transformer weights/attn

CODI compresses explicit chain-of-thought (CoT) language into a fixed-length sequence of continuous latent reasoning steps, aligning the effect of these steps with explicit written derivations (Shen et al., 28 Feb 2025). Bi-directional Feature Compression projects dense 2D feature maps from panoramas into two orthogonal 1D continuous spaces (vertical and horizontal), exposing complementary semantic information before self-distillation refines their fusion (Zheng et al., 2022). Self-distillation in Hilbert space iteratively re-fits a kernel regressor on its own predictions, progressively shrinking the solution into a low-dimensional continuous subspace (Mobahi et al., 2020). MiniLM reduces full-scale Transformer models to thinner, shallower architectures by mimicking continuous-valued self-attention behavior and value relations directly (Wang et al., 2020).

3. Detailed Mechanisms and Training Objectives

CODI: Continuous Chain-of-Thought via Self-Distillation

CODI uses a shared Transformer backbone for both teacher and student. The teacher receives natural language CoT as input and is trained with standard language modeling. The student replaces the natural language reasoning steps with $K$ continuous latent tokens, produced by iteratively projecting previous hidden states using a small MLP and LayerNorm. At each training batch, the teacher and student’s hidden activations at the token immediately before the answer (typically, the colon in “The answer is:”) are aligned using an $L_1$ loss normalized by the teacher's batchwise standard deviation, summed over all layers. The overall loss combines language modeling on the teacher, language modeling on the student, and the latent-alignment loss:

$\mathcal{L} = \alpha\mathcal{L}_{\mathrm{teacher}} + \beta\mathcal{L}_{\mathrm{student}} + \gamma\mathcal{L}_{\mathrm{KD}}$

where the typical setting is $\alpha = \beta = 1$ and $\gamma$ model-dependent (Shen et al., 28 Feb 2025). Only the compressed student is used at inference.

Bi-directional Feature Compression and Self-Distillation

The bi-directional framework for 360° segmentation first processes the multi-scale feature map $F \in \mathbb{R}^{C \times H \times W}$ by applying compression mappings $\phi_h$ (collapse along height) and $\phi_v$ (collapse along width), yielding two sets of 1D representations. Feature fusion follows upsampling of both compressed streams. The fusion output serves as a “teacher,” while the individual streams act as “students.” Self-distillation is achieved by minimizing cross-entropy on all branches, KL-divergence between fused and per-stream predictions, and an $L_2$ alignment of bottleneck features:

$\mathcal{L}_{\text{total}} = \sum_{i\in\{H,V,E\}} L_{ce}^i + \sum_{S\in\{H,V\}} \left[ \alpha L_{ce}^S + \beta L_{kl}^S + \gamma L_{f2}^S \right]$

with hyperparameters (e.g., $\alpha=0.7,\ \beta=0.3,\ \gamma=0.003$ ) (Zheng et al., 2022).

Hilbert Space Self-Distillation

Given training data $\{(x_i, y_i)\}_{i=1}^n$ and RKHS norm $\|\cdot\|_{\mathcal{H}}$ , self-distillation applies iterative soft-label re-fitting such that, at round $t$ ,

$y^{(t)} = K (K + n\lambda_{t-1}I)^{-1} y^{(t-1)}$

Each round refines the representer coefficients to more heavily discount minor eigenmodes (low $d_j$ ) in the kernel's spectrum (Mobahi et al., 2020). The solution rapidly collapses to a principal subspace, providing an analytically precise form of continuous compression.

MiniLM: Deep Self-Attention Distillation

MiniLM compresses Transformers by strictly matching the continuous-valued output distributions of both self-attention (query-key) and value-relation matrices from the teacher’s last layer to those of the student. The total distillation loss for self-attention and value-relation is:

$\mathcal{L}_{\mathrm{total}} = \alpha \mathcal{L}_{\mathrm{attn}} + \beta \mathcal{L}_{\mathrm{val}} + \gamma \mathcal{L}_{\mathrm{logits}}$

$\gamma$ is set to zero for task-agnostic compression, relying solely on the statistical closeness of attention matrices; the architecture and size of the student network can be freely chosen, as the matching occurs in continuous-valued space without explicit projection layers (Wang et al., 2020).

4. Compression Rate, Efficiency, and Model Selection

Compression rate in continuous space compression is quantified as the ratio of the original representation length to the length in the compressed continuous space. For example, CODI achieves $R \approx 3.1\times$ on GSM8K-Aug (25.1 tokens reduced to 8 latent slots), and $R \approx 7.8\times$ on GSM8K-Aug-NL (62.1 tokens) (Shen et al., 28 Feb 2025). MiniLM compresses 12-layer BERT models (109M parameters) to 6-layer (42M) or 6×384 (22M) models with minimal accuracy loss and up to $5.3\times$ speedup (Wang et al., 2020). In 360° segmentation, bi-directional feature compression paired with self-distillation increases mean IoU by 10–12 percentage points relative to prior state-of-the-art at all tested resolutions (Zheng et al., 2022).

In Hilbert space self-distillation, compression is spectral: higher rounds aggressively shrink minor-eigenvalue components, producing low-dimensional subspaces spanned by principal eigenfunctions (Mobahi et al., 2020). Excessive compression induces underfitting, recommending validation-based early stopping.

5. Interpretability and Probing of Compressed Continuous Spaces

Recent frameworks provide empirical evidence that compressed continuous slots encode semantically interpretable content. In CODI, probing by projecting latent reasoning states back to the vocabulary or by analyzing attention context demonstrates that the compressed slots often recover intermediate arithmetic results or the “thought steps” of an explicit chain-of-thought. Quantitative probing shows that CODI’s latent tokens match human-written intermediate steps with high fidelity: 97.1% accuracy for one-step, 83.9% for two-step, and 75.0% for three-step problems (top-5 matches) (Shen et al., 28 Feb 2025).

In bi-directional feature compression for panoramas, qualitative analyses show that the fused and self-distilled features recover fine structures and shape cues distorted or lost in non-compressed or singly compressed variants (Zheng et al., 2022).

6. Theoretical Insights and Regularization Effects

Self-distillation acts as a continuous spectral filter in high-dimensional function spaces. In RKHS, each round of self-distillation applies a multiplicative shrinkage $B_t[j,j]$ to basis coefficient $z_j^{(t)}$ , exponentially privileging top-eigenmodes:

$z_j^{(t)} = \left(\prod_{i=0}^{t-1} \frac{d_j}{d_j + n\lambda_i} \right) z_j^{(0)}$

where $d_j$ are the eigenvalues of the kernel Gram matrix. This process reduces overfitting by preferentially retaining “smooth” directions, but excessive iterations cause collapse to trivial (zero) solutions (Mobahi et al., 2020). Similar effects—though less analytically studied—may underlie performance gains from self-distillation in deep models, amplifying implicit regularization and smoothing predictive distributions.

A plausible implication is that continuous space compression via self-distillation provides robust, efficient, and interpretable models by concentrating learning capacity on salient informational subspaces and aligning internal function spaces between student and teacher pathways.

7. Impact Across Architectures and Tasks

Continuous space compression paired with self-distillation strategies achieves state-of-the-art or near-parity results in diverse domains:

In LLMs, CODI matches explicit CoT on GSM8K (43.7% vs. 44.1% for GPT-2), with a $3.1\times$ compression and $2.7\times$ inference speedup. Out-of-distribution robustness is also improved (Shen et al., 28 Feb 2025).
For 360° segmentation, bi-directional compression with self-distillation achieves 53.8% mIoU/66.5% mAcc at $256\times512$ resolution, exceeding prior art by $+10.5$ pp mIoU (Zheng et al., 2022).
MiniLM preserves over 99% of BERT’s task accuracy with half the parameters and large speedups, demonstrating that distillation in continuous attention space is sufficient for maintaining performance in NLP (Wang et al., 2020).

These results validate continuous space compression via self-distillation as a general paradigm for building compact, efficient, and interpretable deep learning systems without significant compromises in accuracy or robustness.