IB-DTC: Info Bottleneck Token Compression

Updated 26 November 2025

IB-DTC is a framework that applies the Information Bottleneck principle to dynamically compress feature tokens, preserving task-critical signals while eliminating redundant data.
It integrates low-rank SVD, cross-attention-based selection, and stochastic downsampling to efficiently reduce token count in applications like NLP and 3D tracking.
Empirical results show IB-DTC enhances inference speed and reduces computational cost with minimal accuracy loss, making it valuable for real-time and resource-constrained scenarios.

Information Bottleneck-guided Dynamic Token Compression (IB-DTC) describes a principled framework for adaptively compressing feature tokens in deep learning models, optimizing for high predictive information while minimizing redundant representation. The approach is rooted in classical Information Bottleneck (IB) theory and has been operationalized in diverse domains, including 3D point cloud tracking, transformer-based NLP, and LLM context compression. Characteristic IB-DTC pipelines employ a combination of mutual information estimation, low-rank approximations, cross-attention score modeling, and stochastic selection mechanisms to generate compact token subsets (“proxy tokens”) that preserve task-critical signals.

1. Theoretical Foundations: Information Bottleneck Principle

The IB-DTC methodology formalizes token compression as an IB optimization. Given input tokens $X$ and output targets $y$ , the key objective is to find a compressed representation $Z$ that retains maximal information relevant to predicting $y$ , subject to strong compression constraints. This is framed by:

$\min_{g} I(X; Z) \quad \text{subject to} \quad I(Z; y) \geq I_0$

where $I(\cdot;\cdot)$ denotes mutual information, and $g$ is a mapping from full to compressed tokens. Direct mutual information optimization is intractable in high-dimensional embedding spaces. A common surrogate is optimal low-rank approximation: for token matrices $X \in \mathbb{R}^{N \times C}$ , the best rank- $K$ approximation discards redundant (low-variance) directions, aligning with the IB’s compression-fidelity tradeoff (Zhou et al., 19 Nov 2025).

In NLP, the IB objective can condition on a query $Q$ , phrased as:

$L_{IB} = I(Z; X \mid Q) - \beta\, I(Z; Y \mid Q)$

for some $Z$ (compressed context/tokens), $Y$ (output), and tradeoff coefficient $\beta$ (Wang et al., 20 Aug 2024).

2. Algorithmic Implementations and Core Mechanisms

Dynamic token compression under IB is instantiated via several domain-specific algorithms.

Low-Rank SVD-Based Compression

In 3D point cloud tracking (CompTrack), foreground tokens $X_{fg}$ are compressed as follows:

Compute $X_{fg}'$ with positional encoding.
Apply SVD: $X_{fg}' = U \Sigma V^T$ .
Select minimal $K$ so that $\sum_{i=1}^K \sigma_i^2 \geq \tau\, \text{total energy}$ (typically $\tau=0.99$ ).
Retain $K$ dominant directions as a proxy token basis $Q_{SVD}$ .

This truncated SVD provides near-lossless compression; singular values in sparse data decay rapidly, so $K \ll N$ suffices for empirical fidelity (Zhou et al., 19 Nov 2025).

Cross-Attention-Based Selection

QUITO-X models mutual information between tokens and the output via cross-attention scores in encoder–decoder transformers:

Forward pass: $h_{1:T} = f_{enc}(X, Q)$ .
Cross-attention from decoder start token yields attention scores $a_t$ .
Aggregated and smoothed scores $s(w)$ identify critical words.
Top- $\tau$ fraction of high-scoring tokens are selected for compression (Wang et al., 20 Aug 2024).

Stochastic Token Downsampling and Pruning

In Infor-Coef, token downsampling is executed dynamically via:

MLP “Samplers” generate pruning probabilities $\pi_l$ for each token.
Gumbel-Softmax reparametrization samples hard binary masks $z^i_l \sim \text{Bernoulli}(\pi^i_l)$ .
Masked tokens are blocked from attention and computation; inference physically drops these tokens, yielding sublinear FLOPs scaling (Tan, 2023).

3. Integration in Modern Learning Pipelines

IB-DTC modules are incorporated at various junctures.

CompTrack (3D tracking): IB-DTC follows a Spatial Foreground Predictor (SFP) that filters spatial background noise. The compressed proxy tokens $X_p$ feed directly into the prediction head for target offset, orientation, and class. Gradients bypass SVD, flowing solely through learnable query bases and cross-attention weights.
QUITO-X (LLM context): Cross-attention token scores guide context selection for in-context learning; the compressed context is passed to downstream LLMs for QA.
Infor-Coef (Transformer models): Static mask pruning and dynamic downsampling are jointly applied at each layer; the process is supervised by IB loss and cross-entropy, leading to lean model architectures and reduced inference cost.

4. Hyperparameters and Adaptive Thresholding

Key IB-DTC performance determinants include:

Compression ratio $\beta = K/N$ : Controls granularity of retained tokens. In CompTrack, $\beta \approx 0.16$ ( $N \sim 500, K \sim 78$ ).
Energy-retention threshold $\tau$ : Governs SVD truncation, typical values are $0.95 - 0.999$. Higher $\tau$ yields larger $K$ , improved precision at increased cost (Zhou et al., 19 Nov 2025).
Selection fraction $\tau$ in QUITO-X: Top- $\tau$ (e.g., $0.5$) tokens are preserved based on smoothed cross-attention scores (Wang et al., 20 Aug 2024).
Pruning threshold in Infor-Coef: Thresholding $\pi_l$ at inference yields binary token masks for dynamic sequence shortening, selectable via score cutoff or rank (top- $k$ ) (Tan, 2023).
Hyperparameters ( $\gamma, \beta$ ) in IB loss: Must be tuned per task for optimal speed-accuracy tradeoff.

5. Empirical Performance and Benchmarking

Comprehensive experiments validate IB-DTC’s substantial efficiency and competitive accuracy.

CompTrack (3D SOT): On nuScenes: Baseline (no SFP/IB-DTC) achieves 48 FPS, Success/Precision 59.38%/71.63%; IB-DTC alone boosts to 75 FPS with no significant drop in accuracy; full pipeline (SFP+IB-DTC) realizes 90 FPS, 61.04%/73.68%. Comparable improvements on KITTI and Waymo (Zhou et al., 19 Nov 2025).
QUITO-X (QA/Llama3): At $\tau=0.5$ , QUITO-X yields 78–90% EM vs. 50–55% from LLMLingua2, a $\sim$ 25 pp gain; on long contexts, QUITO-X outperforms full-context baselines due to noise removal (Wang et al., 20 Aug 2024).
Infor-Coef (GLUE tasks): Compound static and dynamic pruning achieves up to 16–18 $\times$ FLOPs speedup, incurring < $8\%$ accuracy loss relative to BERT-base; optimal in 2–4 $\times$ speedup regime; at extreme sparsity, still outperforms TinyBERT on key metrics (Tan, 2023).

6. Advantages, Limitations, and Future Directions

IB-guided token compression yields efficiency gains, interpretable selection, and—in context-dependent regimes—enhanced performance from noise pruning. Notable advantages include up to 50% reduction in memory/inference cost, the ability for compressed prompts to sometimes outperform full-context runs, and superior scaling for real-time tasks (e.g., 90 FPS SOT).

Limitations comprise sensitivity to hyperparameter tuning, the challenge of chunking in long-text scenarios (which may disrupt dependencies), and the gap between FLOPs savings and realized latency on hardware. Infor-Coef’s Gumbel-Softmax mechanism may exhibit sensitivity to $\beta$ and $\gamma$ weights. End-to-end IB compressors, joint optimization of depth/width/length, and hardware-aware implementations are prospective research avenues. There are no reported controversies on the basic validity of IB-DTC, but ongoing work centers on learning tradeoff parameters and extending the paradigm to generative and multimodal domains (Tan, 2023, Wang et al., 20 Aug 2024, Zhou et al., 19 Nov 2025).

7. Relationship to Prior Compression Methodologies

IB-DTC generalizes static pruning, self-information, and PPL-based token compression by grounding selection in predictive mutual information. Baselines using self-information or PPL are inconsistent with optimal selection for downstream tasks. Cross-attention-based IB surrogates outperform legacy schemes by maximizing $I(\bar{X};Y)$ , bridging query-context interactions. Joint static-dynamic IB pruning surpasses token-fraction losses (“skim loss”) and demonstrates a complementary effect in compound speedups (Tan, 2023, Wang et al., 20 Aug 2024).

IB-DTC thus represents a unifying and empirically validated framework for dynamic token compression, applicable to both spatial and sequential data regimes across modern deep learning.