Papers
Topics
Authors
Recent
2000 character limit reached

IB-DTC: Info Bottleneck Token Compression

Updated 26 November 2025
  • IB-DTC is a framework that applies the Information Bottleneck principle to dynamically compress feature tokens, preserving task-critical signals while eliminating redundant data.
  • It integrates low-rank SVD, cross-attention-based selection, and stochastic downsampling to efficiently reduce token count in applications like NLP and 3D tracking.
  • Empirical results show IB-DTC enhances inference speed and reduces computational cost with minimal accuracy loss, making it valuable for real-time and resource-constrained scenarios.

Information Bottleneck-guided Dynamic Token Compression (IB-DTC) describes a principled framework for adaptively compressing feature tokens in deep learning models, optimizing for high predictive information while minimizing redundant representation. The approach is rooted in classical Information Bottleneck (IB) theory and has been operationalized in diverse domains, including 3D point cloud tracking, transformer-based NLP, and LLM context compression. Characteristic IB-DTC pipelines employ a combination of mutual information estimation, low-rank approximations, cross-attention score modeling, and stochastic selection mechanisms to generate compact token subsets (“proxy tokens”) that preserve task-critical signals.

1. Theoretical Foundations: Information Bottleneck Principle

The IB-DTC methodology formalizes token compression as an IB optimization. Given input tokens XX and output targets yy, the key objective is to find a compressed representation ZZ that retains maximal information relevant to predicting yy, subject to strong compression constraints. This is framed by:

mingI(X;Z)subject toI(Z;y)I0\min_{g} I(X; Z) \quad \text{subject to} \quad I(Z; y) \geq I_0

where I(;)I(\cdot;\cdot) denotes mutual information, and gg is a mapping from full to compressed tokens. Direct mutual information optimization is intractable in high-dimensional embedding spaces. A common surrogate is optimal low-rank approximation: for token matrices XRN×CX \in \mathbb{R}^{N \times C}, the best rank-KK approximation discards redundant (low-variance) directions, aligning with the IB’s compression-fidelity tradeoff (Zhou et al., 19 Nov 2025).

In NLP, the IB objective can condition on a query QQ, phrased as:

LIB=I(Z;XQ)βI(Z;YQ)L_{IB} = I(Z; X \mid Q) - \beta\, I(Z; Y \mid Q)

for some ZZ (compressed context/tokens), YY (output), and tradeoff coefficient β\beta (Wang et al., 20 Aug 2024).

2. Algorithmic Implementations and Core Mechanisms

Dynamic token compression under IB is instantiated via several domain-specific algorithms.

Low-Rank SVD-Based Compression

In 3D point cloud tracking (CompTrack), foreground tokens XfgX_{fg} are compressed as follows:

  • Compute XfgX_{fg}' with positional encoding.
  • Apply SVD: Xfg=UΣVTX_{fg}' = U \Sigma V^T.
  • Select minimal KK so that i=1Kσi2τtotal energy\sum_{i=1}^K \sigma_i^2 \geq \tau\, \text{total energy} (typically τ=0.99\tau=0.99).
  • Retain KK dominant directions as a proxy token basis QSVDQ_{SVD}.

This truncated SVD provides near-lossless compression; singular values in sparse data decay rapidly, so KNK \ll N suffices for empirical fidelity (Zhou et al., 19 Nov 2025).

Cross-Attention-Based Selection

QUITO-X models mutual information between tokens and the output via cross-attention scores in encoder–decoder transformers:

  • Forward pass: h1:T=fenc(X,Q)h_{1:T} = f_{enc}(X, Q).
  • Cross-attention from decoder start token yields attention scores ata_t.
  • Aggregated and smoothed scores s(w)s(w) identify critical words.
  • Top-τ\tau fraction of high-scoring tokens are selected for compression (Wang et al., 20 Aug 2024).

Stochastic Token Downsampling and Pruning

In Infor-Coef, token downsampling is executed dynamically via:

  • MLP “Samplers” generate pruning probabilities πl\pi_l for each token.
  • Gumbel-Softmax reparametrization samples hard binary masks zliBernoulli(πli)z^i_l \sim \text{Bernoulli}(\pi^i_l).
  • Masked tokens are blocked from attention and computation; inference physically drops these tokens, yielding sublinear FLOPs scaling (Tan, 2023).

3. Integration in Modern Learning Pipelines

IB-DTC modules are incorporated at various junctures.

  • CompTrack (3D tracking): IB-DTC follows a Spatial Foreground Predictor (SFP) that filters spatial background noise. The compressed proxy tokens XpX_p feed directly into the prediction head for target offset, orientation, and class. Gradients bypass SVD, flowing solely through learnable query bases and cross-attention weights.
  • QUITO-X (LLM context): Cross-attention token scores guide context selection for in-context learning; the compressed context is passed to downstream LLMs for QA.
  • Infor-Coef (Transformer models): Static mask pruning and dynamic downsampling are jointly applied at each layer; the process is supervised by IB loss and cross-entropy, leading to lean model architectures and reduced inference cost.

4. Hyperparameters and Adaptive Thresholding

Key IB-DTC performance determinants include:

  • Compression ratio β=K/N\beta = K/N: Controls granularity of retained tokens. In CompTrack, β0.16\beta \approx 0.16 (N500,K78N \sim 500, K \sim 78).
  • Energy-retention threshold τ\tau: Governs SVD truncation, typical values are $0.95 - 0.999$. Higher τ\tau yields larger KK, improved precision at increased cost (Zhou et al., 19 Nov 2025).
  • Selection fraction τ\tau in QUITO-X: Top-τ\tau (e.g., $0.5$) tokens are preserved based on smoothed cross-attention scores (Wang et al., 20 Aug 2024).
  • Pruning threshold in Infor-Coef: Thresholding πl\pi_l at inference yields binary token masks for dynamic sequence shortening, selectable via score cutoff or rank (top-kk) (Tan, 2023).
  • Hyperparameters (γ,β\gamma, \beta) in IB loss: Must be tuned per task for optimal speed-accuracy tradeoff.

5. Empirical Performance and Benchmarking

Comprehensive experiments validate IB-DTC’s substantial efficiency and competitive accuracy.

  • CompTrack (3D SOT): On nuScenes: Baseline (no SFP/IB-DTC) achieves 48 FPS, Success/Precision 59.38%/71.63%; IB-DTC alone boosts to 75 FPS with no significant drop in accuracy; full pipeline (SFP+IB-DTC) realizes 90 FPS, 61.04%/73.68%. Comparable improvements on KITTI and Waymo (Zhou et al., 19 Nov 2025).
  • QUITO-X (QA/Llama3): At τ=0.5\tau=0.5, QUITO-X yields 78–90% EM vs. 50–55% from LLMLingua2, a \sim25 pp gain; on long contexts, QUITO-X outperforms full-context baselines due to noise removal (Wang et al., 20 Aug 2024).
  • Infor-Coef (GLUE tasks): Compound static and dynamic pruning achieves up to 16–18×\times FLOPs speedup, incurring <8%8\% accuracy loss relative to BERT-base; optimal in 2–4×\times speedup regime; at extreme sparsity, still outperforms TinyBERT on key metrics (Tan, 2023).

6. Advantages, Limitations, and Future Directions

IB-guided token compression yields efficiency gains, interpretable selection, and—in context-dependent regimes—enhanced performance from noise pruning. Notable advantages include up to 50% reduction in memory/inference cost, the ability for compressed prompts to sometimes outperform full-context runs, and superior scaling for real-time tasks (e.g., 90 FPS SOT).

Limitations comprise sensitivity to hyperparameter tuning, the challenge of chunking in long-text scenarios (which may disrupt dependencies), and the gap between FLOPs savings and realized latency on hardware. Infor-Coef’s Gumbel-Softmax mechanism may exhibit sensitivity to β\beta and γ\gamma weights. End-to-end IB compressors, joint optimization of depth/width/length, and hardware-aware implementations are prospective research avenues. There are no reported controversies on the basic validity of IB-DTC, but ongoing work centers on learning tradeoff parameters and extending the paradigm to generative and multimodal domains (Tan, 2023, Wang et al., 20 Aug 2024, Zhou et al., 19 Nov 2025).

7. Relationship to Prior Compression Methodologies

IB-DTC generalizes static pruning, self-information, and PPL-based token compression by grounding selection in predictive mutual information. Baselines using self-information or PPL are inconsistent with optimal selection for downstream tasks. Cross-attention-based IB surrogates outperform legacy schemes by maximizing I(Xˉ;Y)I(\bar{X};Y), bridging query-context interactions. Joint static-dynamic IB pruning surpasses token-fraction losses (“skim loss”) and demonstrates a complementary effect in compound speedups (Tan, 2023, Wang et al., 20 Aug 2024).

IB-DTC thus represents a unifying and empirically validated framework for dynamic token compression, applicable to both spatial and sequential data regimes across modern deep learning.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Information Bottleneck-guided Dynamic Token Compression (IB-DTC).