Windowed Token Compression in Transformers

Updated 30 November 2025

Windowed token compression is an architectural technique that partitions tokens into fixed-size windows and aggregates them to reduce sequence lengths in transformer models.
It employs methods such as average pooling and content-adaptive weighting to summarize local redundancies while preserving task-relevant features.
This approach is essential for scaling high-resolution vision, language, and multimodal models, offering notable speedups and memory savings in practical applications.

Windowed token compression is an architectural technique designed to reduce the token sequence length in transformer-based models by aggregating local groups of tokens within fixed-size, non-overlapping spatial and/or temporal windows. Originally motivated by the need to address the prohibitive computational and memory costs of processing high-resolution or long-context inputs in vision, language, and multimodal LLMs, this approach retains task-relevant information while discarding or summarizing local redundancy. Windowed token compression has emerged as a key mechanism in efficient native-resolution visual encoding (Sun et al., 26 Nov 2025), layer-wise visual token compression (Liu et al., 3 Jul 2025), adaptive multimodal compression (Tao et al., 18 Nov 2025), and dynamic feature map reduction strategies (Wang et al., 11 Dec 2024).

1. Formal Definition and Compression Schemes

Windowed token compression operates by (i) partitioning the input sequence—typically, a 2D or 3D tensor of patchwise features—into non-overlapping blocks of size $w\times w$ (or $p\times p$ ), and (ii) replacing each block of $w^2$ tokens with a single, compressed token. The standard instantiation is average pooling, but modern approaches augment or replace pooling with learned, content-adaptive weighting or spatially-aware merging operations.

In "LLaVA-UHD v3," windowed token compression (WTC) is formally defined for visual tokens $X^{(l_j-1)} \in \mathbb{R}^{N^{(l_j-1)}\times D}$ at layer $l_j$ by reshaping to a 2D feature map, partitioning into $w\times w$ windows, and computing either:

Average pooling: $x_{\text{out}} = \frac{1}{w^2}\sum_{i=1}^{w^2} x_i$
Content-adaptive pooling: $x_{\text{out}} = \sum_{i=1}^{w^2} w_i x_i$ , with $w_i$ assigned via an MLP on $[x_i;x_{\text{avg}}]$ followed by softmax normalization

This process reduces the total token count by a factor of $w^2$ per stage and is recursively composable across multiple transformer layers (i.e., hierarchical compression) (Sun et al., 26 Nov 2025).

Alternative window-based approaches include pixel-shuffle "space-to-channel" rearrangement combined with an adaptive MLP and residual averaging in LaCo (Liu et al., 3 Jul 2025), and spatial average pooling with dynamically selected stride based on local feature map statistics in DFMR (LLaVA-Zip) (Wang et al., 11 Dec 2024).

2. Architectures and Contexts of Application

Windowed token compression modules have been integrated into a range of transformer-based backbones for various modalities:

Vision Transformers (ViT): Integrated at arbitrary layers as in LLaVA-UHD v3, supporting progressive reduction for high-resolution input (Sun et al., 26 Nov 2025).
Multimodal LLMs: Incorporated post-patch embedding or within vision encoders (LaCo, DFMR) to enable scalability for multi-image/video prompts and efficient cross-modal modeling (Liu et al., 3 Jul 2025, Wang et al., 11 Dec 2024, Tao et al., 18 Nov 2025).
Omnimodal (e.g., Audio-Visual): Time-windowed compression in OmniZip computes window-wise audio salience to dynamically modulate video token pruning on a per-window basis, and applies interleaved spatio-temporal merge/clustering operations within each window (Tao et al., 18 Nov 2025).

Architecturally, these modules exploit both spatial and temporal coherence to achieve higher compression in low-information windows, while reserving token capacity for challenging or information-dense regions.

3. Algorithmic Procedures and Integration

The canonical workflow for windowed token compression proceeds as follows (cf. (Sun et al., 26 Nov 2025, Liu et al., 3 Jul 2025)):

Partition: At each designated transformer layer, reshape tokens into a 2D or 3D grid and divide into non-overlapping windows of specified size (e.g., $2\times 2$ or $p\times p$ blocks).
Aggregate: Within each window, summarize the $w^2$ $w^{2}$ or $p^2$ $p^{2}$ tokens using:
- Uniform average pooling
- Content-adaptive weighting (via small MLPs over local tokens and context)
- Pixel-shuffle and MLP channel reduction with non-parametric residual averaging (LaCo)
- Dynamic stride selection for average pooling (DFMR), where stride is selected to ensure window-based standard deviation $\leq\tau$
Propagate: The set of windowed-compressed tokens forms the input to subsequent layers; the process is recursively applied as specified (e.g., multi-stage hierarchical compression).
Join with Textual Tokens: In multimodal pipelines, compressed visual tokens are concatenated with text tokens before transformer decoding (Sun et al., 26 Nov 2025, Wang et al., 11 Dec 2024).

In video and omnimodal settings, time-window segmentation precedes token aggregation, and modality-specific guides (e.g., audio retention scores) determine per-window pruning rates (Tao et al., 18 Nov 2025).

4. Theoretical and Computational Properties

Windowed token compression achieves provable $w^4$ -fold token reduction per applied stage, with analytical complexity reductions in both time and memory. For instance, in ViT with $J$ WTC modules at layers $l_1,\dots,l_J$ : $N^{(l)} = \frac{N}{w^{2\,c(l)}}$ where $c(l)$ is the count of compressors up to layer $l$ . The aggregate cost becomes $O\left(DN^2 \sum_{j=0}^{J} L_j/w^{4j}\right)$ , dropping sharply after each compression stage (Sun et al., 26 Nov 2025).

Empirical results demonstrate that integrating WTC or similar windowed modules early in processing yields the largest computational savings, with >2.4× reduction in time-to-first-token (TTFT) for ViT-UHD and ~2.8× speedup for two-stage WTC (Sun et al., 26 Nov 2025). However, early-stage compression risks compound information loss; ablations reveal a Pareto trade-off between speed and task accuracy, with later-stage or adaptive-window schemes mitigating quality degradation (Liu et al., 3 Jul 2025).

5. Adaptive and Multimodal Extensions

Recent systems extend windowed token compression to new modalities and dynamic configurations:

Content-Adaptive Windowing: DFMR computes the local patch standard deviation for each window to dynamically select pooling stride, compressing spatially uniform regions more aggressively (Wang et al., 11 Dec 2024).
Audio-Guided Video Pruning: OmniZip determines window-wise video prune ratios based on audio token salience, followed by interleaved temporal merge (similarity-based) and spatial clustering (density peak, kNN) within each time window (Tao et al., 18 Nov 2025). This approach yields up to 3.4× inference speedup and significant memory savings.
Progressive Compression: LLaVA-UHD v3 hierarchically combines refined patch embedding (finer initial tokenization) and windowed token compression to flexibly trade off between granularity and efficiency, without substantial architectural changes to the core ViT (Sun et al., 26 Nov 2025).

6. Empirical Results and Comparative Analyses

Windowed token compression has been benchmarked across large-scale vision-language and omnimodal datasets. Key observations include:

System	Compression Method	TTFT/Speedup	Avg. Accuracy (%)	Notable Findings
ViT-UHD	3× WTC + RPE	1.5× faster	63.0 (↑0.9 vs. baseline)	Joint WTC and refined patch embedding sustains accuracy with aggressive reduction (Sun et al., 26 Nov 2025)
LLaVA-UHD v3	3× WTC, global encoding	1.9× faster (vs. Qwen2-VL)	Comparable	Reduces TTFT from 290 ms to 153.8 ms, maintains benchmark accuracy (Sun et al., 26 Nov 2025)
LaCo	Pixel-shuffle, residual	>20% train/infer	+23% (video)	Internal layer compression achieves higher efficiency and accuracy than post-encoder shuffle (Liu et al., 3 Jul 2025)
LLaVA-Zip (DFMR)	Dynamic stride pooling	n/a (tokens: 576→64)	Up to +4 over baseline	Dynamic windowing outperforms static stride, especially at low token budgets (Wang et al., 11 Dec 2024)
OmniZip	Audio-guided ISTC	3.42× speedup	Maintained	Multimodal windows, training-free, supports both AV and video-only tasks (Tao et al., 18 Nov 2025)

This empirical landscape demonstrates the versatility and effectiveness of windowed compression across modalities and architectural variants. Content-adaptive pooling and multimodal guidance consistently outperform naive or uniform-window schemes at aggressive compression ratios.

7. Limitations and Prospective Developments

Challenges of windowed token compression involve:

Rigidity of window partitioning: Non-overlapping, uniform windows may misalign with semantic boundaries; dynamic, content-aware windows are a prospective remedy (Wang et al., 11 Dec 2024).
Static hyperparameters: Manual setting of thresholds (e.g., in DFMR) or window size may underperform compared to learned or input-dependent adaptation.
Progressive vs. single-stage integration: Over-compression at early layers can irretrievably lose fine structure. Hierarchical or multi-stage placement offers a continuum between efficiency and fidelity (Liu et al., 3 Jul 2025, Sun et al., 26 Nov 2025).
Extension beyond spatial domains: Emerging work explores windowed compression in audio, temporal, KV-cache, and joint multimodal spaces, with promising initial results but open questions on scaling and robustness (Tao et al., 18 Nov 2025, Zhang et al., 17 Dec 2024).
Compatibility with attention mechanisms: Fine-grained attention patterns may be disrupted by fixed-window compression unless position layouts are carefully managed (see EPL for enhanced layouts (Zhao et al., 22 Sep 2024)).

Prospective research directions include non-uniform, content-aware windows, tighter coupling with position encoding strategies, and integration of windowed compression into online and continual learning scenarios. The paradigm is likely to remain central as data modalities and model deployments scale in complexity and scope.