Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 72 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

ClustViT: Efficient Clustering in Vision Transformers

Updated 3 October 2025
  • ClustViT is a vision transformer variant that integrates trainable clustering modules for dynamic token merging and efficient semantic segmentation.
  • The architecture employs a regenerator module that unmerges and refines tokens via MLPs to restore spatial fidelity for dense prediction tasks.
  • Empirical results demonstrate up to 2.18x GFLOPs reduction and 1.64x speedup, making it highly effective for resource-constrained real-world applications.

ClustViT is a family of methodologies and architectures at the intersection of vision transformers (ViT) and clustering-based mechanisms, designed primarily to address the quadratic complexity of self-attention in ViTs while enabling efficient, robust, and interpretable representations for both dense prediction (e.g., semantic segmentation) and unsupervised learning. These models combine the strengths of transformer-based global modeling with dynamic token reduction and regeneration strategies guided by semantic cues or pseudo-clustering, offering notable advantages for real-world and resource-constrained applications.

1. Clustering-Based Token Merging Architecture

ClustViT introduces a trainable clustering module positioned between transformer layers within the ViT backbone (Montello et al., 2 Oct 2025). This module receives patch tokens (excluding the classification token) and applies a multilayer perceptron (MLP) C\mathcal{C}, which consists of two linear layers with ReLU activation:

LC=C(Zl1)=Linear2(ReLU(Linear1(Zl1)))L_C = \mathcal{C}(Z_{l-1}) = \mathrm{Linear}_2(\mathrm{ReLU}(\mathrm{Linear}_1(Z_{l-1})))

Linear1\mathrm{Linear}_1 maps token feature vectors from RD\mathbb{R}^D to RH\mathbb{R}^H, and Linear2\mathrm{Linear}_2 maps from RH\mathbb{R}^H to Rk+1\mathbb{R}^{k+1}, where kk is the number of active clusters and the “+1” category denotes unclustered tokens. The output is a B×N×(k+1)B\times N \times (k+1) tensor, with BB the batch and NN the number of input patch tokens.

Cluster assignments Cb,nC_{b,n} are then obtained via an argmax over the last dimension. Tokens assigned to 1,,k1,\dots,k are semantically merged; those assigned to $0$ remain unclustered. Merged representatives are computed by mean-aggregation of token features within each cluster:

Eb,k=1{n:Cb,n=k}n,Cb,n=kZl1,b,nE_{b,k} = \frac{1}{| \{ n' : C_{b,n'} = k \} |} \sum_{n', C_{b,n'} = k} Z_{l-1, b, n'}

This process dynamically reduces the token count as the network progresses, adaptively responding to image complexity.

2. Semantic Guidance via Pseudo-Clusters

To ensure that token merging is semantically meaningful, ClustViT incorporates guidance from ground-truth segmentation masks. Masks are partitioned into patches matching the ViT grid (e.g., 16×1616\times16), assigning each patch the dominant semantic label if all pixels agree, or a “mixed” label otherwise. The kk most frequent classes form active pseudo-clusters; all other regions are labeled unclustered ($0$). This schema aligns token merging with major semantic regions and avoids merging across class boundaries, which is crucial for dense prediction tasks.

3. Regenerator Module and Token Restoration

After clustering-induced reduction, the regenerator module restores spatial fidelity for downstream segmentation heads (Montello et al., 2 Oct 2025). The sequence of tokens at the end of the Transformer encoder includes both unclustered patch embeddings and kk merged representatives. The regenerator “unmerges” each representative by repeating it across original locations using stored cluster assignments, and refines these via an MLP acting on concatenated representative and residual features:

Zrefined=Linear2(GELU(Linear1(concat)))Z_\textrm{refined} = \mathrm{Linear}_2(\mathrm{GELU}(\mathrm{Linear}_1(\mathrm{concat})))

Subsequently, unmerged and refined tokens (plus the classification token when required) reconstruct the full-resolution token grid compatible with off-the-shelf segmentation decoders.

4. Empirical Performance and Efficiency

ClustViT demonstrates substantial computational savings and inference speed gains compared to standard ViT architectures. Depending on chosen kk and module placement, the model achieves up to 2.18×2.18\times reduction in GFLOPs and 1.64×1.64\times increase in inference throughput on datasets such as ADE20K, SUIM, and RumexWeeds (Montello et al., 2 Oct 2025). Segmentation accuracy remains comparable to baseline ViTs due to the careful “unmerging” strategy and semantic alignment. The model excels particularly in scenarios with large homogeneous background regions, where token reduction is maximized.

Configuration GFLOPs Reduction Speedup Accuracy (Rel. to Baseline)
High kk (clustering) 2.18x 1.64x Comparable
Low kk (clustering) 1.5x 1.2x Comparable

5. Comparative Analysis with Existing Methods

Unlike structured patch merging techniques such as ToMe (Token Merging) or Content-Aware Token Sharing (CTS), which utilize fixed or heuristic groupings, ClustViT supports unstructured merging based on semantic similarity—learned via the clustering module. Unlike CTS, which depends on a policy network trained separately, ClustViT is trained end-to-end with a composite loss:

L=Lsegm+λLclust\mathcal{L} = \mathcal{L}_\textrm{segm} + \lambda \cdot \mathcal{L}_\textrm{clust}

where λ\lambda controls clustering regularization. Furthermore, ClustViT is inherently suitable for dense prediction because its regenerator module restores the full-resolution tokens required for pixel-wise outputs.

6. Practical Significance and Application Scope

ClustViT is especially relevant for applications in robotics, mobile vision, and real-time systems where inference speed, memory efficiency, and scene understanding are paramount. Its ability to adapt token count dynamically enables deployment on platforms with stringent latency constraints. The model's flexibility in configuring kk and its clustering semantics supports adaptation to a wide spectrum of scene complexities.

While ClustViT is tailored for semantic segmentation, the architectural principles extend naturally to instance segmentation, panoptic segmentation, and potentially to video segmentation workflows, contingent upon the integration of temporal pseudo-clustering.

7. Extensions and Research Directions

Further research may explore optimization of clustering regularization, dynamic adjustment of kk, and improved regenerator architectures to minimize accuracy–efficiency trade-offs. Potential extensions of ClustViT include its application to instance-level dense prediction or the fusion with graph-based global aggregation as exemplified by ClusterViG (Parikh et al., 18 Jan 2025). Integration of external semantic priors (e.g., attention maps, unsupervised region discovery) is a plausible direction.

This suggests that ClustViT's clustering-guided token compression and regeneration form a scalable, semantically robust approach to transformer-based dense prediction, promising improved practicality in domains typified by computational constraints and variable scene complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ClustViT.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube