ClustViT: Efficient Clustering in Vision Transformers
- ClustViT is a vision transformer variant that integrates trainable clustering modules for dynamic token merging and efficient semantic segmentation.
- The architecture employs a regenerator module that unmerges and refines tokens via MLPs to restore spatial fidelity for dense prediction tasks.
- Empirical results demonstrate up to 2.18x GFLOPs reduction and 1.64x speedup, making it highly effective for resource-constrained real-world applications.
ClustViT is a family of methodologies and architectures at the intersection of vision transformers (ViT) and clustering-based mechanisms, designed primarily to address the quadratic complexity of self-attention in ViTs while enabling efficient, robust, and interpretable representations for both dense prediction (e.g., semantic segmentation) and unsupervised learning. These models combine the strengths of transformer-based global modeling with dynamic token reduction and regeneration strategies guided by semantic cues or pseudo-clustering, offering notable advantages for real-world and resource-constrained applications.
1. Clustering-Based Token Merging Architecture
ClustViT introduces a trainable clustering module positioned between transformer layers within the ViT backbone (Montello et al., 2 Oct 2025). This module receives patch tokens (excluding the classification token) and applies a multilayer perceptron (MLP) , which consists of two linear layers with ReLU activation:
maps token feature vectors from to , and maps from to , where is the number of active clusters and the “+1” category denotes unclustered tokens. The output is a tensor, with the batch and the number of input patch tokens.
Cluster assignments are then obtained via an argmax over the last dimension. Tokens assigned to are semantically merged; those assigned to $0$ remain unclustered. Merged representatives are computed by mean-aggregation of token features within each cluster:
This process dynamically reduces the token count as the network progresses, adaptively responding to image complexity.
2. Semantic Guidance via Pseudo-Clusters
To ensure that token merging is semantically meaningful, ClustViT incorporates guidance from ground-truth segmentation masks. Masks are partitioned into patches matching the ViT grid (e.g., ), assigning each patch the dominant semantic label if all pixels agree, or a “mixed” label otherwise. The most frequent classes form active pseudo-clusters; all other regions are labeled unclustered ($0$). This schema aligns token merging with major semantic regions and avoids merging across class boundaries, which is crucial for dense prediction tasks.
3. Regenerator Module and Token Restoration
After clustering-induced reduction, the regenerator module restores spatial fidelity for downstream segmentation heads (Montello et al., 2 Oct 2025). The sequence of tokens at the end of the Transformer encoder includes both unclustered patch embeddings and merged representatives. The regenerator “unmerges” each representative by repeating it across original locations using stored cluster assignments, and refines these via an MLP acting on concatenated representative and residual features:
Subsequently, unmerged and refined tokens (plus the classification token when required) reconstruct the full-resolution token grid compatible with off-the-shelf segmentation decoders.
4. Empirical Performance and Efficiency
ClustViT demonstrates substantial computational savings and inference speed gains compared to standard ViT architectures. Depending on chosen and module placement, the model achieves up to reduction in GFLOPs and increase in inference throughput on datasets such as ADE20K, SUIM, and RumexWeeds (Montello et al., 2 Oct 2025). Segmentation accuracy remains comparable to baseline ViTs due to the careful “unmerging” strategy and semantic alignment. The model excels particularly in scenarios with large homogeneous background regions, where token reduction is maximized.
Configuration | GFLOPs Reduction | Speedup | Accuracy (Rel. to Baseline) |
---|---|---|---|
High (clustering) | 2.18x | 1.64x | Comparable |
Low (clustering) | 1.5x | 1.2x | Comparable |
5. Comparative Analysis with Existing Methods
Unlike structured patch merging techniques such as ToMe (Token Merging) or Content-Aware Token Sharing (CTS), which utilize fixed or heuristic groupings, ClustViT supports unstructured merging based on semantic similarity—learned via the clustering module. Unlike CTS, which depends on a policy network trained separately, ClustViT is trained end-to-end with a composite loss:
where controls clustering regularization. Furthermore, ClustViT is inherently suitable for dense prediction because its regenerator module restores the full-resolution tokens required for pixel-wise outputs.
6. Practical Significance and Application Scope
ClustViT is especially relevant for applications in robotics, mobile vision, and real-time systems where inference speed, memory efficiency, and scene understanding are paramount. Its ability to adapt token count dynamically enables deployment on platforms with stringent latency constraints. The model's flexibility in configuring and its clustering semantics supports adaptation to a wide spectrum of scene complexities.
While ClustViT is tailored for semantic segmentation, the architectural principles extend naturally to instance segmentation, panoptic segmentation, and potentially to video segmentation workflows, contingent upon the integration of temporal pseudo-clustering.
7. Extensions and Research Directions
Further research may explore optimization of clustering regularization, dynamic adjustment of , and improved regenerator architectures to minimize accuracy–efficiency trade-offs. Potential extensions of ClustViT include its application to instance-level dense prediction or the fusion with graph-based global aggregation as exemplified by ClusterViG (Parikh et al., 18 Jan 2025). Integration of external semantic priors (e.g., attention maps, unsupervised region discovery) is a plausible direction.
This suggests that ClustViT's clustering-guided token compression and regeneration form a scalable, semantically robust approach to transformer-based dense prediction, promising improved practicality in domains typified by computational constraints and variable scene complexity.