Clusters in Self-Attention Dynamics

Updated 6 November 2025

Clusters in self-attention dynamics are emergent patterns where token embeddings aggregate into distinct geometric groups during network propagation.
Mathematical characterizations using gradient flows, eigenspectrum analysis, and optimal transport reveal the mechanics behind attention localization and clustering.
Architectural strategies leverage these clustering phenomena to improve model efficiency, interpretability, and performance across various modalities.

Clusters in Self-Attention Dynamics refer to the emergent pattern in which token representations or attention weights within a transformer model aggregate into distinct groups during the forward or backward pass, or over network depth. This phenomenon manifests both at the representational level (tokens collapsing or polarizing into tight geometric sets) and in the structure of the attention or parameter matrices (e.g., dominant columns or directional subspaces). Clustering under self-attention is not limited to one architectural domain; it arises in language, vision, and other modalities, is evident in both empirical visualizations and rigorous mathematical models, and is deeply intertwined with core issues of expressivity, trainability, complexity, and interpretability in transformer networks.

1. Mathematical Characterizations of Clustering

The paper of clustering in self-attention is grounded in various mathematical frameworks. In most formulations, the evolution of token embeddings $\{x_i\}_{i=1}^n$ across transformer layers is modeled as a dynamical system. Continuous- and discrete-time limits, ODEs, and convex geometric analysis provide rigorous descriptions of how tokens and attention patterns evolve.

Gradient Flow and Particle Models

In settings with layer normalization and symmetric value/query/key matrices, the evolution is described on the unit sphere $\mathbb{S}^{d-1}$ : $\dot{x}_i(t) = P_{x_i(t)}\sum_{j=1}^{n} \frac{e^{\beta \left\langle x_i(t), x_j(t)\right\rangle}}{\sum_{k=1}^n e^{\beta \left\langle x_i(t), x_k(t)\right\rangle}} x_j(t)$ where $P_{x} = I_d - x x^\top$ tangentially projects the vector field. For general self-attention, the update mechanism can be cast as a local energy derivative: $x_i^{(t+1)} = -\frac{\partial}{\partial x_i} e_i(x_i, \{x_j\}) + \gamma\, x_i^{(t)}$ with the local energy $e_i(x_i, \{x_j\}) = -\log \sum_{j \neq i} \exp(x_i^\top \mathbb{J}_{ij} x_j)$ . This connects the system to attractor networks and pseudo-likelihood-based energy landscapes (D'Amico et al., 24 Sep 2024).

Eigenspectrum and Attention Localization

A central principle in recent analysis is the role of the eigenspectrum of the query-key (QK) matrix $W$ :

Strong localization (few tokens dominate attention) arises when the eigenspectrum is sharply concentrated around a nonzero mean and low variance:

$\mathrm{tr}(W^2) \to 0, \quad |\mathrm{tr}(W)| \gg 0$

The variance, $\mathrm{tr}(W^2) - \big|\mathrm{tr}(W)\big|^2 / d$ , dictates both how sharply attention clusters and the susceptibility of the system to rank or entropy collapse (Bao et al., 3 Feb 2024).

Frank–Wolfe and Geometric Hardmax

Under "hardmax" (zero-temperature) dynamics, self-attention steps can be interpreted as Frank–Wolfe (conditional gradient) steps for quadratic objectives over the convex hull $\mathcal{K}$ of token embeddings: $x_i^{t+1} = x_i^t + \gamma^t \left(\arg\max_{y \in \mathcal{K}} \langle B x_i^t, y \rangle - x_i^t\right)$

For $B$ positive semidefinite, this leads to convergence of tokens to the nearest vertex (a Voronoi-like clustering).
For $B$ negative semidefinite, all tokens contract to a consensus (single cluster) (Alcalde et al., 13 Aug 2025, Alcalde et al., 26 Jun 2024).

2. Emergence and Dynamics of Clusters

Metastability and Long-Lived Multi-Cluster States

Self-attention models exhibit dynamic metastability: after an initial mixing, particles (tokens) rapidly aggregate near multiple clusters and remain trapped in these configurations for exponentially long times in $\beta$ (sharpness), before ultimate collapse to single-cluster triviality (Geshkovski et al., 9 Oct 2024). The timescale for escape is

$T_{\mathrm{escape}} \sim \frac{\varepsilon}{n}e^{(1-\alpha)\beta}$

where $1-\alpha$ quantifies inter-cluster angular separation.

Causal Masking and Sequential Collapse

With causal masking (autoregressive/decoder-style attention), the dynamics become strictly hierarchical (token $k$ depends only on $j \leq k$ ). The main result is universal asymptotic collapse: for $V = I_d$ and arbitrary $Q, K$ , all $x_k(t) \to x_1(0)$ as $t\to\infty$ . However, at intermediate times, tokens rapidly assemble into metastable clusters, with their nucleation governed by a Rényi parking process: clusters form at minimally separated tokens (Rényi centers), which remain quasi-stationary for exponentially long periods (Karagodin et al., 7 Nov 2024).

Rényi Parking Model (Editor's term)

Parallel to sequential interval packing in combinatorial geometry, Rényi centers are chosen sequentially such that each is separated from prior tokens by at least $\delta \sim \beta^{-1/2}$ . Number of clusters scales as $1/\delta^{d-1}$ in dimension $d$ .

Role of Initial Conditions and Leaders

The limiting clusters are typically context-aware: their locations depend on the spectrum of the value matrix and on the original input embeddings. In both continuous- and discrete-time, the emergent clusters or leaders act as attractors, often corresponding to tokens with maximal influence ("leaders"). For instance, in 1D or for special value matrices, the attention converges to Boolean, low-rank matrices selecting out a few tokens (Geshkovski et al., 2023, Alcalde et al., 26 Jun 2024).

3. Architectural and Algorithmic Manifestations

Clustering Mechanisms in Efficient Transformers

Several architectures exploit clustering to reduce self-attention complexity:

FCA (Fine- and Coarse-Granularity Hybrid Self-Attention): Aggregates uninformative tokens into coarse clusters via weighted pooling, while informative tokens retain their identities; achieves $2\times$ FLOP reduction with near-lossless accuracy (Zhao et al., 2022).
CAST: Uses learnable surrogate tokens (cluster centroids) to define cluster affinities, performs intra-cluster attention, and shares cluster summaries to maintain information flow; empirical results show $6\times$ speedup and order-of-magnitude memory savings without compromising accuracy (Engelenhoven et al., 6 Feb 2024).
Cluster-Former/ClusTR: Groups tokens by k-means or density peaks on content, performs attention within clusters, and optionally feeds clustered representations to downstream tasks or backprop time steps (Wang et al., 2020, Xie et al., 2022).
Clustered Attention: Clusters queries (and optionally keys), computes centroid attention, and refines by focusing on top- $k$ keys per cluster to approximate full attention with linear scaling (Vyas et al., 2020).

Table: Clustering in Sparse Attention Architectures

Model	Clustering Target	Main Operation	Complexity
FCA	Uninformative tokens	Pooling	$O(k^2)$
CAST	All tokens (via surrogates)	Intra/inter attention	$O(\alpha N)$
Cluster-Former	Tokens (semantic)	Clustered within clusters	$O(L \cdot m)$
Clustered Attn	Queries (and keys)	Centroid attention	$O(NC)$
ClusTR	Key/Value tokens	Density peaks clustering	$O(NM)$

Biological and Cognitive Analogues

In DINO-trained ViTs, heads cluster into three interpretable regimes: foreground/face, whole-object (figure), and background. These correspond to observed human visual attention patterns, with cluster emergence in mid-to-deep layers and clear quantitative and spatial correspondence between head focus and human gaze data (Yamamoto et al., 30 Oct 2024).

Self-Supervised and Unsupervised Clustering

GATCluster fuses parameterized Gaussian attention with self-supervised objectives (transformation invariance, separability maximization, entropy analysis, soft-attention loss) to produce direct, one-hot semantic cluster assignments; clusters are guaranteed by the Label Feature Theorem under appropriate constraints (Niu et al., 2020).

4. Clustering and Model Expressivity, Trainability, and Collapse

Connections Between Spectrum, Entropy, and Collapse Modes

The variance and mean of the QK eigenspectrum regulate cluster formation, model expressivity, and entropy:

Small variance, fixed nonzero mean: localized clusters, high entropy, high expressivity.
Large variance: degenerate/vanishing signal, low entropy, poor trainability.
Uniform spectrum: nearly uniform attention, minimal localization (Bao et al., 3 Feb 2024).

Rank collapse (degenerate, low-rank embeddings) and entropy collapse (near-deterministic attention) are both prevented by minimizing spectrum variance; optimal eigenspectrum control achieves both robust cluster formation and effective learning.

Normalization and Cluster Collapse Regulation

Normalization schemes (Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, LN-Scaling) act as speed regulators on representation dynamics (Karagodin et al., 24 Oct 2025):

Post-LN: Exponential clustering/rapid collapse (risking over-smoothing).
Pre-LN/Peri-LN: Polynomial slow-down, deeper layers retain representation richness.
Peri-LN combines early rapid and late controlled motion, achieving optimal depth-wise utilization.

5. Geometric and Theoretical Insights: Energy and Metric Structures

Gradient flow and optimal transport analyses frame self-attention dynamics as a minimizing process in measure space: $\partial_t \mu_t = -\nabla_{\mathcal{S}}\cdot\left( m_{\mu_t}(x)\, \nabla_{\mathcal{S}} \mathcal{E}'(\mu_t) \right)$ with $m_\mu$ and $\mathcal{E}(\mu)$ encoding nonlocal self-attention energies and sphere geometry (Burger et al., 6 Jan 2025). Stationary points correspond to:

Dirac measures at eigenvectors (tight clusters, mode collapse),
Uniform or spread measures (maximal mixing), depending on the spectrum of the interaction matrix.

Finite-temperature (softmax) dynamics exhibit metastability: systems remain in clustered states mirroring hardmax partitions (e.g., by Voronoi tessellation) for times exponential in $\beta$ , justifying approximations and coarse-grained modeling (Alcalde et al., 13 Aug 2025).

6. Open Problems and Extensions

Several unresolved issues remain:

Generalizing clustering analysis to arbitrary, non-symmetric, or data-dependent $Q, K, V$ matrices.
Extension to architectures with multi-head attention, nonlinear feed-forward blocks, and tied/untied layers.
Universal characterization of interface between clustering, expressivity, and depth-induced collapse in practical, deep transformer stacks.
Empirical and theoretical determination of the evolution (and utility) of meta-stable clusters in causal attention for practical generative tasks.
Further integration of these geometric and combinatorial insights with development of new, principled sparse or hybrid attention architectures.

Clusters in self-attention dynamics arise from the intrinsic geometry and algebra of the attention mechanism, modulated by architectural choices and training regimes. Their mathematical theory provides predictive tools for analyzing transformer representations, informs architectural design (especially for efficiency and interpretability), and unifies seemingly disparate empirical phenomena—context-awareness, leader selection, expressivity and collapse, and biological plausibility—under a single dynamical-systems and metric-geometry framework.