Learnable Clustering Module

Updated 29 January 2026

Learnable Clustering Module is a neural component that fuses unsupervised clustering with representation learning via end-to-end differentiable operations.
It employs learnable centroids and soft assignments to adaptively group data in latent spaces, facilitating integration with tasks like classification and reconstruction.
Regularization techniques such as orthogonality and entropy penalties are used to ensure stable, distinct cluster formation across various applications.

A learnable clustering module is a neural architecture or algorithmic component in which cluster assignment, cluster centroids, and typically the dimensionality reduction or representation learning are accomplished via end-to-end differentiable operations, often using deep learning. Such modules fuse unsupervised clustering with neural network optimization, frequently integrating clustering capabilities directly into model training, enabling adaptive grouping in latent spaces, more flexible representation merging, and enabling simultaneous optimization with downstream objectives such as classification or reconstruction.

1. Historical Context and Motivation

Traditional clustering approaches—including $k$ -means, Gaussian Mixture Models (GMM), spectral clustering, and community detection—require manual design of distance metrics, community numbers, or affinity matrices, and usually separate embedding learning from clustering optimization. Limitations of fixed clustering algorithms have motivated neural clustering modules that learn to adapt centroids, metrics, and assignments within an end-to-end pipeline, particularly for domains like brain connectomics (Yang et al., 2024), object-centric learning (Kirilenko et al., 2023), transfer clustering (Zhang et al., 2023), recommendation (Liu et al., 2024), and graph clustering (Yang et al., 2022). These modules overcome the rigidity of fixed clusters and enable integration with joint loss functions for classification, contrastive learning, and other self-supervised objectives.

2. Architectural Principles and Workflow

Learnable clustering modules are architected with the following elements:

Embedding Layer: Input data (features, connectivity, spectral data) are mapped via MLPs, CNNs, or transformers into vector representations of nodes, pixels, tokens, or regions of interest (ROIs).
Learnable Centroids: A set of $K$ centroid vectors (often called prompt tokens, slot heads, or intent neurons) are initialized and jointly updated—either via gradient descent, momentum-averaged updates, or EM-like recurrences. These centroids serve as community, object, or intent representatives.
Clustering Assignment: Samples are assigned to clusters via differentiable softmax over distances (dot-product, cosine, attention-based, Gaussian likelihoods), with assignment matrices produced that drive representation merging.
Cluster Merging: Feature embeddings are merged into community prototypes, slot vectors, or aggregated cluster states, typically by weighted summation controlled by soft assignment.
Orthogonality and Regularization: Orthogonality losses, entropy penalties, and sparsity regularizers are standard to promote compact intra-cluster representations and separation among community centroids.
Joint Integration: Output cluster representations are integrated into downstream classifiers or decoders, enabling seamless, concurrent learning.

For example, in the $\texttt{TC-BrainTF}$ token clustering module (Yang et al., 2024), ROI embeddings $X \in \mathbb{R}^{N \times d}$ are concatenated with $K$ learnable prompt tokens $P \in \mathbb{R}^{K \times d}$ , processed by a transformer encoder, assigned to communities via normalized cosine similarities, and merged to community embeddings $C \in \mathbb{R}^{K \times d}$ .

3. Mathematical Formulation

A typical learnable clustering module is mathematically described by:

Soft Assignments:

$A_{ij} = \text{softmax}_j \left( H_X[i] \cdot \bar{p}_j \right), \qquad A \in \mathbb{R}^{N \times K}$

with unit-normed centroids $\bar{p}_j$ .

Merging Operation:

$C = A^T H_X, \quad C \in \mathbb{R}^{K \times d}$

assigning each community $j$ its weighted sum over assigned embeddings.

Orthogonality Regularization:

$L_{\text{orth}} = \Vert P P^T - I_K \Vert_F$

forces prompt embedding directions to be orthogonal.

Joint Loss:

$L = L_{\text{cls}} + \lambda L_{\text{orth}}$

where $L_{\text{cls}}$ is typically cross-entropy for classification, and $\lambda$ balances the two terms.

Other models extend this with learnable similarity weights (Lyu et al., 2024), attention-modulated assignments (Zhang et al., 2023), GMM-based soft clustering (Kirilenko et al., 2023), or nonnegative template fitting (Watanabe et al., 2020). The mathematical underpinning connects to differentiable EM, attention mechanisms, matrix factorization, and permutation-invariant loss terms.

4. Regularization, Initialization, and Hyperparameters

Learnable clustering modules rely on well-chosen regularization:

Orthogonality Penalties: Prevent centroid collapse and encourage representational diversity, e.g., $\Vert P P^T - I_K \Vert_F$ .
Entropy Losses: Avoid trivial assignment of all samples to a single cluster.
Weight Decay: Occasionally used to regulate centroid norms, but orthogonality often suffices.

Initialization of centroids can use random sampling (Xavier/Glorot), k-means over a bootstrap batch, or fixed orthogonal vectors.

Key hyperparameters include:

Parameter	Range / Example Values	Role
$d$ (embedding dim)	$128-512$	Controls feature/representation dimension
$K$ (community count)	$\{4,8,11\}$ (TC-BrainTF)	Number of clusters/prompt tokens
$\lambda$ (orth. wt)	$0.5-1.0$	Loss trade-off
learning rate	$1e^{-4}$ with annealing	Optimizer parameter
batch size	$32-64$	Training batch size
epochs	$20-50$	Training duration

These parameters are typically tuned by grid search or cross-validation on validation data (e.g., AUROC for ASD classification (Yang et al., 2024)).

5. Integration with Downstream Tasks and End-to-End Training

Learnable clustering modules are integrated directly with downstream objectives:

Classification: Flattened cluster-merge vectors are linearly projected and fed into MLP-type heads for phenotype prediction (e.g., ASD, gender), with losses summed for joint optimization (Yang et al., 2024).
Reconstruction: Cluster assignments serve as slot prototypes for set prediction in object-centric architectures (Kirilenko et al., 2023).
Self-Supervision and Contrastive Learning: Cluster centroids may act as pseudo-labels or anchors for contrastive objectives, facilitating unsupervised representation learning (Yang et al., 2022).
Recommendation Systems: Cluster centers function as latent intents, simultaneously optimized for next-item prediction and intent-behavior alignment (Liu et al., 2024).

Backpropagation traverses all clustering steps, updating both feature extractors and centroid parameters, except when exponential moving average (EMA) updates are used to maintain stable centroids without direct gradient flow (Dewis et al., 22 Jan 2026).

6. Empirical Performance and Impact

Learnable clustering modules demonstrate superior flexibility and benchmarking performance:

Brain Connectome Analysis: $\texttt{TC-BrainTF}$ achieves improved neuroclassification accuracy and robust community clustering (Yang et al., 2024).
Object-Centric Learning: Slot Mixture Module surpasses slot attention and set-prediction accuracy, e.g., CLEVR AP $_{\infty}=99.4\%$ (Kirilenko et al., 2023).
Recommendation: ELCRec improves NDCG@5 by $8.9\%$ and reduces training time by $22.5\%$ over non-learnable clustering (Liu et al., 2024).
Graph Clustering: Learnable augmentation provides state-of-the-art node grouping results (Yang et al., 2022).
Large-Scale Subspace Clustering: RPCM reduces complexity to linear ( $O(N)$ ) and achieves $96\%$ clustering accuracy on million-scale data (Li et al., 2020).

Ablation studies across papers indicate that orthogonality regularization, clustering integration, and adaptive centroid updates are critical to performance; removing these components degrades ARI, NMI, and classification metrics across domains.

7. Extensions and Theoretical Generalization

Research demonstrates learnable clustering modules strictly generalize soft $k$ -means, GMM, and matrix factorization clustering models, with universal approximation under symmetric, positive-definite attentional transformations (Zhang et al., 2023). Modules support flexible cluster counts (dynamic $K$ selection (Meier et al., 2018)), nonparametric Bayesian clustering (Iwata, 2021), and meta-learning on new clustering tasks (Jiang et al., 2019), underscoring adaptability.

Ongoing work explores multi-view clustering integration (Chu et al., 2018), explainable template bank approaches (Watanabe et al., 2020), and transferability across source-target domains (Zhang et al., 2023), in addition to scaling frameworks to industrial applications with hundreds of millions of samples (Liu et al., 2024).

For comprehensive implementations and mathematical specifics, reference (Yang et al., 2024, Kirilenko et al., 2023, Zhang et al., 2023, Liu et al., 2024, Yang et al., 2022, Li et al., 2020).