Loss-Based Clustering Overview

Updated 23 November 2025

Loss-based clustering is a paradigm that minimizes loss functions to quantify within-cluster fit and between-cluster separation.
It unifies classical methods like k-means with deep learning and robust extensions through a common loss framework.
The approach supports rigorous convergence analysis, distributed optimization, and improvements in robustness and interpretability.

Loss-based clustering refers to a broad class of clustering methodologies in which the cluster partition and/or latent representations are obtained by directly minimizing (or maximizing) a loss functional, usually formulated as a sum of within-cluster and between-cluster loss terms. Loss-based formulations unify classical algorithms (such as $k$ -means and spectral methods), robust and regularized extensions, and numerous recent deep learning variants, by providing a criterion that measures both the fit of points to their assigned cluster (within-cluster loss) and the separation from other clusters (between-cluster loss). Such frameworks enable expressivity via the choice of loss, theoretical analysis on convergence and robustness, improved interpretability via explicit risk, and, particularly in modern deep clustering, seamless integration with representation learning and federated optimization.

1. General Principles of Loss-Based Clustering

The core principle of loss-based clustering is to pose clustering as an optimization problem: $\min_{C,\Theta} \mathcal{L}_{\text{total}}(C,\Theta; X)$ where $C$ encodes the cluster assignment, $\Theta$ are cluster parameters (such as centroids, hyperplanes, or latent representations), and $\mathcal{L}_{\text{total}}$ is a prespecified loss function. Losses can be instantiated at various levels:

Pointwise loss: Directly penalize assignment of sample $x_i$ to cluster $j$ via a distance or divergence.
Within- and between-cluster loss: Penalize dispersion of points in a cluster (compactness) and closeness to points/clusters outside.
Global or pairwise losses: Penalize summary indices (e.g., overall distortion, misclassification) or use all pairs.

For rigorous tractability and interpretability, loss-based clustering often imposes regularity (e.g., convexity, symmetry) and a decomposition into terms such as: $\text{Total Loss} = \sum_{i=1}^N \Big[ c_w\,J^\mathrm{w}(d_{i,\text{in}}) + c_b \sum_{j\neq y_i} J^\mathrm{b}(d_{ij,\text{out}}) \Big] + \text{regularization}$ where $J^\mathrm{w}$ measures within-cluster error, $J^\mathrm{b}$ between-cluster error, $d_{i,\text{in}}$ distance to own cluster prototype, and $d_{ij,\text{out}}$ to other clusters (Wang et al., 2019).

2. Foundational Loss-Based Models

Loss-based clustering emerged to unify and extend prototype-based and discriminative clustering families:

$k$ -means: Minimizes sum of squared errors between data and assigned centroid, i.e., $J^\mathrm{w}(d)=d^2$ , $J^\mathrm{b}=0$ .
Generalized Bregman/Deviation loss models: Replace squared Euclidean distance with any Bregman divergence $D_\phi(x, \mu)$ , yielding a broad class of centroid-based algorithms (Rigon et al., 2020).
Plane-based clustering: Instead of centroids, assign points to hyperplanes (or affine subspaces), and losses are measured as deviations from cluster-defined planes. Several well-known methods (k-plane, PPC, TWSVC, robust TWSVC, ramp-TWSVC) can be written as special cases of a master loss function (Wang et al., 2019).
Pairwise and Bayesian losses: Binder and VI losses are used for optimal partition estimation from Bayesian clustering posteriors (Dahl et al., 2021).

Loss-based criteria also underpin spectral and density-based algorithms when the loss is designed to encode affinity or density connectivity.

3. Loss Function Design: Within, Between, and Regularization Terms

The design of within- and between-cluster loss is central and determines the qualitative behavior of the clustering. Key requirements extracted from the loss-based plane clustering framework (Wang et al., 2019):

$J^\mathrm{w}(\rho)$ : Even, non-decreasing for increasing $|\rho|$ , $J^\mathrm{w}(0)=0$ .
$J^\mathrm{b}(\rho)$ : Even, non-increasing, penalizes points being close to non-own centroids/hyperplanes.
Assignment rule reduces to argmin over $|f(x; w_j, b_j)|$ if $J^\mathrm{w}, J^\mathrm{b}$ satisfy these (Wang et al., 2019).

Examples:

Quadratic loss ( $d^2$ ): Emphasizes tight, spherical clusters.
Ramp loss: Truncates effect, reducing sensitivity to outliers.
Cauchy loss ( $\log(1+ r^2/c^2)$ ): Heavy-tailed, robust to gross errors, bounded influence (Li et al., 2019).
Divergence-based (e.g., CS or Jensen-Shannon): Encourages mutual separation and compactness simultaneously (Kampffmeyer et al., 2019, Lim, 12 Dec 2024).
Distributional loss (Kuiper-based, in survival/lifetime clustering): Directly penalizes overlap of empirical cluster survival functions (Mouli et al., 2019).

Regularization on cluster parameters (norms, moments, etc.) further increases model fidelity and prevents degenerate solutions (e.g., all points collapsed into one cluster).

4. Algorithmic Schemes and Convergence Properties

Loss-based clustering algorithms typically alternate between model parameter updates and assignment updates, exploiting the structure of the loss for efficient optimization:

Block-alternating optimization: Fix assignment, optimize cluster parameters; fix cluster parameters, reassign based on loss.
Finite termination: For discrete assignments and strictly decreasing loss, the process is guaranteed to terminate at a local or weak local optimum in a finite number of steps (Wang et al., 2019).
Optimization methods: Classical clustering uses Lloyd-style assignment updates, IRR for robust loss, stochastic search (e.g., SALSO for Bayesian loss minimization) (Dahl et al., 2021), and deep architectures use backpropagation.
Distributed/federated extensions: Loss-based clustering structure enables efficient decentralized solutions where data are distributed across agents, each minimizing local losses and achieving consensus via penalty or communication (Armacki et al., 2 Feb 2024, Lin et al., 2023, Bhatia et al., 27 Jun 2025).

Table: Representative loss-based clustering objective forms

Model family	Loss function/prototype	Assignment rule
$k$ -means	$J^\mathrm{w}(d)=d^2$	Min Euclidean distance
Plane-based	$f(x;w_j, b_j)$ deviation	Min $\|f(x;w_j, b_j)\|$
Robust subspace	Cauchy loss in residual	Weighted update, IRR
Info-theoretic	Divergence $D(p_k, p_j)$	Min divergence-based loss

5. Loss-Based Clustering in Deep and Nonparametric Models

Recent advances have embedded loss-based clustering into deep and nonparametric models:

Deep clustering: Combines autoencoders or CNNs with clustering loss to encourage latent representations that are clustering-friendly, via e.g., divergence, compactness/separability, adversarial JSD-based objectives (Lim, 12 Dec 2024, Lim, 12 Dec 2024, Kampffmeyer et al., 2019).
Deep density-based losses: Incorporate DBSCAN-style density connectivity into a loss function to support nonconvex clusters and noise (Beer et al., 8 Oct 2024).
Open-world/open-set loss-based clustering: Class Anchor Clustering (CAC) uses a loss enforcing logit-space anchors for known classes, optimizing both intra-class compactness and inter-class separation and improving rejection of unknowns (Miller et al., 2020).
Infinite mixture and model selection: Dirichlet process mixtures with divergence-based loss allow automatic estimation of the number of clusters without prior knowledge (Lim, 12 Dec 2024).
Clustering of distributions: Loss over empirical measures or distributions (e.g., in survival analysis) is maximized or minimized to enforce separation of entire cluster-wise distributions (Mouli et al., 2019).

6. Loss-Based Clustering in Distributed and Federated Settings

Hybrid loss-based clustering is fundamental to many distributed and federated learning approaches:

Robust FL: Client updates are filtered by scoring losses on a trusted set and applying 2-means to distinguish honest from malicious or misbehaving clients. Aggregation is restricted to the low-loss cluster, ensuring robustness under arbitrary Byzantine failures, with provable suboptimality gaps (Kritharakis et al., 18 Aug 2025).
Clustered/personalized FL: Client embedding vectors (averaged per-model or per-class losses) are clustered to identify clusters of similar clients, enabling accelerated convergence, robustness to non-IID, and rapid cluster recovery without ideal initialization (Bhatia et al., 27 Jun 2025, Lin et al., 2023).
Distributed optimization: General frameworks (e.g., DGC- $\mathcal{F}_\rho$ ) support any smooth convex loss, enforce local data fidelity, and global cluster consensus, and converge to centralized solutions as penalty increases (Armacki et al., 2 Feb 2024).

7. Key Theoretical Results and Empirical Evidence

Loss-based clustering frameworks admit strong theoretical guarantees when losses are properly constructed:

Finite convergence: For discrete labelings, block-alternating optimization cycles cannot continue indefinitely (Wang et al., 2019).
Robustness: Bounded influence (e.g., Cauchy loss) ensures robustness to gross outliers (Li et al., 2019).
Separability and consistency: Bayesian loss-based criteria (Binder, VI) and distinguishability measures (misclassification risk) guarantee that the estimated partition is optimal under the specified loss (Dahl et al., 2021, Turfah et al., 24 Apr 2024).
Consistency in distributed settings: Distributed gradient clustering converges to the set of Lloyd points for Bregman losses, with global consensus in the limit (Armacki et al., 2 Feb 2024); theoretical cluster recovery is possible in a single round for clustered federated learning with suitable loss vector embeddings (Bhatia et al., 27 Jun 2025).
Empirical superiority: Experimental benchmarks across synthetic and real datasets (UCI, MNIST, CIFAR, large-scale GMMs, survival data) demonstrate that carefully chosen loss-based schemes outperform classical baselines, provide robustness, and scale to large problem sizes (Wang et al., 2019, Kritharakis et al., 18 Aug 2025, Zhou et al., 2023, Mouli et al., 2019).

Loss-based clustering provides a unifying and extensible paradigm for modern clustering, enabling explicit risk minimization, rigorous comparison across frameworks, robustness to noise and outliers, direct quantification of uncertainty, and integration with deep and distributed systems. The expressivity of the loss, and the choice of within/between terms and regularization, critically determine the statistical and computational properties of a given clustering scheme. For comprehensive mathematical details and empirical validations, see (Wang et al., 2019, Miller et al., 2020, Lim, 12 Dec 2024, Kritharakis et al., 18 Aug 2025, Li et al., 2019, Lin et al., 2023, Zhang et al., 8 Oct 2025, Dahl et al., 2021, Mouli et al., 2019, Beer et al., 8 Oct 2024, Kampffmeyer et al., 2019, Bhatia et al., 27 Jun 2025, Zhou et al., 2023, Rigon et al., 2020, Lim, 12 Dec 2024), and (Turfah et al., 24 Apr 2024).