Gradient Clustering Methods

Updated 7 December 2025

Gradient clustering is a family of methods that leverage the geometry of gradients in data and parameter spaces for unsupervised partitioning.
It encompasses techniques such as cost minimization, modal clustering, and gradient space embedding to identify clusters and outliers efficiently.
These approaches are applied in centralized, distributed, and federated learning to enhance convergence, reduce variance, and improve subgroup inference.

Gradient clustering refers to a family of methodologies leveraging (i) geometry of gradients in parameter or data space, (ii) gradient-based optimization, or (iii) the gradient flows of (pseudo-)densities, as central tools for unsupervised partitioning or group identification. Gradient clustering techniques are now pervasive across classical machine learning (e.g., $k$ -means, Bregman clustering), robust and distributed/federated learning, as well as nonparametric modal clustering based on density gradients. Approaches vary in their operational regime (centralized, distributed, federated, functional), their mathematical formalism (cost minimization, mean-shift/gradient flow, clustering in gradient space), and the statistical objects they process (data, parameter vectors, gradients, losses).

1. Gradient-Based Center Clustering: Frameworks and Convergence

Central to a broad class of methods is the principle of optimizing a clustering cost $J(A,x)$ , representing the aggregate distance (using a convex or Bregman loss) between cluster centers $x_1,\ldots,x_K$ and data points, with $A$ capturing assignments. For general (not necessarily Bregman) costs $f(x,y)$ , iterative schemes alternate between:

Assignment step: Reassignment of each point to its nearest center under a metric $g(x,y)$ coherent with $f$ .
Center gradient step: Update of centers in the negative gradient direction with respect to $J$ , i.e.,

$x_i^{(t+1)} = x_i^{(t)} - \alpha \sum_{y\in C_i} p_y\,\nabla_x f(x_i^{(t)}, y)$

Convergence to a set of appropriately defined fixed points is guaranteed under mild convexity and continuity assumptions, and the procedure is computationally simple with per-iteration cost $O(NKd)$ , eliminating the need for complete recomputation or solving inner minimizations for each step. Specializations to Bregman divergences yield convergence to centroidal Voronoi partitions, consistent with classic Lloyd's approach for $k$ -means (Armacki et al., 2022).

For distributed setting with private user data and network topology, recent advances (Distributed Gradient Clustering, DGC) generalize this methodology with consensus mechanisms. DGC introduces a quadratic penalty on deviations between user center estimates and alternates between local clustering and communication-enforced consensus updates. The global objective is

$J_\rho(x,C) = \frac{1}{\rho}J(x,C) + \frac{1}{2}x^\top L x$

with $L$ the graph Laplacian. Algorithms guarantee monotonic decrease in $J_\rho$ , stabilization of cluster assignments, and (as $\rho\to\infty$ ) convergence to consensus clustering equivalent to the centralized optimum. Extensions support non-Euclidean costs (e.g., Huber, logistic, fair losses) and semi-supervised regularization (Armacki et al., 2 Feb 2024).

2. Gradient Space Clustering for Robust Subgroup Inference

Gradient clustering is also leveraged as an unsupervised tool for group inference by operating directly in the space of loss gradients. For each example $i$ , the so-called gradient embedding $g_i = \nabla_\theta \ell(\theta_0; x_i, y_i)$ (for fixed reference parameters $\theta_0$ ) encodes the interaction between data and the model. Clustering these gradient vectors (typically with DBSCAN and cosine or mean-centered metrics) reveals latent subgroups and isolates outliers:

Majority groups align near zero in gradient space (well-classified).
Minorities and outliers correspond to distinctive nonzero clusters.

After clustering, group-aware training strategies (e.g., group DRO) use inferred group IDs to improve worst-group risk (Zeng et al., 2022). This procedure is outlier-robust and scales efficiently for high-dimensional models.

A distinct paradigm, gradient clustering in the sense of mode-seeking, is rooted in the geometry of the underlying data distribution $p^*(x)$ . Cluster assignments arise from following the gradient ascent flow:

$\dot\gamma_x(t) = \nabla f(\gamma_x(t)), \quad f(x) = \log p^*(x)$

or, in practice, its empirical estimate. These flows partition the space into basins of attraction of modes, corresponding exactly to the leaves of the density cluster tree (Hartigan tree). Efficient implementations include Mean-Shift ( $m_h(x) \propto \nabla\hat f_h(x) / \hat f_h(x)$ ) and its generalizations, and direct estimation of the log-density gradient (LSLDG) for improved performance in high dimensions (Arias-Castro et al., 2021, Arias-Castro et al., 2021, Sasaki et al., 2014). The partition induced by gradient flows is provably asymptotically equivalent to level-set clustering as sampling grows dense.

The theory extends to infinite-dimensional $L^2$ spaces for functional data, where clusters correspond to basins of the gradient flow of a pseudo-density functional. Practical and consistent procedures estimate significant modes and mark sample clusters in both $L^2$ and finite-dimensional projections (Ciollaro et al., 2016).

4. Gradient Clustering in Federated and Decentralized Learning

Federated and decentralized learning with heterogeneous (non-IID) data distributions requires client grouping for fast convergence. Joint gradient-and-loss clustering schemes empower devices to assign themselves to clusters by evaluating both:

Gradient similarity ( $S_{i,k}^t$ ), e.g., cosine similarity with cluster update direction.
Model fit (training loss $\mathcal{L}_{i,k}^t$ ) on received cluster models.

Cluster identity is chosen to maximize a weighted combination,

$s_i^t = \arg\max_k[\lambda S_{i,k}^t - (1-\lambda)\mathcal{L}_{i,k}^t]$

enabling rapid, fully distributed grouping and dramatically reducing cluster formation iterations compared to loss-only or pure gradient alignment (Lin et al., 2023).

Neighborhood Gradient Clustering (NGC) further exploits local inter-agent gradient clusters (model-variant and data-variant cross-gradients) in decentralized optimization, provably correcting both model and data heterogeneity biases and yielding superior non-IID performance (Aketi et al., 2022).

5. Optimization and Theoretical Underpinnings

Several gradient clustering methods are interpretable as optimization of explicit (possibly nonconvex) cost functionals, with:

Variants of ultrametric fitting for hierarchical clustering via min-max constrained gradient descent, supporting fidelity, balance, triplet supervision, and Dasgupta-style regularizers (Chierchia et al., 2019).
Implicit gradient descent (stochastic backward Euler) for $k$ -means, formulated as fixed-point equations $x^{k+1}=x^k - \gamma\nabla\phi(x^{k+1})$ , with robust convergence properties and increased robustness to initialization relative to classical EM (Yin et al., 2017).
Robust subspace clustering as stochastic optimization constrained to the Grassmannian, enabling adaptive step-size and efficient geodesic updating even under high outlier fractions (He et al., 2014).

All such algorithms admit efficient pseudocode, straightforward per-iteration complexity, and, for appropriate cost choices, formal convergence guarantees.

6. Gradient Stratification, Variance Reduction, and Practical Implications

Gradient clustering is exploited for variance reduction in minibatch optimization, where the gradient variance of average minibatch gradients is provably minimized when minibatch samples are drawn one-per-cluster from a weighted clustering in gradient space. The clustering assignment solves a weighted $K$ -means objective over per-example gradients. Empirically, this stratification reduces variance beyond simply doubling batch size in some regimes but offers little benefit when intrinsic gradient structure is weak. Normalized gradient variance proves a strong predictor of optimization progress and learning rate sensitivity (Faghri et al., 2020).

7. Summary Table: Major Types of Gradient Clustering

Approach/Domain	Central Mechanism	Key Reference
Center-based cost minimization	Alternating assignment, gradient step	(Armacki et al., 2022, Armacki et al., 2 Feb 2024)
Robust group inference	Gradient space clustering + DBSCAN	(Zeng et al., 2022)
Modal/density-based (Mean-Shift, LSLDG)	Gradient (log-density) flow	(Arias-Castro et al., 2021, Sasaki et al., 2014)
Federated/decentralized learning	Gradient similarity + loss, NGC	(Lin et al., 2023, Aketi et al., 2022)
Hierarchical clustering (ultrametric fit)	Gradient descent on min-max functional	(Chierchia et al., 2019)
SGD variance reduction	Weighted $K$ -means on gradient space	(Faghri et al., 2020)

Gradient clustering, in its various forms, is foundational in clustering theory and practice, providing a principled, versatile toolset for identifying structure in diverse data and parameter spaces, with direct implications for robust learning, efficient computation, and interpretability across statistical and machine learning domains.