Papers
Topics
Authors
Recent
2000 character limit reached

Semi-Centroid Clustering Methods

Updated 8 January 2026
  • Semi-centroid clustering is a hybrid approach that integrates centroid-based and pairwise similarity measures, enabling both hard and fuzzy assignments.
  • It leverages a convex combination of centroid and intra-cluster losses to optimize clustering performance while ensuring fairness and robustness.
  • Algorithmic frameworks like fuzzy K-means and bridged clustering support semi-supervised, multi-modal representation learning with strong theoretical and empirical guarantees.

Semi-centroid clustering refers to a class of clustering and representation learning paradigms that interpolate between centroid-based clustering (where each cluster is summarized by a prototypical centroid) and non-centroid (centerless) clustering (where the organization and evaluation of clusters rely exclusively on intra-cluster relationships, especially pairwise similarities or distances). Unlike classical centroid-based methods such as kk-means, semi-centroid techniques admit hybrid or entirely centroid-free characterizations, allow flexible loss definitions, and support both hard and fuzzy assignments. They provide substantial robustness, interpretability, and fairness guarantees across diverse scenarios, including unsupervised, semi-supervised, and multi-modal representation learning.

1. Formal Definitions and Paradigms

A semi-centroid clustering over a set of nn agents NN and candidate centers MM divides NN into kk clusters C1,,CkC_1,\ldots,C_k and selects centers x1,,xkMx_1,\ldots,x_k\in M. The individual loss for member ii in cluster CtC_t with center xtx_t is parameterized by a convex combination of centroid and non-centroid terms:

α(i;C,x)=αdc(i,x)+(1α)maxjCdm(i,j)\ell_\alpha(i;C,x) = \alpha \cdot d^c(i,x) + (1-\alpha) \cdot \max_{j\in C} d^m(i,j)

for α[0,1]\alpha\in[0,1], where dcd^c and dmd^m are fixed (pseudo)metrics for centroid and maximum intra-cluster loss respectively (Cookson et al., 1 Jan 2026). Setting α=1\alpha=1 recovers centroid-based clustering (e.g., kk-means, kk-medians), and α=0\alpha=0 yields non-centroid clustering. Intermediate α\alpha yield blended loss functions that interpolate between the two regimes.

Centroid-free fuzzy clustering, as realized by Lu et al. (2024), eliminates the need for explicit centroids by encoding partition structure entirely via a fuzzy assignment matrix YR+N×KY\in\mathbb{R}_+^{N\times K} and a fixed global distance matrix DRN×ND\in\mathbb{R}^{N\times N}:

J(Y)=tr(YTDYP1)+λYF2J(Y) = \mathrm{tr}\left(Y^T D Y P^{-1}\right) + \lambda \|Y\|_F^2

subject to Y1K=1NY\mathbf{1}_K=\mathbf{1}_N, with P=diag(p11,,pKK)P=\mathrm{diag}(p_{11},\ldots,p_{KK}) and p=i=1Nyip_{\ell\ell}=\sum_{i=1}^N y_{i\ell} (Bao et al., 2024). All geometric and cluster structure is transferred from explicit centers to distance-weighted membership statistics.

2. Algorithmic Frameworks for Semi-Centroid Clustering

Centroid-Free Fuzzy K-Means (FKMWC)

Lu et al. introduce a multiplicative update algorithm without explicit centroid maintenance (Bao et al., 2024):

  • Initialization: Row-normalized YR+N×KY\in\mathbb{R}_+^{N\times K}.
  • Main loop:
    • Compute a=(YTDY)a_\ell = (Y^T D Y)_{\ell\ell}.
    • Compute p=i=1Nyip_{\ell\ell} = \sum_{i=1}^N y_{i\ell}.
    • Form G=(D+DT)YP1+2λYG = (D + D^T)Y P^{-1} + 2\lambda Y.
    • Update yiyiap2Giy_{i\ell} \leftarrow y_{i\ell} \sqrt{\dfrac{a_\ell p_{\ell\ell}^{-2}}{G_{i\ell}}}.
    • Renormalize rows of YY such that =1Kyi=1\sum_{\ell=1}^K y_{i\ell} = 1.

This approach embeds centroid effects in the trace term tr(YTDYP1)\mathrm{tr}(Y^T D Y P^{-1}) and outputs only fuzzy memberships.

Core-Approximate Semi-Centroid Clustering

Cookson, Shah, and Yu (2024) develop a polynomial-time 3-core approximate algorithm based on:

  • Most-Cohesive Cluster (MCC) Extraction: Iteratively constructing tentative clusters by greedy minimization of maximal hybrid loss.
  • Selective Switching: For each agent, opportunistic transfer between clusters based on potential reduction in loss, using constructed upper bounds on hybrid losses.
  • Complexity: The algorithm is polynomial in nn, kk, and M|M|, and extensions operate in the dual-metric (dcd^c, dmd^m) regime (Cookson et al., 1 Jan 2026).

Semi-Supervised Sparse Bridged Clustering

Bridged Clustering (Katz et al. 2025) demonstrates a semi-centroid methodology for sparse alignment across domains:

  • Step A: Cluster input XX and output YY domains independently, producing centroids {ciX}\{c_i^X\} and {cjY}\{c_j^Y\}.
  • Step B: Learn a sparse bridge BRC×CB\in\mathbb{R}^{C \times C} via

minB(x,y)SBTϕX(x)ϕY(y)22+λB1,\min_B \sum_{(x',y')\in S} \|B^T \phi_X(x') - \phi_Y(y')\|_2^2 + \lambda \|B\|_1,

given kk paired samples SS and cluster-indicator maps ϕ\phi.

  • Step C: Predict via xx\mapsto assigned input cluster ii^*, select output cluster j=argmaxjBi,jj^* = \arg\max_j |B_{i^*,j}|, and output cjYc_{j^*}^Y (Ye et al., 8 Oct 2025).

3. Fairness Criteria and Lower Bounds

Proportional fairness in semi-centroid clustering is formalized via the α\alpha-core and α\alpha-Fully Justified Representation (FJR):

  • α\alpha-core: No coalition SS, Sn/k|S|\ge n/k, can collectively improve their loss by defecting to a new center yMy\in M relative to their losses in current clusters.
  • α\alpha-FJR: A coalition SS, Sn/k|S|\ge n/k, cannot simultaneously achieve strictly better loss than the minimum loss within SS in the given clustering.

Cookson et al. establish:

Loss Function Existential Bound (ρ\rho^*) Poly-Time Bound (ρλ\rho_\lambda) Lower Bound
Dual-metric hybrid 3 3 + 2√3 2 (pure centroid)
Weighted single-metric (λ\lambda) min{2/λ,3}\{2/\lambda, 3\} min{2/λ,fλ}\{2/\lambda, f_\lambda\} max{gλ,2(1λ)/(2λ+1)}\{g_\lambda, 2(1-\lambda)/(2\lambda+1)\}

No finite simultaneous core-approximation is possible for arbitrary mixing of centroid/non-centroid or dual-metric losses (Cookson et al., 1 Jan 2026).

4. Theoretical and Empirical Guarantees

FKMWC achieves, on diverse real-world datasets (faces, images, texts), robust performance that matches or exceeds traditional baselines in accuracy (ACC), normalized mutual information (NMI), and purity, with limited sensitivity to initialization and regularization (Bao et al., 2024). For example, on the AR face dataset, ACC improved from \sim0.25 (K-Means++) to \sim0.39; on JAFFE, performance with KNN distance reaches \sim0.97.

Bridged Clustering exhibits high label efficiency: one or two paired samples per cluster suffice to map centroids across modalities with exponentially small mis-bridging error. Overall risk decomposes as

E[Yy^2]DY+(εX+εB+εY)M\mathbb{E}[\|Y-\hat{y}\|^2] \leq D_Y + (\varepsilon_X + \varepsilon_B + \varepsilon_Y)\cdot M

where DYD_Y is the within-cluster variance in YY, MM is the maximum inter-centroid distance, and ε\varepsilon terms reflect mis-clustering and mis-bridging rates with explicit exponential bounds under sub-Gaussianity and separation conditions (Ye et al., 8 Oct 2025).

5. Structural Properties, Interpretability, and Use Cases

Semi-centroid and centroid-free methods offer several structural and practical advantages:

  • Robustness: By eliminating explicit centroid recomputation, algorithms are less sensitive to noise and initialization (Bao et al., 2024).
  • Flexibility: Choice of distance metric DD allows seamless transition to kernel methods, graph-based clustering, and support for non-Euclidean data (Bao et al., 2024).
  • Fairness and representation: Algorithms enforce proportional representation and defend against coalition improvements, which are essential in societal or democratic allocation settings (Cookson et al., 1 Jan 2026).
  • Interpretability: Sparse bridge matrices BB and cluster-centric assignments facilitate transparent prediction pipelines, in contrast to dense transport-based approaches (Ye et al., 8 Oct 2025).
  • Applicability in semi-supervision: Techniques such as Bridged Clustering are particularly effective in low-supervision and semi-supervised learning contexts involving unpaired datasets and sparse ground-truth alignments (Ye et al., 8 Oct 2025).

Potential limitations include increased computational and storage costs for fully dense distance matrices (O(TKN2)\mathcal{O}(TK N^2) per iteration), which can be mitigated by sparsification or graph-based approximations (Bao et al., 2024).

6. Connections and Extensions

Semi-centroid clustering generalizes and bridges classical approaches:

  • In fuzzy clustering, FKMWC extends FCM by encoding cluster prototypes implicitly, showing full equivalence for squared Euclidean distance (Bao et al., 2024).
  • Semi-centroid fairness algorithms synthesize the centroid and non-centroid paradigms, achieving bounded approximation and representation guarantees even under dual metrics (Cookson et al., 1 Jan 2026).
  • Sparse-bridged approaches relate to multi-view and cross-modal representation learning, with interpretability and label efficiency advantages (Ye et al., 8 Oct 2025).

This framework admits further generalization to kernelized, graph-based, and constraint-driven clustering domains, supporting the evolving demands for robust, fair, and interpretable unsupervised and semi-supervised data partitioning.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Semi-Centroid Clustering.