Semi-Centroid Clustering Methods
- Semi-centroid clustering is a hybrid approach that integrates centroid-based and pairwise similarity measures, enabling both hard and fuzzy assignments.
- It leverages a convex combination of centroid and intra-cluster losses to optimize clustering performance while ensuring fairness and robustness.
- Algorithmic frameworks like fuzzy K-means and bridged clustering support semi-supervised, multi-modal representation learning with strong theoretical and empirical guarantees.
Semi-centroid clustering refers to a class of clustering and representation learning paradigms that interpolate between centroid-based clustering (where each cluster is summarized by a prototypical centroid) and non-centroid (centerless) clustering (where the organization and evaluation of clusters rely exclusively on intra-cluster relationships, especially pairwise similarities or distances). Unlike classical centroid-based methods such as -means, semi-centroid techniques admit hybrid or entirely centroid-free characterizations, allow flexible loss definitions, and support both hard and fuzzy assignments. They provide substantial robustness, interpretability, and fairness guarantees across diverse scenarios, including unsupervised, semi-supervised, and multi-modal representation learning.
1. Formal Definitions and Paradigms
A semi-centroid clustering over a set of agents and candidate centers divides into clusters and selects centers . The individual loss for member in cluster with center is parameterized by a convex combination of centroid and non-centroid terms:
for , where and are fixed (pseudo)metrics for centroid and maximum intra-cluster loss respectively (Cookson et al., 1 Jan 2026). Setting recovers centroid-based clustering (e.g., -means, -medians), and yields non-centroid clustering. Intermediate yield blended loss functions that interpolate between the two regimes.
Centroid-free fuzzy clustering, as realized by Lu et al. (2024), eliminates the need for explicit centroids by encoding partition structure entirely via a fuzzy assignment matrix and a fixed global distance matrix :
subject to , with and (Bao et al., 2024). All geometric and cluster structure is transferred from explicit centers to distance-weighted membership statistics.
2. Algorithmic Frameworks for Semi-Centroid Clustering
Centroid-Free Fuzzy K-Means (FKMWC)
Lu et al. introduce a multiplicative update algorithm without explicit centroid maintenance (Bao et al., 2024):
- Initialization: Row-normalized .
- Main loop:
- Compute .
- Compute .
- Form .
- Update .
- Renormalize rows of such that .
This approach embeds centroid effects in the trace term and outputs only fuzzy memberships.
Core-Approximate Semi-Centroid Clustering
Cookson, Shah, and Yu (2024) develop a polynomial-time 3-core approximate algorithm based on:
- Most-Cohesive Cluster (MCC) Extraction: Iteratively constructing tentative clusters by greedy minimization of maximal hybrid loss.
- Selective Switching: For each agent, opportunistic transfer between clusters based on potential reduction in loss, using constructed upper bounds on hybrid losses.
- Complexity: The algorithm is polynomial in , , and , and extensions operate in the dual-metric (, ) regime (Cookson et al., 1 Jan 2026).
Semi-Supervised Sparse Bridged Clustering
Bridged Clustering (Katz et al. 2025) demonstrates a semi-centroid methodology for sparse alignment across domains:
- Step A: Cluster input and output domains independently, producing centroids and .
- Step B: Learn a sparse bridge via
given paired samples and cluster-indicator maps .
- Step C: Predict via assigned input cluster , select output cluster , and output (Ye et al., 8 Oct 2025).
3. Fairness Criteria and Lower Bounds
Proportional fairness in semi-centroid clustering is formalized via the -core and -Fully Justified Representation (FJR):
- -core: No coalition , , can collectively improve their loss by defecting to a new center relative to their losses in current clusters.
- -FJR: A coalition , , cannot simultaneously achieve strictly better loss than the minimum loss within in the given clustering.
Cookson et al. establish:
| Loss Function | Existential Bound () | Poly-Time Bound () | Lower Bound |
|---|---|---|---|
| Dual-metric hybrid | 3 | 3 + 2√3 | 2 (pure centroid) |
| Weighted single-metric () | min | min | max |
No finite simultaneous core-approximation is possible for arbitrary mixing of centroid/non-centroid or dual-metric losses (Cookson et al., 1 Jan 2026).
4. Theoretical and Empirical Guarantees
FKMWC achieves, on diverse real-world datasets (faces, images, texts), robust performance that matches or exceeds traditional baselines in accuracy (ACC), normalized mutual information (NMI), and purity, with limited sensitivity to initialization and regularization (Bao et al., 2024). For example, on the AR face dataset, ACC improved from 0.25 (K-Means++) to 0.39; on JAFFE, performance with KNN distance reaches 0.97.
Bridged Clustering exhibits high label efficiency: one or two paired samples per cluster suffice to map centroids across modalities with exponentially small mis-bridging error. Overall risk decomposes as
where is the within-cluster variance in , is the maximum inter-centroid distance, and terms reflect mis-clustering and mis-bridging rates with explicit exponential bounds under sub-Gaussianity and separation conditions (Ye et al., 8 Oct 2025).
5. Structural Properties, Interpretability, and Use Cases
Semi-centroid and centroid-free methods offer several structural and practical advantages:
- Robustness: By eliminating explicit centroid recomputation, algorithms are less sensitive to noise and initialization (Bao et al., 2024).
- Flexibility: Choice of distance metric allows seamless transition to kernel methods, graph-based clustering, and support for non-Euclidean data (Bao et al., 2024).
- Fairness and representation: Algorithms enforce proportional representation and defend against coalition improvements, which are essential in societal or democratic allocation settings (Cookson et al., 1 Jan 2026).
- Interpretability: Sparse bridge matrices and cluster-centric assignments facilitate transparent prediction pipelines, in contrast to dense transport-based approaches (Ye et al., 8 Oct 2025).
- Applicability in semi-supervision: Techniques such as Bridged Clustering are particularly effective in low-supervision and semi-supervised learning contexts involving unpaired datasets and sparse ground-truth alignments (Ye et al., 8 Oct 2025).
Potential limitations include increased computational and storage costs for fully dense distance matrices ( per iteration), which can be mitigated by sparsification or graph-based approximations (Bao et al., 2024).
6. Connections and Extensions
Semi-centroid clustering generalizes and bridges classical approaches:
- In fuzzy clustering, FKMWC extends FCM by encoding cluster prototypes implicitly, showing full equivalence for squared Euclidean distance (Bao et al., 2024).
- Semi-centroid fairness algorithms synthesize the centroid and non-centroid paradigms, achieving bounded approximation and representation guarantees even under dual metrics (Cookson et al., 1 Jan 2026).
- Sparse-bridged approaches relate to multi-view and cross-modal representation learning, with interpretability and label efficiency advantages (Ye et al., 8 Oct 2025).
This framework admits further generalization to kernelized, graph-based, and constraint-driven clustering domains, supporting the evolving demands for robust, fair, and interpretable unsupervised and semi-supervised data partitioning.