Proportional Fairness in Clustering
- Proportional fairness criteria are a set of conditions in clustering that ensure each sufficiently large group receives outcomes proportional to its size.
- They integrate centroid, non-centroid, and semi-centroid models through combined loss functions to balance representation and clustering efficiency.
- Approximation algorithms like Dual-Metric, GC, and SemiBall achieve constant-factor fairness while managing efficiency-fairness tradeoffs in practical settings.
Proportional fairness criteria are a set of rigorous conditions applied in clustering analysis to ensure equitable outcomes for agents or data points, inspired by principles of proportional representation in democratic systems. These criteria formalize the requirement that sufficiently large groups of agents should not be able to substantially improve their own clustering outcomes via deviation, relative to their "entitlement," which is proportional to the overall size of the data and the number of clusters. Their study spans centroid, non-centroid, and newly unified semi-centroid clustering settings, with exact and approximate satisfaction algorithms accompanied by lower bounds and efficiency tradeoffs (Cookson et al., 1 Jan 2026).
1. Clustering Paradigms and Loss Functions
Proportional fairness criteria can be instantiated in several clustering frameworks, each associated with distinct notions of agent loss:
- Centroid Clustering: Each agent's loss is its distance to a representative centroid chosen from its cluster.
- Non-Centroid (Diameter) Clustering: Each agent's loss is the maximal distance to any other point in its cluster.
- Semi-Centroid Clustering (Editor's term): The loss for agent assigned to cluster with center is a combination of centroid and diameter contributions: , possibly parameterized by weight in weighted single-metric scenarios.
Let denote agents, the set of allowable centers, the loss function for agent , and the number of clusters; a clustering consists of a partition (Cookson et al., 1 Jan 2026).
2. Formal Proportional Fairness Criteria
Two central proportional fairness criteria generalize previous representations of fairness in clustering:
- -Core: For , a clustering is in the -core if no group of size and center satisfies
That is, no coalition of entitled size can simultaneously and strictly improve its losses by factor or more via seceding and selecting a joint center. Centroid and non-centroid clusterings correspond to respectively in semi-centroid loss.
- -Fully Justified Representation (FJR): For , satisfies -FJR if there is no , , and such that
In this relaxed definition, even the worst-off agent post-deviation must surpass the best-off agent pre-deviation by factor . It holds that -core implies -FJR (Cookson et al., 1 Jan 2026).
These criteria interpolate between earlier studies of centroid clustering proportional fairness [CFLM19] and non-centroid (democratic) clustering [CMS24].
3. Algorithms for Approximating Proportional Fairness
Achieving proportional fairness criteria exactly is computationally intractable in general, so research has focused on constant-factor approximation algorithms.
- Dual-Metric-Core-Approx Algorithm:
- Phase 1 (MCC Covering): Iteratively locate the most cohesive clusters (size ) minimizing maximum agent loss, yielding clusters .
- Phase 2 (Safe Switching): Permit points to switch clusters only when their “upper-bound” loss improves and other cluster members are not excessively worsened.
- Theorem: Using exact MCC in Phase 1 and , outputs are in the $3$-core for dual-metric loss. Using $4$-approximate MCC and produces a -core clustering in polynomial time (Cookson et al., 1 Jan 2026).
- Specialized Algorithms:
- Greedy Capture + Greedy Centroid (GC): For weighted single-metric loss, GC achieves a -core in polynomial time.
- Semi–Ball–Growing (SemiBall): Attains a core approximation of in weighted loss models, generally outperforming GC in resultant fairness for a range of values (Cookson et al., 1 Jan 2026).
A summary table of existential and computable core factors for weighted single-metric loss:
| Loss Parameter | Existential Core | Polytime Core Approximation |
|---|---|---|
This shows that core fairness can be guaranteed up to specific constant factors for any .
4. Lower Bounds and Hardness Results
Worst-case analysis demonstrates that universally better core guarantees are impossible:
- Dual-metric loss: Setting retrieves centroid-only models, for which a $2$-core lower bound is proven [CFLM19]; thus, factors below $2$ are impossible.
- Weighted single-metric loss: Lower bounds derived from example families yield
- FJR relaxation: The trivial lower bound is $1$, as cannot be achieved unless exact FJR solutions are permitted (Cookson et al., 1 Jan 2026).
This establishes fundamental limits to achievable proportional fairness in clustering.
5. Algorithms for Fully Justified Representation (FJR)
A simple and always correct algorithm exists for -FJR:
- Iterative-MCC-for-FJR Algorithm:
- Initialize agent set .
- While nonempty: find any -approximate cohesive cluster of size , include in the partition, remove from .
- Output the clustering.
Theorem: This procedure guarantees -FJR for arbitrary losses. Using a polytime $4$-MCC subroutine results in a polynomial-time $4$-FJR algorithm for dual-metric loss. GC yields $2$-FJR max-diameter and $5$-FJR centroid fairness (Cookson et al., 1 Jan 2026).
6. Experimental Assessment and Efficiency-Fairness Tradeoffs
Empirical evaluation on UCI datasets (Iris, Pima-Diabetes, Adult) with algorithms including GC, SemiBall, -means++, and -medoids revealed:
- GC and SemiBall consistently attain near-unit violations () of both core and FJR, significantly outperforming -means++ and -medoids in fairness, especially for small .
- The efficiency cost, measured by clustering objective (sum of distances), is typically mild—often just a few percent worse relative to standard clustering objectives.
- SemiBall usually provides marginally superior clustering value compared to GC, while maintaining ideal proportional fairness (Cookson et al., 1 Jan 2026).
A plausible implication is that proportional fairness criteria can be practically realized with very limited compromise to clustering efficiency.
7. Connections to Semi-Centroid Bridged Clustering
Semi-centroid clustering, as generalized in the proportional fairness context, encompasses methods such as Bridged Clustering (Ye et al., 8 Oct 2025). In Bridged Clustering, unpaired input and output datasets are independently clustered, then sparsely bridged via matched pairs, relying on centroid-based prediction, and is model-agnostic and label-efficient. The underlying mathematical structure aligns with the dual-metric and weighted single-metric loss constructions central to proportional fairness analysis. This suggests that proportional fairness algorithms and guarantees may serve as a theoretical foundation for fairness in semi-supervised and representation learning frameworks such as Bridged Clustering.
References
- "Unifying Proportional Fairness in Centroid and Non-Centroid Clustering" (Cookson et al., 1 Jan 2026)
- "Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging" (Ye et al., 8 Oct 2025)
- Chakrabarty et al., "Proportionally fair clustering" [CFLM19]
- Chen et al., "Democratic clustering" [CMS24]
Proportional fairness criteria thus unify fairness guarantees in clustering, admit robust constant-factor approximations, and extend to new mixed-loss and semi-supervised paradigms.