Fair k-Clustering with Multiple Colors
- Fair k-clustering is a constrained clustering problem that ensures every cluster has equal representation from multiple protected groups under objectives like k-median, k-means, and k-center.
- It employs a black-box reduction that transforms any α-approximation for standard clustering into an (α+2)-approximation for fair clustering, guaranteeing exact balance.
- Experimental results show that the approach scales to large datasets while maintaining near-baseline costs and strict fairness without allowing additive violations.
A fair k-clustering problem is a constrained clustering formulation in which the assignments to clusters must respect the relative balance or representation of multiple “colors” (protected groups), with cost measured under classic objectives such as k-median, k-means, or k-center. The multi-color fair k-clustering problem, as studied in "Fair Clustering with Multiple Colors" (Böhm et al., 2020), formalizes fairness as a requirement that every cluster contain an identical count of each color, supporting arbitrary numbers of colors, clusters, and points. This problem poses unique structural and computational challenges beyond the two-color case, and remained open for true constant-factor approximation until the introduction of a black-box reduction from vanilla clustering objectives. The following sections survey formal definitions, objective functions, the reduction methodology and its theoretical guarantees, key analytic tools, computational complexity, and experimental highlights.
1. Formal Model: Fair k-Clustering with Multiple Colors
The input consists of a point set of size in (or general metric space), partitioned by a coloring function into color classes , with for all . The clustering task is to select clusters, each assigned a center, and assign each point to a center.
Exact balance constraint: For all clusters and colors ,
Consequently, each must be a multiple of , and all must distribute evenly among clusters.
2. Clustering Objectives under Fairness
The fair k-clustering problem admits several center-based objectives, unified as:
with the set of cluster centers and the assignment. Special cases include:
- k-median:
- k-means:
- k-center: , arbitrary
All subject to the exact-balance constraint per cluster.
3. Black-Box Reduction from Unconstrained to Fair Clustering
The key contribution of (Böhm et al., 2020) is an algorithmic reduction that transforms any -approximation algorithm for the unconstrained (vanilla) -clustering problem into an -approximation for the fair -clustering problem with colors. This is the first such reduction to achieve a true constant factor for all three objectives.
Algorithmic Framework (paraphrased from Algorithm 1):
For each color :
- For all compute a min-cost perfect matching under the cost metric.
- Run the given -approximation algorithm for unconstrained clustering on ; obtain centers .
- For each color and each , assign to the closest center in by matching via .
- Compute the total clustering cost for the combined assignment.
Return the best solution over all .
Guarantee (Theorem 2.1): If the unconstrained problem has an -approximation in time , then fair -clustering admits an -approximation in time , where is the time for min-cost perfect matching under the ground distance.
4. Theoretical Foundations and Proof Technique
Cost and feasibility analysis depend on several key tools:
- Earth Mover's Distance (): Used for matching between color blocks, is the minimum cost of perfect matching under ground distance.
- Color-block averaging argument: By averaging costs over pivot blocks , it follows that there exists a block for which the aggregate matching cost to other blocks plus the unconstrained cost is at most , where is the optimal cost for fair -clustering.
- Triangle inequality chain: The final cost for assigned via the best pivot and matchings is bounded by
- The reduction is fully constructive due to exhaustive trial over all color classes.
5. Computational Complexity, Scope, and Generalization
The reduction involves instances of min-cost perfect matching on points, and runs of the unconstrained clustering algorithm. The overall running time is thus , with no restriction on or .
This framework works for arbitrary finite metric spaces (using embedding if required), all center-based objectives, and arbitrary cluster counts and color numbers.
Special cases:
- For k-center, a simple farthest-first traversal yields a 3-approximate fair solution in time.
- When (i.e., every point is a center), a random-block-sampling routine gives a 2-approximate fair partition in time.
Hardness: Even the fair -median or -center problem is APX-hard for ; no PTAS is possible. Constant-factor approximation is therefore best possible in general.
6. Experimental Results
Implementation on six real benchmark datasets (Adults, Athletes, Bank, Diabetes, Credit cards, Census1990), with up to colors and points, demonstrates:
- (α+2)-approximation algorithms are $10$– faster than previous bicriteria methods, with exact fairness.
- Empirical costs lie $5$– above the unconstrained k-median baseline, closely matching theoretical bounds.
- Fast special-case routines scale to hundreds of thousands of points in tens of seconds.
Empirical takeaways:
- Fairness is achieved strictly (no violation), as opposed to prior algorithms permitting additive violation in cluster composition.
- Solutions maintain high computational scalability for moderate to large datasets.
7. Extensions, Limitations, and Outlook
This reduction establishes a template for fair clustering applicable to any approximation algorithm for unconstrained clustering objectives, offering simplicity, modularity, and constant-factor guarantees in arbitrary finite metric spaces and arbitrary numbers of colors and clusters.
Extensions may target:
- Generalizing the reduction to balance constraints beyond exact equality (e.g., range constraints, proportional, or disparate impact).
- Adaptation to streaming or distributed environments via parallelization of the matching and unconstrained clustering phase.
- Empirical study of trade-offs between cluster balance and utility for varied clustering criteria.
This framework resolves the multi-color fair k-clustering approximation barrier: for the first time, true constant-factor guarantees are established for fair k-median, k-means, and k-center under a strict balance model, with broad applicability and strong empirical validation (Böhm et al., 2020).