Fair k-Clustering with Multiple Colors

Updated 17 November 2025

Fair k-clustering is a constrained clustering problem that ensures every cluster has equal representation from multiple protected groups under objectives like k-median, k-means, and k-center.
It employs a black-box reduction that transforms any α-approximation for standard clustering into an (α+2)-approximation for fair clustering, guaranteeing exact balance.
Experimental results show that the approach scales to large datasets while maintaining near-baseline costs and strict fairness without allowing additive violations.

A fair k-clustering problem is a constrained clustering formulation in which the assignments to clusters must respect the relative balance or representation of multiple “colors” (protected groups), with cost measured under classic objectives such as k-median, k-means, or k-center. The multi-color fair k-clustering problem, as studied in "Fair Clustering with Multiple Colors" (Böhm et al., 2020), formalizes fairness as a requirement that every cluster contain an identical count of each color, supporting arbitrary numbers of colors, clusters, and points. This problem poses unique structural and computational challenges beyond the two-color case, and remained open for true constant-factor approximation until the introduction of a black-box reduction from vanilla clustering objectives. The following sections survey formal definitions, objective functions, the reduction methodology and its theoretical guarantees, key analytic tools, computational complexity, and experimental highlights.

1. Formal Model: Fair k-Clustering with Multiple Colors

The input consists of a point set $A$ of size $n r$ in $\mathbb R^d$ (or general metric space), partitioned by a coloring function $c: A \to [r]$ into $r$ color classes $A^{(1)},\ldots,A^{(r)}$ , with $|A^{(i)}|=n$ for all $i$ . The clustering task is to select $k$ clusters, each assigned a center, and assign each point to a center.

Exact balance constraint: For all clusters $C_j$ and colors $i,i' \in [r]$ ,

$|C_j \cap A^{(i)}| = |C_j \cap A^{(i')}|\,.$

Consequently, each $|C_j|$ must be a multiple of $r$ , and all $A^{(i)}$ must distribute evenly among clusters.

2. Clustering Objectives under Fairness

The fair k-clustering problem admits several center-based objectives, unified as:

$\min_{Z,\,\phi}\;\Bigl\|\,A - C\Bigr\|_{p,q}, \qquad \|A-C\|_{p,q} =\Bigl(\sum_{p\in A}\|p-\phi(p)\|_q^p\Bigr)^{1/p}$

with $Z$ the set of cluster centers and $\phi:A\to Z$ the assignment. Special cases include:

k-median: $p=1,q=1$
k-means: $p=2,q=2$
k-center: $p=\infty$ , $q$ arbitrary

All subject to the exact-balance constraint per cluster.

3. Black-Box Reduction from Unconstrained to Fair Clustering

The key contribution of (Böhm et al., 2020) is an algorithmic reduction that transforms any $\alpha$ -approximation algorithm for the unconstrained (vanilla) $(k,p,q)$ -clustering problem into an $(\alpha+2)$ -approximation for the fair $k$ -clustering problem with $r$ colors. This is the first such reduction to achieve a true constant factor for all three objectives.

Algorithmic Framework (paraphrased from Algorithm 1):

For each color $i \in [r]$ :

For all $j \in [r]$ compute a min-cost perfect matching $\pi_{i,j}:A^{(i)}\to A^{(j)}$ under the $(p,q)$ cost metric.
Run the given $\alpha$ -approximation algorithm for unconstrained $(k,p,q)$ clustering on $A^{(i)}$ ; obtain centers $Z^{(i)}$ .
For each color $j \neq i$ and each $x \in A^{(j)}$ , assign $x$ to the closest center in $Z^{(i)}$ by matching via $\pi_{i,j}(x) \in A^{(i)}$ .
Compute the total clustering cost for the combined assignment.

Return the best solution over all $i \in [r]$ .

Guarantee (Theorem 2.1): If the unconstrained problem has an $\alpha$ -approximation in time $T$ , then fair $k$ -clustering admits an $(\alpha+2)$ -approximation in time $O(r^2\cdot MCPM(n)\cdot T)$ , where $MCPM(n)$ is the time for min-cost perfect matching under the ground $\ell_q$ distance.

4. Theoretical Foundations and Proof Technique

Cost and feasibility analysis depend on several key tools:

Earth Mover's Distance ( $EMD_{p,q}$ ): Used for matching between color blocks, $EMD_{p,q}(X,Y)$ is the minimum cost of perfect matching under $\ell_q$ ground distance.
Color-block averaging argument: By averaging costs over pivot blocks $A^{(i)}$ , it follows that there exists a block for which the aggregate matching cost to other blocks plus the unconstrained cost is at most $2 \cdot OPT_k$ , where $OPT_k$ is the optimal cost for fair $k$ -clustering.
Triangle inequality chain: The final cost for $A$ assigned via the best pivot and matchings is bounded by

$\|A-Z\|_{p,q} \le 2\,OPT_k + \alpha\,OPT_k = (\alpha+2)\,OPT_k.$

The reduction is fully constructive due to exhaustive trial over all color classes.

5. Computational Complexity, Scope, and Generalization

The reduction involves $O(r^2)$ instances of min-cost perfect matching on $n + n$ points, and $r$ runs of the unconstrained clustering algorithm. The overall running time is thus $O(r^2\,MCPM(n) + r\,T(n,d,k,p,q))$ , with no restriction on $k$ or $r$ .

This framework works for arbitrary finite metric spaces (using $\ell_\infty$ embedding if required), all center-based objectives, and arbitrary cluster counts and color numbers.

Special cases:

For k-center, a simple farthest-first traversal yields a 3-approximate fair solution in $O(nkd)$ time.
When $k=n$ (i.e., every point is a center), a random-block-sampling routine gives a 2-approximate fair partition in $O(n\log(1/\delta))$ time.

Hardness: Even the fair $n$ -median or $n$ -center problem is APX-hard for $r \ge 3$ ; no PTAS is possible. Constant-factor approximation is therefore best possible in general.

6. Experimental Results

Implementation on six real benchmark datasets (Adults, Athletes, Bank, Diabetes, Credit cards, Census1990), with up to $r=8$ colors and $450\,000$ points, demonstrates:

(α+2)-approximation algorithms are $10$– $100{\times}$ faster than previous bicriteria methods, with exact fairness.
Empirical costs lie $5$– $20\%$ above the unconstrained k-median baseline, closely matching theoretical bounds.
Fast special-case routines scale to hundreds of thousands of points in tens of seconds.

Empirical takeaways:

Fairness is achieved strictly (no violation), as opposed to prior algorithms permitting additive violation in cluster composition.
Solutions maintain high computational scalability for moderate to large datasets.

7. Extensions, Limitations, and Outlook

This reduction establishes a template for fair clustering applicable to any approximation algorithm for unconstrained clustering objectives, offering simplicity, modularity, and constant-factor guarantees in arbitrary finite metric spaces and arbitrary numbers of colors and clusters.

Extensions may target:

Generalizing the reduction to balance constraints beyond exact equality (e.g., range constraints, proportional, or disparate impact).
Adaptation to streaming or distributed environments via parallelization of the matching and unconstrained clustering phase.
Empirical study of trade-offs between cluster balance and utility for varied clustering criteria.

This framework resolves the multi-color fair k-clustering approximation barrier: for the first time, true constant-factor guarantees are established for fair k-median, k-means, and k-center under a strict balance model, with broad applicability and strong empirical validation (Böhm et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Fair Clustering with Multiple Colors (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Fair k-Clustering Problem.