Papers
Topics
Authors
Recent
2000 character limit reached

Fair k-Clustering with Multiple Colors

Updated 17 November 2025
  • Fair k-clustering is a constrained clustering problem that ensures every cluster has equal representation from multiple protected groups under objectives like k-median, k-means, and k-center.
  • It employs a black-box reduction that transforms any α-approximation for standard clustering into an (α+2)-approximation for fair clustering, guaranteeing exact balance.
  • Experimental results show that the approach scales to large datasets while maintaining near-baseline costs and strict fairness without allowing additive violations.

A fair k-clustering problem is a constrained clustering formulation in which the assignments to clusters must respect the relative balance or representation of multiple “colors” (protected groups), with cost measured under classic objectives such as k-median, k-means, or k-center. The multi-color fair k-clustering problem, as studied in "Fair Clustering with Multiple Colors" (Böhm et al., 2020), formalizes fairness as a requirement that every cluster contain an identical count of each color, supporting arbitrary numbers of colors, clusters, and points. This problem poses unique structural and computational challenges beyond the two-color case, and remained open for true constant-factor approximation until the introduction of a black-box reduction from vanilla clustering objectives. The following sections survey formal definitions, objective functions, the reduction methodology and its theoretical guarantees, key analytic tools, computational complexity, and experimental highlights.

1. Formal Model: Fair k-Clustering with Multiple Colors

The input consists of a point set AA of size nrn r in Rd\mathbb R^d (or general metric space), partitioned by a coloring function c:A[r]c: A \to [r] into rr color classes A(1),,A(r)A^{(1)},\ldots,A^{(r)}, with A(i)=n|A^{(i)}|=n for all ii. The clustering task is to select kk clusters, each assigned a center, and assign each point to a center.

Exact balance constraint: For all clusters CjC_j and colors i,i[r]i,i' \in [r],

CjA(i)=CjA(i).|C_j \cap A^{(i)}| = |C_j \cap A^{(i')}|\,.

Consequently, each Cj|C_j| must be a multiple of rr, and all A(i)A^{(i)} must distribute evenly among clusters.

2. Clustering Objectives under Fairness

The fair k-clustering problem admits several center-based objectives, unified as:

minZ,ϕ  ACp,q,ACp,q=(pApϕ(p)qp)1/p\min_{Z,\,\phi}\;\Bigl\|\,A - C\Bigr\|_{p,q}, \qquad \|A-C\|_{p,q} =\Bigl(\sum_{p\in A}\|p-\phi(p)\|_q^p\Bigr)^{1/p}

with ZZ the set of cluster centers and ϕ:AZ\phi:A\to Z the assignment. Special cases include:

  • k-median: p=1,q=1p=1,q=1
  • k-means: p=2,q=2p=2,q=2
  • k-center: p=p=\infty, qq arbitrary

All subject to the exact-balance constraint per cluster.

3. Black-Box Reduction from Unconstrained to Fair Clustering

The key contribution of (Böhm et al., 2020) is an algorithmic reduction that transforms any α\alpha-approximation algorithm for the unconstrained (vanilla) (k,p,q)(k,p,q)-clustering problem into an (α+2)(\alpha+2)-approximation for the fair kk-clustering problem with rr colors. This is the first such reduction to achieve a true constant factor for all three objectives.

Algorithmic Framework (paraphrased from Algorithm 1):

For each color i[r]i \in [r]:

  1. For all j[r]j \in [r] compute a min-cost perfect matching πi,j:A(i)A(j)\pi_{i,j}:A^{(i)}\to A^{(j)} under the (p,q)(p,q) cost metric.
  2. Run the given α\alpha-approximation algorithm for unconstrained (k,p,q)(k,p,q) clustering on A(i)A^{(i)}; obtain centers Z(i)Z^{(i)}.
  3. For each color jij \neq i and each xA(j)x \in A^{(j)}, assign xx to the closest center in Z(i)Z^{(i)} by matching via πi,j(x)A(i)\pi_{i,j}(x) \in A^{(i)}.
  4. Compute the total clustering cost for the combined assignment.

Return the best solution over all i[r]i \in [r].

Guarantee (Theorem 2.1): If the unconstrained problem has an α\alpha-approximation in time TT, then fair kk-clustering admits an (α+2)(\alpha+2)-approximation in time O(r2MCPM(n)T)O(r^2\cdot MCPM(n)\cdot T), where MCPM(n)MCPM(n) is the time for min-cost perfect matching under the ground q\ell_q distance.

4. Theoretical Foundations and Proof Technique

Cost and feasibility analysis depend on several key tools:

  • Earth Mover's Distance (EMDp,qEMD_{p,q}): Used for matching between color blocks, EMDp,q(X,Y)EMD_{p,q}(X,Y) is the minimum cost of perfect matching under q\ell_q ground distance.
  • Color-block averaging argument: By averaging costs over pivot blocks A(i)A^{(i)}, it follows that there exists a block for which the aggregate matching cost to other blocks plus the unconstrained cost is at most 2OPTk2 \cdot OPT_k, where OPTkOPT_k is the optimal cost for fair kk-clustering.
  • Triangle inequality chain: The final cost for AA assigned via the best pivot and matchings is bounded by

AZp,q2OPTk+αOPTk=(α+2)OPTk.\|A-Z\|_{p,q} \le 2\,OPT_k + \alpha\,OPT_k = (\alpha+2)\,OPT_k.

  • The reduction is fully constructive due to exhaustive trial over all color classes.

5. Computational Complexity, Scope, and Generalization

The reduction involves O(r2)O(r^2) instances of min-cost perfect matching on n+nn + n points, and rr runs of the unconstrained clustering algorithm. The overall running time is thus O(r2MCPM(n)+rT(n,d,k,p,q))O(r^2\,MCPM(n) + r\,T(n,d,k,p,q)), with no restriction on kk or rr.

This framework works for arbitrary finite metric spaces (using \ell_\infty embedding if required), all center-based objectives, and arbitrary cluster counts and color numbers.

Special cases:

  • For k-center, a simple farthest-first traversal yields a 3-approximate fair solution in O(nkd)O(nkd) time.
  • When k=nk=n (i.e., every point is a center), a random-block-sampling routine gives a 2-approximate fair partition in O(nlog(1/δ))O(n\log(1/\delta)) time.

Hardness: Even the fair nn-median or nn-center problem is APX-hard for r3r \ge 3; no PTAS is possible. Constant-factor approximation is therefore best possible in general.

6. Experimental Results

Implementation on six real benchmark datasets (Adults, Athletes, Bank, Diabetes, Credit cards, Census1990), with up to r=8r=8 colors and 450000450\,000 points, demonstrates:

  • (α+2)-approximation algorithms are $10$–100×100{\times} faster than previous bicriteria methods, with exact fairness.
  • Empirical costs lie $5$–20%20\% above the unconstrained k-median baseline, closely matching theoretical bounds.
  • Fast special-case routines scale to hundreds of thousands of points in tens of seconds.

Empirical takeaways:

  • Fairness is achieved strictly (no violation), as opposed to prior algorithms permitting additive violation in cluster composition.
  • Solutions maintain high computational scalability for moderate to large datasets.

7. Extensions, Limitations, and Outlook

This reduction establishes a template for fair clustering applicable to any approximation algorithm for unconstrained clustering objectives, offering simplicity, modularity, and constant-factor guarantees in arbitrary finite metric spaces and arbitrary numbers of colors and clusters.

Extensions may target:

  • Generalizing the reduction to balance constraints beyond exact equality (e.g., range constraints, proportional, or disparate impact).
  • Adaptation to streaming or distributed environments via parallelization of the matching and unconstrained clustering phase.
  • Empirical study of trade-offs between cluster balance and utility for varied clustering criteria.

This framework resolves the multi-color fair k-clustering approximation barrier: for the first time, true constant-factor guarantees are established for fair k-median, k-means, and k-center under a strict balance model, with broad applicability and strong empirical validation (Böhm et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fair k-Clustering Problem.