Epsilon-Biased Binning Problem

Updated 29 September 2025

Epsilon-biased binning is a discretization method designed to partition data into near equal-size bins while ensuring group ratios deviate by at most a specified threshold ε.
The framework analyzes dynamic programming for exact optimality (O(n²k)), and scalable divide-and-conquer and local search heuristics for near-optimal performance under realistic fairness constraints.
Practical applications include fair data sharing and machine learning pipelines, where controlled deviations mitigate bias amplification while balancing computational efficiency.

The epsilon-biased binning problem formalizes the principle that commonly used discretization procedures, such as equal-size (equal-frequency) binning, can cause substantial group imbalance across buckets, amplifying downstream unfairness. The ε-biased binning problem seeks to find a bucketization of a single attribute that is as close as possible to being equally sized, while guaranteeing that the group ratios in all bins deviate from full parity by at most a user-specified threshold ε. Solutions presented in the literature include exact dynamic programming, scalable divide-and-conquer algorithms, and local search refinements. This framework directly targets the practical need for fair discretizations in data sharing and machine learning pipelines, where perfect fairness may not be achievable but small controlled deviations are acceptable.

1. Mathematical Formulation and Fairness Criteria

Let $D$ be a dataset with $n$ tuples, each assigned to one of several demographic groups $G_1, G_2, \ldots, G_r$ . Consider a sorted attribute to be discretized into $k$ bins, denoted $\mathcal{B} = (B_1, \dots, B_k)$ . For each group $g_\ell$ , the group proportion in bin $B_j$ is $|B_j \cap G_\ell|/|B_j|$ , and the overall group proportion in the dataset is $|G_\ell|/|D|$ .

The bucket-wise group bias is defined as: $\beta_D(B_j, g_\ell) = \left| \frac{|B_j \cap G_\ell|}{|B_j|} - \frac{|G_\ell|}{|D|} \right|$ The overall binning bias is the maximum group-wise, bin-wise deviation: $\beta_D(\mathcal{B}) = \max_\ell \max_j \beta_D(B_j, g_\ell)$ An ε-biased binning is one where $\beta_D(\mathcal{B}) \leq \varepsilon$ for a user-specified $\varepsilon > 0$ . The goal is to find a binning (subject to this constraint) that minimizes the price of fairness, typically expressed as the difference between the largest and smallest bin sizes.

2. Dynamic Programming Approach

Dynamic programming (DP) gives an exact polynomial-time algorithm for the ε-biased binning problem but has quadratic complexity in dataset size. The approach consists of:

Feasibility Preprocessing: For every possible segment $[i+1:j]$ in the sorted data, check whether the corresponding bucket is ε-biased:

$T[i, j] = \begin{cases} 1 & \text{if}~\max_\ell \left| \frac{| \{ t_{i+1:j} \} \cap G_\ell |}{j-i} - \frac{|G_\ell|}{n} \right| \leq \varepsilon \ 0 & \text{otherwise} \end{cases}$

This step builds a $O(n^2)$ feasibility table.

Recursive DP Recurrence: Define $\text{OPT}(j, \kappa)$ as the best achievable (minimum width range) partitioning of the first $j$ tuples into $\kappa$ bins. For each $i<j$ with $T[i, j]=1$ , consider:

$w_i^\uparrow = \max(w_{\text{prev}}^\uparrow, j-i), \quad w_i^\downarrow = \min(w_{\text{prev}}^\downarrow, j-i)$

and select $i^\star = \operatorname{argmin}_{i: T[i,j]=1}(w_i^\uparrow - w_i^\downarrow)$ . The final answer is $\text{OPT}(n, k)$ .

Complexity: The overall run-time is $O(n^2 k)$ , and $O(n^2)$ space, making this intractable for large datasets.

3. Efficient Divide-and-Conquer Strategy

The divide-and-conquer (D) algorithm provides a highly scalable, near-linear-time alternative, yielding feasible but not necessarily optimal ε-biased binnings.

Initial Boundaries: Begin at positions close to those of the equal-size bins: e.g., for dividing $n$ items into $k$ buckets, initial candidate for splitting is at $i = l + \lceil (\kappa/2) \times ((h-l)/\kappa) \rceil$ for current segment $[l, h]$ .
Feasibility Search: Around the ideal split point, perform a local search for valid boundaries that satisfy the ε-bias constraint for both halves. If none found within a local window, declare infeasibility.
Recursion: Recursively apply the same procedure to each subinterval, yielding overall recursion depth $O(\log k)$ , and linear work per recursion level.
Complexity: $O(n \log k)$ worst-case, and always produces a solution if one exists.

Key advantages: Near-linear scalability, automatic “filling” of buckets closest to equal sizes while respecting group bias tolerance. Main limitation: does not guarantee the minimum possible difference between largest and smallest bin.

To address suboptimality of divide-and-conquer solutions, the local search (LS) heuristic is layered on top:

Neighborhood Search: Given the D solution with maximum width difference $w_{dnc}$ , restrict attention to candidate boundaries inside windows of width $w_{dnc}$ centered at the equal-size locations.
Combinatorial Search: Evaluate the objective for various local perturbations of boundaries, iterating over Cartesian products of candidate indices for boundary locations.
Empirical Performance: LS typically converges rapidly, as $w_{dnc}$ is modest in most practical settings and the solution space is sharply constrained by feasibility.
Trade-off: In the worst case, exponential scaling in $k$ (number of bins), but observed to be efficient in common real-world tasks.

5. Comparison of Solution Techniques

Method	Optimality	Time Complexity	Scalability
Dynamic Programming	Optimal	$O(n^2 k)$	Low
Divide-and-Conquer	Near-optimal	$O(n \log k)$	High
Local Search	Near-optimal/improved	Practically fast (depends on $k$ and $w_{dnc}$ )	Medium–High

The DP approach always returns the best solution (smallest fairness price), but its high time and space requirements preclude use on large datasets. Divide-and-conquer is scalable, always yields a feasible solution if one exists, and provides a high-quality upper bound (in terms of fairness price). Local search can further refine this solution, often achieving optimality in practice, and is orders of magnitude faster than DP except potentially for very large $k$ .

6. Practical Implications and Limitations

Adjustable Tolerance: By setting ε, practitioners control the maximum allowed disparity in group ratios per bucket, striking a balance between fairness constraints and granularity of the discretization.
Scalability: The D and LS procedures enable $\varepsilon$ -biased binning for very large datasets, where quadratic dynamic programming would be infeasible.
Price of Fairness: Allowing small bias ε can dramatically reduce the “cost” in terms of deviation from equal bucket sizes compared to enforcing exact parity, making ε-biased binning attractive in settings where strict group equality is not attainable.
No Solution Cases: In practice, for sufficiently stringent ε (e.g., $\varepsilon \ll 1/n$ ) and especially when group value distributions differ significantly, it may be impossible to find any feasible binning; all methods degrade gracefully in this event, quickly reporting infeasibility.
Objective Function: All methods focus on minimizing $\max_j |B_j| - \min_j |B_j|$ among all ε-biased binnings, directly quantifying the fairness/granularity trade-off.

7. Summary and Broader Impact

The epsilon-biased binning problem (Asudeh et al., 26 Sep 2025) offers a rigorous, tunable mechanism for constructing fair attribute discretizations, directly addressing bias amplification introduced by naive bucketization schemes. The proposed exact and scalable algorithms enable practitioners to efficiently construct near-equal-sized bins while provably limiting disparities in group representation to a user-specified threshold. Empirical results indicate that with realistic tolerance levels, the local search approach yields solutions with near-zero price of fairness at massively reduced computational cost relative to prior exhaustive techniques.

This framework is relevant for any pipeline where feature binning is performed in the presence of sensitive attribute imbalance, including fair machine learning, data publication, and pre-processing for equitable downstream analytics. Future work may investigate extensions to multivariate and non-numeric binning, dynamic data environments, or integrating further fairness notions beyond simple group parity constraints.

PDF Markdown Chat (Pro)

References (1)

Unbiased Binning: Fairness-aware Attribute Representation (2025)

Follow Topic

Get notified by email when new papers are published related to Epsilon-Biased Binning Problem.

Epsilon-Biased Binning Problem

1. Mathematical Formulation and Fairness Criteria

2. Dynamic Programming Approach

3. Efficient Divide-and-Conquer Strategy

4. Local Search Refinement

5. Comparison of Solution Techniques

6. Practical Implications and Limitations

7. Summary and Broader Impact

Follow Topic

Continue Learning

Epsilon-Biased Binning Problem

1. Mathematical Formulation and Fairness Criteria

2. Dynamic Programming Approach

3. Efficient Divide-and-Conquer Strategy

4. Local Search Refinement

5. Comparison of Solution Techniques

6. Practical Implications and Limitations

7. Summary and Broader Impact

Follow Topic

Continue Learning

Related Topics