Overlap Number of Balls (ONB) Analysis
- ONB is a geometric and combinatorial metric that measures class overlap by counting the minimum number of non-overlapping, class-pure hyperspheres needed to cover each class.
- It employs a safe-radius approach and a greedy ball-covering algorithm to capture local point density, boundary complexity, and data separation nuances.
- ONB values correlate with classification difficulty, where high overlap leads to many small balls and signal greater challenges in instance-based learning.
The Overlap Number of Balls (ONB) quantifies the geometric and combinatorial overlap between distinct classes or groups within a set—whether in statistical learning, combinatorial allocation, or urn models—by measuring the minimal number of non-overlapping, class-pure balls (hyperspheres) required to cover each class such that no ball contains points from more than one class. The ONB framework is used extensively in data complexity analysis, overlap quantification, and applied probability, with variations tailored to classification geometry, probability of collisions in random allocations, and partition thresholds in combinatorial settings (Pascual-Triana et al., 2024, Pascual-Triana et al., 2020, Gouet et al., 2019, Czabarka et al., 2012).
1. Mathematical Definitions and Algorithmic Construction
The core ONB construction begins with a labeled dataset for classes. For a given of class , the "safe-radius" is defined as the minimal distance to any point of a different class: The corresponding closed ball is . The set of all points of class is . To cover , the algorithm greedily selects balls that maximize the number of yet-uncovered points in 0, breaking ties by radius if needed, and removes newly covered points from consideration. The process iterates until all points in 1 are covered. The number of selected balls, 2, is the ONB for class 3 (Pascual-Triana et al., 2024, Pascual-Triana et al., 2020).
Associated to each covering ball are three attributes:
- Radius (4): indicates local class-separation.
- Covered instances (5): quantifies local point density.
- Density (6): signals tightness of local packing, relevant for outlier and boundary detection.
2. ONB as a Data Complexity Metric
ONB metrics offer a tunable, geometry-aware quantification of class overlap and boundary complexity. Heavy class overlap leads to small 7 and large 8, as many small balls are required to maintain class-purity. Well-separated classes yield large 9 and minimal 0. Main variants summarized in (Pascual-Triana et al., 2020) include:
| Metric | Formula | Typical Use |
|---|---|---|
| 1 | 2 | Global overlap |
| 3 | 4 | Class-level overlap |
| Distance choices | Euclidean (5) or Manhattan (6) | Data-dependent |
Empirically, the Manhattan-distance class-averaged ONB 7 demonstrates the strongest negative correlation with 1NN geometric mean performance across both synthetic and real-world datasets (8 in balanced artificial data) (Pascual-Triana et al., 2020).
3. Theoretical Properties and Interpretations
Several monotonicity and tradeoff properties hold:
- Monotonicity: As class overlap increases, ONB increases; as classes become more separable, ONB decreases.
- Radius–Overlap Trade-off: The average ball radius 9 for covering class 0 is inversely related to overlap—more overlap means smaller 1.
- Bounds: 2, with 3 under maximal overlap (each point requires its own ball), and 4 for fully disjoint classes (Pascual-Triana et al., 2024, Pascual-Triana et al., 2020).
- Boundary Complexity: ONB simultaneously captures local (microscopic) and global (macroscopic) structural complexity at class boundaries.
ONB values correlate strongly with classification difficulty. Instance-based methods (e.g., kNN) suffer most in high-ONB regimes, where boundaries are intricate or classes interpenetrate. This behavior is validated empirically, as ONB provides better prediction of classifier performance than alternatives such as MST- or nearest-neighbor-based complexity measures (Pascual-Triana et al., 2020).
4. Computational Complexity and Practical Implementations
The dominant complexity arises from distance computations and the covering procedure:
- Pairwise distances: 5 for 6 points in 7 dimensions.
- Cover construction: per class, each step may require 8 scans, possibly up to 9 steps, yielding 0 complexity total in the worst case.
- Accelerations: For moderate 1, practical implementations leverage spatial indices (e.g., kd-trees) or approximate nearest-neighbor techniques to expedite range queries and nearest-opposite computation (Pascual-Triana et al., 2024, Pascual-Triana et al., 2020).
- Parameterization: The metric is robust to distance choice and agnostic to scale, but boundary region identification may require percentile-based thresholding of radius, coverage, or density.
5. Extensions: Singular Models and Generalizations
The ONB paradigm generalizes naturally to:
- Multi-label: Restricting ball covers to label-overlap constraints (Pascual-Triana et al., 2020).
- Multi-instance: Treating each bag as a composite entity, with covering applied in bag space.
- Multi-view: Requiring that balls capture proximity in all feature spaces jointly.
- Singular problems: Ball coverage schemes can be adapted to account for more intricate or non-Euclidean relational structures.
In applied probability, analogous "ONB" statistics arise in urn models, where overflow quantifies the number of assignments of balls to urns (with capacity 2) that result in overfilling. Exact asymptotic formulas and limit laws (Poisson or Gaussian) for these collision/overflow statistics are derived under varying scaling regimes for 3 balls and 4 urns (Gouet et al., 2019).
6. Combinatorial Thresholds and ONB in Allocation Problems
In combinatorics, the ONB concept maps to sharp thresholds for the emergence of overlapping box occupancies:
- Model-dependent thresholds: For 5 balls and 6 boxes (distinguishable/indistinguishable, surjective, etc.), the ONB represents the maximal box count 7 where the probability of any two boxes coinciding in occupancy remains bounded away from zero.
- Sample results (Czabarka et al., 2012):
| Model | Threshold for ONB |
|---|---|
| Compositions | 8 |
| Integer partitions | 9 |
| Surjections / set partitions | 0 |
Each model exhibits a sharp phase transition: as 1 crosses the threshold, the probability of occupancy overlap jumps from 2 to 3.
7. Applications: Fairness, Bias Reduction, and Data Preprocessing
In the context of fair machine learning, the ONB has been adapted into the Fair-ONB method, which targets bias reduction by undersampling regions of greatest overlap—those closest to decision boundaries or with minimal class-purity—according to ball attributes (radius, coverage, and density) (Pascual-Triana et al., 2024). The procedure identifies high-overlap ("worst") regions by percentile filtering and removes or relabels associated instances, thus enhancing model fairness with minimal predictive performance degradation.
ONB and its filtered variants have been empirically validated to:
- Improve class-balanced representation across protected groups.
- Reduce bias in algorithmic decisions rooted in training set geometry.
- Offer instance selection strategies superior to random or naive undersampling in maximizing fairness while preserving classification utility (Pascual-Triana et al., 2024).
References:
- (Pascual-Triana et al., 2024) Fair Overlap Number of Balls (Fair-ONB): A Data-Morphology-based Undersampling Method for Bias Reduction
- (Pascual-Triana et al., 2020) Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect
- (Gouet et al., 2019) Asymptotics of the overflow in urn models
- (Czabarka et al., 2012) Threshold functions for distinct parts: revisiting Erdos-Lehner