Stratified Coreset Sampling
- Stratified coreset sampling is a method that partitions data into strata and samples within each layer, ensuring robust approximation and efficient computation.
- It leverages statistical concentration bounds and tailored weighting to maintain performance in clustering, regression, and classification tasks.
- Empirical results show that this approach improves data efficiency and stability while mitigating the effects of outliers and class imbalance.
Stratified coreset sampling refers to the construction of coresets—a weighted subset of the data that approximates the result of a learning or optimization problem—by partitioning the dataset into meaningful strata (layers), sampling within these strata, and assigning appropriate weights. Such methods offer robustness against outliers, improved representation across data subpopulations, and statistical guarantees at reduced sample size and computational overhead.
1. Principles of Data Stratification and Layer Definition
Stratification is achieved by partitioning input data according to key statistics, model dynamics, or semantically meaningful scores. In robust clustering or regression, layered sampling (Ding et al., 2020) proceeds by defining nested regions around an initial solution:
For -median/means with outliers, the data is partitioned into layers
- includes points within a small radius of centers ,
- spans (dyadic rings) around centers, for ,
- collects the (1+1/ε)z farthest points as a dedicated outlier stratum.
For linear regression, slabs about an initial hyperplane define layers by bounding residuals. In concept-bottleneck-based sampling (Mehra et al., 23 Feb 2025), data is partitioned into deciles or quantile bins of concept-driven margin scores (AUM), potentially excising the top- hardest examples. In class-difficulty-separable scenarios (Tsai et al., 15 Jul 2025), stratification occurs across classes, with further stratification within each class based on difficulty.
2. Stratified Coreset Construction Algorithms
Sampling within layers exploits statistical concentration inequalities (Hoeffding, Chernoff):
- For each stratum , samples are drawn uniformly (without replacement) and assigned weight ( is stratum population).
- Outliers are wholly included with unit weight to prevent under-sampling.
- Aggregated samples from all strata form the weighted coreset .
Key pseudocode (as in (Ding et al., 2020) and (Malaviya et al., 2023)) appears below:
1 2 3 4 5 |
for each stratum H_i:
sample m_i points uniformly without replacement
assign weight w(p) = n_i / m_i to sampled points
include all points in H_out with weight 1
aggregate all weighted samples as coreset S |
In class-proportional selection (Tsai et al., 15 Jul 2025):
- Total coreset size quota is allocated proportionally to class sizes ( for class ),
- Sampling is performed using methods like Hardest-CP, Sliding-Window-CP, or CCS-CP, independently per class.
In concept-stratified selection (Mehra et al., 23 Feb 2025), bins are defined over concept-aligned AUM scores (e.g., ), with iterative coverage to ensure each stratum is represented.
3. Theoretical Guarantees and Statistical Properties
Stratified sampling inherits concentration from per-stratum sampling bounds, with union bounds controlling global approximation error across all strata. For robust clustering, layered sampling yields (Ding et al., 2020):
- For -median/means with outliers, the weighted coreset satisfies
for any solution within a local neighborhood .
- Coreset size is .
For non-decomposable objectives (F1, MCC), stratified coresets achieve weak -guarantees on metrics for all classifiers above a performance threshold, with sample complexities that match lower bounds:
per class, where is VC-dimension and is the confidence parameter (Malaviya et al., 2023).
Uniform stratification by distance rings yields size coresets for capacitated/fair -median/means and for the geometric median in (Braverman et al., 2022). Stratification by concept-difficulty yields balanced coverage and resilience to mislabeled outliers, with empirical robustness at high pruning rates (Mehra et al., 23 Feb 2025).
4. Stratification Schemes: Class, Concept, and Difficulty
Stratification schemes vary with problem context:
- Class-based: Each class receives a quota and is sampled independently, preventing majority class dominance and mitigating minority class underrepresentation (Tsai et al., 15 Jul 2025).
- Concept-difficulty: Samples are binned by interpretable model-agnostic difficulty scores (AUM via concept bottlenecks) (Mehra et al., 23 Feb 2025); this enables coverage across semantic complexity and works for both labeled and unlabeled data.
- Distance/residual-based: Layers are built with exponentially growing radii or residual slabs about an initial solution to absorb variance and isolate outliers (Ding et al., 2020).
| Stratification Principle | Layer Definition | Guarantee Type |
|---|---|---|
| Class-proportional | Per-class quotas, stratify difficulty | Preserves class proportions, robust to imbalance (Tsai et al., 15 Jul 2025) |
| Concept/difficulty | AUM quantile bins, outlier cutoff | Balanced semantic coverage, outlier control (Mehra et al., 23 Feb 2025) |
| Distance/residual | Dyadic metric rings/slabs, outlier bin | Additive error for robust objectives, trims outliers (Ding et al., 2020) |
5. Empirical Results and Practical Implementation
Stratified coreset sampling demonstrates improved data efficiency and stability under aggressive pruning. Representative empirical results include:
- Class-proportional CCS-CP maintains >97% accuracy, precision, and recall even at 99% pruning, outperforming class-agnostic CCS (accuracy drop: 2.58% vs. 7.59%) on CTU-13 (Tsai et al., 15 Jul 2025).
- Concept-stratified coresets achieve 84.6% accuracy (90% pruning) on CIFAR-10, compared to 79.1% for random sampling under the same protocol (Mehra et al., 23 Feb 2025).
- Stratified coresets match or outperform leverage-score or k-means-based coresets in non-decomposable settings (F1, MCC), with 10–100× runtime advantage (Malaviya et al., 2023).
Key practical recommendations:
- Measure class-difficulty separability () to guide stratification.
- Allocate sampling budgets to minority/rare strata for variance reduction.
- Use empirically-determined stratification parameters (number of bins, outlier cutoff).
- Parallelize stratified sampling across strata for scalability to large datasets.
6. Relation to Classical Stratified Sampling
Stratified coreset sampling derives from classical stratified sampling but adapts to machine learning through target-aware strata boundaries, adaptive allocation, and coreset weight assignment:
Similarities:
- Data partitioned so that target statistics (e.g., loss, metric) are bounded within layers.
- Uniform subsampling and weighting within strata yield unbiased estimators.
- Per-stratum concentration ensures global approximation.
Differences:
- Strata are frequently adaptive and guided by empirical model characteristics, not fixed population features.
- Outlier layers are explicitly constructed and always fully represented.
- Guarantees target not just mean estimation but complex optimization objectives (robust clustering, non-decomposable metrics, assignment-preserving clustering).
This tailored stratification yields coresets suitable for robust optimization, structured datasets, and fairness/constrained settings, with sample sizes independent of data size for many tasks (Ding et al., 2020, Braverman et al., 2022).
7. Applications and Extensions
Stratified coreset sampling underpins scalable algorithms for:
- Robust -median/means clustering with/without outliers (Ding et al., 2020),
- Linear regression under outlier contamination (Ding et al., 2020),
- Non-decomposable supervised classification objectives (F1, MCC) (Malaviya et al., 2023),
- Model-agnostic image dataset pruning using concept bottlenecks (Mehra et al., 23 Feb 2025),
- Fair and capacitated clustering and Wasserstein barycenter computation (Braverman et al., 2022),
- High-stakes, imbalanced domains (network security, medical imaging) (Tsai et al., 15 Jul 2025).
A plausible implication is that stratified coreset sampling generalizes classical sampling theory to contemporary large-scale, high-dimensional, and structured data, providing an adaptive mechanism for efficient and effective data reduction across a spectrum of learning and optimization tasks.