LLP-Bench: Standardizing LLP Evaluation
- LLP-Bench is a comprehensive benchmark suite that standardizes the evaluation of LLP algorithms under group-level supervision.
- It defines four variants—Naive, Simple, Intermediate, and Hard—each with specific conditional independence constraints to challenge methodological assumptions.
- The framework covers diverse data modalities including tabular, image, and biological datasets, and incorporates detailed bag-level hardness metrics and model selection strategies.
Learning from Label Proportions (LLP) addresses supervised classification and regression when only group-level—rather than individual—labels are available: instances are partitioned into "bags," and supervision comes through aggregate label proportions per bag. LLP-Bench is a set of benchmarks and evaluation methodologies specifically designed to standardize, diversify, and rigorously challenge LLP algorithms. Rather than relying on simplistic or homogeneous data constructions, LLP-Bench—across its variants—programmatically generates datasets with a broad spectrum of bag structures and dependence regimes, and advances biological, tabular, and image-based evaluation. Key components include both an analysis framework for diverse dataset characteristics and a comprehensive empirical apparatus for fair and reproducible algorithm comparison (Franco et al., 2023, Brahmbhatt et al., 2023).
1. Problem Setting and Variants in LLP-Bench
An LLP instance comprises a feature space , class labels , and a dataset , where each is assigned to a bag . Supervision is limited to a bag-proportion matrix with entries
where are the (unobserved) true labels.
LLP-Bench formalizes four distinct variants by imposing specific conditional independence (CI) constraints among the item features , labels , and bag assignments :
| Variant | Distribution Factorization | Key CI Constraint(s) |
|---|---|---|
| Naive | , | |
| Simple | ||
| Intermediate | (but ) | |
| Hard | No non-trivial factorization (fully joint) | No CI structure imposed |
Each variant corresponds to a minimal graphical model. Conformance to CI constraints is empirically verified using established conditional independence tests (Franco et al., 2023).
2. Dataset Generation Strategies
Dataset construction in LLP-Bench aims to preserve key statistical properties specified by the target variant. Starting with a labeled base dataset , items are assigned to bags, each with target size and label distribution . The construction for each variant is as follows:
- Naive: Bags are assigned independently of both features and labels; random shuffling and block partitioning suffice.
- Simple: Bags are formed by randomly permuting items within each class, then allocating items to bags in proportions matching .
- Intermediate: Employs auxiliary clustering of ; optimization seeks a row-stochastic matrix such that is minimized. Projected gradient descent is used for matrix fitting.
- Hard: Constructs a 3D contingency table over clusters, labels, and bags. Iterative Proportional Fitting (IPF) ensures empirical marginals match targets for , , , and their pairwise joints.
These procedures generalize over a variety of base datasets (e.g., tabular Adult, CIFAR-10 images), bag sizes (equal/varied), and proportion patterns (globally-matched, far-from-global, mixed) (Franco et al., 2023).
3. Benchmark Composition and Bag Construction Modalities
The tabular-specific LLP-Bench introduces two principal bag creation modalities:
- Random Bags: Fixed-size groups sampled iid, preserving feature distribution but randomizing label proportions.
- Feature Bags: Partitions data using one or two categorical features as grouping keys. All instances in a bag share the same feature value(s), mirroring real-world aggregation practices (e.g., advertising cohorts).
Stringent filtering is applied to ensure statistical validity: bags with too many or too few items are dropped, and groupings leaving <30% of the data are excluded. For Criteo CTR and Criteo SSCL datasets, these rules yield 62 feature-bag datasets and 8 random-bag datasets, each with rich diversity in size, label distribution, and geometric clustering (Brahmbhatt et al., 2023).
4. Dataset Hardness Metrics
To quantify the intrinsic difficulty of LLP datasets, LLP-Bench proposes four metrics applied over a collection of bags :
- Label-Proportion Stdev (): Standard deviation of bag proportions . Higher values indicate richer variation and less trivial bag-level learning.
- Inter- vs. Intra-Bag Separation Ratio ():
Measures geometric clustering. Values closer to 1 indicate indistinguishable bags; higher ratios suggest stronger cues from bag identity.
- Mean Bag Size: Averages per-bag instance counts. Larger bags reduce resolution of supervision.
- Cumulative Bag-Size Distribution: Records selected percentiles (50th, 70th, 85th, 95th) of bag sizes, identifying long- or short-tailed distributions and thus differences in "per-instance" learning difficulty.
Empirical analysis shows that feature bags span a wide range in each metric (e.g., –$0.18$, –$1.6$, mean bag sizes 150–500) (Brahmbhatt et al., 2023). There is minimal correlation between these axes, validating the benchmark's coverage of LLP complexity.
5. Model Selection and Evaluation Protocol
Standard supervised hyperparameter selection is not feasible in LLP, as ground-truth labels are unavailable for held-out items. LLP-Bench implements four strategies tailored to the LLP setting:
- Full-Bag -Fold: Entire bags assigned to different folds; validation via predicted vs. real bag-level proportions.
- Split-Bag Bootstrap: Each bag is split into training/validation sub-bags, bootstrapped and averaged.
- Split-Bag -Fold: Bag-respecting -fold partitioning within each bag.
- Split-Bag Shuffle: Random balanced splits within each bag.
The surrogate validation loss is
and hyperparameters are chosen to minimize this criterion.
Evaluation follows a meta-protocol: item-level 75%/25% train/test split, bag proportion recomputation, hyperparameter sweep and selection, full training, and item-level (classification) or MSE (regression) assessment. Algorithm comparisons aggregate 30 runs per setting with statistical testing (Franco et al., 2023).
6. Empirical Findings and Method Performance
Table: Top-performing LLP methods by dataset type
| Dataset | Dominant Methods | Typical Metrics |
|---|---|---|
| CTR Feature Bags | SIM-LLP, DLLP-BCE, DLLP-MSE, GenBags | AUC: 72%–78% |
| CTR Random Bags | DLLP-BCE/MSE, SIM-LLP | |
| SSCL Feature Bags | DLLP-MSE, SIM-LLP | MSE stable; Bag size ≈200–325 |
| SSCL Random Bags | GenBags (for q ≥ 256) | Lower MSE |
- SIM-LLP dominates in 41/52 Criteo CTR feature-bags, leveraging clustering induced by feature grouping.
- DLLP-BCE and DLLP-MSE consistently perform near the instance-supervised upper bound.
- Mean-Map and OT-based methods are less effective, especially as bag structure diverges from their assumptions.
- The choice of model selection strategy is critical. For Naive/Simple variants, Full-Bag -Fold is effective; for Intermediate/Hard, Split-Bag methods yield up to 10% relative improvement (Franco et al., 2023, Brahmbhatt et al., 2023).
Correlations align with metric intuition: higher bag separation and label variation increase ease of recovery, while large bag sizes impede instance discrimination.
7. Strengths, Limitations, and Impact
LLP-Bench establishes the first large-scale, open tabular LLP benchmark, encompassing classification and regression with tens of millions of instances and unparalleled diversity in bag construction. Its metrics enable analysis of dataset and method hardness, and exhaustive benchmarking (over 3,000 experiments) provides a nuanced view of algorithm strengths and weaknesses, supplanting prior limited comparisons.
Limitations include restriction to at most two categorical grouping keys, reliance on Criteo datasets, and a fixed MLP architecture. Proposed extensions involve other tabular domains, deeper architectures, and broader metrics (e.g., label–feature coupling). Practical applicability includes principled design of privacy-preserving aggregation pipelines, especially in online advertising and federated learning: LLP-Bench provides design guidance on grouping key and bag sizing decisions to optimize utility-privacy tradeoffs.
LLP-Bench thus provides the reference standard for evaluating LLP algorithms, clarifying which techniques and model-selection approaches suit each dataset regime, and catalyzing advances in weak supervision under aggregate constraints (Brahmbhatt et al., 2023, Franco et al., 2023).