Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Sample Complexity of Multicalibration

Published 23 Apr 2026 in cs.LG, math.ST, and stat.ML | (2604.21923v1)

Abstract: We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $κ> 0$, in the regime $|G|\le \varepsilon{-κ}$, we prove that $\widetildeΘ(\varepsilon{-3})$ samples are necessary and sufficient, up to polylogarithmic factors. The lower bound holds even for randomized predictors, and the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction. This separates the sample complexity of multicalibration from that of marginal calibration, which scales as $\widetildeΘ(\varepsilon{-2})$, and shows that mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting. In contrast we observe that for $κ= 0$, the sample complexity of multicalibration remains $\widetildeΘ(\varepsilon{-2})$ exhibiting a sharp threshold phenomenon. More generally, we establish matching upper and lower bounds, up to polylogarithmic factors, for a weighted $L_p$ multicalibration metric for all $1 \le p \le 2$, with optimal exponent $3/p$. We also extend the lower-bound template to a regular class of elicitable properties, and combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles.

Summary

  • The paper establishes that achieving multicalibration requires a sample complexity of Õ(ε⁻³) for mean ECE and Õ(ε^(–3/p)) for Lₚ metrics.
  • The methodology leverages coding-theoretic constructions for lower bounds and employs online-to-batch reduction to match these bounds with nearly optimal upper rates.
  • The findings reveal a threshold phenomenon where constant-size group families maintain marginal calibration rates, highlighting the increased statistical burden for larger group classes.

The Sample Complexity of Multicalibration: Formal Analysis and Minimax Rates

Introduction and Motivation

Multicalibration is a statistical fairness criterion ensuring calibrated predictions not only at the population level but simultaneously for multiple identifiable subpopulations (groups). Originally formulated to strengthen calibration beyond its marginal form, multicalibration has since become foundational for loss-agnostic prediction (omniprediction), complexity-theoretical constructions, and distributed information aggregation. The present work addresses the fundamental open question: What is the minimax sample complexity of multicalibration as a function of target error ε\varepsilon and the number of groups G|G|?

This paper establishes tight minimax sample complexity rates for multicalibration across classical mean regression, weighted LpL_p metrics, and a broad class of elicitable properties—including expectiles and quantiles. The authors show separation between multicalibration and marginal calibration, delineating the increased statistical burden multicalibration imposes even in batch settings.

Problem Formulation

Given nn i.i.d. samples from an unknown distribution PP over (X,Y)(X, Y), the learner is tasked with outputting a (possibly randomized) predictor QQ achieving population multicalibration error at most ε\varepsilon with respect to a finite family of groups GG. The central error measure is Expected Calibration Error (ECE) for a group gg:

G|G|0

Multicalibration requires ECE to be bounded for each G|G|1. The sample complexity is then:

  • G|G|2: minimal G|G|3 needed for G|G|4-multicalibration over group families of size G|G|5.

Results: Sharp Minimax Rates

Main Theoretical Contributions

  1. Lower Bounds: For polynomial-size group families (G|G|6, G|G|7), achieving multicalibration with mean ECE error G|G|8 against adversarially chosen (even randomized) predictors requires:

G|G|9

Generalized to LpL_p0 metrics (LpL_p1) and regular elicitable properties, the minimax sample complexity is:

LpL_p2

The lower bounds exploit coding-theoretic constructions of "hard" group families and distribution families, establishing that the rates already manifest in simple one-dimensional monotone Bernoulli regression settings.

  1. Upper Bounds: The minimax lower rates are matched (up to polylogarithmic terms), with batch predictors obtained via online-to-batch reduction from recent online multicalibration algorithms. For mean ECE, the best achievable rate is:

LpL_p3

For weighted LpL_p4 metrics and general elicitable properties (under regularity), the sharp sample complexity rate is LpL_p5.

  1. Threshold Phenomenon: For constant-size group families, the sample complexity remains at the marginal calibration rate (LpL_p6), witnessing a `sharp threshold' as LpL_p7 increases.

Numerical and Algorithmic Implications

  • The lower bounds apply even to randomized predictors, disproving any conjecture that randomization substantially reduces sample complexity.
  • For LpL_p8, the lower bounds do not match the upper bounds, suggesting an open question about tightness in this range.
  • The batch minimax exponent for mean ECE (LpL_p9) matches the adversarial online setting, in stark contrast to marginal calibration—where batch and online rates diverge.

Technical Approach

Lower Bound Construction

  • Hard Instances via Coding Theory: The construction leverages packing codes to define a family of regression problems parameterized by codewords, each encoding the regression function as a staircase map. The associated group family is constructed dyadically and achieves compact representation for all threshold functions on the domain.
  • Decoding Argument: Any predictor achieving nn0-multicalibration with respect to the compressed group family must, in effect, identify the underlying codeword; Fano's inequality then dictates the sample size necessary to decode among exponentially many distributions that are pairwise close in KL-divergence.
  • Extension to Regular Properties and nn1 Metrics: Holder's inequality lifts the lower bound from ECE (nn2) to nn3. Regular elicitable properties are handled under quantitative conditions on identification functions and distributional proximity.

Upper Bound Framework

  • Online-to-Batch Reduction: Online multicalibration algorithms process iid samples streamwise. The batch predictor averages online predictors, and concentration inequalities (Azuma-Hoeffding for nn4, Freedman for nn5) ensure population calibration.
  • Algorithm Instantiation: Sharp online multicalibration algorithms [noarov2025high, hu2025efficient] yield batch procedures with matching error rates via rounding and concentration-based transfer.

Relationship to Prior Work

The results resolve longstanding gaps among previously known rates, which ranged from nn6 to nn7 for mean ECE; these are now shown to be loose. The work separates multicalibration sample complexity from marginal calibration, which scales as nn8, and establishes matching lower bounds for randomized predictors—previously absent in the literature.

Notably, the minimax exponents for multicalibration in batch and online adversarial settings now coincide, confirming the statistical rigor required for simultaneous calibration across groups.

Implications and Future Directions

Practical Implications

  • Fairness-Driven Learning: Practitioners enforcing multicalibration must budget for an increased sample complexity, relative to ordinary calibration, proportional to nn9 or PP0 (for PP1).
  • Algorithm Selection: Randomized predictors do not offer sample efficiency beyond deterministic ones in minimax settings.
  • Scalability: The logarithmic dependence on PP2 is tight in the polynomial regime, but characterizing exact dependence in subpolynomial regimes (e.g., PP3) is left open.

Theoretical Directions

  • Optimality for Larger PP4: Sample complexity tightness for PP5 metrics when PP6 remains pending.
  • Derandomization: Whether deterministic predictors can achieve the same minimax rates or a gap exists is not yet resolved.
  • Structural Complexity: Integrating group structure (VC-dimension, partition complexity) for sharper sample complexity bounds merits investigation.
  • Uniform Convergence vs. Multicalibration Learning: Uniform convergence results do not imply production of multicalibrated predictors—highlighting the significance of this minimax analysis.

Conclusion

This paper rigorously establishes the minimax sample complexity for multicalibration in statistical learning, revealing that the additional demands of group-level calibration provoke a super-quadratic scaling with error tolerance, fundamentally distinguishing multicalibration from marginal calibration. These findings underpin both theoretical advances and practical algorithm selection in fairness-aware machine learning and highlight important avenues for future investigation concerning structural and algorithmic nuances.

Citation: "The Sample Complexity of Multicalibration" (2604.21923).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (big picture)

The paper asks a simple-sounding question: If we want a prediction system to be fair and reliable for many different subgroups of people (not just “on average”), how much data do we need to learn such a system?

This idea is called multicalibration. It strengthens ordinary calibration (where, for example, among all cases where the model says “70% chance,” about 70% should be positive) by demanding that this be true not just overall, but also separately for many subgroups (like different ages, regions, or combinations of features).

The authors pinpoint, up to small extra factors, the exact number of data points needed to reach a target accuracy for multicalibration. They also extend their results to stronger error measures and to other kinds of statistics beyond simple averages.

What questions the paper answers

In plain terms, the paper asks:

  • If we want our predictions to be well-matched to outcomes across many subgroups (multicalibration), how many training examples do we need to guarantee a small error, say at most ε?
  • How does this data requirement change when:
    • We measure error in different ways (some ways punish big mistakes more)?
    • We care about other targets besides the mean (like quantiles or expectiles)?
    • We have more or fewer subgroups we want to be correct on?

They aim for the “worst-case” answer—how much data is enough no matter which data distribution and which family of subgroups we pick (as long as the subgroup family isn’t too huge).

How they approach it (methods in everyday terms)

Think of teaching a weather forecaster to give probabilities (“30% chance of rain”). Ordinary calibration checks that, overall, days with a 30% prediction do indeed rain about 30% of the time. Multicalibration says: this must also hold separately for many subgroups—say, in each neighborhood, on weekdays vs. weekends, in winter vs. summer, and so on.

To figure out “how much data is enough,” the authors do two complementary things:

  • Build hard examples (lower bounds):
    • They cleverly design a world where, to be multicalibrated, a learner must uncover a lot of hidden structure—like a secret pattern spread across many small steps (a “staircase”), so you can’t just blur the details and still be accurate for all subgroups.
    • They use ideas from coding theory (think of many different, well-separated “secret codes”) to ensure there are lots of different possibilities the learner must distinguish between.
    • They prove that any method that succeeds on these hard examples must see at least about 1/ε³ examples (ignoring minor log factors).
    • They formalize this with a classic information-theory tool (Fano’s inequality), which says that if many possibilities are hard to tell apart, you need a lot of samples to be confident.
  • Turn strong online learners into batch predictors (upper bounds):
    • They take recent algorithms that learn “on the fly” (one example at a time) and convert them into “batch” predictors (trained on a static dataset) by averaging their decisions.
    • Using careful probability tools (like martingale concentration and refined versions of Freedman’s inequality), they show that these converted predictors also achieve multicalibration with about 1/ε³ samples (again, up to small extra factors).

Together, these show the rate is tight: you both need and can achieve about 1/ε³ samples.

What they found and why it matters

Here are the main findings, stated simply:

  • For standard multicalibration measured by Expected Calibration Error (ECE):
    • You need about 1/ε³ samples, and that many samples are also enough (up to small log factors), as long as the number of subgroups grows at most polynomially with 1/ε.
    • This separates multicalibration from ordinary (overall) calibration, which only needs about 1/ε² samples. In other words, being fair and accurate across many subgroups truly requires more data.
  • There’s a sharp threshold:
    • If the number of subgroups is fixed (doesn’t grow as you ask for smaller error), then multicalibration has the same sample need as ordinary calibration: about 1/ε².
    • But once the family of subgroups grows (even moderately), the need jumps to about 1/ε³.
  • Stronger error measures (weighted Lp metrics for 1 ≤ p ≤ 2):
    • The sample need becomes about 1/ε3/p.
    • So if you measure errors in a way that punishes big mistakes more (larger p), the data requirement goes down accordingly within this range.
  • Beyond means: other targets like expectiles and bounded-density quantiles:
    • Using a general template, the authors show the same kind of matching lower and upper bounds: about 1/ε3/p samples for 1 ≤ p ≤ 2.

Why this matters:

  • It gives a clear, trustworthy target: if you want subgroup-level reliability with ECE ≤ ε across many groups, plan on gathering around 1/ε³ data points.
  • It informs both practitioners and theorists about what is and isn’t possible with limited data.
  • It highlights that demanding fairness across many subgroups is statistically costlier than just being correct on average—which helps explain why this is hard in practice.

What this means going forward (implications)

  • Practical planning: Teams aiming for fair, reliable predictions across many subgroups can now budget data collection more realistically. If you halve the error you want (ε gets smaller), the needed data grows roughly eightfold for ECE (because of the ε³ in the denominator).
  • Algorithm design: The fact that the best-known online methods convert to optimal batch methods suggests that advances in online learning can directly benefit batch training for multicalibration.
  • Broader scope: The results aren’t limited to averages. They apply to other statistics used in risk and decision-making (like quantiles and expectiles), making the findings valuable for a wide range of applications (credit, medicine, resource allocation, etc.).
  • Research directions: The paper settles the case for p between 1 and 2. It points to future work for even stronger error measures (p > 2), where the picture is not yet complete.

Overall, the paper delivers a clear message: achieving trustworthy, subgroup-fair predictions is possible, but it requires substantially more data than getting things right “on average.” The authors quantify exactly how much more, and they show that this cost persists even when using advanced, randomized prediction strategies.

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a concise list of concrete gaps and open problems suggested by the paper that future work could address.

  • Exact dependence on group family size: Characterize the sharp joint dependence of sample complexity on both ε and |G| (beyond “polylog(1/ε)” factors), especially in intermediate regimes where |G| grows slowly with 1/ε (e.g., polylogarithmic or subpolynomial growth), and sharpen the threshold behavior around κ=0.
  • Eliminate polylogarithmic gaps: Remove the polylog(1/ε) slack in both the lower and upper bounds (including the online-to-batch transfer), yielding explicit, tight constants and the exact leading-order dependence.
  • Deterministic vs randomized learners: Determine whether deterministic batch learners can achieve the ε{-3} rate for mean-ECE multicalibration (and ε{-3/p} for L_p), or prove lower bounds that separate deterministic from randomized predictors in the batch setting.
  • Tight L_p bounds for p>2: Establish matching lower and upper bounds for weighted L_p multicalibration when p∈(2,∞] (including L_∞), without relying on the lossy Hölder-based reduction from L_1.
  • Online-to-batch without transfer loss: Develop an online-to-batch reduction for multicalibration that avoids the current additive T{-1/3} transfer loss, enabling batch lower bounds to imply nontrivial online lower bounds (or vice versa).
  • Distribution-dependent rates: Go beyond worst-case minimax to identify distribution-dependent (or structure-dependent) sample complexity (e.g., under smoothness, Lipschitz, sparsity, monotonicity in higher dimensions, margin or noise conditions) and determine when ε{-2} is achievable under structural assumptions.
  • High-dimensional contexts: Extend lower and upper bounds to high-dimensional context spaces, quantifying dimension dependence and determining whether the ε{-3} exponent persists or worsens with dimension.
  • General group families: Analyze multicalibration with non-binary or real-valued groups, infinite group families controlled by capacity measures (e.g., VC, Rademacher), and provide sharp sample complexity in terms of these complexity measures rather than |G|.
  • Data-adaptive groups: Characterize sample complexity when the group family is selected adaptively from data (auditing-style or two-stage procedures), including any additional penalties from adaptivity.
  • Broader elicitable properties: Relax the “regular” elicitable property assumptions to include properties with non-smooth identification functions or non-regular behavior (e.g., CVaR, modes, heavy-tailed functionals), and establish whether the ε{-3/p} exponent extends or changes.
  • Quantiles without density bounds: Remove the bounded-density assumption for quantiles and identify the correct ε-dependence when the conditional density can vanish or spike at the target quantile.
  • Beyond Bernoulli/[0,1] outcomes: Extend the lower and upper bounds to unbounded or heavy-tailed outcomes (e.g., sub-Gaussian, sub-exponential, or finite-variance settings), quantifying how tails affect the exponent and polylog factors.
  • Non-i.i.d. data: Establish rates under dependent data (e.g., mixing sequences, covariate shift, label shift), and determine the minimal conditions under which ε{-3} remains necessary and sufficient.
  • Model class constraints: Determine sample complexity when the predictor is constrained to a hypothesis class H (e.g., linear, neural nets), linking multicalibration learnability to the capacity and approximation properties of H.
  • Computation and practicality: Provide algorithms with provable runtime and memory guarantees that scale in |G| and the discretization K, avoid large oracle requirements, and assess whether ε{-3} is achievable with practical resource bounds.
  • Discretization vs. continuity: Replace grid-based discretization with continuous predictors to avoid rounding losses; quantify the minimal support size |V(Q)| needed to achieve ε error and the tradeoff between K and sample complexity.
  • High-probability guarantees: Strengthen from in-expectation guarantees (plus Markov) to sharp high-probability bounds matching the ε{-3} rate with optimal logarithmic dependence and minimal constants.
  • Precise phase transition at κ=0: Provide a fine-grained characterization (including constants) of the ε{-2} vs ε{-3} transition as |G| moves from constant to slowly growing, and identify the smallest growth of |G| that provably forces ε{-3}.
  • Robustness to group misspecification: Analyze sensitivity to errors in group definitions (e.g., noisy or partial group membership, overlapping or fuzzy groups), and derive robust multicalibration guarantees and lower bounds.
  • Privacy constraints: Establish sample complexity under differential privacy or other information constraints, and determine whether privacy induces a penalty on the ε{-3} exponent or only on polylog factors.
  • Auditing complexity: Quantify the sample complexity for auditing (verifying) multicalibration error of a fixed predictor to tolerance ε over a given G, and how it compares to learning complexity.
  • Multiclass/vector-valued outputs: Extend bounds to multi-class or vector-valued targets and properties, clarifying whether analogous ε-exponents hold and how they scale with label dimension.

Practical Applications

Overview

This paper establishes tight, minimax sample-complexity bounds for multicalibration in the batch (i.i.d.) setting. For mean calibration measured by Expected Calibration Error (ECE), the optimal sample complexity is approximately ε⁻³ (up to polylog factors in 1/ε and |G|), and for weighted Lp multicalibration with p ∈ [1,2] it is approximately ε⁻³ᐟᵖ. The results apply even when predictors are randomized and extend to elicitable properties beyond means (e.g., expectiles, bounded-density quantiles). A sharp “threshold” is shown: with a constant number of groups (|G| = O(1)), the sample complexity remains near ε⁻², but for any fixed κ>0 allowing polynomially many groups |G| ≤ ε⁻ᵏ, the exponent increases to 3 (or 3/p).

These findings translate into practical guidance for data collection, model governance, algorithm design, and fairness auditing across sectors. Below are actionable use cases grouped by timeframe.

Immediate Applications

The following applications can be deployed with existing tools and workflows, leveraging the paper’s bounds and online-to-batch construction.

  • Sample-size planning and feasibility checks for fairness-calibrated models
    • What to do: Use n ≈ C * ε⁻³ * log|G| (mean ECE) or n ≈ C * ε⁻³ᐟᵖ * log|G| (weighted Lp with p ∈ [1,2]) to plan data collection and assess achievability of multicalibration targets across a given group family.
    • Sectors: healthcare (calibrated risk across demographics), finance/insurance (fair lending and pricing), advertising (CTR calibration across segments), education (dropout-risk calibration), energy (regional demand quantile calibration).
    • Tools/products/workflows:
    • “Calibration sample-size calculators” embedded in MLOps dashboards.
    • Governance checklists that block deployment if n is insufficient for the stated ε and |G|.
    • Assumptions/dependencies:
    • i.i.d. samples; finite group family with |G| growing at most polynomially with 1/ε; chosen metric is ECE (or weighted Lp with p ∈ [1,2]); labels and groups must be available and reliable.
  • Scoping group definitions to match data budgets (threshold phenomenon)
    • What to do: If data is scarce, keep |G| constant or very small to stay near ε⁻² sample complexity; if many subgroups must be calibrated, expect ε⁻³ scaling and consider staged or prioritized deployment.
    • Sectors: any regulated domain with subgroup fairness mandates (e.g., lending, healthcare, hiring).
    • Tools/workflows:
    • Policy templates that differentiate “core groups” (constant |G|) from “extended groups” (|G| polynomial in 1/ε) with corresponding data requirements.
    • Assumptions/dependencies:
    • Accepts the trade-off that fewer groups may miss some fairness concerns; legal/compliance requirements may constrain group scoping choices.
  • Online-to-batch deployment recipe for multicalibration with guaranteed rates
    • What to do: Implement the online multicalibration algorithm (e.g., from Noarov et al., 2025; Hu et al., 2025) and convert to a batch predictor by averaging predictions over T rounds. Set prediction-bucket count K ≈ T{1/3}; use a light/heavy bucket threshold τ ≈ Õ(1/T).
    • Sectors: software/ML platforms, fintech, health tech, ad-tech.
    • Tools/workflows:
    • Production-ready library that:
    • discretizes predictions onto a grid of size K ≈ T{1/3},
    • streams i.i.d. examples,
    • outputs a randomized or deterministically rounded calibrated predictor.
    • Assumptions/dependencies:
    • Randomized outputs may be used or rounded; i.i.d. data stream; metric is ECE or weighted Lp (p ∈ [1,2]); careful bucketization and variance-adaptive concentration (Freedman-style) used in the reduction.
  • Calibrated quantiles and expectiles for risk management
    • What to do: Apply weighted Lp multicalibration to elicitable properties, e.g., quantiles (bounded-density) and expectiles, with the same ε⁻³ᐟᵖ sample scaling (p ∈ [1,2]). Use this for calibrated prediction intervals and tail-risk control.
    • Sectors: finance (VaR/ES proxies via quantiles/expectiles), healthcare (predictive intervals for readmissions), energy (peak demand quantiles), logistics (service-level targets).
    • Tools/workflows:
    • Property-calibration modules for quantiles and expectiles with heavy- vs light-bucket handling (τ ≈ Õ(1/T)) and K ≈ T{1/3}.
    • Assumptions/dependencies:
    • Property must satisfy regularity and bounded-density assumptions; p ∈ [1,2]; i.i.d. sampling; careful handling of small-mass (“light”) prediction buckets.
  • Realistic auditing thresholds and “impossibility” warnings
    • What to do: Translate existing n and |G| into best-possible multicalibration error baselines (≈ Õ(n{-1/3}) for mean ECE) to set pass/fail thresholds in audits; avoid over-promising ε that data cannot support.
    • Sectors: compliance functions across regulated industries.
    • Tools/workflows:
    • Auditor dashboards that compute minimal achievable ε given n and |G| and flag unattainable goals.
    • Assumptions/dependencies:
    • Assumes the learner and auditor use comparable metrics (ECE or weighted Lp) and that i.i.d. assumptions approximately hold.
  • Academic benchmarking and experimental design
    • What to do: Use ε⁻³ and ε⁻³ᐟᵖ baselines to size datasets for reproducible multicalibration experiments; select K ≈ T{1/3} for online studies; report |G| and bucket mass thresholds explicitly.
    • Sectors: academia, industrial research labs.
    • Tools/workflows:
    • Benchmark protocols that standardize ε, p, |G|, K, and τ; leaderboards that report calibrated scores and sample budgets.
    • Assumptions/dependencies:
    • Comparable evaluation metrics; explicit reporting of group definitions and bucketing schemes.

Long-Term Applications

These applications require additional research, engineering, or standard-setting to scale reliably.

  • Adaptive group selection and data acquisition optimization
    • What to do: Use the trade-off revealed by the threshold phenomenon to dynamically select subgroup sets that maximize fairness coverage per sample. Employ dyadic or code-based constructions (as in the paper’s group-family ideas) to cover many threshold-like subpopulations with polylog(|G|) groups.
    • Sectors: any fairness-sensitive domain.
    • Tools/products/workflows:
    • “Group design” optimizers that propose subgroup taxonomies under data constraints; active-learning systems that oversample “light buckets” or underrepresented subgroups.
    • Assumptions/dependencies:
    • Requires robust governance over group definitions; careful attention to ethical and legal implications of dynamic subgrouping.
  • Standard-setting for fairness claims and sample budgets
    • What to do: Regulators and standards bodies adopt sample-size guidance aligned with ε⁻³ (ECE) and ε⁻³ᐟᵖ (Lp) for multicalibration claims; mandate transparency on ε, |G|, and n.
    • Sectors: public policy, financial regulation, healthcare oversight.
    • Tools/workflows:
    • Regulatory templates that tie permitted claims to minimum data and documented group scopes.
    • Assumptions/dependencies:
    • Consensus on metrics and acceptable p; agreement on i.i.d. validity or conservative adjustments for distribution shift.
  • Integration into AutoML/MLOps as a first-class “calibration budget”
    • What to do: Build AutoML pipelines that jointly tune ε, |G|, K, and τ and couple model selection with data collection planning to meet multicalibration targets.
    • Sectors: software platforms providing ML-as-a-service.
    • Tools/workflows:
    • End-to-end “calibration-aware” model training and validation; automated grid selection (K ≈ T{1/3}) and robust rounding strategies.
    • Assumptions/dependencies:
    • Engineering work to make online-to-batch reductions robust under operational non-i.i.d. conditions and concept drift.
  • Cross-organization data consortia to meet multicalibration sample needs
    • What to do: Where fairness mandates require many groups and small ε, create privacy-preserving consortia to pool data and achieve n ≈ ε⁻³ log|G|.
    • Sectors: healthcare provider networks, financial alliances, public sector collaborations.
    • Tools/workflows:
    • Federated or secure aggregation pipelines that incorporate multicalibration targets; agreements on group definitions and metrics.
    • Assumptions/dependencies:
    • Privacy, security, and governance frameworks; fairness metrics alignment across partners; handling heterogeneity (non-i.i.d.) across sites.
  • Extending to p > 2, complex properties, and non-i.i.d./drifting environments
    • What to do: Develop theory and algorithms with tight sample complexity for Lp with p>2, multi-class/multi-output settings, time-varying distributions, and partial observability.
    • Sectors: high-frequency finance, real-time healthcare triage, autonomous systems.
    • Tools/workflows:
    • Drift-aware multicalibration; robust online-to-batch reductions; monitoring and alarms for bucket-mass shifts.
    • Assumptions/dependencies:
    • New concentration and stability tools; validated calibration metrics under shift; acceptance of randomized strategies or principled de-randomization.
  • Property calibration as a risk-control primitive in decision systems
    • What to do: Use quantile/expectile multicalibration to guarantee subgroup-valid service levels (e.g., fill rates, wait times) or tail-risks (e.g., VaR) with explicit data budgets.
    • Sectors: logistics, operations management, finance, healthcare operations.
    • Tools/workflows:
    • Decision policies that consume calibrated quantiles/expectiles; dashboards linking targeted service levels to required data and achieved calibration.
    • Assumptions/dependencies:
    • Regularity and bounded-density assumptions for properties; careful translation from calibrated properties to decision thresholds.
  • Education and workforce upskilling on calibration-aware fairness
    • What to do: Incorporate ε⁻³ (and ε⁻³ᐟᵖ) planning into ML curricula and practitioner training; standardize reporting of calibration metrics, group definitions, and data sufficiency.
    • Sectors: academia, professional training, industry certifications.
    • Tools/workflows:
    • Course modules, case studies, and certification criteria tied to sample-complexity-aware fairness practices.
    • Assumptions/dependencies:
    • Broad adoption and alignment with regulatory and industry standards.

Notes on Key Assumptions and Dependencies

  • i.i.d. sampling is assumed for the batch rates; significant distribution shift or dependence can degrade guarantees and may require drift-aware extensions.
  • Metrics: ECE (p=1) and weighted Lp with p ∈ [1,2] are covered; guarantees for p>2 are not established here.
  • Group families must be finite; allowing |G| to grow polynomially in 1/ε increases the sample exponent to 3 (or 3/p). Constant-size |G| yields ε⁻² rates but with reduced subgroup coverage.
  • Predictors may be randomized; some deployments may prefer deterministic outputs, requiring rounding that maintains guarantees up to small additive losses.
  • For properties beyond the mean (e.g., quantiles, expectiles), additional regularity and bounded-density assumptions are required for the stated rates and reductions.

Glossary

  • adversarial online settings: An online learning scenario where the data sequence may be chosen adversarially against the learner. Example: "Together with the tight online lower bound of \cite{collina2026optimal}, our result shows that the minimax exponents for mean-ECE multicalibration agree in the batch and adversarial online settings."
  • Azuma--Hoeffding bound: A concentration inequality for martingales with bounded differences, used to control deviations of sums of dependent random variables. Example: "An Azuma--Hoeffding bound with a union over groups and sign patterns yields"
  • Bernoulli regression: A regression setting where the outcome is Bernoulli (0/1) with mean depending on the context. Example: "The hard instance is a one-dimensional Bernoulli regression with a monotone mean function on [m][m], showing that the optimal rate already arises on extremely simple instances."
  • bounded-density quantiles: Quantiles under a regularity assumption that the conditional density is bounded near the target quantile, aiding calibration results. Example: "expectiles and bounded-density quantiles."
  • bucketed multicalibration: A form of multicalibration evaluated over discretized prediction “buckets” rather than the full continuum of values. Example: "bucketed multicalibration in LL_\infty"
  • coding theory: The study of codes for efficient and robust data representation; here used to construct hard instances and group families via packing/low-correlation properties. Example: "borrowing from a coding theory perspective"
  • discrepancy theory: A field analyzing how uniformly elements can be distributed across set systems; used here to justify small, well-behaved group bases. Example: "a coding theory and a discrepancy theory insight enables both of these properties to hold at once."
  • dyadic intervals: Intervals whose lengths are powers of two, often used in multiscale decompositions. Example: "dyadic intervals at different scales"
  • dyadic peeling: A technique that partitions events or quantities into dyadic (power-of-two) scales to apply concentration inequalities more finely. Example: "with a dyadic peeling component"
  • elicitable properties: Statistical functionals that are uniquely characterized as minimizers of an expected loss; admit identification functions and calibration. Example: "a regular class of elicitable properties"
  • Expected Calibration Error (ECE): A metric summing the magnitudes of prediction-conditional biases across prediction values. Example: "The standard quantitative measure of mis-calibration is the Expected Calibration Error (ECE),"
  • Expected Multicalibration Error: The worst-case (over groups) expected calibration error, measuring multicalibration quality. Example: "The Expected Multicalibration Error is the maximum, over groups, of the group-weighted ECE."
  • expectiles: Risk measures/generalized means defined via asymmetric squared loss; an elicitable property used in calibration. Example: "expectiles and bounded-density quantiles."
  • Fano's inequality: An information-theoretic lower bound relating probability of error to mutual information, used to derive sample complexity lower bounds. Example: "via an application of Fano's inequality"
  • finitely supported randomized predictor: A predictor that outputs random predictions supported on a finite set of values. Example: "A finitely supported randomized predictor on X\mathcal{X} is a rule"
  • Freedman's inequality: A martingale concentration inequality controlling deviations using predictable quadratic variation. Example: "standard Freedman inequality controls a martingale sum in terms of its predictable quadratic variation"
  • Gilbert-Varshamov type: Refers to a packing bound from coding theory ensuring many well-separated codewords. Example: "the packing argument is of Gilbert-Varshamov type"
  • Hamming distance: The number of positions at which two codewords differ; used to measure separation in coded constructions. Example: "via a direct connection via Hamming distance."
  • Holder's inequality: A fundamental inequality relating Lp norms, used here to pass from L1 to Lp lower bounds. Example: "For this range of pp, simply applying Holder's inequality"
  • identification function: A function whose zero expectation characterizes the target elicitable property under correct predictions. Example: "the expected identification function MΓ(v,t)M_\Gamma(v,t)"
  • Johnson-Lindenstrauss lemma: A result that random projections approximately preserve distances in high dimensions; used to derive low-rank approximations. Example: "via results based on Johnson-Lindenstrauss lemma"
  • KL divergence: Kullback–Leibler divergence, measuring the discrepancy between probability distributions. Example: "are O(1/m2)O(1/m^2)-close in KL divergence"
  • low-correlation code: A code whose codewords have small pairwise correlations, aiding construction of nearly orthogonal group functions. Example: "obtained from a low-correlation code."
  • low-rank factorizations: Matrix decompositions with small rank; used here to approximate identity matrices for interval bases. Example: "low-rank factorizations of the identity matrix"
  • martingale difference: The increment in a martingale sequence with zero conditional mean given the past. Example: "differs from the corresponding empirical bias by a martingale difference:"
  • minimax sample complexity: The smallest number of samples needed by any algorithm to guarantee a target error against the worst-case instance. Example: "We resolve the minimax sample-complexity of multicalibration"
  • multicalibration: A strengthening of calibration requiring calibration to hold simultaneously across many groups/subpopulations. Example: "Multicalibration asks for substantially more than plain (marginal) calibration."
  • omniprediction: Learning predictors that are simultaneously near-optimal for many loss functions. Example: " --- omniprediction \citep{gopalan2021omnipredictors,gopalan2023loss,okoroafor2025near} ---"
  • omnipredictors: Predictors achieving near-optimal performance across a family of losses, often enabled by multicalibration. Example: "omnipredictors for broad convex loss families"
  • online-to-batch reduction: A technique converting online learning guarantees into batch statistical guarantees. Example: "online-to-batch reduction"
  • predictable quadratic variation: The cumulative conditional variance term in martingale analysis, central to Freedman-type inequalities. Example: "in terms of its predictable quadratic variation"
  • property multicalibration: The extension of multicalibration from means to general elicitable properties. Example: "introduced the general framework of property multicalibration"
  • Rademacher complexity: A measure of function class richness via average correlation with random signs, used in uniform convergence bounds. Example: "Rademacher-complexity analyses"
  • staircase maps: Monotone, stepwise functions used to encode parameters in hard-instance constructions for calibration. Example: "which is formalized via what we call staircase maps"
  • swap multicalibration: A stronger calibration notion considering swapped pairs or couplings across buckets/groups in the online setting. Example: "targets online swap multicalibration guarantees"
  • uniform convergence: The property that empirical metrics converge uniformly to their expectations over a function class. Example: "give uniform convergence bounds"
  • variance-adaptive Freedman strategy: A concentration approach adapting to variance levels (bucket masses) in martingale analyses. Example: "we follow a variance-adaptive Freedman strategy (with a dyadic peeling component)"
  • Walsh basis: An orthogonal basis of square-wave functions; subsampling it yields compact group families for threshold approximation. Example: "subsampled Walsh basis technique"
  • weighted L_p multicalibration: A family of multicalibration metrics measuring p-th moments of bucketwise bias, weighted by bucket mass. Example: "weighted LpL_p multicalibration metric"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 71 likes about this paper.