Bayes-Dependent Universal Rates

Updated 30 June 2025

Bayes-dependent universal rates are distribution-adaptive learning rates that define the fastest excess risk decay achievable by ERM in agnostic binary classification.
They are characterized by a trichotomy—exponential, super-root, and arbitrarily slow rates—based on the presence of centered eluder and VC-eluder sequences near the Bayes-optimal classifier.
This framework highlights that local combinatorial complexity, rather than global VC-dimension, determines practical learning curves, informing better model selection and regularization strategies.

Bayes-dependent universal rates are precise, distribution-adaptive learning rates that describe the fastest possible decay of excess error achieved by empirical risk minimization (ERM) in the agnostic binary classification setting, contingent on the combinatorial structure of the hypothesis class relative to a given Bayes-optimal classifier. These rates sharpen the conventional uniform (worst-case) rates by linking attainable sample complexity to the local complexity of the class at the Bayes rule, offering a fine-grained, instance-specific description of learning curves beyond classical PAC learning.

1. Trichotomy of Agnostic Universal Rates

In the agnostic setting, where the labeling function may not belong to the class and the Bayes-optimal classifier minimizes risk under the arbitrary true distribution, the paper establishes that the universal learning curve of ERM—measured as excess risk—must take one of only three possible forms for any fixed distribution and target (Bayes-optimal) classifier ([Theorem 3]):

Exponential rate $e^{-n}$ : Rapid, parametric convergence.
Super-root rate $o(n^{-1/2})$ : Any rate strictly faster than $1/\sqrt{n}$ , but not necessarily $1/n$.
Arbitrarily slow rates: The excess risk can decay more slowly than any prescribed function tending to zero.

This trichotomy is both necessary and sufficient: for every concept class, and for every Bayes-optimal classifier $h$ , there is a uniquely determined rate from this list that characterizes the fastest possible decay of ERM’s excess risk under any distribution for which $h$ is Bayes-optimal.

2. Combinatorial Characterization via Centered Sequences

Which rate occurs in a given scenario is determined by two relative combinatorial dimensions, defined centered at the Bayes-optimal classifier $h$ :

Eluder sequence centered at $h$ : An infinite sequence of domain points labeled by $h$ , where, for each $k$ , there is a function in the class that matches $h$ on previous points but disagrees at $k$ .
VC-eluder sequence centered at $h$ : An infinite star-eluder sequence with additional VC-type shattering properties in the version space after seeing previous data.

The agnostic Bayes-dependent ERM rate falls into one of the following cases ([Theorem 3]):

Rate	Condition on $H$ w.r.t. $h$ (centered at Bayes-optimal)
$e^{-n}$	No infinite eluder sequence centered at $h$
$o(n^{-1/2})$	Infinite eluder but no infinite VC-eluder sequence at $h$
Arbitrarily slow	Infinite VC-eluder sequence centered at $h$

This granular, target-centered complexity distinguishes Bayes-dependent rates from classical ones, which depend only on the global structure of $H$ .

3. Relationship to Uniform and Universal Learning Rates

Uniform PAC learning theory yields general sample complexity bounds based on global VC dimension, with rates at best $O(n^{-1/2})$ in the agnostic case for general infinite classes. Bayes-dependent universal rates can be much faster for specific distributions: whenever the class is “simple” locally around the Bayes-optimal classifier, ERM can achieve exponential rates, irrespective of high global complexity.

The table below highlights the relationship:

Category	Uniform Rate (PAC)	Bayes-Dependent Universal Rate
Finite classes	$e^{-n}$	$e^{-n}$
Infinite VC	$O(n^{-1/2})$	$o(n^{-1/2})$ or arbitrarily slow (depends on $h$ )
Localized structure	$O(n^{-1/2})$	$e^{-n}$ (if no centered eluder seq.)

A plausible implication is that conventional learning theory may be substantially pessimistic for “well-behaved” distributions or target functions.

4. Examples and Applicability

Concrete classes illustrate each scenario:

$e^{-n}$ : Finite class, or infinite class where, at the Bayes-optimal $h$ , no infinite eluder sequence exists (e.g., finite-valued thresholds where Bayes-optimal classifier does not admit such a sequence locally).
$o(n^{-1/2})$ : Infinite class where the local structure at the Bayes-optimal classifier allows for eluder sequences but no VC-eluder sequence; e.g., certain countable infinite singletons.
Arbitrarily slow: High local complexity (infinite VC-eluder sequence centered at $h$ ), as can occur in certain set-theoretically complex or fully nonparametric problems.

Notably, the precise rate is determined not by the global VC dimension, but by the local combinatorial structure at the Bayes-optimal classifier.

5. Implications for Machine Learning Theory and Practice

This theory demonstrates that in the agnostic setting, the achievable ERM learning curve is dictated by the data distribution through the identity of the Bayes-optimal classifier and the local complexity of the function class at this classifier. This guidance has significant consequences:

In practice, ERM can show much faster convergence than uniform PAC bounds would predict on favorable distributions, but may be arbitrarily slow where the local class is combinatorially rich.
Model or hypothesis class selection should consider not just global VC complexity, but the combinatorial structure around plausible Bayes-optimal hypotheses.
Practitioners and theorists can use these results to “diagnose” and predict ERM excess risk learning curves for specific data and model choices.

A plausible implication is that adoption of more refined/matched hypothesis classes or data-dependent regularization may yield substantial improvements on real datasets where the local structure is simple (i.e., avoids centered infinite eluder or VC-eluder sequences).

6. Limitations and Nuances

These results pertain to ERM for binary classification in the agnostic setting. The trichotomy and structural characterizations rely on the version space and labeling structure at the Bayes-optimal classifier, which is typically unobserved in practice; thus the rates are fundamentally theoretical—they reveal what is possible when the target is known, but in practice the statistical learner remains agnostic.

Improper learners are excluded; the trichotomy may be further subdivided for procedures beyond ERM that are not restricted to output hypotheses in the class $H$ .

7. Table: Summary of Agnostic Bayes-Dependent ERM Rates

Rate	Centered Combinatorial Structure at Bayes-optimal $h$	Example
$e^{-n}$	No infinite eluder sequence at $h$	Finite class, isolated $h$
$o(n^{-1/2})$	Infinite eluder but no infinite VC-eluder at $h$	Certain infinite singletons
Arbitrarily slow	Infinite VC-eluder sequence at $h$	Full VC class at $h$

References to Key Statements

Trichotomy: Theorem 3 (“Bayes-dependent Agnostic Universal Rates”)
Centered combinatorial definitions: Section 5, definitions of eluder/VC-eluder sequences
Implications, limitations, and practical significance: Sections 5 and Discussion

These findings establish a foundational taxonomy for Bayes-dependent universal rates in agnostic learning with ERM, precisely classifying when and how much faster than classical rates distribution-specific learning curves can decay.

PDF Markdown Chat (Upgrade)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now