Papers
Topics
Authors
Recent
2000 character limit reached

PAC Indistinguishability: Theory & Algorithms

Updated 26 December 2025
  • PAC Indistinguishability is a framework that generalizes standard PAC learning by focusing on making a predictor indistinguishable from a target using a class of outcome-based distinguishers.
  • It leverages metric entropy and dual Minkowski norms to tightly characterize sample complexity, linking classical PAC and agnostic L₁-learning regimes.
  • Practical algorithms like Distinguisher-Covering and Multiaccuracy Boost demonstrate how theoretical bounds translate into efficient predictor selection and update procedures.

PAC indistinguishability, also termed no-access Outcome Indistinguishability (OI), generalizes the standard PAC (Probably Approximately Correct) learning paradigm by considering a scenario where the goal is to output a predictor pp that cannot be distinguished from a target predictor pp^* by a class DD of distinguishers, based on the observable outcomes derived from predictions. The distinguishing power of DD, the interplay between metric entropy and sample complexity, and the duality connections to convex geometry are central to the theory, yielding a framework that interpolates between classical PAC learning and agnostic L1L_1-learning depending on the choice of DD (Hu et al., 2022).

1. Formal Definition and Framework

Let XX denote an instance space and μΔX\mu\in \Delta_X a probability distribution over XX. A predictor p:X[0,1]p: X \to [0,1] induces a joint law μp\mu_p on X×{0,1}X \times \{0,1\}: first sample xμx \sim \mu, then generate oBernoulli(p(x))o \sim \text{Bernoulli}(p(x)). Fix a distinguisher class D{d:X×{0,1}{0,1}}D \subseteq \{d: X \times \{0,1\} \to \{0,1\}\}, possibly randomized.

The distinguishing advantage for dDd \in D and predictors p,pp, p' is given by

Advμ,d(p,p)=Pr(x,o)μp[d(x,o)=1]Pr(x,o)μp[d(x,o)=1].\mathrm{Adv}_{\mu, d}(p, p') = \left|\Pr_{(x, o) \sim \mu_p}\left[d(x, o) = 1\right] - \Pr_{(x, o) \sim \mu_{p'}}\left[d(x, o) = 1\right]\right|.

The maximized distinguishing advantage over DD is

Advμ,D(p,p)=supdDAdvμ,d(p,p).\mathrm{Adv}_{\mu, D}(p, p') = \sup_{d \in D} \mathrm{Adv}_{\mu, d}(p, p').

A predictor pp is (D,ϵ)(D,\epsilon)-OI to pp' under μ\mu if Advμ,D(p,p)ϵ\mathrm{Adv}_{\mu, D}(p,p') \le \epsilon.

Expressing the distinguisher action as a function fd:X[1,1]f_d: X \to [-1,1], with fd(x)=Pr[d(x,1)=1]Pr[d(x,0)=1]f_d(x) = \Pr[d(x,1)=1] - \Pr[d(x,0)=1], the distinguishing advantage becomes

Advμ,d(p,p)=Exμ[fd(x)(p(x)p(x))].\mathrm{Adv}_{\mu, d}(p, p') = \left|\mathbb{E}_{x \sim \mu}[f_d(x)(p(x) - p'(x))]\right|.

Thus Advμ,D(p,p)\mathrm{Adv}_{\mu, D}(p,p') corresponds to the dual Minkowski semi-norm:

ppD,μ:=supfDExμ[f(x)(p(x)p(x))].\|p - p'\|^*_{D, \mu} := \sup_{f \in D} \left|\mathbb{E}_{x \sim \mu}[f(x)(p(x) - p'(x))]\right|.

Variants correspond to realizable vs. agnostic (the ground truth pp^* in a known class PP, or not) and distribution-specific vs. distribution-free (where the learner may or may not adapt to μ\mu).

2. Metric Entropy Characterization in the Distribution-Specific Realizable Setting

In the realizable, distribution-specific case, assume pP[0,1]Xp^* \in P \subseteq [0,1]^X is the unknown target, D[1,1]XD \subseteq [-1,1]^X is the distinguisher class, and μ\mu is fixed.

The central sample complexity measure is the covering number (metric entropy) of PP with respect to the dual Minkowski norm:

Nμ,D(P,ϵ)=min{N:p1,,pNP, pP i:ppiD,μϵ}.N_{\mu, D}(P, \epsilon) = \min\left\{N:\, \exists p_1, \ldots, p_N \in P,\ \forall p \in P\ \exists i: \|p - p_i\|^*_{D, \mu} \le \epsilon\right\}.

Lower Bound: Packing arguments yield that any (possibly improper, randomized) learner using nn i.i.d. samples drawn from μp\mu_{p^*} must satisfy:

nreal(P,D,ϵ,δ,μ)log((1δ)Nμ,D(P,2ϵ)).n_{\mathrm{real}}(P, D, \epsilon, \delta, \mu) \geq \log\left((1-\delta) N_{\mu, D}(P, 2\epsilon)\right).

Upper Bound: The "Distinguisher-Covering" algorithm computes an approximate ϵ/2\epsilon/2-cover of PPP-P in the dual norm D,μ\|\cdot\|^*_{D, \mu}, empirically estimates Exμ[p(x)f(x)]\mathbb{E}_{x \sim \mu}[p(x) f(x)] for ff in this cover, and selects pPp \in P minimizing the maximum estimation error. It achieves:

nreal(P,D,ϵ,δ,μ)O(ϵ2[logNμ,P(D,ϵ/2)+log(1/δ)]).n_{\mathrm{real}}(P, D, \epsilon, \delta, \mu) \leq O(\epsilon^{-2}[\,\log N_{\mu, P}(D, \epsilon/2) + \log(1/\delta)]\,).

3. Metric-Entropy Duality: Tight Characterizations

Leveraging the symmetry between covering PP by DD and DD by PP, a metric-entropy duality theorem holds: for any bounded, nonempty K1[M1,M1]XK_1 \subseteq [-M_1, M_1]^X, K2[M2,M2]XK_2 \subseteq [-M_2, M_2]^X, and ϵ>0\epsilon > 0,

logNμ,K2(K1,ϵ)c(M1M2/ϵ)2(1+logNμ,K1(K2,ϵ/8))\log N_{\mu, K_2}(K_1, \epsilon) \leq c(M_1 M_2 / \epsilon)^2 (1 + \log N_{\mu, K_1}(K_2, \epsilon/8))

for an absolute constant c>0c > 0. In particular, plugging K1=PPK_1 = P - P, K2=DK_2 = D yields nearly tight two-sided bounds:

Ω(ϵ2logNμ,D(P,16ϵ)1+log(1δ))nreal(P,D,ϵ,δ,μ)O(ϵ4logNμ,D(P,ϵ/32)+ϵ2log(1/δ)).\Omega\left(\epsilon^2 \cdot \log N_{\mu, D}(P, 16\epsilon) - 1 + \log(1-\delta)\right) \leq n_{\mathrm{real}}(P, D, \epsilon, \delta, \mu) \leq O\left(\epsilon^{-4}\log N_{\mu, D}(P, \epsilon/32) + \epsilon^{-2} \log(1/\delta)\right).

This duality connects the sample complexity of PAC indistinguishability to metric entropy duality phenomena in convex geometry. The ϵ2\epsilon^{-2} term is essential unless convexity further simplifies the setting (Hu et al., 2022).

4. Distribution-Free Characterization via Fat-Shattering Dimension

In the distribution-free agnostic and realizable settings—typically with P=[0,1]XP = [0,1]^X—the sample complexity is governed by the fat-shattering dimension of DD, denoted fatD(γ)\mathrm{fat}_D(\gamma). For any DD, ϵ,δ(0,1)\epsilon, \delta \in (0,1):

ndf([0,1]X,D,ϵ,δ)=Θ(ϵ4[fatD(ϵ/25)(log(1/ϵ))2+log(1/δ)]).n_{\mathrm{df}}([0,1]^X, D, \epsilon, \delta) = \Theta\left(\epsilon^{-4}[\mathrm{fat}_D(\epsilon/25) \cdot (\log(1/\epsilon))^2 + \log(1/\delta)]\right).

This result leverages uniform convergence (via the fat-shattering dimension), and a multiaccuracy boosting algorithm that performs iterative updates: in each round, if there exists dDd \in D with sufficient average discrepancy, pp is updated in the direction of dd. Each round uses O(ϵ2fatD()+log(1/δ))O(\epsilon^{-2}\,\mathrm{fat}_D(\cdot) + \log(1/\delta)) fresh samples and decreases L2L_2 distance by Ω(ϵ2)\Omega(\epsilon^2). Packing arguments establish the matching lower bound.

5. Separation Between Realizable and Agnostic Regimes

A critical departure from classical PAC theory is the potential for an unbounded separation between realizable and agnostic PAC indistinguishability sample complexity:

Setting Realizable Sample Complexity Agnostic Sample Complexity
PP finite or pconv(P)p^* \in \mathrm{conv}(P) O(ϵ2log(1/δ))O(\epsilon^{-2} \log(1/\delta)) Ω(P)\Omega(|P|) or Ω(fatD())\Omega(\mathrm{fat}_D(\cdot))
P=[0,1]XP = [0,1]^X, DD arbitrary O(ϵ2log(1/δ))O(\epsilon^{-2} \log(1/\delta)) Θ(ϵ4fatD())\Theta(\epsilon^{-4} \mathrm{fat}_D(\cdot))

Concretely, for P={p1,p2}P = \{p_1, p_2\} differing at one point, realizable OI learning is trivial but agnostic, distribution-free OI requires Ω(1/ϵ2)\Omega(1/\epsilon^2) samples. Under restrictions such as PP symmetric convex or DD containing all {±1}\{\pm1\} functions (and p,Pp^*,P binary), the rates collapse to O(ϵ2VC)O(\epsilon^{-2} \mathrm{VC}) or to the metric-entropy rate.

6. Algorithms for PAC Indistinguishability

Two principal algorithmic approaches realize the aforementioned sample complexity bounds:

  • Distinguisher-Covering (Realizable, Distribution-Specific):
    • Cover PPP-P under D,μ\|\cdot\|^*_{D, \mu}.
    • On nn samples (xi,oi)μp(x_i, o_i) \sim \mu_{p^*}, estimate E[p(x)f(x)]\mathbb{E}[p(x)f(x)] for ff in the cover.
    • Select pPp \in P minimizing the worst empirical error across the covering set.
  • Multiaccuracy Boost (Distribution-Free):
    • Initialize p12p \equiv \frac{1}{2}.
    • Repeat T=Θ(ϵ2)T = \Theta(\epsilon^{-2}) rounds:
    • Draw batch of m=Θ(ϵ2fatD()+log(1/δ))m = \Theta(\epsilon^{-2}\mathrm{fat}_D(\cdot) + \log(1/\delta)) examples.
    • If some dDd \in D has discrepancy cϵ\geq c\epsilon, update pp+cϵdp \leftarrow p + c'\epsilon d, clipped to [0,1][0,1].
    • Otherwise, terminate.

Both algorithms yield nearly tight rates matching the theoretical characterizations given by metric entropy and fat-shattering dimension, respectively.

7. Mathematical Constructs and Significance

The theory centralizes two geometric-combinatorial constructs:

  • Dual Minkowski Norm: For h[1,1]Xh \in [−1,1]^X and D[1,1]XD \subseteq [−1,1]^X,

hD,μ:=supfDExμ[h(x)f(x)].\|h\|^*_{D, \mu} := \sup_{f \in D}\left|\mathbb{E}_{x \sim \mu}[h(x)f(x)]\right|.

  • Metric Entropy (Covering Number):

N(ϵ,P,):=min{N:p1,,pNP, Pi=1N{p:ppiϵ}}.\mathcal{N}(\epsilon, P, \|\cdot\|^*) := \min\left\{N: \exists p_1, \ldots, p_N \in P,\ P \subseteq \bigcup_{i=1}^N \{p: \|p-p_i\|^* \le \epsilon\} \right\}.

These underlie both the upper/lower bounds and duality results. The theory provides the first tight, general characterizations for the number of samples needed to ensure (D,ϵ)(D, \epsilon)-indistinguishability, providing a continuum of learning-theoretic settings from PAC to fully agnostic L1L_1-learning by appropriately varying DD (Hu et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PAC Indistinguishability.