Papers
Topics
Authors
Recent
Search
2000 character limit reached

B-PAC Reasoning: Safety & Efficiency

Updated 6 February 2026
  • B-PAC reasoning is an online decision-making framework that combines inverse propensity scoring, betting martingales, and anytime-valid PAC guarantees to ensure safety and efficiency.
  • It dynamically adjusts the use of high-cost models based on a statistical betting process that monitors uncertainty via thresholding to control risk.
  • Empirical results demonstrate that B-PAC meets stringent performance guarantees and substantially reduces expensive model invocations in non-stationary environments.

Betting Probably Approximately Correct (B-PAC) reasoning refers to a class of online decision-making frameworks that certify performance to within a user-specified tolerance (ε) with high probability (1−α), while optimally minimizing the invocation of an expensive, high-accuracy “thinking model.” The B-PAC paradigm integrates sequential importance sampling, statistical betting martingales, and anytime-valid PAC (Probably Approximately Correct) guarantees. It addresses scenarios where only partial feedback on errors is available and the system must remain robust to non-stationary query streams, yielding practical and theoretically-sound online reasoning mechanisms for large reasoning models and boundedly rational agents (Yu et al., 30 Jan 2026, Oesterheld et al., 2023).

1. Problem Setting: Online Reasoning under Partial Feedback

B-PAC reasoning is motivated by the deployment of two black-box models: a high-accuracy, high-cost “thinking model” f:XYf:\mathcal{X}\to\mathcal{Y}, and a low-cost, lower-accuracy “non-thinking model” f:XYf':\mathcal{X}\to\mathcal{Y}. For each time tt, a query XtX_t is drawn from a possibly non-stationary stream. The output f(Xt)f(X_t) is available only when the expensive model is invoked. The decision ft(Xt){f(Xt),f(Xt)}f_t(X_t)\in\{f'(X_t),f(X_t)\} is determined online by a composite policy.

Crucially, the true response YtY_t is never directly observed; performance loss is measured relative to f(Xt)f(X_t) as (ft(X),f(X))[0,1]\ell(f_t(X),f(X)) \in [0,1], where \ell is a loss function. The population risk at time tt under a thresholding policy is defined as

Rt(ft)=EX[(ft(X),f(X))].R_t(f_t) = \mathbb{E}_X[\ell(f_t(X), f(X))].

The key objective is to guarantee, with high probability 1α\geq 1-\alpha, that Rt(ft)<ϵR_t(f_t) < \epsilon for all tt while minimizing invocations of the expensive model.

2. B-PAC Objective: Simultaneous Safety and Efficiency

The dual mandate of B-PAC reasoning is:

  • Safety: Enforce the uniform bound Rt(ft)ϵR_t(f_t) \leq \epsilon at all times tt with probability 1α\geq 1-\alpha (the “anytime (ε,α)-PAC efficiency guarantee”).
  • Efficiency: Minimize the cumulative fraction of queries routed to ff, measured as

ECPt=1ti=1tI{call f at i},\mathrm{ECP}_t = \frac{1}{t} \sum_{i=1}^t \mathbb{I}\{\text{call } f \text{ at } i\},

or the equivalent token-level savings.

To achieve both objectives, B-PAC dynamically adapts a threshold u^t\hat{u}_t on a scalar uncertainty score UtU_t for ff'. If Ut<u^tU_t < \hat{u}_t, ff' is used; otherwise, ff is invoked. Raising u^t\hat{u}_t expands the usage of ff', but only when statistical evidence, quantified via a betting process, suffices to certify the performance loss remains within ϵ\epsilon (Yu et al., 30 Jan 2026).

3. Core Methodology: IPS Risk Estimation and Betting Martingales

The system must estimate the risk of delegating to ff' under selective feedback. Since (f(Xt),f(Xt))\ell(f'(X_t),f(X_t)) is observed only when ff is called, B-PAC employs an inverse propensity scoring (IPS) estimator to correct selection bias. For any threshold uu, the estimator is:

Zt(u)=(1pmin)i=1tI{Ui<u}I{θi=1}Ti,Z_t(u) = (1 - p_{\min})\sum_{i=1}^t \frac{\mathbb{I}\{U_i < u\}\,\mathbb{I}\{\theta_i = 1\}}{T_i},

where TiT_i is the sampling probability (which depends on UiU_i and previous thresholds), θiBernoulli(Ti)\theta_i \sim \mathrm{Bernoulli}(T_i) indicates whether ff was called, and pmin=inftptp_{\min} = \inf_t p_t is the minimal exploration rate.

A “test supermartingale” is constructed as a wealth process Kt(u)K_t(u). At each round, a nonnegative, predictable bet At(u)A_t(u) is placed, and

Kt(u)=Kt1(u)(1+At(u)Dt(u))K_t(u) = K_{t-1}(u)\bigl(1 + A_t(u) D_t(u)\bigr)

with Dt(u)=ϵZt(u)D_t(u) = \epsilon - Z_t(u). Under the null hypothesis that the true risk at threshold uu exceeds ϵ\epsilon, Kt(u)K_t(u) forms a supermartingale, ensuring anytime-valid control via Ville’s inequality (Yu et al., 30 Jan 2026).

4. Thresholding Algorithm and Theoretical Guarantees

A finite grid of candidate thresholds U={u(1),,u(N)}U = \{u^{(1)},\ldots,u^{(N)}\} is maintained. At each round tt:

  1. Compute f(Xt)f'(X_t) and UtU_t.
  2. With probability TtT_t (determined by comparison to u^t1\hat{u}_{t-1} and the exploration parameter ptp_t), decide whether to call ff (i.e., θtBernoulli(Tt)\theta_t \sim \mathrm{Bernoulli}(T_t)).
  3. Update IPS estimator Zt(u)Z_t(u), betting fraction At(u)A_t(u), and wealth Kt(u)K_t(u) for each uu.
  4. Set u^t\hat{u}_t to the largest u(i)u^{(i)} for which Kt(u(j))1/αK_t(u^{(j)}) \geq 1/\alpha for all jij \leq i; otherwise, set u^t=0\hat{u}_t=0.

The principal safety guarantee is: with probability 1α\geq1-\alpha, for all tt, Rt(u^t)ϵR_t(\hat{u}_t) \leq \epsilon under i.i.d. queries or certain non-stationary regimes. The adaptive betting strategy, based on online convex optimization (using a projected fraction At(u)A_t(u) incorporating a second-order Taylor surrogate), ensures logarithmic regret O(logT)O(\log T) to the best fixed threshold, enabling rapid convergence to optimal efficiency (Yu et al., 30 Jan 2026).

5. Connections to Bounded Inductive Rationality and Hypothesis Betting

B-PAC style reasoning generalizes to settings explored in the theory of bounded inductive rationality (Oesterheld et al., 2023). In this broader context, boundedly rational inductive agents (BRIAs) sequentially choose among a finite set of options DPt\mathcal{DP}_t each round, using an internal “betting” process over a countable family of hypotheses H\mathcal{H} with computable recommendations and reward promises. BRIAs maintain a policy that never overestimates rewards and test each hypothesis infinitely often. The agent’s “wealth” for each hypothesis is updated via allowance schedules and observed payoffs. Coverage of hypotheses via statistical tests ensures that if any hypothesis promises consistently higher rewards, it is followed sufficiently often.

The general PAC-like guarantee in this setting asserts that, when a valid, high-reward hypothesis exists, the agent’s empirical mean reward converges with high probability to within ϵ\epsilon of the best-expected reward, provided payoffs along the hypothesis’ trajectory are sufficiently random (e.g., van-Mises–Wald–Church or Schnorr randomness). In repeated games, pairs of BRIAs can implement any strictly individually rational correlated equilibrium, with empirical plays converging accordingly (Oesterheld et al., 2023).

6. Implementation and Empirical Results

B-PAC reasoning incurs negligible computational overhead compared to large model inference: O(N)O(N) per-round updates for a threshold grid of N1000N \approx 1000 is typical. Exploration probabilities ptp_t are managed by a two-stage schedule: a warm-up phase with pwarm0.7p_{\text{warm}} \approx 0.7 for the initial Twarm200T_{\text{warm}} \approx 200 rounds, then pdeploy0.05p_{\text{deploy}} \approx 0.05 for steady-state operation. Asynchronous or sharded implementations of the betting martingale allow for high-throughput deployment (Yu et al., 30 Jan 2026).

Empirical benchmarks on MATH, MMLU-Pro, BIG-Bench Hard (BBH), and Magpie demonstrate substantial computational savings with stringent safety. For example, on Magpie with parameters ϵ=0.08\epsilon=0.08, α=0.1\alpha=0.1, B-PAC maintains empirical loss ER0.03<0.08\mathrm{ER} \approx 0.03 < 0.08 throughout, invokes the expensive model on only 19%19\% of queries, and reserves 59%59\% of tokens for it, outperforming offline PAC thresholds and ablation baselines that either violate safety or are overly conservative. Under non-stationary drifts, B-PAC adaptively tightens the threshold to preserve the safety guarantee, in contrast to fixed offline PAC (Yu et al., 30 Jan 2026).

7. Comparative Analysis, Limitations, and Significance

B-PAC’s integration of IPS estimation, betting supermartingales, and dynamic thresholding provides anytime-valid, model-agnostic, online control of reasoning risks—substantially generalizing classical PAC approaches and outperforming heuristic routing or simple union-bound-based methods. Naive estimators tend to violate safety or incur prohibitive expert-call rates (100%\approx 100\%). Classical Hoeffding-style methods are conservative, achieving safety at the expense of efficiency. Heuristic policies such as Chain-of-Draft or NoThinking do not reliably control loss.

A plausible implication is that B-PAC offers a broadly applicable, robust strategy for AI reasoning architectures under uncertainty and limited feedback, and connects deeply to theories of inductive rationality in both single-agent and multi-agent settings. The folk-theorem for BRIAs reinforces that B-PAC-type “betting” approaches are not artifacts of black-box model selection, but arise naturally in settings where optimality and bounded rationality must be simultaneously addressed (Oesterheld et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Betting Probably Approximately Correct (B-PAC) Reasoning.